Home >Backend Development >Python Tutorial >Original words rewritten: An unexpected discovery is that what was originally regarded as a bug is actually a feature in the design of Protobuf

Original words rewritten: An unexpected discovery is that what was originally regarded as a bug is actually a feature in the design of Protobuf

PHPz
PHPzforward
2023-05-09 16:22:09927browse

Hello everyone, I am amazing.

Recently, in our project, we use the protobuf format as a carrier for storing data. I accidentally buried a big hole for myself, but it took me a long time to discover it.

Introduction to protobuf

protobuf’s full name is Protocol buffers. It was developed by Google and is a cross-language, cross-platform, and scalable serialized data Mechanisms. Similar to XML, but smaller, faster, and simpler. You only need to define once how you want your data to be structured, and then you can use its generation tools to generate source code that includes some serialization and deserialization operations. Structured data can be easily written and read from a variety of data streams and using a variety of programming languages.

The proto2 version supports code generation in Java, Python, Objective-C and C. With the new proto3 language version, you can also use Kotlin, Dart, Go, Ruby, PHP and C#, and many more languages.

How did you find it?

In our new project, we store the data of the project run by using protobuf format. In this way, during the debugging process, we may perform local debugging based on the data recorded on site.

message ImageData {
// ms
int64 timestamp = 1;
int32 id = 2;
Data mat = 3;
}

message PointCloud {
// ms
int64 timestamp = 1;
int32 id = 2;
PointData pointcloud = 3;
}

message State {
// ms
int64 timestamp = 1;
string direction = 2;
}

message Sensor {
repeated PointCloud point_data = 1;
repeated ImageData image_data = 2;
repeated State vehicle_data = 3;
}

We define such a set of data, and then when storing, because the frame rates of the three data sources of Sensor are different, when storing, a single Sensor actually only contains one set of data. In addition, Two types of data are not included.

We didn't encounter problems when we only recorded a single pack. Until we feel that a single packet cannot be recorded for a long time, we need to find a solution to split the packet.

At that time, I thought this must be very simple, so we set it up. When a package reaches 500M, we will store the subsequent data in a new package. I finished writing it very smoothly and then put it on site for data recording. After recording for a while, we took the package back and simulated testing our new program. It was found that there was a problem in parsing the data of some packages. The program will get stuck in the middle of running. After many tests, it was found that some packages have this problem.

What we suspected at first was that the way to judge the file size was wrong, which affected subcontracting. Because when judging the file size, the file will be opened. But after judging several other ways of not opening the file, the split was carried out. I still encountered problems with some of the recorded packages.

Only then did I suspect that protobuf has some special requirements for storing data. Later, I read some articles and learned that protobuf requires identifiers to store multiple sets of data into one file. Otherwise, when parsing back from the file, protobuf does not know where the stop character of a single data is, causing data parsing errors.

Here, this pit appears. We store a series of data into a single package without any separator operations. When protobuf parses, all the contents in the file are parsed into a single Sensor. Sensor contains all data, and protobuf actively merges all stored data.

At this time, I discovered that when I recorded single packets in the past, the data was all correct. That was really my luck. protobuf happens to be parsed successfully.

How to solve it?

Now that we know that protobuf will operate in this way, we only need to know how to divide protobuf. This method is really hard to find because there are too few people like us who use it. Chinese search can’t find this content at all. Maybe everyone doesn’t use protobuf to store data. The method everyone uses should be the scenario of interaction among multiple services.

Finally found the answer through some answers on stackoverflow. From the answers, I learned that this solution was only officially merged in protobuf 3.3. It seems that this function is really rarely used.

bool SerializeDelimitedToOstream(const MessageLite& message,
 std::ostream* output);
bool ParseDelimitedFromZeroCopyStream(
MessageLite* message, io::ZeroCopyInputStream* input, bool* clean_eof);

Through this pair of methods, files can be stored and read one by one according to the data flow. No more worrying about data being merged and read.

Of course, the data stored in this way cannot be parsed by the original parsing method, and the format of the storage has completely changed. This method will store the size of the binary data first, and then store the binary data.

Conclusion

After a lot of tossing, I finally solved this segmentation pit. The usage scenario may be relatively niche, resulting in a lot of information that cannot be found at all. I discovered these problems by looking at the source code myself. The source code of C is really difficult to read. There are many template methods and template classes and it is easy to miss some details. Finally, I looked at the C# code and finally confirmed it.

The above is the detailed content of Original words rewritten: An unexpected discovery is that what was originally regarded as a bug is actually a feature in the design of Protobuf. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete