Home  >  Article  >  Backend Development  >  Python object serialization and deserialization: Part 2

Python object serialization and deserialization: Part 2

PHPz
PHPzOriginal
2023-09-03 20:33:051376browse

Python 对象序列化和反序列化:第 2 部分

This is the second part of a tutorial on serializing and deserializing Python objects. In the first part, you learned the basics and then delved into the details of Pickle and JSON.

In this part, you'll explore YAML (make sure to have the running example from Part One), discuss performance and security considerations, learn about other serialization formats, and finally learn how to choose the right one. p>

YAML

YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:

pip install yaml

The

yaml module only has load() and dump() functions. By default they take strings like loads() and dumps() but can take a second argument which is an open stream and can then dump/ Load to/from file.

import yaml



print yaml.dump(simple)



boolean: true

int_list: [1, 2, 3]

none: null

number: 3.44

text: string

Please note how readable YAML is compared to Pickle or even JSON. Now comes the cool part about YAML: it understands Python objects! No need for custom encoders and decoders. Here's the complex serialization/deserialization using YAML:

> serialized = yaml.dump(complex)

> print serialized



a: !!python/object:__main__.A

  simple:

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

when: 2016-03-07 00:00:00



> deserialized = yaml.load(serialized)

> deserialized == complex

True

As you can see, YAML has its own notation for marking Python objects. The output is still very easy to read. Datetime objects do not require any special markup because YAML inherently supports datetime objects.

performance

Before you start thinking about performance, you need to consider whether performance is an issue. If you're serializing/deserializing small amounts of data relatively infrequently (such as reading a config file at the beginning of your program), then performance isn't really an issue and you can move on.

However, assuming you profile your system and find that serialization and/or deserialization is causing performance issues, the following issues need to be addressed.

Performance has two aspects: how fast is the serialization/deserialization, and how big is the serialized representation?

To test the performance of various serialization formats, I will create a larger data structure and serialize/deserialize it using Pickle, YAML, and JSON. big_data List contains 5,000 complex objects.

big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]

Pickle

I'll use IPython here since it has the convenient %timeit magic function to measure execution time.

import cPickle as pickle



In [190]: %timeit serialized = pickle.dumps(big_data)

10 loops, best of 3: 51 ms per loop



In [191]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 24.2 ms per loop



In [192]: deserialized == big_data

Out[192]: True



In [193]: len(serialized)

Out[193]: 747328

Default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialization size is 747,328 bytes.

Let's try using the highest protocol.

In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL)

10 loops, best of 3: 21.2 ms per loop



In [196]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 25.2 ms per loop



In [197]: len(serialized)

Out[197]: 394350

Interesting results. Serialization time dropped to just 21.2ms, but deserialization time increased slightly to 25.2ms. The serialized size is significantly reduced to 394,350 bytes (52%).

JSON

In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder)

10 loops, best of 3: 34.7 ms per loop



In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object)

10 loops, best of 3: 148 ms per loop



In [255]: len(serialized)

Out[255]: 730000

OK. Performance on encoding seems a little worse than Pickle, but performance on decoding is much, much worse: 6x slower. How is this going? This is an artifact of the object_hook function that needs to be run for each dictionary to check if it needs to be converted to an object. It runs much faster without using object hooks.

%timeit deserialized = json.loads(serialized)

10 loops, best of 3: 36.2 ms per loop

The lesson here is to carefully consider any custom encoding when serializing and deserializing to JSON, as they can have a significant impact on overall performance.

YAML

In [293]: %timeit serialized = yaml.dump(big_data)

1 loops, best of 3: 1.22 s per loop



In[294]: %timeit deserialized = yaml.load(serialized)

1 loops, best of 3: 2.03 s per loop



In [295]: len(serialized)

Out[295]: 200091

OK. YAML is really, really slow. However, note something interesting: the serialized size is only 200,091 bytes. Much better than both Pickle and JSON. Let’s take a quick look inside:

In [300]: print serialized[:211]

- a: &id001

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

YAML is very clever here. It determines that all 5,000 dictionaries share the same "a" key value, so it only stores it once and references it using *id001 for all objects.

Safety

Security is often a critical issue. Pickle and YAML are vulnerable to code execution attacks due to the construction of Python objects. Cleverly formatted files can contain arbitrary code that will be executed by Pickle or YAML. No need to panic. This is by design and documented in Pickle's documentation:

Warning: The pickle module is not designed to protect against erroneous or maliciously constructed data. Never cancel data received from untrusted or unauthenticated sources.

And the content in the YAML document:

Warning: It is unsafe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load, so it can call any Python function.

Just know that you should not use Pickle or YAML to load serialized data received from untrusted sources. JSON is fine, but if you have a custom encoder/decoder you might be exposed as well.

The

yaml module provides the yaml.safe_load() function which only loads simple objects, but then you lose a lot of the functionality of YAML and may choose to just use JSON.

Other formats

There are many other serialization formats available. Here are some of them.

Protocol Buffer

Protobuf (i.e. Protocol Buffer) is Google's data interchange format. It is implemented in C but has Python bindings. It has a sophisticated architecture and packages data efficiently. Very powerful, but not very easy to use.

Message package

MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it does not require a schema. It has a type system similar to JSON, but richer. Keys can be of any type, not just strings and non-UTF8 strings are supported.

CBOR

CBOR stands for Concise Binary Object Representation. Likewise, it supports the JSON data model. CBOR is not as famous as Protobuf or MessagePack, but it is interesting for two reasons:

  1. It is an official Internet standard: RFC 7049.
  2. It is designed for the Internet of Things (IoT).

how to choose?

this is a big problem. So many choices, how do you choose? Let’s consider the various factors that should be considered:

  1. Should the serialization format be human-readable and/or human-editable?
  2. Will serialized content be received from untrusted sources?
  3. Is serialization/deserialization a performance bottleneck?
  4. Does serialized data need to be exchanged with non-Python environments?

I'll make it really simple for you and walk through a few common scenarios and the format I recommend for each:

Automatically save the local state of the Python program

Use pickle (cPickle) and HIGHEST_PROTOCOL here. It's fast, efficient, and can store and load most Python objects without any special code. It can also be used as a local persistent cache.

Configuration file

Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It has been successfully used by Ansible and many other projects. In some cases, you may prefer to use direct Python modules as configuration files. This might be the right choice, but it's not serialization, it's actually part of the program, not a separate configuration file.

Web API

JSON is the clear winner here. Today, Web APIs are most commonly used by JavaScript web applications that use JSON natively. Some web APIs may return other formats (e.g. csv for dense tabular result sets), but I think you can pack the csv data into JSON with minimal overhead (no need to repeat each row as an object with all column names ).

High-capacity/low-latency large-scale communication

Use one of the binary protocols: Protobuf (if architecture is required), MessagePack, or CBOR. Run your own tests to verify the performance and representation capabilities of each option.

in conclusion

Serialization and deserialization of Python objects is an important aspect of distributed systems. You cannot send Python objects directly over the network. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.

Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Understanding all the options and the pros and cons of each will allow you to choose the method that best suits your situation.

The above is the detailed content of Python object serialization and deserialization: Part 2. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn