This is the second part of a tutorial on serializing and deserializing Python objects. In the first part, you learned the basics and then delved into the details of Pickle and JSON.
In this part, you'll explore YAML (make sure to have the running example from Part One), discuss performance and security considerations, learn about other serialization formats, and finally learn how to choose the right one. p>
YAML
YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:
pip install yaml
yaml module only has load()
and dump()
functions. By default they take strings like loads()
and dumps()
but can take a second argument which is an open stream and can then dump/ Load to/from file.
import yaml print yaml.dump(simple) boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string
Please note how readable YAML is compared to Pickle or even JSON. Now comes the cool part about YAML: it understands Python objects! No need for custom encoders and decoders. Here's the complex serialization/deserialization using YAML:
> serialized = yaml.dump(complex) > print serialized a: !!python/object:__main__.A simple: boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-07 00:00:00 > deserialized = yaml.load(serialized) > deserialized == complex True
As you can see, YAML has its own notation for marking Python objects. The output is still very easy to read. Datetime objects do not require any special markup because YAML inherently supports datetime objects.
performance
Before you start thinking about performance, you need to consider whether performance is an issue. If you're serializing/deserializing small amounts of data relatively infrequently (such as reading a config file at the beginning of your program), then performance isn't really an issue and you can move on.
However, assuming you profile your system and find that serialization and/or deserialization is causing performance issues, the following issues need to be addressed.
Performance has two aspects: how fast is the serialization/deserialization, and how big is the serialized representation?
To test the performance of various serialization formats, I will create a larger data structure and serialize/deserialize it using Pickle, YAML, and JSON. big_data
List contains 5,000 complex objects.
big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]
Pickle
I'll use IPython here since it has the convenient %timeit
magic function to measure execution time.
import cPickle as pickle In [190]: %timeit serialized = pickle.dumps(big_data) 10 loops, best of 3: 51 ms per loop In [191]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 24.2 ms per loop In [192]: deserialized == big_data Out[192]: True In [193]: len(serialized) Out[193]: 747328
Default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialization size is 747,328 bytes.
Let's try using the highest protocol.
In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL) 10 loops, best of 3: 21.2 ms per loop In [196]: %timeit deserialized = pickle.loads(serialized) 10 loops, best of 3: 25.2 ms per loop In [197]: len(serialized) Out[197]: 394350
Interesting results. Serialization time dropped to just 21.2ms, but deserialization time increased slightly to 25.2ms. The serialized size is significantly reduced to 394,350 bytes (52%).
JSON
In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder) 10 loops, best of 3: 34.7 ms per loop In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object) 10 loops, best of 3: 148 ms per loop In [255]: len(serialized) Out[255]: 730000
OK. Performance on encoding seems a little worse than Pickle, but performance on decoding is much, much worse: 6x slower. How is this going? This is an artifact of the object_hook
function that needs to be run for each dictionary to check if it needs to be converted to an object. It runs much faster without using object hooks.
%timeit deserialized = json.loads(serialized) 10 loops, best of 3: 36.2 ms per loop
The lesson here is to carefully consider any custom encoding when serializing and deserializing to JSON, as they can have a significant impact on overall performance.
YAML
In [293]: %timeit serialized = yaml.dump(big_data) 1 loops, best of 3: 1.22 s per loop In[294]: %timeit deserialized = yaml.load(serialized) 1 loops, best of 3: 2.03 s per loop In [295]: len(serialized) Out[295]: 200091
OK. YAML is really, really slow. However, note something interesting: the serialized size is only 200,091 bytes. Much better than both Pickle and JSON. Let’s take a quick look inside:
In [300]: print serialized[:211] - a: &id001 boolean: true int_list: [1, 2, 3] none: null number: 3.44 text: string when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44 - a: *id001 when: 2016-03-13 00:11:44
YAML is very clever here. It determines that all 5,000 dictionaries share the same "a" key value, so it only stores it once and references it using *id001
for all objects.
Safety
Security is often a critical issue. Pickle and YAML are vulnerable to code execution attacks due to the construction of Python objects. Cleverly formatted files can contain arbitrary code that will be executed by Pickle or YAML. No need to panic. This is by design and documented in Pickle's documentation:
Warning: The pickle module is not designed to protect against erroneous or maliciously constructed data. Never cancel data received from untrusted or unauthenticated sources.
And the content in the YAML document:
Warning: It is unsafe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load, so it can call any Python function.
Just know that you should not use Pickle or YAML to load serialized data received from untrusted sources. JSON is fine, but if you have a custom encoder/decoder you might be exposed as well.
Theyaml module provides the yaml.safe_load()
function which only loads simple objects, but then you lose a lot of the functionality of YAML and may choose to just use JSON.
Other formats
There are many other serialization formats available. Here are some of them.
Protocol Buffer
Protobuf (i.e. Protocol Buffer) is Google's data interchange format. It is implemented in C but has Python bindings. It has a sophisticated architecture and packages data efficiently. Very powerful, but not very easy to use.
Message package
MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it does not require a schema. It has a type system similar to JSON, but richer. Keys can be of any type, not just strings and non-UTF8 strings are supported.
CBOR
CBOR stands for Concise Binary Object Representation. Likewise, it supports the JSON data model. CBOR is not as famous as Protobuf or MessagePack, but it is interesting for two reasons:
- It is an official Internet standard: RFC 7049.
- It is designed for the Internet of Things (IoT).
how to choose?
this is a big problem. So many choices, how do you choose? Let’s consider the various factors that should be considered:
- Should the serialization format be human-readable and/or human-editable?
- Will serialized content be received from untrusted sources?
- Is serialization/deserialization a performance bottleneck?
- Does serialized data need to be exchanged with non-Python environments?
I'll make it really simple for you and walk through a few common scenarios and the format I recommend for each:
Automatically save the local state of the Python program
Use pickle (cPickle) and HIGHEST_PROTOCOL
here. It's fast, efficient, and can store and load most Python objects without any special code. It can also be used as a local persistent cache.
Configuration file
Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It has been successfully used by Ansible and many other projects. In some cases, you may prefer to use direct Python modules as configuration files. This might be the right choice, but it's not serialization, it's actually part of the program, not a separate configuration file.
Web API
JSON is the clear winner here. Today, Web APIs are most commonly used by JavaScript web applications that use JSON natively. Some web APIs may return other formats (e.g. csv for dense tabular result sets), but I think you can pack the csv data into JSON with minimal overhead (no need to repeat each row as an object with all column names ).
High-capacity/low-latency large-scale communication
Use one of the binary protocols: Protobuf (if architecture is required), MessagePack, or CBOR. Run your own tests to verify the performance and representation capabilities of each option.
in conclusion
Serialization and deserialization of Python objects is an important aspect of distributed systems. You cannot send Python objects directly over the network. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.
Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Understanding all the options and the pros and cons of each will allow you to choose the method that best suits your situation.
The above is the detailed content of Python object serialization and deserialization: Part 2. For more information, please follow other related articles on the PHP Chinese website!

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

In this tutorial you'll learn how to handle error conditions in Python from a whole system point of view. Error handling is a critical aspect of design, and it crosses from the lowest levels (sometimes the hardware) all the way to the end users. If y

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
