Python object serialization and deserialization: Part 2-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Python object serialization and deserialization: Part 2

PHPz

Sep 03, 2023 pm 08:33 PM

Python 对象序列化和反序列化：第 2 部分

This is the second part of a tutorial on serializing and deserializing Python objects. In the first part, you learned the basics and then delved into the details of Pickle and JSON.

In this part, you'll explore YAML (make sure to have the running example from Part One), discuss performance and security considerations, learn about other serialization formats, and finally learn how to choose the right one. p>

YAML

YAML is my favorite format. It is a human-friendly data serialization format. Unlike Pickle and JSON, it is not part of the Python standard library, so you need to install it:

pip install yaml

The

yaml module only has load() and dump() functions. By default they take strings like loads() and dumps() but can take a second argument which is an open stream and can then dump/ Load to/from file.

import yaml



print yaml.dump(simple)



boolean: true

int_list: [1, 2, 3]

none: null

number: 3.44

text: string

Please note how readable YAML is compared to Pickle or even JSON. Now comes the cool part about YAML: it understands Python objects! No need for custom encoders and decoders. Here's the complex serialization/deserialization using YAML:

> serialized = yaml.dump(complex)

> print serialized



a: !!python/object:__main__.A

  simple:

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

when: 2016-03-07 00:00:00



> deserialized = yaml.load(serialized)

> deserialized == complex

True

As you can see, YAML has its own notation for marking Python objects. The output is still very easy to read. Datetime objects do not require any special markup because YAML inherently supports datetime objects.

performance

Before you start thinking about performance, you need to consider whether performance is an issue. If you're serializing/deserializing small amounts of data relatively infrequently (such as reading a config file at the beginning of your program), then performance isn't really an issue and you can move on.

However, assuming you profile your system and find that serialization and/or deserialization is causing performance issues, the following issues need to be addressed.

Performance has two aspects: how fast is the serialization/deserialization, and how big is the serialized representation?

To test the performance of various serialization formats, I will create a larger data structure and serialize/deserialize it using Pickle, YAML, and JSON. big_data List contains 5,000 complex objects.

big_data = [dict(a=simple, when=datetime.now().replace(microsecond=0)) for i in range(5000)]

Pickle

I'll use IPython here since it has the convenient %timeit magic function to measure execution time.

import cPickle as pickle



In [190]: %timeit serialized = pickle.dumps(big_data)

10 loops, best of 3: 51 ms per loop



In [191]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 24.2 ms per loop



In [192]: deserialized == big_data

Out[192]: True



In [193]: len(serialized)

Out[193]: 747328

Default pickle takes 83.1 milliseconds to serialize and 29.2 milliseconds to deserialize, and the serialization size is 747,328 bytes.

Let's try using the highest protocol.

In [195]: %timeit serialized = pickle.dumps(big_data, protocol=pickle.HIGHEST_PROTOCOL)

10 loops, best of 3: 21.2 ms per loop



In [196]: %timeit deserialized = pickle.loads(serialized)

10 loops, best of 3: 25.2 ms per loop



In [197]: len(serialized)

Out[197]: 394350

Interesting results. Serialization time dropped to just 21.2ms, but deserialization time increased slightly to 25.2ms. The serialized size is significantly reduced to 394,350 bytes (52%).

JSON

In [253] %timeit serialized = json.dumps(big_data, cls=CustomEncoder)

10 loops, best of 3: 34.7 ms per loop



In [253] %timeit deserialized = json.loads(serialized, object_hook=decode_object)

10 loops, best of 3: 148 ms per loop



In [255]: len(serialized)

Out[255]: 730000

OK. Performance on encoding seems a little worse than Pickle, but performance on decoding is much, much worse: 6x slower. How is this going? This is an artifact of the object_hook function that needs to be run for each dictionary to check if it needs to be converted to an object. It runs much faster without using object hooks.

%timeit deserialized = json.loads(serialized)

10 loops, best of 3: 36.2 ms per loop

The lesson here is to carefully consider any custom encoding when serializing and deserializing to JSON, as they can have a significant impact on overall performance.

YAML

In [293]: %timeit serialized = yaml.dump(big_data)

1 loops, best of 3: 1.22 s per loop



In[294]: %timeit deserialized = yaml.load(serialized)

1 loops, best of 3: 2.03 s per loop



In [295]: len(serialized)

Out[295]: 200091

OK. YAML is really, really slow. However, note something interesting: the serialized size is only 200,091 bytes. Much better than both Pickle and JSON. Let’s take a quick look inside:

In [300]: print serialized[:211]

- a: &id001

    boolean: true

    int_list: [1, 2, 3]

    none: null

    number: 3.44

    text: string

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

- a: *id001

  when: 2016-03-13 00:11:44

YAML is very clever here. It determines that all 5,000 dictionaries share the same "a" key value, so it only stores it once and references it using *id001 for all objects.

Safety

Security is often a critical issue. Pickle and YAML are vulnerable to code execution attacks due to the construction of Python objects. Cleverly formatted files can contain arbitrary code that will be executed by Pickle or YAML. No need to panic. This is by design and documented in Pickle's documentation:

Warning: The pickle module is not designed to protect against erroneous or maliciously constructed data. Never cancel data received from untrusted or unauthenticated sources.

And the content in the YAML document:

Warning: It is unsafe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load, so it can call any Python function.

Just know that you should not use Pickle or YAML to load serialized data received from untrusted sources. JSON is fine, but if you have a custom encoder/decoder you might be exposed as well.

The

yaml module provides the yaml.safe_load() function which only loads simple objects, but then you lose a lot of the functionality of YAML and may choose to just use JSON.

Other formats

There are many other serialization formats available. Here are some of them.

Protocol Buffer

Protobuf (i.e. Protocol Buffer) is Google's data interchange format. It is implemented in C but has Python bindings. It has a sophisticated architecture and packages data efficiently. Very powerful, but not very easy to use.

Message package

MessagePack is another popular serialization format. It is also binary and efficient, but unlike Protobuf it does not require a schema. It has a type system similar to JSON, but richer. Keys can be of any type, not just strings and non-UTF8 strings are supported.

CBOR

CBOR stands for Concise Binary Object Representation. Likewise, it supports the JSON data model. CBOR is not as famous as Protobuf or MessagePack, but it is interesting for two reasons:

It is an official Internet standard: RFC 7049.
It is designed for the Internet of Things (IoT).

how to choose?

this is a big problem. So many choices, how do you choose? Let’s consider the various factors that should be considered:

Should the serialization format be human-readable and/or human-editable?
Will serialized content be received from untrusted sources?
Is serialization/deserialization a performance bottleneck?
Does serialized data need to be exchanged with non-Python environments?

I'll make it really simple for you and walk through a few common scenarios and the format I recommend for each:

Automatically save the local state of the Python program

Use pickle (cPickle) and HIGHEST_PROTOCOL here. It's fast, efficient, and can store and load most Python objects without any special code. It can also be used as a local persistent cache.

Configuration file

Definitely YAML. Nothing beats its simplicity for anything humans need to read or edit. It has been successfully used by Ansible and many other projects. In some cases, you may prefer to use direct Python modules as configuration files. This might be the right choice, but it's not serialization, it's actually part of the program, not a separate configuration file.

Web API

JSON is the clear winner here. Today, Web APIs are most commonly used by JavaScript web applications that use JSON natively. Some web APIs may return other formats (e.g. csv for dense tabular result sets), but I think you can pack the csv data into JSON with minimal overhead (no need to repeat each row as an object with all column names ).

High-capacity/low-latency large-scale communication

Use one of the binary protocols: Protobuf (if architecture is required), MessagePack, or CBOR. Run your own tests to verify the performance and representation capabilities of each option.

in conclusion

Serialization and deserialization of Python objects is an important aspect of distributed systems. You cannot send Python objects directly over the network. You often need to interoperate with other systems implemented in other languages, and sometimes you just want to store the state of your program in persistent storage.

Python comes with several serialization schemes in its standard library, and many more are available as third-party modules. Understanding all the options and the pros and cons of each will allow you to choose the method that best suits your situation.

The above is the detailed content of Python object serialization and deserialization: Part 2. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python: A Deep Dive into Compilation and InterpretationMay 12, 2025 am 12:14 AM

Pythonusesahybridmodelofcompilationandinterpretation:1)ThePythoninterpretercompilessourcecodeintoplatform-independentbytecode.2)ThePythonVirtualMachine(PVM)thenexecutesthisbytecode,balancingeaseofusewithperformance.

Is Python an interpreted or a compiled language, and why does it matter?May 12, 2025 am 12:09 AM

Pythonisbothinterpretedandcompiled.1)It'scompiledtobytecodeforportabilityacrossplatforms.2)Thebytecodeistheninterpreted,allowingfordynamictypingandrapiddevelopment,thoughitmaybeslowerthanfullycompiledlanguages.

For Loop vs While Loop in Python: Key Differences ExplainedMay 12, 2025 am 12:08 AM

Forloopsareidealwhenyouknowthenumberofiterationsinadvance,whilewhileloopsarebetterforsituationswhereyouneedtoloopuntilaconditionismet.Forloopsaremoreefficientandreadable,suitableforiteratingoversequences,whereaswhileloopsoffermorecontrolandareusefulf

For and While loops: a practical guideMay 12, 2025 am 12:07 AM

Forloopsareusedwhenthenumberofiterationsisknowninadvance,whilewhileloopsareusedwhentheiterationsdependonacondition.1)Forloopsareidealforiteratingoversequenceslikelistsorarrays.2)Whileloopsaresuitableforscenarioswheretheloopcontinuesuntilaspecificcond

Python: Is it Truly Interpreted? Debunking the MythsMay 12, 2025 am 12:05 AM

Pythonisnotpurelyinterpreted;itusesahybridapproachofbytecodecompilationandruntimeinterpretation.1)Pythoncompilessourcecodeintobytecode,whichisthenexecutedbythePythonVirtualMachine(PVM).2)Thisprocessallowsforrapiddevelopmentbutcanimpactperformance,req

Python concatenate lists with same elementMay 11, 2025 am 12:08 AM

ToconcatenatelistsinPythonwiththesameelements,use:1)the operatortokeepduplicates,2)asettoremoveduplicates,or3)listcomprehensionforcontroloverduplicates,eachmethodhasdifferentperformanceandorderimplications.

Interpreted vs Compiled Languages: Python's PlaceMay 11, 2025 am 12:07 AM

Pythonisaninterpretedlanguage,offeringeaseofuseandflexibilitybutfacingperformancelimitationsincriticalapplications.1)InterpretedlanguageslikePythonexecuteline-by-line,allowingimmediatefeedbackandrapidprototyping.2)CompiledlanguageslikeC/C transformt

For and While loops: when do you use each in python?May 11, 2025 am 12:05 AM

Useforloopswhenthenumberofiterationsisknowninadvance,andwhileloopswheniterationsdependonacondition.1)Forloopsareidealforsequenceslikelistsorranges.2)Whileloopssuitscenarioswheretheloopcontinuesuntilaspecificconditionismet,usefulforuserinputsoralgorit

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Hot Topics

1666

1425

1325

1273

1252