1 Unicode
The basic unit of computer storage is the byte, which is composed of 8 bits. Since English only consists of 26 letters plus a number of symbols, English characters can be stored directly in bytes. But other languages (such as Chinese, Japanese, Korean, etc.) have to use multiple bytes for encoding due to the large number of characters.
With the spread of computer technology, non-Latin character encoding technology continues to develop, but there are still two major limitations:
Does not support multiple languages: The encoding scheme of one language cannot be used for another language
There is no unified standard: for example, Chinese has multiple encoding standards such as GBK, GB2312, GB18030
Because the encoding methods are not uniform, developers need to convert back and forth between different encodings, and many errors will inevitably occur. In order to solve this kind of inconsistency problem, the Unicode standard was proposed. Unicode organizes and encodes most of the writing systems in the world, allowing computers to process text in a unified way. Unicode currently contains more than 140,000 characters and naturally supports multiple languages. (Unicode’s uni is the root of “unification”)
2 Unicode in Python
2.1 Benefits of Unicode objects
After Python 3, Unicode is used internally in the str object Represents, and therefore becomes a Unicode object in the source code. The advantage of using Unicode representation is that the core logic of the program uses Unicode uniformly, and only needs to be decoded and encoded at the input and output layers, which can avoid various encoding problems to the greatest extent.
The diagram is as follows:
>>> sys.getsizeof('ab') - sys.getsizeof('a') 1 >>> sys.getsizeof('一二') - sys.getsizeof('一') 2 >>> sys.getsizeof('????????') - sys.getsizeof('????') 4It can be seen that Python internally optimizes Unicode objects: according to the text content, the underlying storage unit is selected . The underlying storage of Unicode objects is divided into three categories according to the Unicode code point range of text characters:
- PyUnicode_1BYTE_KIND: All character code points are between U 0000 and U 00FF
- PyUnicode_2BYTE_KIND: All character code points are between U 0000 and U FFFF, and at least one character has a code point greater than U 00FF
- PyUnicode_1BYTE_KIND: All character code points are between U 0000 and U 10FFFF, and at least one character has a code point greater than U FFFF ##The corresponding enumeration is as follows:
enum PyUnicode_Kind { /* String contains only wstr byte characters. This is only possible when the string was created with a legacy API and _PyUnicode_Ready() has not been called yet. */ PyUnicode_WCHAR_KIND = 0, /* Return values of the PyUnicode_KIND() macro: */ PyUnicode_1BYTE_KIND = 1, PyUnicode_2BYTE_KIND = 2, PyUnicode_4BYTE_KIND = 4 };
According to different Classification, select different storage units:
/* Py_UCS4 and Py_UCS2 are typedefs for the respective unicode representations. */ typedef uint32_t Py_UCS4; typedef uint16_t Py_UCS2; typedef uint8_t Py_UCS1;
The corresponding relationship is as follows:
Character storage unit | Character storage unit size (bytes) | |
---|---|---|
Py_UCS1 | 1 | |
Py_UCS2 | 2 | |
Py_UCS4 | 4 |
- interned: Whether to maintain the interned mechanism
- kind: type, used to distinguish the size of the underlying storage unit of characters
- compact: memory allocation method, whether the object and the text buffer are separated
- asscii: Whether the text is all pure ASCII
- Through the PyUnicode_New function, according to the number of text characters size and the maximum character maxchar initializes the Unicode object. This function mainly selects the most compact character storage unit and underlying structure for Unicode objects based on maxchar: (The source code is relatively long, so it will not be listed here. You can understand it by yourself. It is shown in table form below)
maxchar | 128 | 256 | 65536 | |
---|---|---|---|---|
PyUnicode_1BYTE_KIND | PyUnicode_2BYTE_KIND | PyUnicode_4BYTE_KIND | ascii | |
0 | 0 | 0 | Character storage unit size (bytes) | |
1 | 2 | 4 | Underlying structure | |
PyCompactUnicodeObject | PyCompactUnicodeObject | PyCompactUnicodeObject |
The above is the detailed content of Python built-in type str source code analysis. For more information, please follow other related articles on the PHP Chinese website!

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

SublimeText3 Chinese version
Chinese version, very easy to use

WebStorm Mac version
Useful JavaScript development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SublimeText3 Linux new version
SublimeText3 Linux latest version