Home > Article > Backend Development > Detailed explanation of Unicode and Python Chinese processing methods
In the Python language, UincodeString processing has always been a confusing problem. Many Python enthusiasts often have trouble figuring out the differences between Unicode, UTF-8, and many other encodings. This article will introduce the relevant knowledge of Unicode and Python's Chinese processing. Let’s take a look with the editor below
In the Python language, Uincode string processing has always been a confusing problem. Many Python enthusiasts often have trouble figuring out the differences between Unicode, UTF-8, and many other encodings. The author was once a member of this "troublesome group", but after more than half a year of hard work, I have finally figured out some of the relationships. It is now organized as follows and shared with all colleagues. At the same time, I also hope that this short article can attract more real experts to join in and jointly improve our Python Chinese environment.
Some of the various opinions mentioned in this article are obtained by consulting the data, and some are obtained by the author using various existing coded data using the "guessing and verification" method. The author thinks that he has little talent and knowledge, and I am afraid that there are many mistakes hidden in it. There are many experts among you. If any of you find any mistakes in it, I hope you will give me some advice. It is a small matter for the author to be embarrassed himself, but it is a big matter for others to have wrong opinions, so you don’t have to worry about the author’s face.
Section 1 Text Encoding and Unicode Standard
To explain Unicode strings, we must first start with what Unicode encoding is. As we all know, text display has always been a basic problem that computer display functions must solve. The computer is not literate. It actually regards the text as a series of "pictures", each "picture" corresponding to a character. When each computer program displays text, it must use a collection of "pictures" that records how the text "picture" is displayed, find the data corresponding to the "picture" for each character, and "draw" the word in the same way. to the screen. This "picture" is called a "font", and the collection of recorded font display data is called "Character set". In order to facilitate program search, the font data of each character must be arranged in an orderly manner in the character set, and each character will be assigned a unique ID. This ID is the character's encoding. When the computer processes character data, this encoding is always used to represent the character it represents. Therefore, a character set specifies a set of character data that a computer can process. Obviously, different countries specify different character set sizes, and the corresponding character encodings are also different.
In the history of computers, the most widely used standardized character set is the ASCII character set. It is actually a standard formulated in the United States and developed for North American users. It uses 7 binary bit encoding and can represent 128 characters. This character set was eventually officially adopted by the ISO organization as an international standard and is widely used in various computer systems. Nowadays, the BIOS of all PCs contains the font model of the ASCII character set, which is evident from its popularity.
However, when computers became popular in various countries, the limitations of ASCII encoding were exposed: its character space is really limited and cannot accommodate more characters, but most languages need to use The number of characters is far more than 128. In order to correctly handle their own characters, officials or private individuals in various countries have begun the work of designing their own character encoding sets, and eventually many character encodings for each country's characters emerged, such as the ISO-8859-1 encoding for Western European characters. There are GB series codes for Simplified Chinese, and SHIFT-JIS codes for Japanese, etc. At the same time, in order to ensure that each new character set is compatible with the original ASCII text, most character sets invariably use ASCII characters as their first 128 characters, and make their encodings correspond to ASCII encodings one-to-one.
In this way, the problem of displaying characters in various countries is solved, but it also brings a new problem: garbled characters. Character sets used in different countries and regions are usually not bound by unified specifications, so the encodings of various character sets are often incompatible with each other. The encoding of the same word in two different character sets is generally different; and the corresponding characters of the same encoding in different character sets are also different. A piece of text written in encoding A will often be displayed as a mess of characters on a system that only supports encoding B. To make matters worse, the encoding lengths used by different character sets are often different. Programs that can only handle single-byte encoding often fail to handle text correctly when encountering double-byte or even multi-byte encoding. The infamous "half-word" problem. This made the already chaotic situation even more chaotic.
In order to solve these problems once and for all, many large companies and organizations in the industry jointly proposed a standard, which is Unicode. Unicode is actually a new character encoding system. It encodes each character in the character set with a two-byte long ID number, thereby defining a coding space that can accommodate up to 65536 characters, and including all commonly used words in encodings from various countries in the world. . Due to careful consideration in designing the encoding, Unicode has well solved the problems of garbled characters and "half-words" caused by other character sets in data exchange. At the same time, the designers of Unicode fully considered the fact that a large amount of font data today still uses various encodings formulated by various countries, and put forward the design concept of "using Unicode as an internal encoding". In other words, the character display program still uses the original encoding and code, and the internal logic of the application will use Unicode. When displaying text, the program always converts the Unicode-encoded string into the original encoding for display. In this way, everyone does not have to redesign the font data system in order to use Unicode. At the same time, in order to distinguish it from the encodings that have been formulated by various countries, the designers of Unicode call Unicode "wide characters encodings", while the encodings formulated by various countries are customarily called "multi-byte encodings" (multi bypes). encodings). Today, the Unicode system has introduced a four-byte extended encoding, and is gradually converging with UCS-4, which is the ISO10646 encoding specification. It is hoped that one day the ISO10646 system can be used to unify all text encodings around the world.
The Unicode system received high hopes as soon as it was born, and was quickly accepted as an international standard recognized by ISO. However, during the promotion process of Unicode, it encountered opposition from European and American users. The reason for their opposition is very simple: the original encodings used by European and American users are single-byte long, and the double-byte Unicode processing engine cannot process the original single-byte data; and if all existing single-byte texts need to be converted To convert it into Unicode, the workload will be too much. Furthermore, if all single-byte encoded text were converted to double-byte Unicode encoding, all their text data would take up twice as much space, and all handlers would have to be rewritten. They cannot accept this expense.
Although Unicode is an internationally recognized standard, it is impossible for the standardization organization to ignore the requirements of European and American users, the largest computer user group. So after consultations between all parties, a variant version of Unicode was produced, which is UTF-8. UTF-8 is a multi-byte encoding system. Its encoding rules are as follows:
1. UTF-8 encoding is divided into four areas:
The first area is single-byte encoding,
The encoding format is: 0xxxxxxx;
corresponds to Unicode: 0x0000 - 0x007f
The second area is double-byte encoding,
The encoding format is: 110xxxxx 10xxxxxx;
corresponds to Unicode: 0x0080 - 0x07ff
The three areas are three-byte encoding,
The encoding format is: 1110xxxx 10xxxxxxx 10xxxxxx
corresponds to Unicode: 0x0800 - 0xffff
The four areas are four-byte encoding,
The encoding format is: 11110xxx 10xxxxxxx 10xxxxxx 10xxxxxx
corresponds to Unicode: 0x00010000 - 0x0001ffff
## The five areas are five-byte encoding, and the encoding format of
is: 111110xx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx corresponding Unicode: 0x00200000 - 0x03ffffffThe six areas are six-byte encoding, and the encoding format of
is: 111110x 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxxCorresponds to Unicode: 0x04000000 - 0x7fffffffAmong them, the first, second and third areas correspond to the double-byte encoding area of Unicode, while the fourth area is for the four-byte extended part of Unicode (according to this definition, UTF -8 also has five and six districts, but the author did not find them in the GNU glibc library, I don’t know why);
2. Each district is divided into one, two, three, four, Five and six are arranged in order, and the characters at the corresponding positions remain the same as Unicode;
3. Unicode characters that cannot be displayed are encoded as 0 bytes, in other words, They are not included in UTF-8 (this is the author's statement from the GNU C library comment , which may not be consistent with the actual situation);
According to UTF-8 Encoding rules It is not difficult for us to find that the 128 codes in the first area are actually ASCII codes. So the UTF-8 processing engine can directly process ASCII text. However, UTF-8's compatibility with ASCII encoding comes at the expense of other encodings. For example, originally Chinese, Japanese, and Korean characters were basically double-byte encodings, but their positions in Unicode encoding correspond to the three areas in UTF-8, and each character encoding is three bytes long. In other words, if we convert all existing non-ASCII character text data encoded in China, Japan, and Korea into UTF-8 encoding, its size will become 1.5 times the original size.Although the author personally thinks that the encoding method of UTF-8 seems a bit unfair, it has solved the transition problem from ASCII text to the Unicode world, so it has won wide recognition. Typical examples are XML and Java: the default encoding of XML text is UTF-8, and Java source code can actually be written in UTF-8 characters (JBuilder users should be impressed). There is also the well-known GTK 2.0 in the open source software world, which uses UTF-8 characters as internal encoding.
Having said so much, it seems that the topic is a bit far away. Many Python enthusiasts may have begun to worry: "What does this have to do with Python?" Okay, now we turn our attention to the world of Python Come.
Section 2 Python’s Unicode encoding system
In order to correctly handle multi-language texts, Python introduced Unicode strings after version 2.0. Since then, strings in the Python language have been divided into two types: traditional Python strings that have been used for a long time before version 2.0, and new Unicode strings. In the Python language, we use the unicode() built-in function to "decode" a traditional Python string to get a Unicode string, and then use the encode() method of the Unicode string to decode this Unicode string. The string is "encoded" and "encoded" into a traditional Python string. The above content must be familiar to every Python user. But did you know that Python's Unicode string is not a true "Unicode-encoded string", but follows its own unique rules. The content of this rule is very simple:
1. The Python Unicode encoding of ASCII characters is the same as their ASCII encoding. In other words, the ASCII text in Python's Unicode string is still a single-byte length encoding;
2. The encoding of characters other than ASCII characters is Unicode A two-byte (or four-byte) encoding of the standard encoding. (The author guesses that the reason why the Python community wants to formulate such a weird standard may be to ensure the versatility of ASCII strings)
Usually in Python applications, Unicode strings are It is used for internal processing, and the terminal display work is completed by traditional Python strings (in fact, Python's print statement cannot print out double-byte Unicode encoded characters at all). In the Python language, traditional Python strings are so-called "multi-byte encoded" strings, which are used to represent various strings that are "encoded" into specific character set encodings (such as GB, BIG5, KOI8-R, JIS, ISO-8859-1, and of course UTF-8); and Python Unicode strings are "wide character encoding" strings, which represent Unicode data "decoded" from a specific character set encoding. So usually, a Python application that needs to use Unicode encoding will often process string data in the following way:
def foo(string, encoding = "gb2312"): # 1. convert multi-byte string to wide character string u_string = unicode(string, encoding) # 2. do something ... # 3. convert wide character string to printable multi-byte string return u_string.encode(encoding)
us You can give an example: Python colleagues who often use PyGTK2 for XWindowprogramming in a Red Hat Linux environment may have discovered this situation long ago: If we directly write the following statement:
import pygtk pygtk.require('2.0') import gtk main = gtk.Window() # create a window main.set_title("你好") # NOTICE!
When such a statement is executed, a warning will appear on the terminal:
Error converting from UTF-8 to 'GB18030': Invalid character sequence
appears in the conversion input and the program window title will not be set to "Hello"; but if the user installs the Chinese codec and Change the last sentence above to:
u_string = unicode('你好','gb2312') main.set_title(u_string)
The program window title will be correctly set to "Hello". Why is this?
the reason is simple. The gtk.Window.set_title() method always treats the title string it receives as a Unicode string. When the PyGTK system receives the user's main.set_title() request, it processes the string obtained as follows somewhere:
class Window(gtk.Widget): ... def set_title(self, title_unicode_string): ... # NOTICE! unicode -> multi-byte utf-8 real_title_string = title_unicode_string.encode('utf-8') ... # pass read_title_string to GTK2 C API to draw the title ...
We see that the string title_unicode_string is "encoded" into a new string inside the program: real_title_string. Obviously, this real_title_string is a traditional Python string, and its encoding is UTF-8. In the previous section, the author mentioned that the strings used internally in GTK2 are encoded in UTF-8, so the GTK2 core system can correctly display the title after receiving real_title_string.
So, what if the title entered by the user is an ASCII string (for example: "hello world")? If we recall the definition rules of Python Unicode strings, it is not difficult to find that if the user's input is an ASCII string, the result obtained by recoding it is itself. In other words, if the value of title_unicode_string is an ASCII string, the values of real_title_string and title_unicode_string will be exactly the same. An ASCII string is also a UTF-8 string, and there will be no problem passing it to the GTK2 system.
The examples we gave above are about PyGTK2 under Linux, but similar problems do not only appear in PyGTK. In addition to PyGTK, various current Python-bound graphics packages, such as PyQT, Tkinter, etc., will more or less encounter problems related to Unicode processing.
Now we have figured out Python’s Unicode string encoding mechanism, but the question we most want to know is still unresolved: How can we make Python support using Unicode to process Chinese? We will explain this issue in the next section.
Section 3 How to make Python's Unicode string support Chinese
After reading the title of this section, some Python colleagues may be a little disapproving: "Why Do we have to use Unicode to process Chinese? Aren’t we usually good at using traditional Python strings? "Indeed, in general, operations such as string concatenation and substring matching use traditional Python characters. Skewers are enough. However, if it involves some advanced string operations, such as regular expression matching containing multiple languages, text editing, expression analysis, etc., these are a large number of mixed single-byte and multi-byte text. The operation would be very troublesome if traditional string processing was used. Besides, traditional strings can never solve the damn "half word" problem. And if we can use Unicode, these problems can be easily solved. Therefore, we must face up to and try to solve the problem of Chinese Unicode processing.
We know from the introduction in the previous section that if you want to use Python's Unicode mechanism to process strings, as long as you can have a multi-byte Chinese encoding (including GB encoding series and BIG5 series) and Unicode The encoding/decoding module performs bidirectional conversion of the encoding. According to Python terminology, such an encoding/decoding module is called a codec. So the next question becomes: How do we write such a codec?
If Python's Unicode mechanism is hard-coded in the Python core, then adding a new codec to Python will be an arduous task. Fortunately, the designers of Python are not that stupid. They provide an extremely extensible mechanism that can easily add new codecs to Python.
Python's Unicode processing module has three most important components: one is the codecs.py file, the other is the encodings directory, and the third is the aliases.py file. The first two are located in the installation directory of the Python system library (if it is a Win32 distribution, it is in the $PYTHON_HOME/lib/ directory; if it is Red Hat Linux, it is in the /usr/lib/python-version/ directory. Other systems can find it similarly), and the last one is located in the encodings directory. Next, we explain these three respectively.
Let’s take a look at the codecs.py file first. This file defines the interface that a standard Codec module should have. The specific content can be found in your own Python distribution, so I won’t go into details here. According to the definition of the codecs.py file, a complete codec should have at least three classes and one standard function:
1, Codec class
Use:
Used to treat the buffer data (a buffer) passed in by the user as a traditional Python string, and
"decode" it into the corresponding Unicode string. A complete Codec class definition must provide two methods: Codec.decode() and
Codec.encode():
Codec.decode(input, <a href="http://www.php.cn/wiki/222.html" target="_blank">errors</a> = "strict")
is used to treat the input data as a traditional Python string and "decode" it into the corresponding Unicode string.
Parameters:
input: input buffer (can be a string, or any object that can be converted into a string representation)
errors: when a conversion error occurs processing options. You can choose from the following three values:
strict (default value): If an error occurs, a UnicodeError exception is thrown;
replace: If an error occurs, a default Unicode encoding is selected instead. Among them;
ignore: If an error occurs, this character is ignored and analysis of the remaining characters continues.
Return value:
A constant list (tuple): the first element is the converted Unicode string, and the last element is the length of the input data.
Codec.encode(input, errors = "strict")
is used to treat the input data as a Unicode string and "encode" it, Convert to the corresponding traditional Python string.
Parameters:
input: input buffer (usually a Unicode string)
errors: processing options when conversion errors occur. The value rules are the same as the Codec.decode() method.
Return value:
A constant list (tuple): the first element is the converted traditional Python string, and the last element is the length of the input data.
2. StreamReader class (usually should inherit from the Codec class)
用于分析文件输入流。提供所有对文件对象的读取操作,如readline()方法等。
3、StreamWriter类(通常应该继承自Codec类)
用于分析文件输出流。提供所有对文件对象的写入操作,如writeline()方法等。
5、getregentry()函数
即“GET REGistry ENTRY”之意,用于获取各个Codec文件中定义的四个关键函数。其函数体统一为:
def getregentry(): return tuple(Codec().encode,Codec().decode,StreamReader,StreamWriter)
在以上提到的所有四个类中,实际上只有Codec类和getregentry()函数是必须提供的。必须提供前者是因为它是实际提供转换操作的模块;而后者则是Python系统获得Codec定义的标准接口,所以必须存在。至于StreamReader和StreamWriter,理论上应该可以通过继承codecs.py中的StreamReader和StreamWriter类,并使用它们的默认实现。当然,也有许多codec中将这两个类进行了改写,以实现一些特殊的定制功能。
接下来我们再说说encodings目录。顾名思义,encodings目录就是Python系统默认的存放所有已经安装的codec的地方。我们可以在这里找到所有Python发行版自带的codecs。习惯上,每一个新的codec都会将自己安装在这里。需要注意的是,Python系统其实并不要求所有的codec都必须安装于此。用户可以将新的codec放在任何自己喜欢的位置,只要Python系统的搜索路径可以找得到就行。
仅仅将自己写的codec安装在Python能够找到的路径中还不够。要想让Python系统能找到对应的codec,还必须在Python中对其进行注册。要想注册一个新的codec,就必须用到encodings目录下的aliases.py文件。这个文件中只定义了一个哈希表aliases,它的每个键对应着每一个codec在使用时的名称,也就是unicode()内建函数的第二个参数值;而每个键对应的值则是一个字符串,它是这个codec对应的那个处理文件的模块名。比如,Python默认的解析UTF-8的codec是utf_8.py,它存放在encodings子目录下,则aliases哈希表中就有一项表示其对应关系:
'utf-8' : 'utf_8', # the <a href="http://www.php.cn/code/8212.html" target="_blank">module</a> `utf_8' is the codec <a href="http://www.php.cn/wiki/125.html" target="_blank">for</a> UTF-8
同理,如果我们新写了一个解析‘mycharset'字符集的codec,假设其编码文件为mycodec.py,存放在$PYTHON_HOME/lib/site-packages/mycharset/目录下,则我们就必须在aliases哈希表中加入这么一行:
'mycharset' : 'mycharset.mycodec',
这里不必写出mycodec.py的全路径名,因为site-packages目录通常都在Python系统的搜索路径之中。
Python解释器在需要分析Unicode字符串时,会自动加载encodings目录下的这个aliases.py文件。如果mycharset已经在系统中注册过,则我们就可以像使用其它内建的编码那样使用我们自己定义的codec了。比如,如果按照上面的方式注册了mycodec.py,则我们就可以这样写:
my_unicode_string = unicode(a_multi_byte_string, 'mycharset') print my_unicode_string.encode('mycharset')
现在我们可以总结一下要编写一个新的codec一共需要那些步骤:
首先,我们需要编写一个自己的codec编码/解码模块;
其次,我们要把这个模块文件放在一个Python解释器可以找到的地方;
最后,我们要在encodings/aliases.py文件中对其进行注册。
从理论上说,有了这三步,我们就可以将自己的codec安装到系统中去了。不过这样还不算完,还有一个小问题。有时候,我们出于种种原因,不希望随便修改自己的系统文件(比如,一个用户工作在一个集中式的系统中,系统管理员不允许别人对系统文件进行修改)。在以上介绍的步骤中,我们需要修改aliases.py文件的内容,这是一个系统文件。可如果我们不能修改它,难道我们就不能添加新的codec吗?不,我们当然有办法。
这个办法就是:在运行时修改encodings.aliases.aliases哈希表的内容。
还是使用上面那个假设,如果用户工作系统的管理员不允许用户把mycodec.py的注册信息写入aliases.py,那么我们就可以如此处理:
1、将mycodec.py放在一个目录下,比如/home/myname/mycharset/目录;
2、这样编写/home/myname/mycharset/init.py文件:
import encodings.aliases # update aliases hash map encodings.aliases.aliases.update({/ 'mycodec' : 'mycharset.mycodec',/ }}
以后每次要使用Python时,我们可以将/home/myname/加入搜索路径,并且在使用自己的codec时预先执行:
import mycharset # execute the script in mycharset/init.py
这样我们就可以在不改动原有系统文件的情况下使用新的codecs了。另外,如果借助Python的site机制,我们还可以让这个import工作自动化。如果大家不知道什么是site,就请在自己的Python交互环境中运行:
import site print site.doc
浏览一下site模块的文档,即可明白个中技巧。如果大家手头有Red Hat Linux v8,v9,还可以参考一下Red Hat的Python发行版中附带的日文codec,看看它是如何实现自动加载的。也许不少同道可能找不到这个日文的codec在哪里,这里列出如下:
Red Hat Linux v8:在/usr/lib/python2.2/site-package/japanese/目录下; Red Hat Linux v9:在/usr/lib/python2.2/lib-dynload/japanese/目录下;
提示:请Red Hat用户注意site-packages目录下的japanese.pth文件,结合site模块的文档,相信马上就能豁然开朗。
结束语
记得当初笔者在Dohao论坛上夸下海口:“如果可以的话,我可以为大家编写一个(中文模块)”,现在回想起来,不禁为自己当初的不知天高地厚而汗颜。一个把自己所有的的时间都花在学习上,一个学期只学七门课程,还落得个两门课不及格的傻瓜研究生,哪里有什么资格在大家面前如此嚣张。现如今,第二个学期由于这两门课的缘故负担陡增(十门课呀!),家中老父老母还眼巴巴地等着自己的儿子能给他们挣脸。要想在有限的时间之内,既保证学习,又保证工作(我要承担导师的课程辅导工作,同时还有一个学校的教学改革方案需要我在其中挑大梁),已经是疲于应付,再加上一个中文模块……唉,请恕笔者分身乏术,不得不食言。
因此,笔者斗胆,在此和盘托出自己这半年以来的心得,只希望能够找到一批,不,哪怕是一个也好,只要是对这个项目感兴趣的同道中人,能够接下笔者已经整理出来的知识,把一个完整的(至少应该包含GB、BIG5、笔者个人认为甚至还应包括HZ码)中文模块编写出来,贡献给大家(不论是有偿的还是无偿的),那就是我们广大Python爱好者之福了。另外,Python的发行版至今尚未包括任何中文支持模块。既然我等平日深爱Python,如果我们的工作能因此为Python的发展做出一点贡献,何乐而不为呢?
附录 几个小小提示
1、LUO Jian兄已经编写了一个非常不错的中文模块(Dohao上有链接,文件名是showfile.zip,这个模块比我已经写完的草稿版本要快得多),同时支持GB2312和GB18030编码,可惜不支持BIG5。如果大家有兴趣,可以下载这个模块研究一下;
2、和其它字符集编码相比,中文模块有其特殊性,那就是其海量的字符数目。一些相对较小的字符集还好说,比如GB2312,可以利用哈希表查找。而对于巨大的GB18030编码,如果简单地将所有数据制成一个特大的编码对照表,则查询速度会慢得让人无法容忍(笔者在编写模块时最头疼的就是这一点)。如果要编写一个速度上能让人满意的codec,就必须考虑设计某种公式,能够通过简单地运算从一种编码推算出另一种来,或者至少能推算出它的大概范围。这就要求程序员要能对整个编码方案做统计,设法找到规律。笔者认为,这应该是编写中文模块时的最大难点。或许是数学功底实在太差的缘故,笔者费尽心机也未能找出一个规律来。希望能有数学高手不吝赐教;
3、中文编码分为两大派系:GB和BIG5。其中GB又分为GB2312、GBK和、GB18030三种编码,而BIG5也分为BIG5和BIG5-HKSCS两种(分别对应原始的BIG5和香港扩展版本)。虽然同一派系的编码可以向下兼容,但考虑到其字符数目庞大,为了加快查找速度,笔者个人认为还是将它们分开编码比较合理。当然,如果能够找到对应字符集的转换公式,则这种分离就没有必要了;
The above is the detailed content of Detailed explanation of Unicode and Python Chinese processing methods. For more information, please follow other related articles on the PHP Chinese website!