Home  >  Article  >  Backend Development  >  Detailed introduction to Python character encoding

Detailed introduction to Python character encoding

高洛峰
高洛峰Original
2017-03-28 17:19:581385browse

1. Character encodingIntroduction

1.1. ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte encoding. At the beginning, there was only English in the computer world, and a single byte can represent 256 different characters. Represents all English characters and many control symbols, but ASCII only uses half of them (\x80 or less), which is also the basis for the realization of MBCS. MBCS

However, computers. Soon there were other languages ​​in the world, and single-byte ASCII could no longer meet the demand. Later, each language developed its own encoding because there were too few characters that could be represented by a single byte, and it also needed to be consistent with the ASCII encoding. Compatible, so these encodings use multi-bytes to represent characters, such as GBxxx, BIGxxx, etc. Their rule is that if the first byte is below \x80, it still represents ASCII characters; and if it is above \x80 , then together with the next byte (two bytes in total) represent a character, then skip the next byte and continue to judge.

Here, IBM invented a concept called Code Page. Putting these codes into the bag and assigning page numbers, GBK is page 936, which is CP936. Therefore, CP936 can also be used to represent GBCS (Multi-Byte Character

Set

). It is the collective name of these encodings. So far, everyone has used double-byte, so it is sometimes called DBCS (Double-Byte Character Set). It must be clear that MBCS is not a specific encoding. Depending on the setting area, MBCS refers to different encodings, and MBCS cannot be used as the encoding in

Linux

. In Windows, you cannot see the characters MBCS because Microsoft uses ANSI in order to be more fashionable. To scare people, the encoding ANSI in the Save As dialog box of Notepad is MBCS. At the same time, in the default locale setting of Simplified Chinese Windows, it refers to GBK. , some people began to feel that too much coding had made the world too complex, making people's heads hurt, so everyone sat together and patted their heads and came up with a method: characters in all languages ​​are represented by the same character set , this is Unicode. The original Unicode standard UCS-2 uses two bytes to represent a character, so you can often hear that Unicode uses two bytes to represent a character. But soon after, some people felt that 256*256 was too little, and it was still not enough, so the UCS-4 standard appeared, which uses 4 bytes to represent a character, but the one we use most is still UCS-2.

UCS (Unicode Character Set) is just a table of corresponding code points for characters. For example, the code point for the word "汉" is 6C49. UTF (UCS Transformation Format) is responsible for the specific transmission and storage of characters.

This is very simple at the beginning, just use the UCS code point to save it, which is UTF-16. For example, "Han" can be saved directly using \x6C\x49 (UTF-16-BE), or Use \x49\x6C to save (UTF-16-LE) in reverse. But Americans feel that they have suffered a big loss after using it. In the past, English letters only needed one byte to save, but now it becomes two bytes after eating a big pot, and the space consumption has doubled... So UTF-8 Out of nowhere. UTF-8 is a very awkward encoding. Specifically, it is variable-length and compatible with ASCII. ASCII characters are represented by 1 byte. However, what is omitted here must be extracted from other places. You must have heard that Chinese characters in UTF-8 use 3 bytes to save, right? The characters saved in 4 bytes are even more tearful... (Please search for details on how UCS-2 became UTF-8)

Another thing worth mentioning is BOM (Byte Order Mark). When we save the file, the encoding used in the file is not saved. When we open it, we need to remember the encoding used when saving and open it using this encoding. This causes a lot of trouble. (You may want to say that Notepad does not allow you to select the encoding when opening the file? You might as well open Notepad first and then use File-> Open to see) UTF introduces BOM to represent its own encoding. Byte is one of them, which means that the encoding used for the text to be read next is the corresponding encoding:

BOM_UTF8 '\xef\xbb\xbf'

BOM_UTF16_LE '\xff\xfe'

BOM_UTF16_BE '\xfe\xff'

Not all

editors

will write BOM, but even without BOM, Unicode can still be read, just like MBCS encoding , you need to specify the specific encoding separately, otherwise the decoding will fail.

You may have heard that UTF-8 does not require a BOM. This is not true, but most editors read UTF-8 as the default encoding when there is no BOM. Even Notepad, which uses ANSI (MBCS) by default when saving, first uses UTF-8 test encoding when reading the file. If it can be decoded successfully, UTF-8 is used for decoding. This awkward approach of Notepad has caused a BUG: if you create a new text file and enter "姹姧" and then save it using ANSI (MBCS), it will become "Han a" when you open it again. You might as well try it:)

2. PythonEncoding issues in 2.x

2.1. str and unicode

str and unicode are both subclasses of basestring. Strictly speaking, str is actually a byte string, which is a sequence of Unicode-encoded bytes. When using the len() function on the UTF-8 encoded str'Chinese', the result is 3, because in fact, the UTF-8 encoded 'Chinese' == '\xE6\xB1\x89'.

unicode is the true string, which is obtained after decoding the byte string str using the correct character encoding, and len(u'汉') == 1.

Let’s take a look at the two basestring instance methods of encode() and decode(). After understanding the difference between str and unicode, these two methods will no longer be confused:

# coding: UTF-8
 
u = u'汉'
print repr(u) # u'\u6c49'
s = u.encode('UTF-8')
print repr(s) # '\xe6\xb1\x89'
u2 = s.decode('UTF-8')
print repr(u2) # u'\u6c49'
 
# 对unicode进行解码是错误的
# s2 = u.decode('UTF-8')
# 同样,对str进行编码也是错误的
# u2 = s.encode('UTF-8')

It should be noted that although calling the encode() method on str is wrong, in fact Python will not throw an exception , but will return another str with the same content but a different id; calling decode on unicode The same goes for the () method. I really don’t understand why encode() and decode() are not placed in unicode and str respectively but both are placed in basestring. But since this is the case, let’s be careful to avoid making mistakes.

2.2. Character encoding declaration

If non-ASCII characters are used in the source code file, a character encoding declaration needs to be made in the header of the file, as follows:

#-*- coding: UTF-8 -*-

In fact, Python only checks #, coding, and encoding strings, and other characters are added for aesthetics. In addition, there are many character encodings available in Python, and there are many aliases, which are not case-sensitive. For example, UTF-8 can be written as u8. See http://docs.python.org/library/codecs.html#standard-encodings.

It should also be noted that the declared encoding must be consistent with the encoding used when the file is actually saved, otherwise there is a high chance that code parsing exceptions will occur. Today's IDE will generally handle this situation automatically. After changing the declaration, it will be saved in the declared encoding, but text editor controls need to be careful:)

2.3. Read and write files

Built-in When the open() method opens a file, read() reads str. After reading, you need to use the correct encoding format to decode(). When writing (), if the parameter is unicode, you need to use the encoding you want to write to encode(). If it is a str in other encoding formats, you need to first use the encoding of the str to decode() and convert it to unicode. Then use the written encoding to encode(). If you pass unicode directly as a parameter to the write() method, Python will first use the character encoding declared in the source code file to encode and then write.

# coding: UTF-8
 
f = open('test.txt')
s = f.read()
f.close()
print type(s) # <type &#39;str&#39;>
# 已知是GBK编码,解码成unicode
u = s.decode('GBK')
 
f = open('test.txt', 'w')
# 编码成UTF-8编码的str
s = u.encode('UTF-8')
f.write(s)
f.close()

In addition, the module codecs provides an open() method, which can specify an encoding to open the file. The file opened using this method will read and return unicode. When writing, if the parameter is unicode, it will be encoded using the encoding specified during open() and then written; if it is str, it will be decoded into unicode according to the character encoding declared in the source code file before performing the aforementioned operation. Compared with the built-in open(), this method is less prone to coding problems.

# coding: GBK
 
import codecs
 
f = codecs.open('test.txt', encoding='UTF-8')
u = f.read()
f.close()
print type(u) # <type &#39;unicode&#39;>
 
f = codecs.open('test.txt', 'a', encoding='UTF-8')
# 写入unicode
f.write(u)
 
# 写入str,自动进行解码编码操作
# GBK编码的str
s = '汉'
print repr(s) # '\xba\xba'
# 这里会先将GBK编码的str解码为unicode再编码为UTF-8写入
f.write(s)
f.close()

2.4. Encoding-related methods

The sys/locale module provides some methods for obtaining the default encoding in the current environment.

# coding:gbk
 
import sys
import locale
 
def p(f):
    print '%s.%s(): %s' % (f.module, f.name, f())
 
# 返回当前系统所使用的默认字符编码
p(sys.getdefaultencoding)
 
# 返回用于转换Unicode文件名至系统文件名所使用的编码
p(sys.getfilesystemencoding)
 
# 获取默认的区域设置并返回元祖(语言, 编码)
p(locale.getdefaultlocale)
 
# 返回用户设定的文本数据编码
# 文档提到this function only returns a guess
p(locale.getpreferredencoding)
 
# \xba\xba是'汉'的GBK编码
# mbcs是不推荐使用的编码,这里仅作测试表明为什么不应该用
print r"'\xba\xba'.decode('mbcs'):", repr('\xba\xba'.decode('mbcs'))
 
#在笔者的Windows上的结果(区域设置为中文(简体, 中国))
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): ('zh_CN', 'cp936')
#locale.getpreferredencoding(): cp936
#'\xba\xba'.decode('mbcs'): u'\u6c49'

3. Some suggestions

3.1. Use character encoding declaration, and all source code files in the same project use the same character encoding declaration.

This must be done.

3.2. Abandon str and use unicode for all.

Click u before pressing the quotation marks. It is really hard to get used to doing it at first, and you often forget to go back and fix it, but if you do this, you can reduce 90% of coding problems. If the encoding problem is not serious, you don’t need to refer to this article.

3.3. Use codecs.open() instead of the built-in open().

If the encoding problem is not serious, you don’t need to refer to this article.

3.4. Character encodings that absolutely need to be avoided: MBCS/DBCS and UTF-16.

The MBCS mentioned here does not mean that GBK or anything else cannot be used, but that you should not use the encoding named 'MBCS' in Python unless the program is not transplanted at all.

The encoding 'MBCS' and 'DBCS' in Python are synonyms, referring to the encoding referred to by MBCS in the current Windows environment. There is no such encoding in the Linux implementation of Python, so exceptions will definitely occur once ported to Linux! In addition, as long as the Windows system region set is different, the encoding referred to by MBCS is also different. The result of setting different areas and running the code in Section 2.4:

#中文(简体, 中国)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): ('zh_CN', 'cp936')
#locale.getpreferredencoding(): cp936
#'\xba\xba'.decode('mbcs'): u'\u6c49'
 
#英语(美国)
#sys.getdefaultencoding(): UTF-8
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): ('zh_CN', 'cp1252')
#locale.getpreferredencoding(): cp1252
#'\xba\xba'.decode('mbcs'): u'\xba\xba'
 
#德语(德国)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): ('zh_CN', 'cp1252')
#locale.getpreferredencoding(): cp1252
#'\xba\xba'.decode('mbcs'): u'\xba\xba'
 
#日语(日本)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): ('zh_CN', 'cp932')
#locale.getpreferredencoding(): cp932
#'\xba\xba'.decode('mbcs'): u'\uff7a\uff7a'

可见,更改区域后,使用mbcs解码得到了不正确的结果,所以,当我们需要使用'GBK'时,应该直接写'GBK',不要写成'MBCS'。

UTF-16同理,虽然绝大多数操作系统中'UTF-16'是'UTF-16-LE'的同义词,但直接写'UTF-16-LE'只是多写3个字符而已,而万一某个操作系统中'UTF-16'变成了'UTF-16-BE'的同义词,就会有错误的结果。实际上,UTF-16用的相当少,但用到的时候还是需要注意。

The above is the detailed content of Detailed introduction to Python character encoding. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn