Home >Backend Development >Python Tutorial >Detailed explanation of Python3's solution to difficult character encoding problems
Python3 One of the most important improvements is to solve the big pit left by string and character encoding in Python2. Why is Python coding so painful? Some flaws in Python2 string design have been introduced:
- Using ASCII code as the default encoding method is very unfriendly to Chinese processing.
- Far-fetchedly dividing strings into two types, unicode and str, misleading developers
Of course, this is not a bug. As long as you pay more attention when processing, you can avoid these pitfalls. But in Python3 both problems are solved very well.
First, Python3 sets the system default encoding to UTF-8
>>> import sys >>> sys.getdefaultencoding() 'utf-8' >>>
Then, text characters and binary data are more clearly distinguished, represented by str and bytes respectively. All text characters are represented by the str type. str can represent all characters in the Unicode character set , while binary byte data is represented by a new data type , represented by bytes.
str>>> a = "a" >>> a 'a' >>> type(a) <class 'str'> >>> b = "禅" >>> b '禅' >>> type(b) <class 'str'>bytesIn Python3, adding 'b' before the character quotation marks clearly indicates that this is a
>>> c = b'a'>>> c b'a'>>> type(c) <class 'bytes'> >>> d = b'\xe7\xa6\x85'>>> d b'\xe7\xa6\x85'>>> type(d) <class 'bytes'> >>> >>> e = b'禅' File "<stdin>", line 1SyntaxError: bytes can only contain ASCII literal characters.The bytes type provides the same operations as str, supporting operations such as sharding, indexing, and basic numerical operations. However, the + operation cannot be performed on data of type str and bytes, although it is feasible in py2.
>>> b"a"+b"c" b'ac' >>> b"a"*2 b'aa' >>> b"abcdef\xd6"[1:] b'bcdef\xd6' >>> b"abcdef\xd6"[-1] 214 >>> b"a" + "b" Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't concat bytes to strencode and decodeConversion between str and bytes can be done using the encode and decode methods.
encode is responsible for character to byte encoding conversion. By default, UTF-8 encoding is used.
>>> s = "Python之禅" >>> s.encode() b'Python\xe4\xb9\x8b\xe7\xa6\x85' >>> s.encode("gbk") b'Python\xd6\xae\xec\xf8'decode is responsible for decoding and converting bytes to characters, and generally uses UTF-8 encoding format for conversion.
>>> b'Python\xe4\xb9\x8b\xe7\xa6\x85'.decode() 'Python之禅' >>> b'Python\xd6\xae\xec\xf8'.decode("gbk") 'Python之禅'
The above is the detailed content of Detailed explanation of Python3's solution to difficult character encoding problems. For more information, please follow other related articles on the PHP Chinese website!