Home  >  Article  >  Backend Development  >  Detailed explanation of Python3's solution to difficult character encoding problems

Detailed explanation of Python3's solution to difficult character encoding problems

PHPz
PHPzOriginal
2017-04-02 13:23:491463browse

Python3 One of the most important improvements is to solve the big pit left by string and character encoding in Python2. Why is Python coding so painful? Some flaws in Python2 string design have been introduced:
- Using ASCII code as the default encoding method is very unfriendly to Chinese processing.
- Far-fetchedly dividing strings into two types, unicode and str, misleading developers

Of course, this is not a bug. As long as you pay more attention when processing, you can avoid these pitfalls. But in Python3 both problems are solved very well.

First, Python3 sets the system default encoding to UTF-8

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>>

Then, text characters and binary data are more clearly distinguished, represented by str and bytes respectively. All text characters are represented by the str type. str can represent all characters in the Unicode character set , while binary byte data is represented by a new data type , represented by bytes.

str

>>> a = "a"
>>> a
'a'
>>> type(a)
<class &#39;str&#39;>
>>> b = "禅"
>>> b
&#39;禅&#39;
>>> type(b)
<class &#39;str&#39;>

bytes

In Python3, adding 'b' before the character quotation marks clearly indicates that this is a

object of bytes type. In fact, It is a set of data consisting of a sequence of binary bytes. The bytes type can be characters in the ASCII range and other character data in hexadecimal form, but it cannot be represented by non-ASCII characters such as Chinese.

>>> c = b&#39;a&#39;>>> c
b&#39;a&#39;>>> type(c)
<class &#39;bytes&#39;>

>>> d = b&#39;\xe7\xa6\x85&#39;>>> d
b&#39;\xe7\xa6\x85&#39;>>> type(d)
<class &#39;bytes&#39;>
>>>

>>> e = b&#39;禅&#39;
  File "<stdin>", line 1SyntaxError: bytes can only contain ASCII literal characters.

The bytes type provides the same operations as str, supporting operations such as sharding, indexing, and basic numerical operations. However, the + operation cannot be performed on data of type str and bytes, although it is feasible in py2.

>>> b"a"+b"c"
b&#39;ac&#39;
>>> b"a"*2
b&#39;aa&#39;
>>> b"abcdef\xd6"[1:]
b&#39;bcdef\xd6&#39;
>>> b"abcdef\xd6"[-1]
214
>>> b"a" + "b"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can&#39;t concat bytes to str

encode and decode

Conversion between str and bytes can be done using the encode and decode methods.


encode is responsible for character to byte encoding conversion. By default, UTF-8 encoding is used.

>>> s = "Python之禅"
>>> s.encode()
b&#39;Python\xe4\xb9\x8b\xe7\xa6\x85&#39;
>>> s.encode("gbk")
b&#39;Python\xd6\xae\xec\xf8&#39;

decode is responsible for decoding and converting bytes to characters, and generally uses UTF-8 encoding format for conversion.

>>> b&#39;Python\xe4\xb9\x8b\xe7\xa6\x85&#39;.decode()
&#39;Python之禅&#39;
>>> b&#39;Python\xd6\xae\xec\xf8&#39;.decode("gbk")
&#39;Python之禅&#39;

The above is the detailed content of Detailed explanation of Python3's solution to difficult character encoding problems. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn