search
HomeBackend DevelopmentPython TutorialLearn about Python encoding and Unicode

I'm sure there are a lot of instructions about Unicode and Python, but in order to facilitate my own understanding and use, I still plan to write a few more things about them.

Byte stream vs Unicode object

Let’s first define a string in Python. When you use the string type, you actually store a byte string.

[  a ][  b ][  c ] = "abc"
[ 97 ][ 98 ][ 99 ] = "abc"

In this example, the string abc is a byte string. 97., 98, and 99 are ASCII codes. The definition in Python 2.x is to treat all strings as ASCII. Unfortunately, ASCII is the least common standard among the Latin character sets.

ASCII uses the first 127 numbers for character mapping. Character maps like windows-1252 and UTF-8 have the same first 127 characters. It is safe to mix string encodings when the value of each byte in your string is less than 127. However, there is a danger in making this assumption, as will be discussed below.

Problems will arise when there are bytes in your string with a value greater than 126. Let's look at a string encoded in windows-1252. The character mapping in Windows-1252 is an 8-bit character mapping, so there will be a total of 256 characters. The first 127 are the same as ASCII, and the next 127 are other characters defined by windows-1252.

A windows-1252 encoded string looks like this:
[ 97 ] [ 98 ] [ 99 ] [ 150 ] = "abc–"

Windows-1252 is still a byte string, but did you see that the value of the last byte is greater than 126? If Python tries to decode this byte stream using the default ASCII standard, it will report an error. Let’s see what happens when Python decodes this string:

>>> x = "abc" + chr(150)
>>> print repr(x)
'abc\x96'
>>> u"Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: &#39;ASCII&#39; codec can&#39;t decode byte 0x96 in position 3: ordinal not in range(128)

Let’s use UTF-8 to encode another string:

A UTF-8 encoded string looks like this:
[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"
[0x61] [0x62] [0x63] [0xe2]  [ 0x80] [ 0x93] = "abc-"

If you pick up the Unicode encoding table that you are familiar with, you will find that the Unicode code point corresponding to the English dash is 8211 (0×2013). This value is greater than the ASCII maximum value of 127. A value larger than one byte can store. Because 8211 (0×2013) is two bytes, UTF-8 must use some tricks to tell the system that it takes three bytes to store one character. Let's look at when Python is going to use the default ASCII to encode a UTF-8 encoded string with a character value greater than 126.

>>> x = "abc\xe2\x80\x93"
>>> print repr(x)
&#39;abc\xe2\x80\x93&#39;
>>> u"Hello" + x
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: &#39;ASCII&#39; codec can&#39;t decode byte 0xe2 in position 3: ordinal not in range(128)

As you can see, Python has always used ASCII encoding by default. When it processes the 4th character, Python throws an error because its value is 226 which is greater than 126. This is the problem with mixed encoding.

Decoding byte stream

When first learning Python Unicode, the term decoding can be confusing. You can decode a byte stream into a Unicode object and encode a Unicode object into a byte stream.

Python needs to know how to decode a byte stream into a Unicode object. When you get a byte stream, you call its "decode" method to create a Unicode object from it.

You'd better decode the byte stream to Unicode as early as possible.

>>> x = "abc\xe2\x80\x93"
>>> x = x.decode("utf-8")
>>> print type(x)
<type &#39;unicode&#39;>
>>> y = "abc" + chr(150)
>>> y = y.decode("windows-1252")
>>> print type(y)
>>> print x + y
abc–abc–

Encoding Unicode into a byte stream

A Unicode object is an encoding-agnostic representation of a text. You can't simply output a Unicode object. It must be turned into a byte string before output. Python would be well-suited for this kind of work, although Python defaults to ASCII when encoding Unicode into a byte stream. This default behavior can be the cause of a lot of headaches.

>>> u = u"abc\u2013"
>>> print u
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\u2013&#39; in position 3: ordinal not in range(128)
>>> print u.encode("utf-8")
abc–

Using codecs module

The codecs module can provide great help when processing byte streams. You can open files with the defined encoding and the content you read from the file will be automatically converted to Unicode objects.

​Try this:

>>> import codecs
>>> fh = codecs.open("/tmp/utf-8.txt", "w", "utf-8")
>>> fh.write(u"\u2013")
>>> fh.close()

What it does is get a Unicode object and write it to the file in UTF-8 encoding. You can use it in other situations as well.

Try this:

When reading data from a file, codecs.open will create a file object that can automatically convert the UTF-8 encoded file into a Unicode object.

Let’s continue the example above, this time using urllib streams.

>>> stream = urllib.urlopen("http://www.google.com")
>>> Reader = codecs.getreader("utf-8")
>>> fh = Reader(stream)
>>> type(fh.read(1))
<type &#39;unicode&#39;>
>>> Reader
<class encodings.utf_8.StreamReader at 0xa6f890>

Single line version:

>>> fh = codecs.getreader("utf-8")(urllib.urlopen("http://www.google.com"))
>>> type(fh.read(1))

You have to be very careful with codecs modules. What you pass in must be a Unicode object, otherwise it will automatically decode the byte stream as ASCII.

>>> x = "abc\xe2\x80\x93" # our "abc-" utf-8 string
>>> fh = codecs.open("/tmp/foo.txt", "w", "utf-8")
>>> fh.write(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/codecs.py", line 638, in write
  return self.writer.write(data)
File "/usr/lib/python2.5/codecs.py", line 303, in write
  data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe2 in position 3: ordinal not in range(128)

Ouch, Python started using ASCII to decode everything again.

The problem of slicing UTF-8 byte stream

Because a UTF-8 encoded string is a list of bytes, len() and slicing operations do not work properly. First use the string we used before.

[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"

Next do the following:

>>> my_utf8 = "abc–"
>>> print len(my_utf8)
6

What? It looks like 4 characters, but the result of len says it's 6. Because len counts the number of bytes rather than the number of characters.

>>> print repr(my_utf8)
&#39;abc\xe2\x80\x93&#39;

Now let's split this string.

>>> my_utf8[-1] # Get the last char
&#39;\x93&#39;

Let me go, the segmentation result is the last byte, not the last character.

In order to correctly segment UTF-8, you'd better decode the byte stream to create a Unicode object. Then you can operate and count safely.

>>> my_unicode = my_utf8.decode("utf-8")
>>> print repr(my_unicode)
u&#39;abc\u2013&#39;
>>> print len(my_unicode)
4
>>> print my_unicode[-1]
–

When Python automatically encodes/decodes

In some cases, errors will be thrown when Python automatically encodes/decodes using ASCII.

  第一个案例是当它试着将Unicode和字节串合并在一起的时候。

>>> u"" + u"\u2019".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe2 in position 0:   ordinal not in range(128)

  在合并列表的时候会发生同样的情况。Python在列表里有string和Unicode对象的时候会自动地将字节串解码为Unicode。

>>> ",".join([u"This string\u2019s unicode", u"This string\u2019s utf-8".encode("utf-8")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe2 in position 11:  ordinal not in range(128)

  或者当试着格式化一个字节串的时候:

>>> "%s\n%s" % (u"This string\u2019s unicode", u"This string\u2019s  utf-8".encode("utf-8"),)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe2 in position 11: ordinal not in range(128)

  基本上当你把Unicode和字节串混在一起用的时候,就会导致出错。

  在这个例子里面,你创建一个utf-8文件,然后往里面添加一些Unicode对象的文本。就会报UnicodeDecodeError错误。

>>> buffer = []
>>> fh = open("utf-8-sample.txt")
>>> buffer.append(fh.read())
>>> fh.close()
>>> buffer.append(u"This string\u2019s unicode")
>>> print repr(buffer)
[&#39;This file\xe2\x80\x99s got utf-8 in it\n&#39;, u&#39;This string\u2019s unicode&#39;]
>>> print "\n".join(buffer)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe2 in position 9: ordinal not in range(128)

  你可以使用codecs模块把文件作为Unicode加载来解决这个问题。

>>> import codecs
>>> buffer = []
>>> fh = open("utf-8-sample.txt", "r", "utf-8")
>>> buffer.append(fh.read())
>>> fh.close()
>>> print repr(buffer)
[u&#39;This file\u2019s got utf-8 in it\n&#39;, u&#39;This string\u2019s unicode&#39;]
>>> buffer.append(u"This string\u2019s unicode")
>>> print "\n".join(buffer)
This file’s got utf-8 in it

This string’s unicode

  正如你看到的,由codecs.open 创建的流在当数据被读取的时候自动地将比特串转化为Unicode。

  最佳实践

  1.最先解码,最后编码

  2.默认使用utf-8编码

  3.使用codecs和Unicode对象来简化处理

  最先解码意味着无论何时有字节流输入,需要尽早将输入解码为Unicode。这会防止出现len( )和切分utf-8字节流发生问题。

  最后编码意味着只有在准备输入的时候才进行编码。这个输出可能是一个文件,一个数据库,一个socket等等。只有在处理完成之后才编码unicode对象。最后编码也意味着,不要让Python为你编码Unicode对象。Python将会使用ASCII编码,你的程序会崩溃。

  默认使用UTF-8编码意味着:因为UTF-8可以处理任何Unicode字符,所以你最好用它来替代windows-1252和ASCII。

  codecs模块能够让我们在处理诸如文件或socket这样的流的时候能少踩一些坑。如果没有codecs提供的这个工具,你就必须将文件内容读取为字节流,然后将这个字节流解码为Unicode对象。

  codecs模块能够让你快速的将字节流转化为Unicode对象,省去很多麻烦。

  解释UTF-8

  最后的部分是让你能入门UTF-8,如果你是个超级极客可以无视这一段。

  利用UTF-8,任何在127和255之间的字节是特别的。这些字节告诉系统这些字节是多字节序列的一部分。

Our UTF-8 encoded string looks like this:
[ 97 ] [ 98 ] [ 99 ] [ 226 ] [ 128 ] [ 147 ] = "abc–"

  最后3字节是一个UTF-8多字节序列。如果你把这三个字节里的第一个转化为2进制可以看到以下的结果:

11100010

  前3比特告诉系统它开始了一个3字节序列226,128,147。

  那么完整的字节序列。

11100010 10000000 10010011

  然后你运用三字节序列的下面的掩码。

1110xxxx 10xxxxxx 10xxxxxx
XXXX0010 XX000000 XX010011 Remove the X&#39;s
0010       000000   010011 Collapse the numbers
00100000 00010011          Get Unicode number 0x2013, 8211 The "–"

  这是基本的UTF-8入门,如果想知道更多的细节,可以去看UTF-8的维基页面。

  原文链接: ERIC MORITZ   翻译: 伯乐在线 - 贱圣OMG

The above is the detailed content of Learn about Python encoding and Unicode. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Merging Lists in Python: Choosing the Right MethodMerging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsPython concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

Python execution, what is that?Python execution, what is that?May 14, 2025 am 12:06 AM

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Python: what are the key featuresPython: what are the key featuresMay 14, 2025 am 12:02 AM

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python: compiler or Interpreter?Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsPython loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Atom editor mac version download

Atom editor mac version download

The most popular open source editor