About Chinese encoding issues in Python

Chinese encoding issues in Python

1. Chinese encoding issues in python

1.1 Encoding in .py files

Python’s default script files are all ANSCII Encoded, when there are characters in the file that are not within the ANSCII encoding range, you must use the "encoding instructions" to correct it. In the definition of a module, if the .py file contains Chinese characters (strictly speaking, it contains non-anscii characters), you need to specify the encoding statement on the first or second line:

# -*- coding =utf-8 -*-or #coding=utf-8 Other encodings such as: gbk, gb2312 are also acceptable; otherwise a similar message will appear: SyntaxError: Non-ASCII character '/xe4' in file ChineseTest.py on line 1, but no encoding declared; see exception information like http://www.pytho for details; n.org/peps/pep-0263.html

1.2 Encoding and decoding in python

First let’s talk about the string types in python. There are two string types in python, namely str and unicode. They are both derived classes of basestring; the str type is a character that contains Characters represent (at least) A sequence of 8-bit bytes; each unit of unicode is a unicode obj; so:

The value of len(u'China') is 2; the value of len('ab') is also 2;

There is this sentence in the documentation of str: The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. That is to say, when reading the contents of a file, or When reading content from the network, the maintained object is of str type; if you want to convert a str into a specific encoding type, you need to convert str to Unicode, and then convert from Unicode to a specific encoding type such as: utf-8, gb2312 etc.;

Conversion functions provided in python:

unicode to gb2312, utf-8, etc.

# -*- coding=UTF-8 -*-
if __name__ == '__main__': 
   s = u'中国'    
   s_gb = s.encode('gb2312')

utf-8, GBK to unicode use the function unicode(s ,encoding) or s.decode(encoding)

# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = u'中国'
    s_utf8 =  s.encode('UTF-8')
    assert(s_utf8.decode('utf-8') == s)

Convert ordinary str to unicode

# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = '中国'
    su = u'中国''
    #因为s为所在的.py(# -*- coding=UTF-8 -*-)编码为utf-8
    s_unicode =  s.decode('UTF-8')
    assert(s_unicode == su)
# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = '中国'

An exception will occur here:

Python will Automatically decode s to unicode first, and then encode it to gb2312. Because decoding is performed automatically by python and we do not specify the decoding method, python will use the method specified by sys.defaultencoding to decode. In many cases sys.defaultencoding is ANSCII, and an error will occur if s is not of this type.
Take the above situation as an example, my sys.defaultencoding is ancii, and the encoding method of s is consistent with the encoding method of the file, which is utf8, so an error occurred: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
In this case, we have two ways to correct the error:
One is to clearly indicate the encoding method of s

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
s = '中文' 

The second is to change sys.defaultencoding to the file encoding method

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
import sys 
reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入 
str = '中文' 

File encoding and print function
Create a file test. txt, the file format is ANSI, the content is:
Use python to read
# coding=gbk
print open("Test.txt").read()
result :abc中文
Change the file format to UTF-8:
Result: abc涓枃
Obviously, decoding is needed here:

# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")

Result: abc中文
I used Editplus to edit the above test.txt, but when I used Windows’ built-in Notepad to edit and save it in UTF-8 format,
an error occurred when running:

Traceback (most recent call last):
  File "ChineseTest.py", line 3, in <module>
    print open("Test.txt").read().decode("utf-8")
UnicodeEncodeError: &#39;gbk&#39; codec can&#39;t encode character u&#39;/ufeff&#39; in position 0: illegal multibyte sequence

Original , some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a file encoded in UTF-8.
So we need to remove these characters ourselves when reading. The codecs module in python defines this constant:

# coding=gbk
import codecs
data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
 data = data[3:]
print data.decode("utf-8")

Result: abc Chinese

(4) Some remaining issues
In the second part, we use the unicode function and decode method to convert str into unicode. Why do the parameters of these two functions use "gbk"?
The first reaction is that we use gbk (# coding=gbk) in our coding statement, but is this really the case?
Modify the source file:

# coding=utf-8
s = "中文"
print unicode(s, "utf-8")

Run, error:

Traceback (most recent call last):
  File "ChineseTest.py", line 3, in <module>
    s = unicode(s, "utf-8")
UnicodeDecodeError: &#39;utf8&#39; codec can&#39;t decode bytes in position 0-1: invalid data

Obviously, if the previous one is normal because gbk is used on both sides, then I keep it here The utf-8 on both sides is the same, so it should be normal and no error will be reported.
A further example, if we still use gbk for conversion here:

# coding=utf-8
s = "中文"
print unicode(s, "gbk")

Result: Chinese

Principle of print in python:
When Python executes a print statement , it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

  To print data reliably, you must know the encoding that this display program expects.


# coding=utf-8
s = "中文"
rint unicode(s, "cp936")
# 结果:中文


>>> s="哈哈"
>>> s&#39;
>>> print s  #这里为啥就可以呢? 见上文对print的解释
哈哈>>> import sys
>>> sys.getdefaultencoding() &#39;
>>> print s.encode(&#39;utf8&#39;)  # s在encode之前系统默认按ascii模式把s解码为unicode,然后再encode为utf8
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe5 in position 0: ordinal not in range(128)
>>> print s.decode(&#39;utf-8&#39;).encode(&#39;utf8&#39;)


使用 chardet 可以很方便的实现字符串/文件的编码检测


 rawdata = urllib.urlopen(&#39;http://www.google.cn/&#39;).read()>>>
 chardet.detect(rawdata){&#39;confidence&#39;: 0.98999999999999999, &#39;encoding&#39;: &#39;GB2312&#39;}>>>

chardet 下载地址 http://chardet.feedparser.org/


UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence 
这 是因为遇到了非法字符——尤其是在某些用C/C++编写的程序中,全角空格往往有多种不同的实现方式,比如/xa3/xa0,或者/xa4/x57,这些 字符,看起来都是全角空格,但它们并不是“合法”的全角空格(真正的全角空格是/xa1/xa1),因此在转码的过程中出现了异常。 

s.decode(&#39;gbk&#39;, ‘ignore&#39;).encode(&#39;utf-8′)

因为decode的函数原型是decode([encoding], [errors='strict']),可以用第二个参数控制错误处理的策略,默认的参数就是strict,代表遇到非法字符时抛出异常; 


decode( [encoding[, errors]]) 
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error, see section 4.8.1. 





