Home  >  Article  >  Backend Development  >  About Chinese encoding issues in Python

About Chinese encoding issues in Python

零到壹度
零到壹度Original
2018-04-16 11:38:501549browse

The content of this article is about Chinese encoding issues in Python. It has certain reference value. Now I share it with everyone. Friends in need can refer to it

1. Chinese encoding issues in python

1.1 Encoding in .py files

Python’s default script files are all ANSCII Encoded, when there are characters in the file that are not within the ANSCII encoding range, you must use the "encoding instructions" to correct it. In the definition of a module, if the .py file contains Chinese characters (strictly speaking, it contains non-anscii characters), you need to specify the encoding statement on the first or second line:

# -*- coding =utf-8 -*-or #coding=utf-8 Other encodings such as: gbk, gb2312 are also acceptable; otherwise a similar message will appear: SyntaxError: Non-ASCII character '/xe4' in file ChineseTest.py on line 1, but no encoding declared; see exception information like http://www.pytho for details; n.org/peps/pep-0263.html

1.2 Encoding and decoding in python

First let’s talk about the string types in python. There are two string types in python, namely str and unicode. They are both derived classes of basestring; the str type is a character that contains Characters represent (at least) A sequence of 8-bit bytes; each unit of unicode is a unicode obj; so:

The value of len(u'China') is 2; the value of len('ab') is also 2;

There is this sentence in the documentation of str: The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. That is to say, when reading the contents of a file, or When reading content from the network, the maintained object is of str type; if you want to convert a str into a specific encoding type, you need to convert str to Unicode, and then convert from Unicode to a specific encoding type such as: utf-8, gb2312 etc.;

Conversion functions provided in python:

unicode to gb2312, utf-8, etc.

# -*- coding=UTF-8 -*-
if __name__ == '__main__': 
   s = u'中国'    
   s_gb = s.encode('gb2312')

utf-8, GBK to unicode use the function unicode(s ,encoding) or s.decode(encoding)

# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = u'中国'
    #s为unicode先转为utf-8
    s_utf8 =  s.encode('UTF-8')
    assert(s_utf8.decode('utf-8') == s)

Convert ordinary str to unicode

# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = '中国'
    su = u'中国''
    #s为unicode先转为utf-8
    #因为s为所在的.py(# -*- coding=UTF-8 -*-)编码为utf-8
    s_unicode =  s.decode('UTF-8')
    assert(s_unicode == su)
    #s转为gb2312,先转为unicode再转为gb2312
    s.decode('utf-8').encode('gb2312')
    #如果直接执行s.encode('gb2312')会发生什么?
    s.encode('gb2312')
 
# -*- coding=UTF-8 -*-
if __name__ == '__main__':    s = '中国'
    #如果直接执行s.encode('gb2312')会发生什么?
    s.encode('gb2312')

An exception will occur here:

Python will Automatically decode s to unicode first, and then encode it to gb2312. Because decoding is performed automatically by python and we do not specify the decoding method, python will use the method specified by sys.defaultencoding to decode. In many cases sys.defaultencoding is ANSCII, and an error will occur if s is not of this type.
Take the above situation as an example, my sys.defaultencoding is ancii, and the encoding method of s is consistent with the encoding method of the file, which is utf8, so an error occurred: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
In this case, we have two ways to correct the error:
One is to clearly indicate the encoding method of s

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
s = '中文' 
s.decode('utf-8').encode('gb2312')

The second is to change sys.defaultencoding to the file encoding method

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
import sys 
reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入 
sys.setdefaultencoding('utf-8') 
str = '中文' 
str.encode('gb2312')

File encoding and print function
Create a file test. txt, the file format is ANSI, the content is:
abc中文
Use python to read
# coding=gbk
print open("Test.txt").read()
result :abc中文
Change the file format to UTF-8:
Result: abc涓枃
Obviously, decoding is needed here:

# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")

Result: abc中文
I used Editplus to edit the above test.txt, but when I used Windows’ built-in Notepad to edit and save it in UTF-8 format,
an error occurred when running:

Traceback (most recent call last):
  File "ChineseTest.py", line 3, in <module>
    print open("Test.txt").read().decode("utf-8")
UnicodeEncodeError: &#39;gbk&#39; codec can&#39;t encode character u&#39;/ufeff&#39; in position 0: illegal multibyte sequence

Original , some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a file encoded in UTF-8.
So we need to remove these characters ourselves when reading. The codecs module in python defines this constant:

# coding=gbk
import codecs
data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
 data = data[3:]
print data.decode("utf-8")

Result: abc Chinese

(4) Some remaining issues
In the second part, we use the unicode function and decode method to convert str into unicode. Why do the parameters of these two functions use "gbk"?
The first reaction is that we use gbk (# coding=gbk) in our coding statement, but is this really the case?
Modify the source file:

# coding=utf-8
s = "中文"
print unicode(s, "utf-8")

Run, error:

Traceback (most recent call last):
  File "ChineseTest.py", line 3, in <module>
    s = unicode(s, "utf-8")
UnicodeDecodeError: &#39;utf8&#39; codec can&#39;t decode bytes in position 0-1: invalid data

Obviously, if the previous one is normal because gbk is used on both sides, then I keep it here The utf-8 on both sides is the same, so it should be normal and no error will be reported.
A further example, if we still use gbk for conversion here:

# coding=utf-8
s = "中文"
print unicode(s, "gbk")

Result: Chinese

Principle of print in python:
When Python executes a print statement , it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

  To print data reliably, you must know the encoding that this display program expects.

简单地说,python中的print直接把字符串传递给操作系统,所以你需要把str解码成与操作系统一致的格式。Windows使用CP936(几乎与gbk相同),所以这里可以使用gbk。
最后测试:

# coding=utf-8
s = "中文"
rint unicode(s, "cp936")
# 结果:中文

这也可以解释为何如下输出不一致:

>>> s="哈哈"
>>> s&#39;
\xe5\x93\x88\xe5\x93\x88&#39;
>>> print s  #这里为啥就可以呢? 见上文对print的解释
哈哈>>> import sys
>>> sys.getdefaultencoding() &#39;
ascii&#39;
>>> print s.encode(&#39;utf8&#39;)  # s在encode之前系统默认按ascii模式把s解码为unicode,然后再encode为utf8
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: &#39;ascii&#39; codec can&#39;t decode byte 0xe5 in position 0: ordinal not in range(128)
>>> print s.decode(&#39;utf-8&#39;).encode(&#39;utf8&#39;)
哈哈
>>>

编码问题测试

使用 chardet 可以很方便的实现字符串/文件的编码检测

例子如下:

>>>
 import
 urllib>>>
 rawdata = urllib.urlopen(&#39;http://www.google.cn/&#39;).read()>>>
 import
 chardet
>>>
 chardet.detect(rawdata){&#39;confidence&#39;: 0.98999999999999999, &#39;encoding&#39;: &#39;GB2312&#39;}>>>

chardet 下载地址 http://chardet.feedparser.org/

特别提示:

在工作中,经常遇到,读取一个文件,或者是从网页获取一个问题,明明看着是gb2312的编码,可是当使用decode转时,总是出错,这个时候,可以使用decode('gb18030')这个字符集来解决,如果还是有问题,这个时候,一定要注意,decode还有一个参数,比如,若要将某个String对象s从gbk内码转换为UTF-8,可以如下操作 
s.decode('gbk').encode('utf-8′) 
可是,在实际开发中,我发现,这种办法经常会出现异常: 
UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence 
这 是因为遇到了非法字符——尤其是在某些用C/C++编写的程序中,全角空格往往有多种不同的实现方式,比如/xa3/xa0,或者/xa4/x57,这些 字符,看起来都是全角空格,但它们并不是“合法”的全角空格(真正的全角空格是/xa1/xa1),因此在转码的过程中出现了异常。 
这样的问题很让人头疼,因为只要字符串中出现了一个非法字符,整个字符串——有时候,就是整篇文章——就都无法转码。 
解决办法: 

s.decode(&#39;gbk&#39;, ‘ignore&#39;).encode(&#39;utf-8′)

因为decode的函数原型是decode([encoding], [errors='strict']),可以用第二个参数控制错误处理的策略,默认的参数就是strict,代表遇到非法字符时抛出异常; 
如果设置为ignore,则会忽略非法字符; 
如果设置为replace,则会用?取代非法字符; 
如果设置为xmlcharrefreplace,则使用XML的字符引用。 

python文档 

decode( [encoding[, errors]]) 
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error, see section 4.8.1. 
详细出处参考:http://www.jb51.net/article/16104.htm

参考文献

【1】Python编码转换

【2】全角半角转换的Python实现

【3】Python编码实现

The above is the detailed content of About Chinese encoding issues in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn