Home >Backend Development >Python Tutorial >A summary of python Chinese garbled problems
Running similar code like this:
#!/usr/bin/env python s="中文" print s
I often encounter this problem recently:
Problem 1: SyntaxError: Non-ASCII character 'xe4' in file E:codingpythonUntitled 6.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Question 2: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 108: ordinal not in range(128)
Question 3: UnicodeEncodeError: 'gb2312' codec can't encode character u'u2014' in position 72366: illegal multibyte sequence
These are all issues related to character encoding, very depressing, Chinese always I couldn’t figure it out, so I looked for a lot of solutions. Here are some of the solutions I found a few days ago. Let me share them with you
The internal representation of strings in Python is unicode encoding. Therefore, when doing encoding conversion, usually Unicode needs to be used as the intermediate encoding, that is, strings of other encodings are first decoded into unicode, and then encoded from unicode into another encoding.
The function of decode is to convert other encoded strings into unicode encoding, such as str1.decode('gb2312'), which means converting the gb2312 encoded string str1 into unicode encoding.
The function of encode is to convert unicode encoding into other encoded strings, such as str2.encode('gb2312'), which means converting unicode encoded string str2 into gb2312 encoding.
In some IDEs, the output of strings always appears garbled, or even errors. This is actually because the IDE's result output console itself cannot display the encoding of the string, rather than a problem with the program itself.
If you run the following code in UliPad:
s=u"中文"
print s
will prompt: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128 ). This is because UliPad's console information output window on English Windows
Change the last sentence to: print s.encode('gb2312')
The word "Chinese" can be correctly output.
If the last sentence is changed to: print s.encode('utf8')
, then the output is: xe4xb8xadxe6x96x87, which is the result of the console information output window outputting the utf8-encoded string according to ascii encoding.
The following code may be more general, as follows:
#!/usr/bin/env python #coding=utf-8 s="中文" if isinstance(s, unicode): #s=u"中文" print s.encode('gb2312') else: #s="中文" print s.decode('utf-8').encode('gb2312') #!/usr/bin/env python #coding=utf-8 s="中文" if isinstance(s, unicode): #s=u"中文" print s.encode('gb2312') else: #s="中文" print s.decode('utf-8').encode('gb2312')
Look at the following piece of code:
#!/usr/bin/env python #coding=utf-8 #python version:2.7.4 #system:windows xp import httplib2 def getPageContent(url): ''''' 使用httplib2用编程的方式根据url获取网页内容 将bytes形式的内容转换成utf-8的字符串 ''' #使用ie9的user-agent,如果不设置user-agent将会得到403禁止访问 headers={'user-agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)', 'cache-control':'no-cache'} if url: response,content = httplib2.Http().request(url,headers=headers) if response.status == 200 : return content
import sys reload(sys) sys.setdefaultencoding('utf-8') #修改默认编码方式,默认为ascci print sys.getdefaultencoding() content = getPageContent("http://www.oschina.net/") print content.decode('utf-8').encode('gb2312') #!/usr/bin/env python #coding=utf-8 #python version:2.7.4 #system:windows xp import httplib2 def getPageContent(url): ''' 使用httplib2用编程的方式根据url获取网页内容 将bytes形式的内容转换成utf-8的字符串 ''' #使用ie9的user-agent,如果不设置user-agent将会得到403禁止访问 headers={'user-agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)', 'cache-control':'no-cache'} if url: response,content = httplib2.Http().request(url,headers=headers) if response.status == 200 : return content
import sys reload(sys) sys.setdefaultencoding('utf-8') #修改默认编码方式,默认为ascci print sys.getdefaultencoding() content = getPageContent("http://www.oschina.net/") print content.decode('utf-8').encode('gb2312')
The meaning of the above code: Request his homepage from www.oschina.net website, (if it is directly encoded in utf-8, Chinese cannot be output) I want to change the encoding method from utf-8 to gd2312, but there is a problem three
When I When you change print content.decode('utf-8').encode('gb2312') to print content.decode('utf-8').encode('gb2312', 'ignore'), it's OK. It can display Chinese, but I’m not sure if it’s all of it. It seems like only some of it. Some of them can’t be encoded with gb2312
However, when I change the website to www.soso.com, I don’t need to convert to gb2312, just use utf-8. Chinese can be displayed normally
To summarize:
Directly outputting ss to the file will throw the same exception. When processing unicode Chinese strings, you must first call the encode function on it to convert it into other encoding output. This is true for every environment. In Python, the "str" object is a byte array. It doesn't matter whether the content inside is a legal string or what encoding (gbk, utf-8, unicode) the string uses. These contents require users to record and judge by themselves. These restrictions also apply to "unicode" objects. Keep in mind that the contents of a "unicode" object are never necessarily valid unicode strings, as we'll see shortly. On the Windows console, gbk-encoded str objects and unicode-encoded unicode objects are supported.