小弟在经过python2.7和python3关于编码问题的对比后发现一个问题,如下:
B = b'\xc4\xe8'
B.decode('latin-1')
print B #结果为 'Äè'
按照python3的说法就是B为bytes类型.是用来表示二进制字节字符串的.然后对他按照"latin-1" 去decode之后得到了实际的str字符串默认为utf8.为西欧文字.
但是在python2中str类型本身为二进制类型.然后unicode为其编码类型.小弟做了一下的事情.
B = '\xc4\ce8'
B = B.decode('latin-1') #按照latin给他解码
print B
得到的结果为u'\xc4\xe8',
而不是西欧语言.
PHP中文网2017-04-17 11:56:19
Be careful. Not ce8
but xe8
.
Python's CLI interface, when seeing a single expression, will print out the original value of the variable similar to PHP's var_dump
. For unicode strings, the output is a string with u''
, and each non-ascii character is escaped. Only when print
is actually used, the python interpreter will correctly convert the encoding and output real characters to the screen according to the locale option of the system interface.
You can take a look at the following results of running under the Linux system, which will be helpful to answer your questions.
pi@linux-0o8x:~> locale | grep LANG
LANG=zh_CN.utf8
pi@linux-0o8x:~> python
Python 2.7.5 (default, May 30 2013, 16:55:57) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'中国人 说汉语 用中文 abc123'
u'\u4e2d\u56fd\u4eba \u8bf4\u6c49\u8bed \u7528\u4e2d\u6587 abc123'
>>> B = b'\xc4\xe8'
>>> B.decode('latin-1')
u'\xc4\xe8'
>>> print(B.decode('latin-1'))
Äè
In addition, a good habit in actual programming is to simply not use byte to decode, but to unify everything to Unicode. Simple and hassle-free.