Home  >  Q&A  >  body text

python2.7 编码问题

小弟在经过python2.7和python3关于编码问题的对比后发现一个问题,如下:

在python3中

B = b'\xc4\xe8'
B.decode('latin-1') 
print B  #结果为 'Äè'

按照python3的说法就是B为bytes类型.是用来表示二进制字节字符串的.然后对他按照"latin-1" 去decode之后得到了实际的str字符串默认为utf8.为西欧文字.


但是在python2中str类型本身为二进制类型.然后unicode为其编码类型.小弟做了一下的事情.

B = '\xc4\ce8'
B = B.decode('latin-1')  #按照latin给他解码
print B  

得到的结果为u'\xc4\xe8',
而不是西欧语言.

请问这是为什么?

如何在python2.7中得到以上的西欧文字?

over

伊谢尔伦伊谢尔伦2765 days ago295

reply all(1)I'll reply

  • PHP中文网

    PHP中文网2017-04-17 11:56:19

    Be careful. Not ce8 but xe8.

    Python's CLI interface, when seeing a single expression, will print out the original value of the variable similar to PHP's var_dump. For unicode strings, the output is a string with u'', and each non-ascii character is escaped. Only when print is actually used, the python interpreter will correctly convert the encoding and output real characters to the screen according to the locale option of the system interface.

    You can take a look at the following results of running under the Linux system, which will be helpful to answer your questions.

    pi@linux-0o8x:~> locale | grep LANG
    LANG=zh_CN.utf8
    pi@linux-0o8x:~> python
    Python 2.7.5 (default, May 30 2013, 16:55:57) [GCC] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> u'中国人 说汉语 用中文 abc123'
    u'\u4e2d\u56fd\u4eba \u8bf4\u6c49\u8bed \u7528\u4e2d\u6587 abc123'
    >>> B = b'\xc4\xe8'
    >>> B.decode('latin-1') 
    u'\xc4\xe8'
    >>> print(B.decode('latin-1'))
    Äè
    

    In addition, a good habit in actual programming is to simply not use byte to decode, but to unify everything to Unicode. Simple and hassle-free.

    reply
    0
  • Cancelreply