Home  >  Article  >  Backend Development  >  Python coding summary (encoding type, format, transcoding)

Python coding summary (encoding type, format, transcoding)

高洛峰
高洛峰Original
2017-03-01 13:30:261650browse

This article summarizes python coding in detail. Share it with everyone for your reference, the details are as follows:

[So-called unicode]

unicode is an abstract encoding similar to a symbol set, it only stipulates It contains the binary code of the symbol, but does not specify how this binary code should be stored. That is, it is only an internal representation and cannot be saved directly. Therefore, a storage format needs to be specified when storing, such as utf-8 and utf-16, etc. In theory, Unicode is an encoding scheme that can accommodate all languages ​​​​in the world. (No more details about other encoding formats)

[So-called GB code]

GB means "national standard", that is: the national standard of the People's Republic of China. GB code is a coding for Chinese characters, including GB2312 (GB2312-80), GBK, and GB18030. The representation range increases from small to large, and is basically backward compatible. In addition, we often encounter a code called CP936, which can actually be roughly regarded as GBK.

[Judge encoding]

1. Use isinstance(s, str) to determine whether a string is a general string (str is an ascii type string, utf-8, utf-16, GB2312, GBK, etc. are all ascii type strings);

Use isinstance(s, unicode) to determine whether a string is a unicode encoding string (unicode encoding The string is a unicode type string).

2. Use type() or .__class__

If the encoding is correct:

For example: stra = "中", then use the result of type(stra) It is , indicating an ascii type string;

For example: strb = u"中", then the result of using type(strb) is , Indicates a unicode type string.


tmp_str = 'tmp_str'
print tmp_str.__class__   #<type &#39;str&#39;>
print type(tmp_str)    #<type &#39;str&#39;>
print type(tmp_str).__name__ #str
tmp_str = u&#39;tmp_str&#39;
print tmp_str.__class__   #<type &#39;unicode&#39;>
print type(tmp_str)    #<type &#39;unicode&#39;>
print type(tmp_str).__name__ #unicode


3. The best way is to use chardet to judge, especially in web-related operations, such as grabbing html pages When it comes to content, the charset tag of the page only indicates the encoding, which is sometimes incorrect, and some Chinese characters in the page content may exceed the range of the encoding. In this case, it is most convenient and accurate to use charset detection.

(1) Installation method: After downloading chardet, place the decompressed chardet folder in the \Lib\site-packages directory of the Python installation directory, and use import chardet in the program.

(2) Usage method 1: Detect all contents to determine the encoding


import urllib2
import chardet
res = urllib2.urlopen(&#39;http://www.php.cn&#39;)
res_cont = res.read()
res.close()
print chardet.detect(res_cont) #{&#39;confidence&#39;: 0.99, &#39;encoding&#39;: &#39;utf-8&#39;}


The return value of the detect function is a A dictionary of 2 key-value pairs, the first is the detection confidence, and the second is the detected encoding form.

(3) Method 2: Detect part of the content to determine the encoding and increase the speed


import urllib2
from chardet.universaldetector import UniversalDetector
res = urllib2.urlopen(&#39;http://www.php.cn&#39;)
detector = UniversalDetector()
for line in res.readlines():
 #detect untill reach threshold
 detector.feed(line)
 if detector.done:
  break
detector.close()
res.close()
print detector.result
#{&#39;confidence&#39;: 0.99, &#39;encoding&#39;: &#39;utf-8&#39;}


##【 Convert encoding】

1. Convert from specific encoding (ISO-8859-1 [ASCII code], utf-8, utf-16, GBK, GB2312, etc.) to unicode, directly use unicode( s, charset) or s.decode(charset), where charset is the encoding of s (note that unicode will make an error when using decode());


#将任意字符串转换为unicode
def to_unicode(s, encoding):
 if isinstance(s, unicode):
  return s
 else:
  return unicode(s, encoding)


Note: When decoding(), if you encounter illegal characters (such as non-standard full-width spaces \xa3\xa0, or \xa4\x57, the real full-width spaces are \xa1\xa1), An error will be reported.

Solution: Use 'ignore' mode, that is: stra.decode('...', 'ignore').encode('utf-8').

Explanation: The function prototype of decode is decode([encoding],[errors='strict']), and the second parameter can be used to control the error handling strategy.

The default parameter is strict, which means an exception will be thrown when illegal characters are encountered; if set to ignore, illegal characters will be ignored; if set to replace, illegal characters will be replaced with ?; if set to xmlcharrefreplace , then use XML character references.

2. To convert from unicode to a specific encoding, you also directly use s.encode(charset), where s is the unicode encoding and charset is the specific encoding (note that non-unicode will make an error when using encode()) ;

3. Naturally, when converting from one specific encoding to another specific encoding, you can first decode into unicode and then encode into the final encoding.

[python command line encoding (system encoding)]

Use the locale module that comes with python to detect the default encoding of the command line (that is, the system encoding) and Set the command line encoding:


import locale
#get coding type
print locale.getdefaultlocale() #(&#39;zh_CN&#39;, &#39;cp936&#39;)
#set coding type
locale.setlocale(locale.LC_ALL, locale=&#39;zh_CN.GB2312&#39;)
print locale.getlocale() #(&#39;zh_CN&#39;, &#39;gb2312&#39;)


indicates that the internal encoding of the current system is cp936, which is similar to GBK. In fact, the internal system encoding of Chinese XP and WIN7 is cp936 (GBK).

[Encoding in python code]

1. When the string in the python code is not specified for encoding, the default encoding is the same as the encoding of the code file itself. consistent. For example: if the string str = 'Chinese' is in a utf8-encoded code file, the string is utf8-encoded; if it is in a gb2312 file, the string is gb2312-encoded. So how do you know the encoding of the code file itself?

(1) Specify the encoding of the code file yourself: Add "#-*- coding:utf-8 -*-" to the header of the code file to declare that the code file is utf-8 encoded. At this time, the encoding of strings whose encoding is not specified becomes utf-8.

(2)在没有指定代码文件的编码时,创建代码文件时使用的是python默认采用的编码(一般来说是ascii码,在windows中实际保存为cp936(GBK)编码)。通过sys.getdefaultencoding()和sys.setdefaultencoding('...')来获取和设置该默认编码。


import sys
reload(sys)
print sys.getdefaultencoding() #ascii
sys.setdefaultencoding(&#39;utf-8&#39;)
print sys.getdefaultencoding() #utf-8


结合(1)和(2)做个试验:指定代码文件编码为utf-8时,用notepad++打开显示的是utf-8无DOM编码;未指定代码文件编码时,用notepad++打开显示的是ANSI编码(压缩编码,默认的保存编码形式)。

Python coding summary (encoding type, format, transcoding)

(3)如何永久地将python默认采用的编码设置为utf-8呢?有2种方法:

第一个方法:编辑site.py,修改setencoding()函数,强制设置为 utf-8;

第二个方法:增加一个名为 sitecustomize.py的文件,存放在安装目录下的\Lib\site-packages目录下

sitecustomize.py是在site.py被import执行的,因为 sys.setdefaultencoding()是在site.py的结尾处被删除的,所以可以在 sitecustomize.py使用 sys.setdefaultencoding()。

2、python代码中的字符串如果被指定了编码,举个例子:str = u'中文',该字符串的编码被指定为unicode(即python的内部编码)。

(1)这里有个误区需要注意!假如在py文件中有如下代码:


stra = u"中"
print stra.encode("gbk")


按上面说的stra是unicode形式,直接encode称gbk编码应该没问题啊?但是实际执行时会报错“UnicodeEncodeError: 'gbk' codec can't encode character u'\xd6' in position 0: illegal multibyte sequence”。

原因在于:python解释器在导入python代码文件并执行时,会先查看文件头有没有编码声明(例如#coding:gbk等)。如果发现声明,会将文件中的字符串都先解释成unicode的形式(这里先用默认编码gbk(cp936)将stra解码成unicode编码'd6d0'后保存),之后执行stra.encode('gbk')时,由于stra已经是unicode编码且'd6d0'在gbk的编码范围内,所以编码不会出现错误;如果文件头没有编码声明,则不会进行上述过程中的解码操作(这里就直接使用stra的unicode编码'd6'),之后执行stra.encode('gbk')时,由于'd6'不在gbk的编码范围所以报错。

(2)为避免这种类型的错误,最好在代码文件头上声明编码,或者麻烦点每次使用setdefaultencoding()。

(3)总的来说就是unicode是python解释器的内码,所有代码文件在导入并执行时,python解释器会先将字符串使用你指定的编码形式解码成unicode,然后再进行各种操作。所以不管是对字符串的操作,还是正则表达式,还是读写文件等等最好都通过unicode来进行。

【python中其他编码】

文件系统的编码:sys.getfilesystemencoding()
终端的输入编码:sys.stdin.encoding
终端的输出编码:sys.stdout.encoding

更多Python coding summary (encoding type, format, transcoding)相关文章请关注PHP中文网!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn