Home >Backend Development >Python Tutorial >Explanation of the conversion module codecs in python (with examples)
This article brings you an explanation of the conversion module codecs in Python (with examples). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
When converting encoding, it is usually necessary to use unicode as the intermediate encoding, that is, first decode other encoded strings into unicode , and then encode from unicode to another encoding.
str1.decode('gb2312') #将gb2312编码的字符串转换成unicode编码 str2.encode('gb2312') #将unicode编码的字符串转换成gb2312编码
Note:
s=‘Chinese’ If it is in a utf8 file, the string is utf8 encoded. If it is in a gb2312 file, its encoding is gb2312. In this case, to perform encoding conversion, you need to first use the decode method to convert it to unicode encoding, and then use the encode method to convert it to other encodings.
When no specific encoding method is specified, code files created using the system default encoding are used.
If the string is defined like this: s=u’Chinese’, then the encoding of the string is specified as unicode, which is python’s internal encoding, regardless of the encoding of the code file itself. You only need to directly use the encode method to convert it to the specified encoding.
If a string is already unicode, an error will occur when decoding it, so it is usually necessary to judge whether the encoding method is unicode isinstance(s, unicode ) #Used to determine whether it is unicode
(1) For the notepad we often use, "File" -> "Save As" can be viewed The current encoding method.
(2) Open it with notepad, click "Menu Bar" -> "Format" to view it.
(3) UltraEdit:
The encoding format of text with different encodings is defined based on the first two bytes of the text. The definition is as follows:
ANSI: No format definition;
Unicode: The first two bytes are FFFE;
Unicode big endian: The first two bytes are FEFF;
UTF-8: The first two bytes are EFBB;
This way you can pass the first two bytes The specific format of the file is determined.
When python needs to do encoding conversion, it will use internal encoding. The conversion process is as follows:
Original encoding-> Internal encoding- > Purpose encoding
Python is processed internally using unicode, but what needs to be considered when using unicode is that there are two encoding formats. One is UCS-2, which has a total of 65536 code points. The other is UCS-4, which has 2147483648g code points.
Determine what encoding method the installed python uses:
import sys print(sys.maxunicode)
If the output value is 65535, then it is UCS-2, if the output is 1114111, it is UCS-4 encoding.
Convert to internal code:
c = "风卷残云" print(type(c)) c = bytes(c,encoding='utf-8') print(type(c)) print(c) b = codecs.decode(c, "utf-8") #与c.decode()等效 print(type(b)) print(b) print(c.decode())
Output:
<class 'str'> <class 'bytes'> b'\xe9\xa3\x8e\xe5\x8d\xb7\xe6\xae\x8b\xe4\xba\x91' <class 'str'> 风卷残云 风卷残云
codecs is specially used for coding conversion. Through its interface, it can be extended to other code transformations.
In python3.x, bytes type data can be directly converted into other encoding formats without manually converting to unicode first.
import codecs a = "我爱你" # 创建utf-8编码器 look = codecs.lookup('utf-8') type(a) a = bytes(a,encoding='utf-8') b = look.decode(a) print(b)
Output:
('我爱你', 9)
In the returned tuple, b[0] is the data and b[1] is the length.
** Use the open method provided by codecs to specify the language encoding of the opened file. It will automatically convert to internal unicode when reading**
f = codecs.open(filepath, 'r', 'utf8')
There are many ways to read , f here can be traversed using a for loop, of course, it can also be read directly using the readline or readlines function method.
#for i in f: # print(i) #f.readline() #f.read() #f.readlines()
The above is the detailed content of Explanation of the conversion module codecs in python (with examples). For more information, please follow other related articles on the PHP Chinese website!