Home > Article > Backend Development > Detailed explanation of the character encoding problem when lxml processes xml
In order to simplify the problem, the content of xml is simplified into the following form:
<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文,就是任性]]></da></DOCUMENT>
Its encoding is gbk, and one of the nodes is a Chinese character
Use lxml The following exception occurred when extracting the value of the node
lxml.etree.XMLSyntaxError: Extra content at the end of the document
The corresponding Python script at this time is:
tst = u'<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文,就是任性]]></da></DOCUMENT>' for event,element in etree.iterparse(BytesIO(tst.encode('utf-8'))): print("%s, %s" % (element.tag, element.text))
But before simplification, another exception was reported
lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D
No matter which exception it is, it is probably related to the encoding form of the characters.
After various attempts to no avail, I later saw this article on stackoverflow. The problem mentioned in the article is related to the encoding value in xml. I tried adding a piece of code
tst = u'<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文,就是任性]]></da></DOCUMENT>' tst = tst.replace('encoding="gbk"', 'encoding="utf-8"') for event,element in etree.iterparse(BytesIO(tst.encode('utf-8'))): print("%s, %s" % (element.tag, element.text))
Added a replacement statement, replace the previous encoding="gbk" with encoding:"utf-8"
So we finally got the result:
da, 中文,就是任性 DOCUMENT, None
The above is the detailed content of Detailed explanation of the character encoding problem when lxml processes xml. For more information, please follow other related articles on the PHP Chinese website!