Home > Article > Backend Development > python handles html escape characters
The example in this article describes how Python handles HTML escape characters. Share it with everyone for your reference, the details are as follows:
When I use Python to process web page data recently, I often encounter some html escape characters (also called html character entities), such as a8093152e673feb7aba1828c43532094 etc. . Character entities are generally used to represent reserved characters in web pages. For example, > is represented by > to prevent the browser from thinking it is a tag. For details, please refer to w3school's HTML character entities. Although useful, they can greatly affect the parsing of web data. In order to handle these escape characters, there are the following solutions:
1. Use HTMLParser to process
import HTMLParser html_cont = " asdfg>123<" html_parser = HTMLParser.HTMLParser() new_cont = html_parser.unescape(html_cont) print new_cont #new_cont = " asdfg>123<"
convert back (It’s just that the spaces cannot be converted back):
import cgi new_cont = cgi.escape(new_cont) print new_cont #new_cont = " asdfg>123<"
2. Replace
html_cont = " asdfg>123<" new_cont = new_cont.replace(' ', ' ') print new_cont #new_cont = " asdfg>123<" new_cont = new_cont.replace('>', '>') print new_cont #new_cont = " asdfg>123<" new_cont = new_cont.replace('<', '<') print new_cont #new_cont = " asdfg>123<"# directly. ##I don’t know if there is a better way. In addition, stackoverflow provides an answer to handling escape characters in xml: python - What's the best way to handle -like entities in XML documents with lxml? - Stack Overflow.