Home >Backend Development >Python Tutorial >python handles html escape characters

python handles html escape characters

高洛峰Original: 2017-03-01 13:27:572166browse

The example in this article describes how Python handles HTML escape characters. Share it with everyone for your reference, the details are as follows:

When I use Python to process web page data recently, I often encounter some html escape characters (also called html character entities), such as a8093152e673feb7aba1828c43532094 etc. . Character entities are generally used to represent reserved characters in web pages. For example, > is represented by > to prevent the browser from thinking it is a tag. For details, please refer to w3school's HTML character entities. Although useful, they can greatly affect the parsing of web data. In order to handle these escape characters, there are the following solutions:

1. Use HTMLParser to process

import HTMLParser
html_cont = " asdfg>123<"
html_parser = HTMLParser.HTMLParser()
new_cont = html_parser.unescape(html_cont)
print new_cont #new_cont = " asdfg>123<"

convert back (It’s just that the spaces cannot be converted back):

import cgi
new_cont = cgi.escape(new_cont)
print new_cont #new_cont = " asdfg>123<"

2. Replace

html_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39; &#39;, &#39; &#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;>&#39;, &#39;>&#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;<&#39;, &#39;<&#39;)
print new_cont #new_cont = " asdfg>123<"

# directly.

##I don’t know if there is a better way.

In addition, stackoverflow provides an answer to handling escape characters in xml: python - What's the best way to handle -like entities in XML documents with lxml? - Stack Overflow.

For more articles related to python processing html escape characters, please pay attention to the PHP Chinese website!

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Garbled code problem when python captures and saves html pagesNext article：Garbled code problem when python captures and saves html pages

See more

python handles html escape characters

Related articles