Home  >  Article  >  Backend Development  >  How to Convert XML/HTML Entities to Unicode in Python?

How to Convert XML/HTML Entities to Unicode in Python?

Barbara Streisand
Barbara StreisandOriginal
2024-11-04 00:06:30372browse

How to Convert XML/HTML Entities to Unicode in Python?

Converting XML/HTML Entities to Unicode in Python

Challenge:

In web scraping, HTML entities are commonly used to represent non-ASCII characters. Python needs a utility that can convert a string with these entities into a Unicode type.

Solution:

The Python standard library's HTMLParser possesses an undocumented function, unescape(), which can fulfill this requirement effectively.

Implementation:

For Python 3.4 and earlier:

<code class="python">import HTMLParser

h = HTMLParser.HTMLParser()
result = h.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>

For Python 3.4 and later:

<code class="python">import html

result = html.unescape('&amp;copy; 2010')  # u'\xa9 2010'</code>

Example:

Consider the HTML entity ǎ, which corresponds to an "ǎ" with a tone mark in binary. Using unescape(), you can convert it to the Unicode value u'u01ce':

<code class="python">result = h.unescape('&amp;#x01ce;')  # u'\u01ce'</code>

The above is the detailed content of How to Convert XML/HTML Entities to Unicode in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn