Home  >  Article  >  Backend Development  >  How to convert HTML entities to Unicode strings in Python?

How to convert HTML entities to Unicode strings in Python?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-05 05:21:02354browse

How to convert HTML entities to Unicode strings in Python?

Convert XML/HTML Entities into Unicode String in Python

Question: How can I convert a string containing HTML entities into a Unicode string in Python? For example, the string "ǎ" should be converted to "ǎ" with a tone mark (u'u01ce').

Answer:

The Python standard library's HTMLParser has an undocumented function called unescape(). This function can convert HTML entities into their Unicode equivalents.

<code class="python">import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('&amp;copy; 2010') # u'\xa9 2010'
h.unescape('&amp;#169; 2010') # u'\xa9 2010'</code>

For Python 3.4 and above, the following code will work using the html module:

<code class="python">import html
html.unescape('&amp;copy; 2010') # u'\xa9 2010'
html.unescape('&amp;#169; 2010') # u'\xa9 2010'</code>

The above is the detailed content of How to convert HTML entities to Unicode strings in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn