Home  >  Article  >  Backend Development  >  How to Convert XML/HTML Entities to Unicode Strings in Python?

How to Convert XML/HTML Entities to Unicode Strings in Python?

Susan Sarandon
Susan SarandonOriginal
2024-11-04 06:36:02499browse

How to Convert XML/HTML Entities to Unicode Strings in Python?

Converting XML/HTML Entities to Unicode Strings in Python

In web scraping, entities are frequently used to represent non-ASCII characters. To decode these entities in Python and obtain the corresponding Unicode representation, you can utilize the unescape() function available in the standard library's HTMLParser module.

Example:

Suppose you have the following entity:

ǎ

which represents an "ǎ" with a tone mark. The binary equivalent of this is 01ce (16 bits). To convert this entity into the Unicode value u'u01ce':

Python 3.4 and earlier:

import HTMLParser
h = HTMLParser.HTMLParser()
unicode_string = h.unescape('© 2010') # u'\xa9 2010'
unicode_string = h.unescape('© 2010') # u'\xa9 2010'

Python 3.4 and later:

import html
unicode_string = html.unescape('© 2010') # u'\xa9 2010'
unicode_string = html.unescape('© 2010') # u'\xa9 2010'

The resulting unicode_string contains the desired Unicode representation of the string with the entities replaced with their actual Unicode values.

The above is the detailed content of How to Convert XML/HTML Entities to Unicode Strings in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn