Home  >  Article  >  Backend Development  >  How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?

How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?

DDD
DDDOriginal
2024-11-22 16:12:15182browse

How Does Python's `unicodedata.normalize()` Function Simplify Unicode Representations?

Normalizing Unicode in Python: Simplifying Unicode Representations

In Python, the unicodedata module provides the .normalize() function to simplify Unicode string representations. This function transforms decomposed Unicode entities into their simplest composite forms.

Consider the following example:

import unicodedata

char = "á"
print(len(char))  # Output: 1

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A WITH ACUTE']

char = "á"
print(len(char))  # Output: 2

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']

The "á" character is composed of two code points: U 0061 (LATIN SMALL LETTER A) and U 0301 (COMBINING ACUTE ACCENT). Decomposed, these characters appear as "á."

To normalize this string, we can use .normalize('NFC'), which returns the composed form:

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))  # Output: '\xe1'

Conversely, .normalize('NFD') returns the decomposed form:

print(ascii(unicodedata.normalize('NFD', '\u00E1')))  # Output: 'a\u0301'

Additional normalization forms exist to handle compatibility code points. NFKC and NFKD replace compatibility characters with their canonical forms. For example, U 2160 (ROMAN NUMERAL ONE) normalizes to "I" using NFKC:

print(unicodedata.normalize('NFKC', '\u2167'))  # Output: 'VIII'

It's important to note that normalization is not always reversible, as some characters may not have unique decomposed forms.

The above is the detailed content of How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn