Home > Article > Backend Development > How do I ensure correct Unicode representation when reading and writing files in Python?
Unicode (UTF-8) Reading and Writing to Files in Python
When working with Unicode strings in Python, it's essential to understand the interplay between Unicode representations and file encoding. A subtle misunderstanding can lead to unexpected results, as demonstrated in the following example:
<code class="python">ss = u'Capit\xe1n' ss8 = ss.encode('utf8') repr(ss), repr(ss8)</code>
The output reveals a discrepancy between the Unicode representation of the string and its UTF-8 encoded form:
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
To avoid this confusion, it's crucial to explicitly specify the file encoding when reading and writing. In Python 2.6 and later, the io module provides an io.open function that allows specifying the encoding:
<code class="python">import io f = io.open("test", mode="r", encoding="utf-8") f.read()</code>
With this approach, f.read() returns a decoded Unicode object:
u'Capit\xe1l\n\n'
In Python 3.x, the io.open function is an alias for the built-in open function, which also supports the encoding argument. Another option is to use the codecs module:
<code class="python">import codecs f = codecs.open("test", "r", "utf-8") f.read()</code>
However, be aware that mixing read() and readline() can result in issues when using the codecs module. By specifying the encoding explicitly when reading and writing files, you ensure that Unicode strings are represented and decoded correctly, avoiding potential pitfalls.
The above is the detailed content of How do I ensure correct Unicode representation when reading and writing files in Python?. For more information, please follow other related articles on the PHP Chinese website!