Home >Backend Development >Python Tutorial >How Can I Read and Write Unicode (UTF-8) Files Correctly in Python?

How Can I Read and Write Unicode (UTF-8) Files Correctly in Python?

Susan SarandonOriginal: 2024-11-05 02:35:02249browse

Unicode (UTF-8) File I/O in Python

In Python, handling Unicode text in files involves encoding and decoding operations. However, understanding these concepts can be challenging, as exemplified by a common issue:

Decoding Confusion:

Consider the following code in Python 2.4:

<code class="python">ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
print(ss, ss8)</code>

This code outputs:

Capit\xe1n b'Capit\xc3\xa1n'

The a-acute character (á) is represented differently in Unicode (u'Capitxe1n') and UTF-8 (ss8 = 'Capitxc3xa1n'). When printing ss8, Python defaults to an ASCII representation, hence the xc3xa1n sequence.

Opening the file 'f1' in write mode and writing ss8 to it results in 'Capitxc3xa1nn' being written to the file. Conversely, when writing ss to another file 'f2', Python attempts to interpret the a-acute character as an escape sequence, resulting in 'Capitxc3xa1nn'.

Decoding Solution:

To resolve this confusion, specify the encoding explicitly when opening the file. In Python 2.6 and later, the io.open function can be used:

<code class="python">import io
f = io.open("test", mode="r", encoding="utf-8")</code>

This approach ensures that the file is read and written in UTF-8, eliminating the need for manual encoding and decoding. In Python 3.x, the io.open function is an alias for the built-in open function, which also supports the encoding argument.

Alternatively, the codecs module can be used:

<code class="python">import codecs
f = codecs.open("test", "r", "utf-8")</code>

It's important to note that mixing read() and readline() methods may cause issues when usingcodecs.open.

The above is the detailed content of How Can I Read and Write Unicode (UTF-8) Files Correctly in Python?. For more information, please follow other related articles on the PHP Chinese website!

Python for function this ASCII issue

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：How to Convert a Pandas DataFrame with Missing Values to a NumPy Array Preserving NaN?Next article：How to Convert a Pandas DataFrame with Missing Values to a NumPy Array Preserving NaN?

See more

How Can I Read and Write Unicode (UTF-8) Files Correctly in Python?

Related articles