Home  >  Article  >  Java  >  How to Retain the BOM When Reading UTF-8 Files in Java?

How to Retain the BOM When Reading UTF-8 Files in Java?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-24 15:44:15638browse

How to Retain the BOM When Reading UTF-8 Files in Java?

Reading UTF-8 with BOM Marker: Understanding the Unexpected BOM Output

When reading files encoded in UTF-8 with a Byte-Order Mark (BOM), it's possible to encounter the BOM being included in the output string. This occurs because the BOM, a Unicode identifier, is stored as a specific byte sequence at the beginning of the file.

In the given Java code, the FileReader and BufferedReader are appropriately utilized for handling UTF-8 file reading. However, the issue arises in the subsequent line:

text = new String(tmp.getBytes(), "UTF-8");

This line attempts to decode the bytes stored in the tmp string using the UTF-8 character set. However, the getBytes() method on a String does not retain the BOM marker from the original file. As a result, the decoding process ignores the BOM, and it is effectively lost.

To retain the BOM marker in the output string, a slight adjustment to the code is necessary:

byte[] bytes = tmp.getBytes("UTF-8");
if (isUTF8WithBOM(bytes)) {
    text = new String(bytes, 3, bytes.length - 3);
} else {
    text = new String(bytes, "UTF-8");
}

The isUTF8WithBOM method checks if the byte array begins with the UTF-8 BOM sequence (0xEF, 0xBB, 0xBF). If true, the BOM is removed by slicing the byte array to start from the third byte. This ensures that the subsequent decoding process includes the BOM marker in the output string.

The above is the detailed content of How to Retain the BOM When Reading UTF-8 Files in Java?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn