Reading UTF-8 with BOM Marker: Understanding the Unexpected BOM Output
When reading files encoded in UTF-8 with a Byte-Order Mark (BOM), it's possible to encounter the BOM being included in the output string. This occurs because the BOM, a Unicode identifier, is stored as a specific byte sequence at the beginning of the file.
In the given Java code, the FileReader and BufferedReader are appropriately utilized for handling UTF-8 file reading. However, the issue arises in the subsequent line:
text = new String(tmp.getBytes(), "UTF-8");
This line attempts to decode the bytes stored in the tmp string using the UTF-8 character set. However, the getBytes() method on a String does not retain the BOM marker from the original file. As a result, the decoding process ignores the BOM, and it is effectively lost.
To retain the BOM marker in the output string, a slight adjustment to the code is necessary:
byte[] bytes = tmp.getBytes("UTF-8"); if (isUTF8WithBOM(bytes)) { text = new String(bytes, 3, bytes.length - 3); } else { text = new String(bytes, "UTF-8"); }
The isUTF8WithBOM method checks if the byte array begins with the UTF-8 BOM sequence (0xEF, 0xBB, 0xBF). If true, the BOM is removed by slicing the byte array to start from the third byte. This ensures that the subsequent decoding process includes the BOM marker in the output string.
The above is the detailed content of How to Retain the BOM When Reading UTF-8 Files in Java?. For more information, please follow other related articles on the PHP Chinese website!