Home >Backend Development >PHP Tutorial >php UTF-8, Unicode and BOM issues_PHP tutorial
1. Introduction
UTF-8 is a Unicode character encoding method often used in web applications. The advantage of using UTF-8 is that it is a variable length Encoding method, the encoding length of ANSII code is 1 byte, which can save a lot of network bandwidth when transmitting a large number of web pages with ASCII character sets.
UTF-8 signature, also called BOM (Byte Order Mark), is a standard mark used to identify encoding in the UTF encoding scheme. BOM is the standard mark used to identify encoding in the UTF encoding scheme. In UTF-16, it was originally FF FE, but when it becomes UTF-8, it becomes EF BB BF. This flag is optional because UTF8 bytes are not ordered, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this detection, but some software does not do this detection and treats it as a normal character. Microsoft adds three bytes, EF BB BF, before its own UTF-8 format text files. Notepad and other programs on Windows use these three bytes to determine whether a text file is ASCII or UTF-8. However, this is only a secret mark made by Microsoft, and other platforms do not mark UTF-8 text files in this way. In other words, a UTF-8 file may or may not have a BOM.
With only one BOM, there will be no problem. If signatures are set for multiple files, the binary stream will contain multiple UTF-8 signatures, which is the "root element must be well-formed" reason that causes XML conversion to fail.
2. View and Convert
Since a UTF-8 file may or may not have a BOM, how to distinguish it?
Just use software with hexadecimal editing mode, for example, use UltraEdit-32 to open the file, switch to hexadecimal editing mode, and check whether there is EF BB BF in the header of the file. If yes, it is the method with BOM.
The notepad that comes with Windows, when saved as UTF-8, has BOM by default.
There are many conversion methods, including the common UltraEdit-32 or NotePad++. Take UltraEdit-32 as an example. After opening the file, select "Save As". In the "Format" column, you have the following options:
In addition, DreamWeaver CS3 also has similar options, in "Preferences" , if you select Unicode (UTF-8) as the default encoding, you can select the Include Unicode Signature (BOM) option to include the Byte Order Mark (BOM) in the document. Otherwise, without BOM:
3. Other knowledge
From http://blog.csdn.net/thimin/archive/2007 /08/03/1724393.aspx I learned from this article:
The so-called files saved by unicode are actually utf-16, which just happens to be the same code as unicode. But conceptually unicode and utf are two different things. Unicode is The memory encoding represents the scheme, and utf is the scheme of how to save and transmit unicode. UTF-16 is also divided into two types: high endian (LE) and high endian (BE). The official UTF encoding is also UTF-32, which is also divided into LE and BE. The official non-Unicode UTF encoding is also UTF-7, which is mainly used for email transmission. The single-byte part of utf-8 is compatible with iso-8859-1. This is mainly due to the inability of some old systems and library functions to correctly handle utf-16. It also saves saving for English characters. of file space (at the expense of wasted space for non-English characters). When iso-8859-1, both utf8 and iso-8859-1 are represented by one byte. When representing other characters, utf-8 will use two or three bytes.
A more detailed description of BOM comes from here:
There is a character called "ZERO WIDTH NO-BREAK SPACE" in UCS encoding, and its encoding is FEFF. FFFE is a character that does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream. In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.
UTF-8 does not require a BOM to indicate the byte order, but can use the BOM to indicate the encoding method. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF. So if the receiver receives a byte stream starting with EF BB BF, it knows that it is UTF-8 encoded.
Windows uses BOM to mark the encoding method of text files.
PHP also does not support BOM.
PHP did not consider the BOM issue when it was designed, which means that it will not ignore the three characters of the BOM at the beginning of the UTF-8 encoded file. Since the code after or ※ One additional sentence: Especially when using PHP to import templates, it is more likely to cause browsing abnormalities because of these three characters.