Home >Backend Development >PHP Tutorial >Detailed explanation of php files and character encoding

Detailed explanation of php files and character encoding

小云云Original: 2018-03-14 15:15:052047browse

My initial doubt was: What is the difference between text files and binary files? Why can one display the content, but the other's content often cannot be displayed normally (using a text editor)?

This training note from the University of Maryland clearly explains the difference between the two: text files are a type of binary files, and the underlying storage is also 0 and 1; text files have good readability and portability, but Expression characters are limited; binary file data storage is compact and has no character encoding restrictions. Text files can basically only store content composed of limited characters such as numbers, text, punctuation, etc. Binary files have no character constraints and can store images, audio and video and other data at will.

Using the example of storing numbers, we can vividly see the difference in the storage content of text files and binary files. For example, to store the number 1234567890, the text file needs to store the ASCII codes of the ten numbers 0-9. The corresponding hexadecimal representation is:

31 32 33 34 35 36 37 38 39 30, occupying 10 Bytes; the binary corresponding to 1234567890 is "0100 1001 1001 0110 0000 0010 1101 0010", which occupies 4 bytes (binary representation is 32 bits, one byte is 8 bits), and is stored in 16 of the file The base representation is (big endian): 49 96 02 D2.

Text files store content in

characters , while binary files store content in bytes . This is the most essential difference between the two files. Based on this characteristic, some common conclusions can be inferred: binary files are often more compact than text files and take up less space; text files are more user-friendly and can be edited in a WYSIWYG way; binary files often require special programs to open, etc. .

Looking back at the text editor, binary files are often garbled. For example, a binary file stores an integer 1234 (four bytes), which is represented in hexadecimal as:

00 00 04 D2. After opening the text editor and interpreting it character by character, you will find that these bytes cannot spell out displayable characters, so you have to treat them as gibberish. The reason for the garbled characters is that the text editor cannot correctly parse the byte stream, which is why binary files need to be opened with special software. For example, a jpg file needs to be opened with a picture viewing software. If it is opened with a music player, it’s over! Video files need to be opened with a player and compression software, so let’s get started!

File format

After understanding the difference between text files and binary files, let’s look at the file format. We know that Windows recognizes the file format according to the file extension and calls the corresponding program to open the file; in (like) Unix systems, the extension is optional, so how do you know what format the file is?

Fortunately, there is the file command, which can tell us what format the file is in. The file extension is not the essential difference in file format, the content is. Change a.zip to a.txt/a.jgp/a.mp3. No matter what the file name is, file will reveal its original shape: Zip archive data, at least v1.0 to extract.

Encoding

After talking about the file, let’s talk about the encoding in the file content. There are 127 common ASCII characters. There is no encoding to say. Anyway, almost all encoding methods are compatible with it. Double-byte and multi-byte characters, encoding methods and byte order are the problems that trouble programmers. For a Chinese character, GBK encoding requires two bytes, and the endianness of the local machine must be considered to determine the final form of storage; during network communication, it must be converted into network byte order (big endian) so that the receiver can parse it normally. If developers are not familiar with character encoding and encounter garbled characters during communication, debugging will be difficult.

The formulation of the UCS (Universal Multiple Octet Coded Character Set) standard allows developers to stay away from confusing multi-byte character sets. In the UCS standard, all characters have unique code points, and the corresponding characters can be found based on the code points. UCS uses two bytes to represent a code point (the UCS-4 standard is 4 bytes), corresponding to one character. Because it uses two bytes, it can accommodate 2^16-1 (6w+) characters, which basically accommodates characters commonly used in various countries (UCS-4 can theoretically accommodate up to 2 billion characters, and currently accommodates more than 16W characters) . Note that UCS is just a standard that stipulates the one-to-one correspondence between code points and characters, but does not define how to store them in the computer.

The work of stipulating the storage method of Unicode characters is completed by UTF (Unicode Transformation Format). The most commonly used solutions are UTF-16 and UTF-8. UTF-16 uses two bytes to represent a character. The default character encoding schemes for Windows, MacOS, and Java platforms are UTF-16. Since there are two bytes, there are two schemes: big-endian and little-endian. For files with only ASCII characters, using UTF-16 encoding causes serious waste of space (wasting 50% of storage). The UTF-8 encoding scheme proposed by Ken Thompson (inventor of C language) and Robe Pike (inventor of Go language) It quickly became popular. UTF-8 is a single-byte stream, there is no byte order problem, and no BOM is required. UTF-8 is currently the common web standard.

Correspondence

The value range of USC-2 is U+0000~U+FFFF, and the corresponding relationship with UTF-8 is as follows:

##0000 0000-0000 007F0xxxxxxx0000 0080-0000 07FF110xxxxx 10xxxxxx##0000 0800-0000 FFFF0001 0000-0010 FFFF

从编码可以看出，与二进制相比，浪费了很多空间。不过这也没办法，可显示的字符更容易阅读和理解，人类很难抗拒这个诱惑。

UTF-8转换规则为： 1. 如果某字节第一位是 0 ，那么判定为 ASCII 字节，除了 0 外余下的 7 位是 ASCII 码，所以 UTF-8 是兼容 ASCII 码的； 2. 如果第一个字节是 1 ，那么连续的几个 “1” 代表从这个字符开始，后面连续的几个字节其实是一个字位，且后面的字节都要以10开头。

了解如上规则，我们的程序便可轻松的处理UTF-8编码的字节流。例如要找出“中”的UTF-8编码，则可以这样处理（注意文件是UTF-8编码）：

$char = "中";
$length = strlen($char);
$bytes = pack("a" . $length, $char);echo "UTF-8:" . bin2hex($bytes) . "\n";
// 或者echo "UTF-8:";for ($index = 0; $index &lt; $length; ++ $index) 
{    echo bin2hex($char{$index});
}echo PHP_EOL;

也可以写出针对UTF-8编码的strlen函数：

function myStrlen(string $string){
    $slen = strlen($string);
    $mlen = 0;
    $maxByteLength = 4;
    $maxOffset = 7;    for ($i = 0; $i &lt; $slen; ++ $i) {
        $byte = ord($string{$i});        // 从01xxxxxx开始对比，直到11110xxxx 10xxxxxx 10xxxxxx 10xxxxxx。只需要对比第一个字节即可
        for ($offset = 0; $offset &lt; $maxByteLength; ++ $offset) {
            $result = $byte & (1 &lt;&lt; ($maxOffset - $offset));            if ($result === 0) {
                $i += $offset;
                ++ $mlen;                break;
            }
        }
    }    return $mlen;
}

$string = "Coder不是工程师！";echo "mb_strlen:" . mb_strlen($string) . "\n";echo "mStrlen:" . myStrlen($string) . "\n";

HEX	BINARY


1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Detailed explanation of php files and character encoding

File format

Encoding

Correspondence

Related articles