Home > Article > Backend Development > How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)
How much do you know about character set encoding ASCII, Unicode and UTF-8? This article will give you a thorough understanding of character set encoding. This article introduces ASCII, Unicode and UTF-8 encoding issues and conversions as well as example analysis. Start reading the article
1. ASCII code
We know that inside the computer, all information is ultimately a binary value. Each binary bit (bit) has two states: 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, one byte can be used to represent a total of 256 different states, and each state corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.
In the 1960s, the United States formulated a set of character encodings that unified the relationship between English characters and binary bits. This was called ASCII and is still used today.
ASCII code specifies a total of 128 character encodings. For example, SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last 7 bits of a byte, and the first bit is uniformly set to 0.
ASCII control characters
ASCII displayable characters
2. Non-ASCII encoding
It is enough to encode English with 128 symbols, but to represent other languages, 128 symbols are not enough of. For example, in French, if there are phonetic symbols above letters, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bits in the bytes to encode new symbols. For example, the encoding for é in French is 130 (binary 10000010). As a result, the encoding system used in these European countries can represent up to 256 symbols.
However, a new problem arises here. Different countries have different letters, so even if they all use a 256-symbol encoding, the letters they represent are different. For example, 130 represents é in French encoding, represents the letter Gimel (ג) in Hebrew encoding, and represents another symbol in Russian encoding. But no matter what, in all these encoding methods, the symbols represented by 0--127 are the same, and the only difference is the section 128--255.
As for the characters of Asian countries, they use even more symbols, with as many as 100,000 Chinese characters. One byte can only represent 256 symbols, which is definitely not enough. Multiple bytes must be used to express one symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols.
The issue of Chinese encoding requires a special article to discuss, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 described later.
3. Unicode
As mentioned in the previous section, there are many encoding methods in the world, and the same binary number can be interpreted into different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise if you use the wrong encoding method to interpret it, garbled characters will appear. Why are emails often garbled? This is because the sender and recipient use different encoding methods.
It is conceivable that if there is a coding that includes all the symbols in the world. Each symbol is given a unique code, so the garbled code problem will disappear. This is Unicode, as its name suggests, an encoding of all symbols.
Unicode is of course a large collection, currently capable of holding more than 1 million symbols. The encoding of each symbol is different. For example, U 0639 represents the Arabic letter Ain, U 0041 represents the English capital letter A, and U 4E25 represents the Chinese character Yan. For a specific symbol correspondence table, you can check unicode.org, or a specialized Chinese character correspondence table.
4. Problems with Unicode
It should be noted that Unicode is just a symbol set. It only specifies the binary code of the symbol, but There is no specification as to how this binary code should be stored.
For example, the Unicode of Chinese character Yan is the hexadecimal number 4E25, which is converted into a binary number with 15 digits (100111000100101). In other words, the representation of this symbol requires at least 2 bytes. Representing other larger symbols may require 3 bytes or 4 bytes, or even more.
There are two serious problems here. The first question is, how to distinguish Unicode and ASCII? How does the computer know that three bytes represent one symbol, rather than three separate symbols? The second problem is that we already know that only one byte is enough to represent English letters. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two characters. Three bytes are 0, which is a huge waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.
The results they cause are: 1) Multiple storage methods of Unicode have emerged, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be promoted for a long time until the emergence of the Internet.
5. UTF-8
The popularity of the Internet strongly requires the emergence of a unified encoding method. UTF-8 is the most widely used Unicode implementation on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but these are rarely used on the Internet. Again, the connection here is that UTF-8 is an implementation of Unicode.
One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols.
The encoding rules of UTF-8 are very simple, there are only two:
1. For single-byte symbols, the first bit of the byte is set to 0, and the following The 7 bits are the Unicode code of this symbol. So for English letters, UTF-8 encoding and ASCII encoding are the same.
2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n 1st bit is set to 0, and the first two bits of the following bytes are set to 1. Always set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol.
The following table summarizes the encoding rules. The letter x indicates the available encoding bits.
#According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, then the byte alone is a character; if the first bit is 1, then the number of consecutive 1s indicates how many bytes the current character occupies.
Next, we will take the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.
Yan’s Unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800 - 0000 FFFF), so Yan’s UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary digit of Yan, fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that Yan's UTF-8 encoding is 11100100 10111000 10100101, which converted to hexadecimal is E4B8A5.
6. Conversion between Unicode and UTF-8
Through the example in the previous section, you can see that Yan’s Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are different. Conversion between them can be achieved through programs.
For Windows platform, one of the simplest conversion methods is to use the built-in notepad applet notepad.exe. After opening the file, click the Save As command in the File menu, and a dialog box will pop up with a coding drop-down bar at the bottom.
There are four options: ANSI, Unicode, Unicode big endian and UTF-8.
ANSI is the default encoding. For English files, it is ASCII encoding, and for Simplified Chinese files, it is GB2312 encoding (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).
Unicode encoding here refers to the UCS-2 encoding method used by notepad.exe, which directly uses two bytes to store the Unicode code of the character. This option uses the little endian format. .
Unicode big endian encoding corresponds to the previous option. I will explain the meaning of little endian and big endian in the next section.
UTF-8 encoding, which is the encoding method mentioned in the previous section.
After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.
7. Little endian and Big endian
As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points are not exceeds 0xFFFF). Taking the Chinese character Yan as an example, the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, which is the Little endian method.
These two weird names come from the British writer Swift's "Gulliver's Travels". In the book, a civil war broke out in Lilliput. The cause of the war was people's dispute over whether to crack eggs from the big-endian or the little-endian. Because of this incident, six wars broke out, one emperor lost his life, and another emperor lost his throne.
The first byte comes first, which is "Big endian", and the second byte comes first, which is "Little endian".
So naturally, a question will arise: How does the computer know which way a certain file is encoded?
The Unicode specification defines that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "zero width no-break space" (zero width no-break space), represented by FEFF. This is exactly two bytes, and FF is one greater than FE.
If the first two bytes of a text file are FE FF, it means that the file uses big-end mode; if the first two bytes are FF FE, it means that the file uses small-end mode.
8. Example
Below, give an example.
Open the "Notepad" program notepad.exe, create a new text file, the content is a strict character, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.
Then, use the "hex function" in the text editing software UltraEdit to observe the internal encoding of the file.
ANSI: The encoding of the file is two bytes D1 CF, which is exactly the strict GB2312 encoding, which also implies that GB2312 is stored in the big head mode.
Unicode: The encoding is four bytes FF FE 25 4E, where FF FE indicates that it is stored in small head mode, and the real encoding is 4E25.
Unicode big endian: The encoding is four bytes FE FF 4E 25, where FE FF indicates big endian storage.
UTF-8: The encoding is six bytes EF BB BF E4 B8 A5. The first three bytes EF BB BF indicate that this is UTF-8 encoding, and the last three bytes are E4B8A5. Yan's specific encoding, its storage order is consistent with the encoding order.
9. Extended reading (extracurricular knowledge)
##The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (the most basic knowledge about character sets)Talk about Unicode encoding: RFC3629: UTF-8, a transformation format of ISO 10646 (if the regulations of UTF-8 are implemented)The above is the detailed content of How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection). For more information, please follow other related articles on the PHP Chinese website!