Home > Article > Backend Development > Detailed explanation of various php encoding sets and under what circumstances they should be used
A character set is a collection of multiple characters. There are many types of character sets. Each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode Character sets, etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.
Chinese has a large number of characters, and it is also divided into two types of characters, Simplified Chinese and Traditional Chinese, with different writing rules. Computers were originally designed based on English single-byte characters. Therefore, encoding Chinese characters is the basis for Chinese information exchange. technical foundation. This article will discuss several typical character sets in chronological order of character sets, select several representative Chinese character sets, and study the historical origin, characteristics, and technical features.
ASCII character set
1. Origin of the name
ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer coding system based on the Roman alphabet.
2. Features
It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO 646.
3. Contains content
Control characters: Enter key, backspace, line feed key, etc.
Characters that can be displayed: English upper and lower case characters, Arabic numerals and Western symbols
4. Technical characteristics
7 bits represent one character, a total of 128 characters
5. ASCII extended character set
7-bit encoding The character set of ASCII can only support 128 characters. In order to represent more commonly used European characters, ASCII has been extended. The ASCII extended character set uses 8 bits to represent a character, with a total of 256 characters.
The symbols extended by the ASCII extended character set include tabular symbols, calculation symbols, Greek letters and special Latin symbols.
GB2312 character set
1. Origin of the name GB2312 is also known as GB2312-80 character set, the full name is "Chinese Coded Character Set for Information Exchange Basic Set", issued by the former China State Administration of Standards, in May 1981 Implemented on January 1st.
2. Features
GB2312 is China’s national standard simplified Chinese character set. The Chinese characters it contains have covered 99.75% of the frequency of use, basically meeting the computer processing needs of Chinese characters. It is widely used in mainland China and Singapore.
3. Content included
GB2312 includes simplified Chinese characters and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Chinese pinyin symbols, and Chinese phonetic letters, a total of 7445 graphic characters. It includes 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.
4. Technical features
(1) Partition representation:
The collected Chinese characters are “partitioned” in GB2312, and each zone contains 94 Chinese characters/symbols. This representation is also called location code.
The characters included in each area are as follows: Areas 01-09 are special symbols; Areas 16-55 are first-level Chinese characters, sorted by pinyin; Areas 56-87 are second-level Chinese characters, sorted by radicals/strokes; Areas 10-15 and Areas 88-94 are not coded.
(2) Double-byte representation
The first byte of the two bytes is the first byte, and the latter byte is the second byte. It is customary to call the first byte the "high byte" and the second byte the "low byte".
The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0).
5. Encoding example
Take the first Chinese character "ah" in the GB2312 character set as an example. Its area code is 16 and the bit number is 01. The area code is 1601. In most computer programs, the high byte and Add 0xA0 to the low bytes respectively to get the Chinese character processing code 0xB0A1 of the program. The calculation formula is: 0xB0=0xA0+16, 0xA1=0xA0+1.
BIG5 character set
1. Origin of the name
Also known as Big Five or Big Five, it was developed in 1984 by the Taiwan Information Industry Council and five software companies Acer and MiTAC , Jiajia, Zero One, and FIC were founded, so it is called the Big Five.
The Big5 code was created because different manufacturers in Taiwan at that time launched different codes, such as Yitian code, IBM PS55, Wangan code, etc., which were incompatible with each other; on the other hand, the Taiwan government had not yet launched an official Chinese character code, and Mainland China's GB2312 encoding also does not include traditional Chinese characters.
2. Features
The Big5 character set contains a total of 13,053 Chinese characters. This character set is used in Taiwan, China. What is intriguing is that this character set repeatedly contains the same two characters: "兀" (0xA461 and 0xC94A), "?亍?0xDCD1 and 0xDDFC).
3. Character encoding method
Big5 code uses a double-byte storage method, using two bytes to encode a word. The first byte is called the "high byte" and the second byte is called the "low byte". The encoding range of the high-order byte is 0xA1-0xF9, and the encoding range of the low-order byte is 0x40-0x7E and 0xA1-0xFE.
The character types corresponding to each encoding range are as follows: 0xA140-0xA3BF are punctuation marks, Greek letters and special symbols. In addition, 0xA259-0xA261 stores the words for the two-syllable unit of measurement: ???????憝??; 0xA440-0xC67E are commonly used Chinese characters, sorted by strokes first and then by radicals; 0xC940-0xF9D5 are the next most commonly used Chinese characters, also sorted by strokes first and then by radicals.
4.Limitations of Big5
Although the Big5 code contains more than 10,000 characters, it does not take into account the names of people, place names, dialects, chemistry and biology, etc. that are circulated in society. It does not include Japanese plain characters. Kana and katakana letters.
For example, Taiwan considers " Zhu " to be a variant of " Zhu", so the word " Zhu " is not included. Some radicals in the Kangxi dictionary (such as "亠", "疒", "?", "?", etc.), common names (such as "? Shake BoⅰDo Yinboⅰ?唷博ⅰ? GB18030 character set
1. The full name of GB 18030 is GB18030-2000 "Expansion of the basic set of Chinese character encoding for information exchange", which is the Chinese government's The new national standard for Chinese character encoding was released on March 17, 2000. Software released on the Chinese market after August 31, 2001 must comply with this standard
2. Features
The introduction of the GB 18030 character set standard has undergone extensive participation And demonstration, from well-known companies in the information technology industry at home and abroad, the Ministry of Information Industry and the former State Administration of Quality and Technical Supervision jointly implemented the GB 18030 character set standard to solve the large characters composed of Chinese characters, Japanese kana, Korean and Chinese ethnic minority characters. Sets computer coding issues. The total character encoding space of this standard exceeds 1.5 million encoding bits, including 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority languages. The requirements for information exchange in East Asia include multi-language, large font size, multi-purpose, and unified encoding format. It is also compatible with Unicode version 3.0, fills in the content of the Unicode extended character vocabulary "Unified Chinese Character Extension A", and is consistent with the previous national character encoding standard ( Compatible with GB2312, GB13000.1).
3. Encoding method
GB 18030 standard uses three methods of single byte, double byte and four byte to encode characters. The single byte part uses 0×00 to 0×7F. Code (corresponding to the corresponding code of ASCII code), the first byte code is from 0×81 to 0×FE, and the last byte code bit is 0×40 to 0×7E and 0×80 to 0× respectively. FE. The four-byte part uses 0×30 to 0×39 not used in GB/T 11383 as the suffix for the double-byte encoding expansion. The range of the expanded four-byte encoding is 0×81308130 to 0×FE39FE39. The first and three byte encoding code bits are all from 0×81 to 0×FE, and the second and four byte encoding code bits are from 0×30 to 0×39. 4. Contained content
. The content included in the double-byte part mainly includes 20,902 all CJK Chinese characters in GB13000.1, 13 related punctuation marks, ideographic descriptors, 80 supplementary Chinese characters and radicals/components, the double-byte encoded euro symbol, etc. The section contains all characters in GB 13000.1 except the above-mentioned double-byte characters, including CJK Unified Chinese Character Extension A. Unicode character set
1. The origin of the name
The Unicode character set encoding is Universal Multiple. -Octet Coded Character Set, the abbreviation of Universal Multi-octet Coded Character Set, is a character encoding system developed by an organization called the Unicode Consortium to support the exchange, processing and processing of written text in various languages in the world today. show. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.
2. Features
Unicode is a character encoding used on computers. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.
3. Encoding method
The Unicode standard always uses hexadecimal numbers, and is prefixed with "U+" when writing. For example, the encoding of the letter "A" is 004116 and the encoding of the character "?" is 20AC16. So the encoding of "A" is written as "U+0041".
4.UTF-8 encoding
UTF-8 is one of the ways to use Unicode. UTF is Unicode Translation Format, which means converting Unicode into a certain format.
UTF-8 facilitates the transmission of text in different languages and encodings between different computers using the network, allowing double-byte Unicode to be correctly transmitted on existing systems that handle single-byte processing.
UTF-8 uses variable length bytes to store Unicode characters. For example, ASCII letters continue to use 1 byte to store, accented characters, Greek letters or Cyrillic letters use 2 bytes to store, while commonly used Chinese characters use 3 characters. Festival. Auxiliary plane characters use 4 bytes.
5.UTF-16 and UTF-32 encoding
UTF-32, UTF-16 and UTF-8 are the character encoding schemes of the Unicode standard encoding character set. UTF-16 uses one or two unallocated 16 bits A sequence of code units encodes a Unicode code point; UTF-32 represents each Unicode code point as a 32-bit integer of the same value.
Solutions to various php application garbled problems
1) Use tags to set page encoding
The function of this tag is to declare what character set encoding the client’s browser uses to display the page. xxx can be GB2312, GBK, UTF- 8 (different from MySQL, which is UTF8) and so on. Therefore, most pages can use this method to tell the browser what encoding to use when displaying this page, so as to avoid encoding errors and garbled characters. But sometimes we will find that this sentence still doesn't work. No matter which xxx is, the browser always uses the same encoding. I will talk about this later.
Please note that it belongs to HTML information and is just a statement, which only indicates that the server has passed the HTML information to the browser.
2) header("content-type:text/html; charset=xxx");
The function of this function header() is to send the information in the brackets to the http header. If the content in the brackets is as mentioned in the article, the function is basically the same as the label. If you compare the first one, you will find that the characters are similar. But the difference is that if there is this function, the browser will always use the xxx encoding you requested and will never be disobedient, so this function is very useful. Why is this happening? Then we have to talk about the difference between http header and HTML information:
The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to HTML information, so the content sent by header() reaches the browser first. The popular point is that header() has a higher priority (I don’t know if I can say this). If a php page has both header("content-type:text/html;charset=xxx") and header("content-type:text/html;charset=xxx"), the browser will only recognize the former http header and not the meta. Of course, this function can only be used within php pages.
There is also a question left, why does the former definitely work, but the latter sometimes does not work? This is the reason why we want to talk about Apache next.
3) AddDefaultCharset
In the conf folder of the Apache root directory, there is the entire Apache configuration document httpd.conf.
Use a text editor to open httpd.conf. Line 708 (different versions may be different) contains AddDefaultCharset xxx, where xxx is the encoding name. The meaning of this line of code: Set the character set in the http header of the web page file in the entire server to your default xxx character set. Having this line is equivalent to adding a line of header("content-type: text/html; charset=xxx") to each file. Now you can understand why the browser always uses gb2312 even though it is set to utf-8.
If there is header("content-type:text/html; charset=xxx") in the web page, the default character set will be changed to the character set you set, so this function will always be useful. If you add a "#" in front of AddDefaultCharset xxx, comment out this sentence, and the page does not contain header("content-type..."), then it is the meta tag's turn to take effect.
The above priority order is listed below:
header("content-type:text/html; charset=xxx")
.. AddDefaultCharset xxx
..
If you are a web programmer, I recommend it to you Add a header ("content-type: text/html; charset=xxx") to each page to ensure that it can be displayed correctly on any server and has strong portability.
4) The default_charset configuration in php.ini:
The default_charset = "gb2312" in php.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on the charset in the web page header instead of making a mandatory requirement. This way, web services in multiple languages can be provided on the same server.