Detailed explanation of various php encoding sets and under what circumstances they should be used-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

Detailed explanation of various php encoding sets and under what circumstances they should be used

高洛峰

Nov 30, 2016 pm 02:00 PM

php

A character set is a collection of multiple characters. There are many types of character sets. Each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode Character sets, etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.

Chinese has a large number of characters, and it is also divided into two types of characters, Simplified Chinese and Traditional Chinese, with different writing rules. Computers were originally designed based on English single-byte characters. Therefore, encoding Chinese characters is the basis for Chinese information exchange. technical foundation. This article will discuss several typical character sets in chronological order of character sets, select several representative Chinese character sets, and study the historical origin, characteristics, and technical features.

　ASCII character set

　1. Origin of the name

　ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer coding system based on the Roman alphabet.

　2. Features

　It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO 646.

　3. Contains content

　Control characters: Enter key, backspace, line feed key, etc.

　Characters that can be displayed: English upper and lower case characters, Arabic numerals and Western symbols

　4. Technical characteristics

　 7 bits represent one character, a total of 128 characters

5. ASCII extended character set

7-bit encoding The character set of ASCII can only support 128 characters. In order to represent more commonly used European characters, ASCII has been extended. The ASCII extended character set uses 8 bits to represent a character, with a total of 256 characters.

　The symbols extended by the ASCII extended character set include tabular symbols, calculation symbols, Greek letters and special Latin symbols.

　GB2312 character set

1. Origin of the name GB2312 is also known as GB2312-80 character set, the full name is "Chinese Coded Character Set for Information Exchange Basic Set", issued by the former China State Administration of Standards, in May 1981 Implemented on January 1st.

　2. Features

　GB2312 is China’s national standard simplified Chinese character set. The Chinese characters it contains have covered 99.75% of the frequency of use, basically meeting the computer processing needs of Chinese characters. It is widely used in mainland China and Singapore.

　3. Content included

　GB2312 includes simplified Chinese characters and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Chinese pinyin symbols, and Chinese phonetic letters, a total of 7445 graphic characters. It includes 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.

　4. Technical features

　 (1) Partition representation:

　 The collected Chinese characters are “partitioned” in GB2312, and each zone contains 94 Chinese characters/symbols. This representation is also called location code.

The characters included in each area are as follows: Areas 01-09 are special symbols; Areas 16-55 are first-level Chinese characters, sorted by pinyin; Areas 56-87 are second-level Chinese characters, sorted by radicals/strokes; Areas 10-15 and Areas 88-94 are not coded.

　(2) Double-byte representation

　The first byte of the two bytes is the first byte, and the latter byte is the second byte. It is customary to call the first byte the "high byte" and the second byte the "low byte".

　The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0).

　 5. Encoding example

Take the first Chinese character "ah" in the GB2312 character set as an example. Its area code is 16 and the bit number is 01. The area code is 1601. In most computer programs, the high byte and Add 0xA0 to the low bytes respectively to get the Chinese character processing code 0xB0A1 of the program. The calculation formula is: 0xB0=0xA0+16, 0xA1=0xA0+1.

　BIG5 character set

1. Origin of the name

Also known as Big Five or Big Five, it was developed in 1984 by the Taiwan Information Industry Council and five software companies Acer and MiTAC , Jiajia, Zero One, and FIC were founded, so it is called the Big Five.

The Big5 code was created because different manufacturers in Taiwan at that time launched different codes, such as Yitian code, IBM PS55, Wangan code, etc., which were incompatible with each other; on the other hand, the Taiwan government had not yet launched an official Chinese character code, and Mainland China's GB2312 encoding also does not include traditional Chinese characters.

　2. Features

　The Big5 character set contains a total of 13,053 Chinese characters. This character set is used in Taiwan, China. What is intriguing is that this character set repeatedly contains the same two characters: "兀" (0xA461 and 0xC94A), "?亍?0xDCD1 and 0xDDFC).

　3. Character encoding method

Big5 code uses a double-byte storage method, using two bytes to encode a word. The first byte is called the "high byte" and the second byte is called the "low byte". The encoding range of the high-order byte is 0xA1-0xF9, and the encoding range of the low-order byte is 0x40-0x7E and 0xA1-0xFE.

　 The character types corresponding to each encoding range are as follows: 0xA140-0xA3BF are punctuation marks, Greek letters and special symbols. In addition, 0xA259-0xA261 stores the words for the two-syllable unit of measurement: ???????憝??; 0xA440-0xC67E are commonly used Chinese characters, sorted by strokes first and then by radicals; 0xC940-0xF9D5 are the next most commonly used Chinese characters, also sorted by strokes first and then by radicals.

　　4.Limitations of Big5

　Although the Big5 code contains more than 10,000 characters, it does not take into account the names of people, place names, dialects, chemistry and biology, etc. that are circulated in society. It does not include Japanese plain characters. Kana and katakana letters.

For example, Taiwan considers " Zhu " to be a variant of " Zhu", so the word " Zhu " is not included. Some radicals in the Kangxi dictionary (such as "亠", "疒", "?", "?", etc.), common names (such as "? Shake BoⅰDo Yinboⅰ?唷博ⅰ? GB18030 character set

1. The full name of GB 18030 is GB18030-2000 "Expansion of the basic set of Chinese character encoding for information exchange", which is the Chinese government's The new national standard for Chinese character encoding was released on March 17, 2000. Software released on the Chinese market after August 31, 2001 must comply with this standard

　2. Features

　The introduction of the GB 18030 character set standard has undergone extensive participation And demonstration, from well-known companies in the information technology industry at home and abroad, the Ministry of Information Industry and the former State Administration of Quality and Technical Supervision jointly implemented the GB 18030 character set standard to solve the large characters composed of Chinese characters, Japanese kana, Korean and Chinese ethnic minority characters. Sets computer coding issues. The total character encoding space of this standard exceeds 1.5 million encoding bits, including 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority languages. The requirements for information exchange in East Asia include multi-language, large font size, multi-purpose, and unified encoding format. It is also compatible with Unicode version 3.0, fills in the content of the Unicode extended character vocabulary "Unified Chinese Character Extension A", and is consistent with the previous national character encoding standard ( Compatible with GB2312, GB13000.1).

　3. Encoding method

　GB 18030 standard uses three methods of single byte, double byte and four byte to encode characters. The single byte part uses 0×00 to 0×7F. Code (corresponding to the corresponding code of ASCII code), the first byte code is from 0×81 to 0×FE, and the last byte code bit is 0×40 to 0×7E and 0×80 to 0× respectively. FE. The four-byte part uses 0×30 to 0×39 not used in GB/T 11383 as the suffix for the double-byte encoding expansion. The range of the expanded four-byte encoding is 0×81308130 to 0×FE39FE39. The first and three byte encoding code bits are all from 0×81 to 0×FE, and the second and four byte encoding code bits are from 0×30 to 0×39. 4. Contained content

. The content included in the double-byte part mainly includes 20,902 all CJK Chinese characters in GB13000.1, 13 related punctuation marks, ideographic descriptors, 80 supplementary Chinese characters and radicals/components, the double-byte encoded euro symbol, etc. The section contains all characters in GB 13000.1 except the above-mentioned double-byte characters, including CJK Unified Chinese Character Extension A. Unicode character set

1. The origin of the name

The Unicode character set encoding is Universal Multiple. -Octet Coded Character Set, the abbreviation of Universal Multi-octet Coded Character Set, is a character encoding system developed by an organization called the Unicode Consortium to support the exchange, processing and processing of written text in various languages in the world today. show. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.

　2. Features

　Unicode is a character encoding used on computers. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.

　3. Encoding method

　The Unicode standard always uses hexadecimal numbers, and is prefixed with "U+" when writing. For example, the encoding of the letter "A" is 004116 and the encoding of the character "?" is 20AC16. So the encoding of "A" is written as "U+0041".

　4.UTF-8 encoding

　UTF-8 is one of the ways to use Unicode. UTF is Unicode Translation Format, which means converting Unicode into a certain format.

UTF-8 facilitates the transmission of text in different languages and encodings between different computers using the network, allowing double-byte Unicode to be correctly transmitted on existing systems that handle single-byte processing.

UTF-8 uses variable length bytes to store Unicode characters. For example, ASCII letters continue to use 1 byte to store, accented characters, Greek letters or Cyrillic letters use 2 bytes to store, while commonly used Chinese characters use 3 characters. Festival. Auxiliary plane characters use 4 bytes.

　5.UTF-16 and UTF-32 encoding

　UTF-32, UTF-16 and UTF-8 are the character encoding schemes of the Unicode standard encoding character set. UTF-16 uses one or two unallocated 16 bits A sequence of code units encodes a Unicode code point; UTF-32 represents each Unicode code point as a 32-bit integer of the same value.

　 Solutions to various php application garbled problems

　 1) Use tags to set page encoding

　 The function of this tag is to declare what character set encoding the client’s browser uses to display the page. xxx can be GB2312, GBK, UTF- 8 (different from MySQL, which is UTF8) and so on. Therefore, most pages can use this method to tell the browser what encoding to use when displaying this page, so as to avoid encoding errors and garbled characters. But sometimes we will find that this sentence still doesn't work. No matter which xxx is, the browser always uses the same encoding. I will talk about this later.

　 Please note that it belongs to HTML information and is just a statement, which only indicates that the server has passed the HTML information to the browser.

　2) header("content-type:text/html; charset=xxx");

　The function of this function header() is to send the information in the brackets to the http header. If the content in the brackets is as mentioned in the article, the function is basically the same as the label. If you compare the first one, you will find that the characters are similar. But the difference is that if there is this function, the browser will always use the xxx encoding you requested and will never be disobedient, so this function is very useful. Why is this happening? Then we have to talk about the difference between http header and HTML information:

　The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to HTML information, so the content sent by header() reaches the browser first. The popular point is that header() has a higher priority (I don’t know if I can say this). If a php page has both header("content-type:text/html;charset=xxx") and header("content-type:text/html;charset=xxx"), the browser will only recognize the former http header and not the meta. Of course, this function can only be used within php pages.

　There is also a question left, why does the former definitely work, but the latter sometimes does not work? This is the reason why we want to talk about Apache next.

　3) AddDefaultCharset

　In the conf folder of the Apache root directory, there is the entire Apache configuration document httpd.conf.

Use a text editor to open httpd.conf. Line 708 (different versions may be different) contains AddDefaultCharset xxx, where xxx is the encoding name. The meaning of this line of code: Set the character set in the http header of the web page file in the entire server to your default xxx character set. Having this line is equivalent to adding a line of header("content-type: text/html; charset=xxx") to each file. Now you can understand why the browser always uses gb2312 even though it is set to utf-8.

　If there is header("content-type:text/html; charset=xxx") in the web page, the default character set will be changed to the character set you set, so this function will always be useful. If you add a "#" in front of AddDefaultCharset xxx, comment out this sentence, and the page does not contain header("content-type..."), then it is the meta tag's turn to take effect.

　The above priority order is listed below:

　header("content-type:text/html; charset=xxx")

　.. AddDefaultCharset xxx

　..

　If you are a web programmer, I recommend it to you Add a header ("content-type: text/html; charset=xxx") to each page to ensure that it can be displayed correctly on any server and has strong portability.

　4) The default_charset configuration in php.ini:

　The default_charset = "gb2312" in php.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on the charset in the web page header instead of making a mandatory requirement. This way, web services in multiple languages can be provided on the same server.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

php怎么把负数转为正整数Apr 19, 2022 pm 08:59 PM

php把负数转为正整数的方法：1、使用abs()函数将负数转为正数，使用intval()函数对正数取整，转为正整数，语法“intval(abs($number))”；2、利用“~”位运算符将负数取反加一，语法“~$number + 1”。

php怎么实现几秒后执行一个函数Apr 24, 2022 pm 01:12 PM

实现方法：1、使用“sleep(延迟秒数)”语句，可延迟执行函数若干秒；2、使用“time_nanosleep(延迟秒数,延迟纳秒数)”语句，可延迟执行函数若干秒和纳秒；3、使用“time_sleep_until(time()+7)”语句。

php怎么除以100保留两位小数Apr 22, 2022 pm 06:23 PM

php除以100保留两位小数的方法：1、利用“/”运算符进行除法运算，语法“数值 / 100”；2、使用“number_format(除法结果, 2)”或“sprintf("%.2f",除法结果)”语句进行四舍五入的处理值，并保留两位小数。

php字符串有没有下标Apr 24, 2022 am 11:49 AM

php字符串有下标。在PHP中，下标不仅可以应用于数组和对象，还可应用于字符串，利用字符串的下标和中括号“[]”可以访问指定索引位置的字符，并对该字符进行读写，语法“字符串名[下标值]”；字符串的下标值（索引值）只能是整数类型，起始值为0。

php怎么根据年月日判断是一年的第几天Apr 22, 2022 pm 05:02 PM

判断方法：1、使用“strtotime("年-月-日")”语句将给定的年月日转换为时间戳格式；2、用“date("z",时间戳)+1”语句计算指定时间戳是一年的第几天。date()返回的天数是从0开始计算的，因此真实天数需要在此基础上加1。

php怎么读取字符串后几个字符Apr 22, 2022 pm 08:31 PM

在php中，可以使用substr()函数来读取字符串后几个字符，只需要将该函数的第二个参数设置为负值，第三个参数省略即可；语法为“substr(字符串,-n)”，表示读取从字符串结尾处向前数第n个字符开始，直到字符串结尾的全部字符。

php怎么替换nbsp空格符Apr 24, 2022 pm 02:55 PM

方法：1、用“str_replace(" ","其他字符",$str)”语句，可将nbsp符替换为其他字符；2、用“preg_replace("/(\s|\&nbsp\;||\xc2\xa0)/","其他字符",$str)”语句。

php怎么判断有没有小数点Apr 20, 2022 pm 08:12 PM

php判断有没有小数点的方法：1、使用“strpos(数字字符串,'.')”语法，如果返回小数点在字符串中第一次出现的位置，则有小数点；2、使用“strrpos(数字字符串,'.')”语句，如果返回小数点在字符串中最后一次出现的位置，则有。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Two Point Museum: All Exhibits And Where To Find Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

Chinese version, very easy to use

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software