Home > Article > Web Front-end > What is the difference between UTF-8 and GBK UTF8 GB2312
The content of this article is about the difference between UTF-8 and GBK UTF8 GB2312. It has certain reference value. Friends in need can refer to it. Hope it helps.
UTF-8: Unicode TransformationFormat-8bit, BOM is allowed, but usually does not contain BOM. It is a multi-byte encoding used to solve international characters. It uses 8 bits (that is, one byte) for English and 24 bits (three bytes) for Chinese. UTF-8 contains characters that are used by all countries in the world. It is an international encoding and has strong versatility. UTF-8 encoded text can be displayed on browsers in various countries that support the UTF8 character set. For example, if it is UTF8 encoding, Chinese can also be displayed on foreigners' English IE, and they do not need to download IE's Chinese language support package.
GBK is a standard that is compatible with GB2312 after expansion based on the national standard GB2312. The text encoding of GBK is represented by double bytes, that is, both Chinese and English characters are represented by double bytes. In order to distinguish Chinese characters, the highest bits are set to 1. GBK contains all Chinese characters and is a national encoding. It is less versatile than UTF8, but UTF8 occupies a larger database than GBK.
GBK, GB2312, etc. must be converted to UTF8 through Unicode encoding:
GBK, GB2312--Unicode--UTF8
UTF8--Unicode--GBK, GB2312
pCSS5 is simply functional:
1, GBK usually refers to GB2312 encoding that only supports Simplified Chinese characters
2, utf usually refers to UTF-8, supports Simplified Chinese characters, Traditional Chinese characters, English, Japanese, Korean and other languages (supports a wider range of characters)
3. UTF-8 and gb2312 are usually used in China. Choose according to your own needs.
The specific details are as follows:
For a website or forum, if English characters If there are more, it is recommended to use UTF-8 to save space. However, many forum plug-ins now generally only support GBK.
Detailed explanation of the difference between encodings
Simply put, unicode, gbk and big five codes are the encoded values, and utf-8, uft-16 and the like are the expressions of this value. The previous three codes are compatible. For the same Chinese character, the three code values are completely different. For example, the uncode value of "Chinese" is different from that of gbk. Suppose the uncode is a040 and gbk is b030, and the uft-8 code is the form in which that value is expressed. The utf-8 code is completely organized only for uncode. If GBK wants to be converted to UTF-8, it must be converted to uncode first, and then converted to utf-8 and it is OK.
For details, please see the article below.
Talk about Unicode encoding and briefly explain UCS, UTF, BMP, BOM and other terms
This is an interesting read written by programmers for programmers. The so-called fun means that you can easily understand some previously unclear concepts and improve your knowledge, which is similar to upgrading in an RPG game. The motivation for organizing this article is two questions:
Question 1:
Use Windows Notepad's "Save As" to save in GBK, Unicode, Unicode big endian and UTF -8 Convert between these encoding methods. It is also a txt file. How does Windows identify the encoding method?
I discovered a long time ago that Unicode, Unicode bigendian and UTF-8 encoded txt files will have a few more bytes at the beginning, which are FF, FE (Unicode), FE, FF (Unicode bigendian) ,EF, BB, BF (UTF-8). But what criteria are these markers based on?
Question 2:
I recently saw a ConvertUTF.c on the Internet, which realizes the mutual conversion of UTF-32, UTF-16 and UTF-8 encoding methods. I already know about Unicode encoding (UCS2), GBK, and UTF-8 encoding methods. But this program makes me a little confused, and I can't remember what the relationship between UTF-16 and UCS2 is.
After checking the relevant information, I finally clarified these issues, and also learned some details about Unicode. Write an article and send it to friends who have similar questions. This article is written as easy to understand as possible, but readers are required to know what bytes are and what hexadecimal is.
0, big endian and little endian
Big endian and little endian are different ways for the CPU to handle multi-byte numbers. For example, the Unicode encoding of the character "汉" is 6C49. So when writing to a file, should 6C be written in front or 49 be written in front? If 6C is written in front, it is big endian. If 49 is written in front, it is little endian.
The word "endian" comes from "Gulliver's Travels". The civil war in Lilliput was caused by whether to crack the eggs from the Big-Endian or the Little-Endian. As a result, there were six rebellions. One emperor lost his life, and another Lost the throne.
We generally translate endian as "byte order", and big endian and little endian are called "big end" and "little end".
1. Character encoding, internal code, and by the way, Chinese character encoding
Characters must be encoded before they can be processed by the computer. The default encoding method used by the computer is the computer's internal code. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for Traditional Chinese.
GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range of the Chinese character area is from B0-F7 in the high byte, A1-FE in the low byte, and the occupied code bits are 72*94=6768. Among them, 5 vacancies are D7FA-D7FE.
GB2312 supports too few Chinese characters. The 1995 Chinese character expansion specification GBK1.0 includes 21,886 symbols, which is divided into Chinese character area and graphic symbol area. The Chinese character area includes 21003 characters.
From ASCII, GB2312 to GBK, these encoding methods are backward compatible, that is, the same character always has the same encoding in these schemes, and later standards support more characters. In these encodings, English and Chinese can be processed uniformly. The way to distinguish Chinese encoding is that the highest bit of the high byte is not 0. According to what programmers call them, GB2312 and GBK both belong to double-byte character sets (DBCS).
GB18030 in 2000 is the official national standard that replaced GBK1.0. This standard includes 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major ethnic minority languages. In terms of Chinese character vocabulary, GB18030 adds 6582 Chinese characters of CJK extension A (Unicode code 0x3400-0x4db5) to the 20902 Chinese characters of GB13000.1, and a total of 27484 Chinese characters are included.
CJK means China, Japan and Korea. In order to save code bits, Unicode uniformly encodes characters in the three languages of China, Japan and Korea. GB13000.1 is the Chinese version of ISO/IEC 10646-1, equivalent to Unicode 1.1.
The encoding of GB18030 adopts single-byte, double-byte and 4-byte schemes. Among them, single byte, double byte and GBK are fully compatible. The code bit of 4-byte encoding contains 6582 Chinese characters of CJK extension A. For example: the encoding of UCS 0x3400 in GB18030 should be 8139EF30, and the encoding of UCS 0x3401 in GB18030 should be 8139EF31.
Microsoft provides an upgrade package for GB18030, but this upgrade package only provides a new set of fonts that support 6582 Chinese characters of CJK extension A: New Song Dynasty-18030, and does not change the internal code. The internal code of Windows is still GBK.
There are some details here:
The original text of GB2312 is still the location code. From the location code to the inner code, A0 needs to be added to the high byte and low byte respectively.
For any character encoding, the order of coding units is specified by the encoding scheme and has nothing to do with endian. For example, the coding unit of GBK is byte, and two bytes are used to represent a Chinese character. The order of these two bytes is fixed and is not affected by CPU byte order. The encoding unit of UTF-16 is word (double-byte). The order between words is specified by the encoding scheme. Only the byte arrangement within the word will be affected by endian. UTF-16 will be introduced later.
The highest bits of the two bytes of GB2312 are both 1. But there are only 128*128=16384 code points that meet this condition. Therefore, the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the parsing of the DBCS character stream: when reading the DBCS character stream, as long as a byte with a high bit of 1 is encountered, the next two bytes can be encoded as a double byte, regardless of the low byte. What is high position.
2. Unicode, UCS and UTF
As mentioned earlier, the encoding methods from ASCII, GB2312, GBK to GB18030 are backward compatible. Unicode is only compatible with ASCII (more precisely, compatible with ISO-8859-1) and is not compatible with GB code. For example, the Unicode encoding of the character "汉" is 6C49, while the GB code is BABA.
Unicode is also a character encoding method, but it is an encoding scheme designed by an international organization that can accommodate all languages in the world. The scientific name of Unicode is "UniversalMultiple-Octet Coded Character Set", referred to as UCS. UCS can be seen as the abbreviation of "Unicode CharacterSet".
According to Wikipedia (http://zh.wikipedia.org/wiki/): Historically, there were two organizations that tried to design Unicode independently, namely the International Organization for Standardization (ISO) and a software manufacturer Business Association (unicode.org). ISO developed the ISO 10646 project and the Unicode Consortium developed the Unicode project.
Around 1991, both sides recognized that the world did not need two incompatible character sets. So they began to merge the work of both parties and work together to create a single coding list. Starting from Unicode2.0, the Unicode project uses the same fonts and fonts as ISO 10646-1.
Currently both projects still exist and publish their respective standards independently. The latest version of the Unicode Consortium is Unicode 4.1.0 in 2005. ISO's latest standard is ISO 10646-3:2003.
UCS only stipulates how to encode, but does not specify how to transmit or save this encoding. For example, the UCS encoding of the character "Chinese" is 6C49. I can use 4 ASCII numbers to transmit and save this encoding; I can also use UTF-8 encoding: 3 consecutive bytes E6 B189 to represent it. The key is that both parties to the communication must agree. UTF-8, UTF-7, and UTF-16 are all widely accepted solutions. A special benefit of UTF-8 is that it is fully compatible with ISO-8859-1. UTF is the abbreviation of "UCS Transformation Format".
IETF's RFC2781 and RFC3629 describe the encoding methods of UTF-16 and UTF-8 clearly, crisply and rigorously in the consistent style of RFC. I always forget that IETF is the abbreviation of Internet Engineering Task Force. However, the RFC maintained by the IETF is the basis for all specifications on the Internet.
2.1. Internal code and code page
Currently, the Windows kernel already supports the Unicode character set, so that the kernel can support all languages around the world. However, since a large number of existing programs and documents use a certain language encoding, such as GBK, it is impossible for Windows not to support the existing encoding and all use Unicode.
Windows uses code pages to adapt to various countries and regions. The code page can be understood as the internal code mentioned earlier. The code page corresponding to GBK is CP936.
Microsoft also defines code page for GB18030: CP54936. However, since GB18030 has some 4-byte encodings, and the Windows code page only supports single-byte and double-byte encodings, this code page cannot really be used.
3, UCS-2, UCS-4, BMP
UCS has two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually only 31 bits are used, the highest bit must be 0). Let's do some simple math games:
UCS-2 has 2^16=65536 code points, and UCS-4 has 2^31=2147483648 code points.
UCS-4 is divided into 2^7=128 groups according to the highest byte with the highest bit being 0. Each group is divided into 256 planes based on the next highest byte. Each plane is divided into 256 rows (rows) based on the third byte, and each row contains 256 cells. Of course, cells in the same row only differ in the last byte, and the rest are the same.
Plane 0 of group 0 is called Basic Multilingual Plane, or BMP. Or in UCS-4, the code bits with the upper two bytes being 0 are called BMP.
Remove the first two zero bytes of UCS-4 BMP to get UCS-2. Add two zero bytes in front of the two bytes of UCS-2 to get the BMP of UCS-4. There are no characters allocated outside the BMP in the current UCS-4 specification.
4. UTF encoding
UTF-8 encodes UCS in 8-bit units. The encoding method from UCS-2 to UTF-8 is as follows:
UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode encoding of the "Chinese" character is 6C49. 6C49 is between 0800-FFFF, so you must use a 3-byte template: 1110xxxx 10xxxxxx10xxxxxx. Writing 6C49 in binary is: 0110 110001 001001. Using this bit stream to replace x in the template in turn, we get: 1110011010110001 10001001, which is E6 B1 89.
Readers can use Notepad to test whether our coding is correct. It should be noted that UltraEdit will automatically convert to UTF-16 when opening a UTF-8 encoded text file, which may cause confusion. You can turn this option off in settings. A better tool is Hex Workshop.
UTF-16 encodes UCS in 16-bit units. For UCS codes less than 0x10000, UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes no less than 0x10000, an algorithm is defined. However, since the BMP of the actually used UCS2 or UCS4 must be less than 0x10000, for now, it can be considered that UTF-16 and UCS-2 are basically the same. However, UCS-2 is only an encoding scheme, and UTF-16 is used for actual transmission, so the issue of byte order has to be considered.
5. UTF byte order and BOM
UTF-8 uses bytes as the encoding unit, and there is no byte order problem. UTF-16 uses two bytes as the encoding unit. Before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, the Unicode encoding of "Kui" is 594E, and the Unicode encoding of "B" is 4E59. If we receive the UTF-16 byte stream "594E", is this "Ku" or "B"?
The recommended method of marking byte order in the Unicode specification is the BOM. BOM is not the BOM list of "Bill Of Material", but Byte order Mark. BOM is a little clever idea:
There is a character called "ZERO WIDTH NO-BREAKSPACE" in UCS encoding, and its encoding is FEFF. FFFE is a character that does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream.
In this way, if the receiver receives FEFF, it means that the byte stream is Big-Endian; if it receives FFFE, it means that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.
UTF-8 does not require a BOM to indicate the byte order, but you can use the BOM to indicate the encoding method. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAKSPACE" is EF BB BF (readers can verify it using the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BBBF, it knows that it is UTF-8 encoded.
Windows uses BOM to mark the encoding method of text files.
6. Further reference materials
The main reference material for this article is "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode -iso10646-oview.html).
I also found two pieces of information that looked good, but because I had already found the answers to my initial questions, I didn’t read them:
"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a)
"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03)
I have written UTF-8, UCS-2, GBK mutually Converted packages, including versions with and without Windows API. If I have time in the future, I will sort it out and put it on my personal homepage
I started writing this article after I thought about all the issues clearly. I thought I could finish it in a while. Unexpectedly, it took a long time to consider the wording and check the details, and I wrote it from 1:30 to 9:00 in the afternoon. Hope some readers can benefit from it.
Appendix 1 Let’s talk about the location code, GB2312, internal code and code page
Some friends still have questions about this sentence in the article:
“The original text of GB2312 is still the location code, from the location code To reach the internal code, you need to add A0 to the high byte and low byte respectively."
Let me explain in detail:
"The original text of GB2312" refers to a national standard issued in 1980 Standard "Basic Set of Chinese Coded Character Sets for National Standard Information Exchange of the People's Republic of China GB2312-80". This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called the "area" and the second number is called the "bit". So it is also called location code. Areas 1-9 are Chinese symbols, areas 16-55 are first-level Chinese characters, and areas 56-87 are second-level Chinese characters. Now Windows also has a location input method, for example, input 1601 to get "ah". (This location input method can automatically recognize the hexadecimal GB2312 and decimal location codes, which means that entering B0A1 will also get "ah".)
Internal code refers to the character encoding inside the operating system . The internal code of early operating systems was language-dependent. Today's Windows supports Unicode within the system, and then uses code pages to adapt to various languages. The concept of "internal code" is relatively vague. Microsoft generally refers to the encoding specified by the default code page as internal code.
There is no official definition of the term internal code, and code page is just the name of the company Microsoft. As programmers, as long as we know what they are, there is no need to examine these terms too much.
The so-called code page (code page) is the character encoding for a language. For example, the code page of GBK is CP936, the code page of BIG5 is CP950, and the code page of GB2312 is CP20936.
Windows has the concept of a default code page, that is, what encoding is used by default to interpret characters. For example, Windows Notepad opens a text file, and the content inside is a byte stream: BA, BA, D7, D6. How should Windows interpret it?
Should it be interpreted in accordance with Unicode encoding, GBK, BIG5, or ISO8859-1? If you interpret it according to GBK, you will get the word "Chinese characters". According to other encoding interpretations, the corresponding character may not be found, or the wrong character may be found. The so-called "error" means that it is inconsistent with the original intention of the text author, and garbled characters are produced.
The answer is that Windows interprets the byte stream in the text file according to the current default code page. The default code page can be set through the Regional Options in Control Panel. There is an ANSI item in Notepad's Save As, which actually saves according to the encoding method of the default code page.
The internal code of Windows is Unicode, which can technically support multiple code pages at the same time. As long as the file can explain what encoding it uses and the user has installed the corresponding code page, Windows can display it correctly. For example, charset can be specified in an HTML file.
Some HTML file authors, especially English authors, believe that everyone in the world uses English and do not specify charset in the file. If he uses characters between 0x80-0xff, and Chinese Windows interprets them according to the default GBK, garbled characters will appear. At this time, just add the statement specifying charset to the html file, for example:
If the code page used by the original author is compatible with ISO8859-1, there will be no garbled characters.
Letâs talk about the location code. Ahâs location code is 1601, which is 0x10, 0x01 when written in hexadecimal. This conflicts with the ASCII encoding widely used by computers. In order to be compatible with the ASCII encoding of 00-7f, we add A0 to the high and low bytes of the area code respectively. In this way, the code for "ah" becomes B0A1. We also call the code with two A0s added as GB2312 code, although the original text of GB2312 does not mention this at all.
The above is a complete introduction to the differences between UTF-8 and GBK UTF8 GB2312. If you want to know more about html tutorials, please pay attention to the PHP Chinese website.
The above is the detailed content of What is the difference between UTF-8 and GBK UTF8 GB2312. For more information, please follow other related articles on the PHP Chinese website!