utf8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
How many bytes do utf-8 encoded Chinese characters occupy?
In UTF-8 encoding: one Chinese character is equal to three bytes, and Chinese punctuation occupies three bytes.
One English character is equal to one byte, and English punctuation occupies one byte.
Unicode encoding: One English code is equal to two bytes, and one Chinese character (including traditional Chinese) is equal to two bytes. Chinese punctuation occupies two bytes, and English punctuation takes up two bytes.
UTF-8 uses 1~4 bytes to encode each character:
1. One US-ASCIl character only needs 1 byte encoding ( Unicode range is U 0000~U 007F).
2. Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is represented by U 0080~U 07FF).
3. Characters in other languages (including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.
4. Other rarely used language characters use 4-byte encoding.
Extended knowledge:
UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.
Character set:
UTF-8 encoding rules: If there is only one byte, the value is 0x00-0x7F. The remaining bytes are expanded as follows according to length:
UTF-8 is implemented by 4 encoding methods, namely UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4. Among them:
##UTF8-1 | 0x00- 0x7F |
0xC2-0xDF | 0x80-0xBF |
0xE0 | 0xA0-0xBF 0x80-0xBF0xE1- 0xEC 0x80-0xBF 0x80-0xBF0xED 0x80-0x9F 0x80-0xBF0xEE-0xEF 0x80-0xBF 0x80-0xBF
|
0xF0 | 0x90-0xBF 0x80-0xBF 0x80-0xBF0xF1-0xF3 0x80-0xBF 0x80-0xBF 0x80-0xBF0xF4 0x80- 0x8F 0x80-0xBF 0x80-0xBF
|
For more related knowledge, please visit the
FAQThe above is the detailed content of How many bytes do utf8 encoded Chinese characters occupy?. For more information, please follow other related articles on the PHP Chinese website!