Home >Common Problem >How many bytes do utf8 encoded Chinese characters occupy?

How many bytes do utf8 encoded Chinese characters occupy?

青灯夜游
青灯夜游Original
2023-02-21 11:40:5216563browse

utf8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.

How many bytes do utf8 encoded Chinese characters occupy?

The operating environment of this tutorial: Windows 7 system, Dell G3 computer.

How many bytes do utf-8 encoded Chinese characters occupy?

In UTF-8 encoding: one Chinese character is equal to three bytes, and Chinese punctuation occupies three bytes.

One English character is equal to one byte, and English punctuation occupies one byte.

Unicode encoding: One English code is equal to two bytes, and one Chinese character (including traditional Chinese) is equal to two bytes. Chinese punctuation occupies two bytes, and English punctuation takes up two bytes.

How many bytes do utf8 encoded Chinese characters occupy?

UTF-8 uses 1~4 bytes to encode each character:

1. One US-ASCIl character only needs 1 byte encoding ( Unicode range is U 0000~U 007F).

2. Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is represented by U 0080~U 07FF).

3. Characters in other languages ​​(including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.

4. Other rarely used language characters use 4-byte encoding.

Extended knowledge:

UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

Character set:

UTF-8 encoding rules: If there is only one byte, the value is 0x00-0x7F. The remaining bytes are expanded as follows according to length:

UTF-8 is implemented by 4 encoding methods, namely UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4. Among them:

##UTF8-20xC2-0xDF UTF8-30xE0 UTF8-40xF0 Note: Each encoding may have multiple encoding ranges, each encoding range space as the delimiter for each byte. For example, the first encoding of UTF8-3 must have a value of 0xE0 for the first byte, a range of 0xA0-0xBF for the second byte, and a range of 0x80-0xBF for the third byte.
UTF8, hexadecimal encoding table
##UTF8-1
0x00- 0x7F
0x80-0xBF
0xA0-0xBF
0x80-0xBF0xE1- 0xEC
0x80-0xBF
0x80-0xBF0xED
0x80-0x9F
0x80-0xBF0xEE-0xEF
0x80-0xBF
0x80-0xBF
0x90-0xBF
0x80-0xBF 0x80-0xBF0xF1-0xF3
0x80-0xBF
0x80-0xBF 0x80-0xBF0xF4
0x80- 0x8F
0x80-0xBF 0x80-0xBF

For more related knowledge, please visit the

FAQ

column!

The above is the detailed content of How many bytes do utf8 encoded Chinese characters occupy?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn