What encoding is utf-8?-Common Problem-php.cn

Home

Common Problem

What encoding is utf-8?

青灯夜游

Oct 21, 2020 pm 04:25 PM

utf-8coding

UTF-8 is a variable-length character encoding for Unicode; it can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, making The original software that processes ASCII characters can continue to be used without or with only minor modifications.

What encoding is utf-8?

UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

Basic features

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0×00 to 0x7F (ASCIⅡ compatible). This means that files containing only 7-bit ASCII characters are the same in both ASCII and UTF-8 encodings.

All UCS characters greater than 0x007F are encoded as a string of multiple bytes, each byte has a flag bit set. Therefore, it is impossible for ASCII bytes (0x00-0x7F) to be part of any other characters. The first byte of a multibyte string representing a non-ASCII character is always in the range 0xC0 to 0XFD and indicates how many bytes the character contains. The remaining bytes of the multi-byte string are in the range 0x80 to 0xBF. This makes resynchronization very easy and makes encodings borderless and rarely affected by missing bytes.

UTF-8 encoded characters can theoretically be up to 6 bytes long. However, 16-bit BMP characters can only be up to 3 bytes long. The arrangement order of Bigendian UCS-4 byte strings is predetermined. Bytes 0xFE and OxFF are never used in UTF-8 encoding.

Number of encoding bytes

UTF-8 uses 1~4 bytes to encode each character:

·One US-ASCIl character only Requires 1 byte encoding (Unicode range is U 0000~U 007F).

·Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is U 0080 ~U 07FF).

·Characters in other languages (including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.

·Other rarely used language characters use 4-byte encoding.

UTF-8 encoding rules:

If there is only one byte, its highest binary bit is 0; if it is multiple bytes, its first byte starts from Starting from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of bytes encoded, and the remaining bytes start with 10.

The above is the detailed content of What encoding is utf-8?. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),