How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)-PHP Tutorial-php.cn

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

寻∝梦

Aug 31, 2018 am 11:22 AM

asciiunicodeutf-8

How much do you know about character set encoding ASCII, Unicode and UTF-8? This article will give you a thorough understanding of character set encoding. This article introduces ASCII, Unicode and UTF-8 encoding issues and conversions as well as example analysis. Start reading the article

1. ASCII code

We know that inside the computer, all information is ultimately a binary value. Each binary bit (bit) has two states: 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, one byte can be used to represent a total of 256 different states, and each state corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.

In the 1960s, the United States formulated a set of character encodings that unified the relationship between English characters and binary bits. This was called ASCII and is still used today.

ASCII code specifies a total of 128 character encodings. For example, SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last 7 bits of a byte, and the first bit is uniformly set to 0.

ASCII control characters

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

ASCII displayable characters

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

2. Non-ASCII encoding

It is enough to encode English with 128 symbols, but to represent other languages, 128 symbols are not enough of. For example, in French, if there are phonetic symbols above letters, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bits in the bytes to encode new symbols. For example, the encoding for é in French is 130 (binary 10000010). As a result, the encoding system used in these European countries can represent up to 256 symbols.

However, a new problem arises here. Different countries have different letters, so even if they all use a 256-symbol encoding, the letters they represent are different. For example, 130 represents é in French encoding, represents the letter Gimel (ג) in Hebrew encoding, and represents another symbol in Russian encoding. But no matter what, in all these encoding methods, the symbols represented by 0--127 are the same, and the only difference is the section 128--255.

As for the characters of Asian countries, they use even more symbols, with as many as 100,000 Chinese characters. One byte can only represent 256 symbols, which is definitely not enough. Multiple bytes must be used to express one symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols.

The issue of Chinese encoding requires a special article to discuss, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 described later.

3. Unicode

As mentioned in the previous section, there are many encoding methods in the world, and the same binary number can be interpreted into different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise if you use the wrong encoding method to interpret it, garbled characters will appear. Why are emails often garbled? This is because the sender and recipient use different encoding methods.

It is conceivable that if there is a coding that includes all the symbols in the world. Each symbol is given a unique code, so the garbled code problem will disappear. This is Unicode, as its name suggests, an encoding of all symbols.

Unicode is of course a large collection, currently capable of holding more than 1 million symbols. The encoding of each symbol is different. For example, U 0639 represents the Arabic letter Ain, U 0041 represents the English capital letter A, and U 4E25 represents the Chinese character Yan. For a specific symbol correspondence table, you can check unicode.org, or a specialized Chinese character correspondence table.

4. Problems with Unicode

It should be noted that Unicode is just a symbol set. It only specifies the binary code of the symbol, but There is no specification as to how this binary code should be stored.

For example, the Unicode of Chinese character Yan is the hexadecimal number 4E25, which is converted into a binary number with 15 digits (100111000100101). In other words, the representation of this symbol requires at least 2 bytes. Representing other larger symbols may require 3 bytes or 4 bytes, or even more.

There are two serious problems here. The first question is, how to distinguish Unicode and ASCII? How does the computer know that three bytes represent one symbol, rather than three separate symbols? The second problem is that we already know that only one byte is enough to represent English letters. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two characters. Three bytes are 0, which is a huge waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.

The results they cause are: 1) Multiple storage methods of Unicode have emerged, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be promoted for a long time until the emergence of the Internet.

5. UTF-8

The popularity of the Internet strongly requires the emergence of a unified encoding method. UTF-8 is the most widely used Unicode implementation on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but these are rarely used on the Internet. Again, the connection here is that UTF-8 is an implementation of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols.

The encoding rules of UTF-8 are very simple, there are only two:

1. For single-byte symbols, the first bit of the byte is set to 0, and the following The 7 bits are the Unicode code of this symbol. So for English letters, UTF-8 encoding and ASCII encoding are the same.

2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n 1st bit is set to 0, and the first two bits of the following bytes are set to 1. Always set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol.

The following table summarizes the encoding rules. The letter x indicates the available encoding bits.

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

#According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, then the byte alone is a character; if the first bit is 1, then the number of consecutive 1s indicates how many bytes the current character occupies.

Next, we will take the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.

Yan’s Unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800 - 0000 FFFF), so Yan’s UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary digit of Yan, fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that Yan's UTF-8 encoding is 11100100 10111000 10100101, which converted to hexadecimal is E4B8A5.

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that Yan’s Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are different. Conversion between them can be achieved through programs.

For Windows platform, one of the simplest conversion methods is to use the built-in notepad applet notepad.exe. After opening the file, click the Save As command in the File menu, and a dialog box will pop up with a coding drop-down bar at the bottom.

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

ANSI is the default encoding. For English files, it is ASCII encoding, and for Simplified Chinese files, it is GB2312 encoding (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).
Unicode encoding here refers to the UCS-2 encoding method used by notepad.exe, which directly uses two bytes to store the Unicode code of the character. This option uses the little endian format. .
Unicode big endian encoding corresponds to the previous option. I will explain the meaning of little endian and big endian in the next section.
UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points are not exceeds 0xFFFF). Taking the Chinese character Yan as an example, the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, which is the Little endian method.

These two weird names come from the British writer Swift's "Gulliver's Travels". In the book, a civil war broke out in Lilliput. The cause of the war was people's dispute over whether to crack eggs from the big-endian or the little-endian. Because of this incident, six wars broke out, one emperor lost his life, and another emperor lost his throne.

The first byte comes first, which is "Big endian", and the second byte comes first, which is "Little endian".

So naturally, a question will arise: How does the computer know which way a certain file is encoded?

The Unicode specification defines that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "zero width no-break space" (zero width no-break space), represented by FEFF. This is exactly two bytes, and FF is one greater than FE.

If the first two bytes of a text file are FE FF, it means that the file uses big-end mode; if the first two bytes are FF FE, it means that the file uses small-end mode.

8. Example

Below, give an example.

Open the "Notepad" program notepad.exe, create a new text file, the content is a strict character, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.

Then, use the "hex function" in the text editing software UltraEdit to observe the internal encoding of the file.

ANSI: The encoding of the file is two bytes D1 CF, which is exactly the strict GB2312 encoding, which also implies that GB2312 is stored in the big head mode.
Unicode: The encoding is four bytes FF FE 25 4E, where FF FE indicates that it is stored in small head mode, and the real encoding is 4E25.
Unicode big endian: The encoding is four bytes FE FF 4E 25, where FE FF indicates big endian storage.
UTF-8: The encoding is six bytes EF BB BF E4 B8 A5. The first three bytes EF BB BF indicate that this is UTF-8 encoding, and the last three bytes are E4B8A5. Yan's specific encoding, its storage order is consistent with the encoding order.

9. Extended reading (extracurricular knowledge)

##The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (the most basic knowledge about character sets)

Talk about Unicode encoding: RFC3629: UTF-8, a transformation format of ISO 10646 (if the regulations of UTF-8 are implemented)

The above is the detailed content of How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection). For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

一个ascii字符占几个字节Mar 09, 2023 pm 03:49 PM

一个ascii字符占1个字节。ASCII码字符在计算机中采用7位或8位二进制编码表示，并保存在一个字节中，即一个ASCII码占用一个字节。ASCII码可分为标准ASCII码和扩展ASCII码，其中标准ASCII码也叫基础ASCII码，使用7位二进制数（剩下的1位二进制为0）来表示所有的大写和小写字母，数字0到9、标点符号，以及在美式英语中使用的特殊控制字符。

快速了解 PHP 中的 ASCII 数值转换Mar 28, 2024 pm 06:42 PM

PHP中的ASCII数值转换是编程中经常会遇到的问题。ASCII（AmericanStandardCodeforInformationInterchange）是一种用于将字符转换为数字的标准编码系统。在PHP中，我们经常需要通过ASCII码来实现字符和数字之间的转换。本文将介绍如何在PHP中进行ASCII数值转换，并给出具体的代码示例。一、将字符

unicode怎么转中文Dec 14, 2023 am 10:57 AM

Unicode是一种字符编码标准，用于表示各种语言和符号。要将Unicode编码转换为中文字符，可使用Python的内置函数chr()和ord()。

深入了解PHP：JSON Unicode转中文的实现方法Mar 05, 2024 pm 02:48 PM

深入了解PHP：JSONUnicode转中文的实现方法在开发中，我们经常会遇到需要处理JSON数据的情况，而JSON中的Unicode编码在一些场景下会给我们带来一些问题，特别是当需要将Unicode编码转换为中文字符时。在PHP中，有一些方法可以帮助我们实现这个转换过程，下面将介绍一种常用的方法，并提供具体的代码示例。首先，让我们先了解一下JSON中Un

解决Eclipse中文乱码问题的方法试试看Jan 03, 2024 pm 05:28 PM

Eclipse中文乱码困扰？试试这些解决方案，需要具体代码示例一、背景介绍随着计算机技术的不断发展，中文在软件开发中扮演着越来越重要的角色。然而，很多开发者在使用Eclipse进行中文开发时会遇到乱码问题，影响了工作效率。那么，本文将介绍一些常见的乱码问题，并给出相应的解决方案及代码示例，帮助读者解决Eclipse中文乱码问题。二、常见乱码问题及解决方案文件

PHP教程：如何将JSON Unicode转换为中文字符Mar 05, 2024 pm 06:36 PM

JSON（JavaScriptObjectNotation）是一种轻量级的数据交换格式，通常用于Web应用程序之间的数据交换。在处理JSON数据时，我们经常会遇到Unicode编码的中文字符（例如"u4e2du6587"），需要将其转换为可读的中文字符。在PHP中，我们可以通过一些简单的方法来实现这个转换。接下来，我们将详细介绍如何将JSONUnico

PHP返回字符串第一个字符的 ASCII 值Mar 21, 2024 am 11:01 AM

这篇文章将为大家详细讲解有关PHP返回字符串第一个字符的ASCII值，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。PHP返回字符串第一个字符的ASCII值引言在php中，获取字符串第一个字符的ASCII值是一个常见的操作，涉及到字符串处理和字符编码基础知识。ASCII值用于表示字符在计算机系统中的数字值，对于字符比较、数据传输和存储至关重要。过程获取字符串第一个字符的ASCII值涉及以下步骤：获取字符串：确定要获取ASCII值的字符串。它可以是变量、字符串常量

解决Java连接MySQL数据库时Unicode字符集编码不一致的问题Jun 10, 2023 am 11:39 AM

随着大数据、云计算等技术的发展，数据库成为了企业信息化的重要基石之一。在Java开发的应用程序中，连接MySQL数据库已成为常态。然而，在这个过程中，我们常常会遭遇到一个棘手的问题——Unicode字符集编码不一致。这不仅会影响我们的开发效率，还会影响应用程序的性能和稳定性。本文将介绍如何解决这个问题，让Java连接MySQL数据库更顺畅。一、Unicode

See all articles