The difference between different encoding formats in mysql is: ASCII encoding directly stores the serial number of the character in the encoded character set as a numerical value in the computer; Latin1 encoding, which is an extension of ASCII encoding; UTF- 8 encoding is a variable-length character encoding for Unicode.
This article will explain and introduce some encodings of mysql, but this is not all character set encodings.
Recommended course: mysql video tutorial
1. Introduction to character set
Character (Character) is a variety of text and The general term for symbols, including the characters of various countries, punctuation marks, graphic symbols, numbers, etc.
Character set is a collection of multiple characters. There are many types of character sets. Each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 Character set, GB18030 character set, Unicode character set, etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.
Character encoding (Character encoding) is to encode a certain character in the character set into a character in the specified character set so that text can be stored in the computer and transmitted through the communication network. Common examples include encoding the Latin alphabet into ASCII, which numbers letters, numbers, and other symbols and represents them in a 7-bit binary system.
Character order (collation) refers to the comparison rules between characters in the same character set. Only after determining the character order can we define what are equivalent characters in a character set and the size relationship between characters. A character can contain multiple character sequences. The MySQL character order naming rules are: start with the character set name corresponding to the character order, center with the country name (or center with general), and end with ci, cs, or bin. The character sequence ending with ci indicates case insensitivity, the character sequence ending with cs indicates case sensitivity, and the character sequence ending with bin indicates comparison based on binary coded values.
2. ASCII encoding
ASCII is both a coded character set and a character encoding. ASCII directly stores the serial number of the character in the coded character set as a character in the computer. numerical value.
For example: In ASCII, the A character is ranked 65th in the table, the serial number is 65, and the value of A after encoding is 0100 0001, which is the binary conversion result of 65 in decimal.
3. Latin1 character set
Latin1 character set is extended based on the ASCII character set. It still uses one byte to represent characters, but the high bit is enabled. The expansion Specifies the representation range of the character set.
4. UTF-8 encoding
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, also known as Universal code. Created by Ken Thompson in 1992. It is now standardized as RFC 3629. UTF-8 encodes Unicode characters using 1 to 6 bytes.
UTF-8 is a variable-length byte encoding method. For the UTF-8 encoding of a certain character, if there is only one byte, the highest binary bit is 0; if it is multiple bytes, the first byte starts from the highest bit, and the number of consecutive binary bits is 1. Determines the number of digits to encode, and the remaining bytes start with 10. UTF-8 can be used up to 6 bytes. As shown in the table:
1 Byte 0xxxxxxx
2 Byte 110xxxxx 10xxxxxx
3 Byte 1110xxxx 10xxxxxx 10xxxxxx
4 Byte 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 Byte 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 Bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Therefore, the actual number of digits that can be used to represent character encoding in UTF-8 is up to 31, which is the bit represented by x in the above table. Except for the control bits (10 at the beginning of each byte, etc.), the bits represented by x correspond to the UNICODE encoding one-to-one, and the bit order is the same.
When actually converting UNICODE to UTF-8 encoding, the high-order 0s should be removed first, and then the minimum number of UTF-8 encoding digits required is determined based on the remaining encoding digits. Therefore, characters in the basic ASCII character set (UNICODE compatible with ASCII) can be represented by only one byte of UTF-8 encoding (7 binary bits).
The above is the detailed content of What are the differences between different encoding formats in mysql. For more information, please follow other related articles on the PHP Chinese website!