In order to better recognize Chinese, Japanese, English, and Greek. Commonly used symbols are encoded, and this encoding is the character set.
The character set determines how text is stored.
The character set is equivalent to the human language in the computer.
For example:
I speak English, so when I store it, I need to use English text to store it.
If I am talking about Chinese, use English characters to store it. Then people can’t read or understand it, it’s what we call gibberish.
Because there are too many character sets, enough to have dozens or hundreds of them. So we don't need to know too much about character sets, or even how character sets are compiled into human-visible characters.
We only need to understand:
English character set:
Character set | Description | Byte length |
---|---|---|
ASCII | American Standard Information Interchange Code | Single Byte |
GBK | Chinese character internal code expansion specification | Double byte |
unicode | Universal code | 4 bytes |
UTF-8 | Unicode variable length character encoding | 1 to 6 bytes |
ASCII code uses a specified 7-bit or 8-bit binary number combination to represent 128 or 256 possible characters. Standard ASCII code, also called Basic ASCII code, uses 7-bit binary numbers to represent all uppercase and lowercase letters, numbers 0 to 9, punctuation marks, and special control characters used in American English.
Among them:
0~31 and 127 (33 in total) are control characters or special communication characters (the rest are displayable characters), such as control characters: LF (line feed), CR (carriage return), FF ( Page feed), DEL (delete), BS (backspace), BEL (ring), etc.; communication special characters: SOH (head of text), EOT (end of text), ACK (confirmation), etc.; ASCII values are 8, 9 , 10 and 13 are converted to backspace, tab, line feed and carriage return characters respectively. They do not have a specific graphic display, but will have different effects on text display depending on the application.
32~126 (95 in total) are characters (32 is a space), of which 48~57 are ten Arabic numerals from 0 to 9.
65~90 are 26 uppercase English letters, 97~122 are 26 lowercase English letters, and the rest are some punctuation marks, arithmetic symbols, etc.
GBK is backward compatible with GB 2312 encoding. It is a Chinese character computer encoding specification defined by the People's Republic of China. The earlier version is GB2312.
Unicode (Unicode, Universal Code, Unicode) Unicode is a character encoding scheme developed by an international organization that can accommodate all texts and symbols in the world. To meet the requirements of cross-language and cross-platform text conversion and processing.
is a variable-length character encoding for Unicode, and it is also a universal code. Because UNICODE takes up twice as much space as ASCII, and the high byte 0 is of no use to ASCII. In order to solve this problem, some intermediate format character sets have appeared. They are called universal conversion formats, that is, UTF (Universal Transformation Format)
In The commonly used character sets in Chinese are divided into utf-8 and GBK.
The actual ones used are as follows:
Character set | Description |
---|---|
gbk_chinese_ci | Simplified Chinese, case-insensitive |
utf8_general_ci | Unicode (multi-language), case-insensitive |
Observe the characteristics of (Figure 1) and you will find that the MySQL character set consists of three parts:
1.Character set
2.Language
3. Type
The last bin refers to the binary character set, and the following ci refers to the case-insensitive characters when storing and sorting.
Notice:
When mysql writes utf-8, it writes utf8. Do not add the middle horizontal line.
(Picture 1)