Home  >  Article  >  Backend Development  >  An in-depth explanation of the Chinese character encoding conversion method in PHP

An in-depth explanation of the Chinese character encoding conversion method in PHP

WBOY
WBOYOriginal
2016-07-25 08:53:46987browse
This article introduces some knowledge about Chinese character encoding conversion in PHP and analyzes the principles and methods of PHP encoding conversion. Friends in need can refer to it.

Regarding the understanding of the mysql4.1 character set, let’s talk about how PHP adapts to this change in mysql. Also applicable to mysql5 and above versions.

1. Principles There are two concepts in the character set of MySQL, one is "character set (character set)" and the other is "collations". 1.collations Collations is translated into Chinese as "verification". In the process of web development, this vocabulary is only used in MySQL. Its main function is to guide MySQL to compare characters. For example, in the ascii character set, collations stipulates that a is less than b. a is equal to a, and whether a is equal to a and so on. Usually, you can basically ignore the existence of collations, because each character set has a default collations. Usually, you can just use the default collations. 2.Character set In contrast, character set is a broader concept. Even ordinary text files under Windows also involve character set issues. Different character sets specify different character encoding methods. A character set is a set of symbols and encodings. For example, the ASCII character set includes characters such as numbers, uppercase and lowercase letters, symbols such as semicolons and line feeds. The encoding method is to use a 7bit to represent a character ( The encoding of a is 65, and the encoding of b is 98). ASCII only stipulates the encoding of English letters. Non-English languages ​​cannot be represented by ASCII encoding. For this reason, different countries have encoded their own languages. For example, our country has gb2312 encoding. However, the encodings in each country are different, and there are also some cross-platform problems. For this reason, some international standards organizations have developed some internationally accepted encodings, and the most commonly used one is utf8. ascii only encodes English symbols and English letters, gb2312 encodes English symbols, English letters, and Chinese characters, and utf8 encodes all languages ​​​​in the world. Therefore, gb1212 characters include ascii characters, and utf8 includes gb2312 characters. It can be seen that utf8 is the character set that contains the widest range of characters. Therefore, in some multi-language web systems, the utf8 character set is generally used (phpmyadmin uses utf8 encoding). The storage of any text involves the concept of character sets. Including databases and ordinary text files.

Main terms: Characters: Chinese characters, English letters, punctuation marks, Latin, etc. Encoding: Convert characters into computer storage format, for example, a is represented by 65. Character set: A set of characters and corresponding encoding methods. a. mysql character set MySQL currently supports multiple character sets, and supports conversion between different character sets (to facilitate portability and support multi-language). MySQL can set server-level character sets, database-level character sets, data table-level character sets, and table column character sets. In fact, the final place where the character set is used is the column that stores characters. For example, you set the col1 column in table1 to be characters. Type, col1 only uses the character set. If the col2 column of the table1 table is of type int, col2 does not use the concept of the character set. Server-level character sets, database-level character sets, and data table-level character sets are all default options for column character sets. MySQL must have a character set, which can be specified by adding parameters at startup, during compilation, or in the configuration file. The mysql server character set is just a database-level default. When creating the database, you can specify the character set. If not specified, the server's character set is used. Similarly, when creating a table, you can specify the table-level character set. If not specified, the database character set is used as the table character set. When creating a column, you can specify the character set of a column. If not specified, the table's character set is used. Normally, you only need to set the server-level character set. Other database-level, table-level, and column-level character sets are inherited from the server-level character set. Since utf8 is the widest character set, under normal circumstances, we set the mysql server-level character set to utf8!

b. Character set issues for ordinary text The storage of any text has character set issues, and ordinary text files are no exception. In Windows 2000+ systems, open Notepad and in the "Save as..." dialog box, there is an option that allows you to choose the encoding method for storing text. Normally, everyone uses Windows 2000+ systems and uses the default encoding, so they will not encounter character set problems. Under Windows, you can choose the encoding method when saving a text file, but when opening a text file, the encoding method is automatically determined. There is a joke on the Internet about using Windows 2000+ Notepad to play with China Mobile and China Unicom. You can search it. The problem is caused by wrong encoding judgment when Windows opens a text file. Because automatic judgment of encoding is sometimes wrong, some text files specify how to identify the encoding used by themselves. html files are one such example. html is a text file. When storing html files, you need to use an encoding, and in html files, html syntax is also used to specify the encoding used by the file (for example). If the html file does not specify an encoding, the browser automatically identifies the encoding of the file. If html specifies an encoding, the browser uses the encoding specified by html. Normally, the charset specified in the HTML file is consistent with the encoding of the HTML file itself, but there are also cases of inconsistency. If they are inconsistent, the web page will be garbled (the garbled code here is only related to the text file and has nothing to do with the database.) Use Specialized web page editing tools (such as dreamwave) will automatically encode files based on the charset value in the web page.

c. Character set problem of php+mysql What PHP ultimately generates is a text file, but it needs to retrieve the text from the database or store the text into the database. Since MySQL supports multiple character sets, by default, MySQL does not know what coded characters PHP sends to it. Therefore, MySQL requires the client (php) to tell it what character set it accesses. By setting character_set_client, php tells mysql what encoding method php stores in the database. By setting character_set_results, php tells mysql what kind of encoded data php needs to get. By setting character_set_connection, php tells mysql what encoding to use for the text in the php query. mysql uses the set encoding to store text. Assume that MySQL uses setserver to store text, PHP's character_set_client is setclient, and PHP's character_set_results is setresult. Then, mysql converts the text sent from php from the setclient encoding method to the setserver encoding method, and then stores it in the database. If php retrieves the text, mysql converts the text from setserver to setresult, and then sends it to php. The php file (the final generated html file) itself has a code. If the code passed by mysql is different from the code of the php file itself, then the entire web page will be garbled. Therefore, PHP generally tells MySQL its own encoding method. To ensure that there is no garbled code, it is necessary to unify three codes: one is the code of the web page itself, the other is the code specified in HTML, and the third is the code that PHP tells mysql (including character_set_client and character_set_results). The first and second codes are usually consistent if you use an editor such as dw to write a web page, but they may be inconsistent if you use a notepad to write a web page. The third encoding requires manual notification to mysql. This step can be achieved by using mysql_query("set names characterx") in php.

d.Character set conversion problem If a small character set is converted to a large character set, data will not be lost, but if a large character set is converted to a small character set, data may be lost. For example, some characters in utf8 may not be present in gb2312, so some characters may be lost when converting from utf8 to gb2312. But there is an exception. First convert from gb2312 to utf8, and then convert from utf8 to gb2312. In this case, no data will be lost, because the text converted at the beginning is all characters in gb2312, so the whole process is It is the characters of gb2312 that are being converted and will not be lost. Because utf8 can accommodate all characters in the world, databases generally use utf8 encoding. This allows any character to be stored in the UTF8-encoded database.

e. phpmyadmin garbled problem phpmyadmin supports multiple languages, which must require the html page to use utf8 encoding. The html page uses utf8 encoding, which requires phpmyadmin to use utf8 encoding for character_set_client and character_set_results when connecting to mysql. Under the current circumstances, PHP can only use set names (or several other statements) to notify MySQL of the encoding method when connecting to MySQL. If there is no explicit encoding method declared, latin1 encoding will be used. General programs do not explicitly declare the character_set_client variable, so the gb2312 text is stored in the database in latin1 encoding, and phpmyadmin reads it in utf8 format, which will definitely be garbled. If the PHP program is stored in the database with the correct encoding, there will definitely be no problem. Therefore, it is not phpmyadmin that needs to be modified. (Although sometimes modifying phpmyadmin can solve the garbled problem, this is not the root of the problem)

Two. Summary

1. Try to use utf8 storage for the database (modify /etc/my.cnf and add default-character-set=utf8 to the [mysqld] section) (Existing database, first convert to utf8 format) 2. Before querying the database, the PHP program executes mysql_query("set names xxxx"); where xxxx is the encoding of your web page (charset=xxxx). If charset=utf8 in the web page, then xxxx=utf8, if charset=gb2312 in the web page , then xxxx=gb2312, if the charset=ipaddr in the web page, then xxxx=ipaddr (just kidding, there is no such encoding) Almost all web programs have a common code for connecting to the database, which is placed in a file. In this file, just add mysql_query ("set names"). 3.phpmyadmin does not need to be changed. 4. Note that in order to ensure that the actual encoding of the web page (the encoding in the Windows save dialog box) is consistent with its declared encoding (charset=?), please use tools such as dw to create the web page.



Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn