What does it mean that php does not support unicode?-PHP Problem-php.cn

Home

Backend Development

PHP Problem

What does it mean that php does not support unicode?

藏色散人

Jul 27, 2021 am 09:35 AM

phpunicode

php does not support unicode, which means that PHP strings do not save the encoding information of characters, so the native operation function does not know how binary data corresponds to text, and can only assume that one character corresponds to a single byte; in this way, during processing It is sufficient for English and other ASCII codes, but for multi-byte characters such as Chinese, errors will occur.

What does it mean that php does not support unicode?

The operating environment of this article: windows7 system, PHP7.1 version, DELL G3 computer

What does it mean that php does not support unicode? Why does it say that PHP does not support Unicode encoding?

I often see claims that PHP does not support Unicode, or that PHP does not support Unicode at the bottom level. Although I know that PHP encoding is very painful and the various string processing functions are very non-standard, it can still display Chinese. I have never understood what it means that it does not support Unicode. Spent some time sorting through this information.

Let’s start with an example:

A PHP script is as follows. Assume that the encoding of the file is UTF-8:

//文件编码UTF-8
echo strlen("中文"); // 6
echo substr("中文",0,1) // 乱码
echo substr("中文",0,3) // 中

It’s very strange. From the above, it seems that One Chinese character is regarded as 3 characters. This starts with PHP's storage of strings.

I summarized it as follows:

PHP’s string is composed of an array of bytes. In other words, similar to C language char a[3] = "abc", one character occupies one byte.

In addition, there is no encoding information for storing text, which means that PHP does not know what encoding the binary data of these strings should correspond to.

Going one step further, PHP will determine the encoding of the string according to the encoding of the script file. For example: $string = "Chinese";, if the script file is UTF-8, the Chinese UTF-8 encoding: E4B8ADE69687 will be saved.

Furthermore, as mentioned before, PHP does not save the encoding information of the string. So even if the Chinese is saved as: E4B8ADE69687, from the perspective of the string native function, it is just a string of binary numbers. Therefore, PHP native string functions can only operate on single-byte characters! Just treat a byte as a character!

If you understand the above points, the above code example will naturally be understood:

//文件编码UTF-8
echo bin2hex("中文"); // 可以看到，"中文"对应的二进制就是：e4b8ade69687
echo strlen("中文"); // 所以按照单字节来统计长度，就是6 
echo substr("中文",0,1) // 取0到1个字节，也就是e4，并不对应某个字符的编码，所以乱码
echo substr("中文",0,3) // 取0到3个字节，刚好把`中`的编码取出来

Similarly, if you change the file encoding to GBK or other, you will get similar results after further experiments. The result is that one Chinese character in GBK occupies 2 bytes.

So now, you can basically understand what the bottom layer of PHP does not support unicode. The summary is as follows:

PHP strings do not save the encoding information of characters, so native The operating function does not know how binary data corresponds to text, and can only [assume] that one character corresponds to a single byte. This is sufficient when processing English and other ASCII codes, but for Chinese and other [multi-byte characters], errors will occur.

As the opposite, we can look at the so-called underlying languages that support Unicode:

var string = "中文"
console.log(string.length); // 2
string.substr(0,1) // 中

You can see that in JS, multi-byte characters can be correctly recognized and processed. . That is to say, when storing, the encoding information of the text is also stored. (My guess here is that the Unicode value of the text is saved, but I am not sure because I don’t understand the underlying principles of JS)

Then there is a question here, how can multi-byte characters be correctly processed in PHP? ? The answer is the mbstring extension (for details, see: http://php.net/manual/zh/book.mbstring.php). The so-called mbstring is: multi-byte string, multi-byte string.

In this set of extensions, there are a series of functions corresponding to the native string functions, which can be used to correctly handle multi-byte characters. For example: strlen corresponds to mb_strlen... Among these corresponding functions, they are basically the same as the native functions, except that they usually have an additional optional parameter: encoding.

Examples are as follows:

// 脚本类型为UTF-8
echo strlen("中文"); // 6
echo mb_strlen("中文","UTF-8"); //2  使用mb_strlen ，并传入编码 utf-8, 就会把二进制E4B8ADE69687当做utf-8的处理能正确处理
echo mb_strlen("中文"); //2  如果不传编码UTF-8,则函数会自动确定编码，文档说：如果省略，则使用内部字符编码。所以这里也当做UTF-8来处理。
echo mb_strlen("中文","GBK"); //3，如果传入编码GBK，则：e4b8ade69687会被当做gbk来处理，一个gbk字符占2字节，所以为：3

Recommended learning: "PHP Video Tutorial"

The above is the detailed content of What does it mean that php does not support unicode?. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

ACID vs BASE Database: Differences and when to use each.Mar 26, 2025 pm 04:19 PM

The article compares ACID and BASE database models, detailing their characteristics and appropriate use cases. ACID prioritizes data integrity and consistency, suitable for financial and e-commerce applications, while BASE focuses on availability and

PHP Secure File Uploads: Preventing file-related vulnerabilities.Mar 26, 2025 pm 04:18 PM

The article discusses securing PHP file uploads to prevent vulnerabilities like code injection. It focuses on file type validation, secure storage, and error handling to enhance application security.

PHP Input Validation: Best practices.Mar 26, 2025 pm 04:17 PM

Article discusses best practices for PHP input validation to enhance security, focusing on techniques like using built-in functions, whitelist approach, and server-side validation.

PHP API Rate Limiting: Implementation strategies.Mar 26, 2025 pm 04:16 PM

The article discusses strategies for implementing API rate limiting in PHP, including algorithms like Token Bucket and Leaky Bucket, and using libraries like symfony/rate-limiter. It also covers monitoring, dynamically adjusting rate limits, and hand

PHP Password Hashing: password_hash and password_verify.Mar 26, 2025 pm 04:15 PM

The article discusses the benefits of using password_hash and password_verify in PHP for securing passwords. The main argument is that these functions enhance password protection through automatic salt generation, strong hashing algorithms, and secur

OWASP Top 10 PHP: Describe and mitigate common vulnerabilities.Mar 26, 2025 pm 04:13 PM

The article discusses OWASP Top 10 vulnerabilities in PHP and mitigation strategies. Key issues include injection, broken authentication, and XSS, with recommended tools for monitoring and securing PHP applications.

PHP XSS Prevention: How to protect against XSS.Mar 26, 2025 pm 04:12 PM

The article discusses strategies to prevent XSS attacks in PHP, focusing on input sanitization, output encoding, and using security-enhancing libraries and frameworks.

PHP Interface vs Abstract Class: When to use each.Mar 26, 2025 pm 04:11 PM

The article discusses the use of interfaces and abstract classes in PHP, focusing on when to use each. Interfaces define a contract without implementation, suitable for unrelated classes and multiple inheritance. Abstract classes provide common funct

See all articles