Home > Article > Backend Development > PHP regular matching Chinese garbled problem
The solution to php regular expression matching Chinese garbled characters: first open the PHP code file; then add the UTF8 modifier to the code file. The regular expression statement is such as "preg_replace('/[万]/ u','wan',$a);".
Recommended: "PHP Video Tutorial"
Specific questions:
Using regular expressions to match Chinese characters in PHP strings causes garbled characters
<?php echo '<h2>正则表达式匹配中文</h2><br>'; $a = '天地不仁,以万物为刍狗'; $b = preg_replace('/万/','萬',$a); echo $b; echo '<h2>加上方括号后替换结果出现乱码</h2><br>'; $c = '天地不仁,以万物为刍狗'; $d = preg_replace('/[万]/','萬',$a); echo $d; ?>
The results of the above program can be seen at http://nyaii.com/s/test.php. For some reason, garbled characters appear after adding square brackets to the matching Chinese characters. In the same situation, everything works fine when executed in javascript.
'天地不仁'.replace(/[天]/,'') //outputs "地不仁"
Solution:
Add UTF8 modifier
$d = preg_replace('/[万]/u','萬',$a);
For other modifiers, please see
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
The following is supplementary content to the questions in the comment of the topic
Regarding the question of why the u modifier needs to be added within [], in fact, strictly speaking, it is best to add the u modifier in both situations
But why does [] cause garbled characters? This It needs to be explained from the byte level rather than the character level.
First of all, we know that PHP strings are not stored in Unicode, and then let’s take a look at this code
<?php $a = "万"; echo strlen($a); //3 for ($i = 0; $i < strlen($a); $i++) { echo dechex(ord($a[$i])) . ' '; //e4 b8 87 }
We can get the utf8 hexadecimal encoding of the word "10,000" e4b887
So when the utf8 modifier is not turned on, the regular expression engine does not treat "ten thousand" as an independent character, but as three bytes of continuous data.
The following is the conclusion:
When there is no [] to match, it is looking for three consecutive characters with a hexadecimal encoding value of e4 b8 87. In other words, the actual Your pattern is \xe4\xb8\x87, but when these consecutive characters appear in your string, only the word "Wan" can match it, so there will be no garbled characters after replacement. But if your string may also include four-byte utf8 encoded characters, such as emoji, it may cause problems.
When you wrap [] outside Wan, the regular expression engine actually What we are looking for is [\xe4\xb8\x87]. Those who understand regular expressions will quickly find that it actually matches any one of these three characters, so at this time it will affect other Chinese characters except ten thousand.
When you add the utf8 modifier, "Wan" will be treated as an independent character by the regular expression, so this problem will no longer occur
As for javascript, because it encodes characters It is native unicode, and each character will be treated as a character instead of split into byte data, so this problem will not occur
The above is the detailed content of PHP regular matching Chinese garbled problem. For more information, please follow other related articles on the PHP Chinese website!