Home > Article > Backend Development > Perfect Php regular expression matching Chinese_PHP tutorial
1. Generally use metacharacters to match Chinese, /.*?/s, which can match a piece of Chinese. This can be achieved in program codes in ANSI (gb2312) and utf-8 environments. But as a reminder, w cannot match Chinese. I once read in a book "Mastering Regular Expressions" (People's Posts and Telecommunications Publishing House, edited by Sha Jin) that w can be used to match Chinese characters. I would like to correct that it cannot be done using PHP. You can use "/./", "/[^d]/", "/[^a]/" to match Chinese characters.
2. If you want to accurately match Chinese, that is, match pure Chinese characters, or match Chinese characters plus full-width punctuation, you need to use different methods according to different encoding environments. The following is an introduction to two commonly used encodings (gb2312, utf-8):
In the ANSI (gb2312) environment, you can use the [chr(0xnn)-chr(0xmm)] method to match. For example, this method is provided in an online article, "/[".chr(0xb0)."- ".chr(0xf7)."]+/", this can be used, but it is too general. This expression matches all characters in the gb2312 encoding table, including Chinese characters, punctuation, Japanese hiragana, etc., as well as There are some symbols that I don’t know what they are. It can be seen from the encoding table that the encoding range of Chinese characters is 0xb0a1-0xf7fe, and gb2312 is encoded with two bytes, and the highest bit of each byte is 1. So you can use this to write a regular expression that simply matches Chinese characters:
"/([".chr(0xb0)."-".chr(0xf7)."][".chr(0xa1)."-".chr(0xfe)."])/" , this expression can match For a Chinese character, the quantitative relationship can be easily expanded.
And by analogy, if you want to match full-width punctuation but not Chinese, you can write like this:
"/([".chr(0xa1)."-".chr(0xa3)."][".chr(0xa1)."-".chr(0xff)."])/" matches the encoding range 0xa1a1 -Symbols within 0xa3ff. Others are similar.
3. The following introduces the matching of Chinese in the utf-8 environment. Similar to the above, you can also use the unicode encoding table to determine Chinese matching. As can be seen from the encoding table, the encoding range of Chinese is 0x4e00-0x9fa5, so the regular expression can be written like this:
"/[x{4e00}-x{9fa5}]/u", x{nnnn} represents the hexadecimal form of the character. Please check the PHP manual for more information. Pay special attention to the mode modifier u. The PHP manual says this: u (PCRE_UTF8) This modifier enables an additional feature in PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available since PHP 4.1.0 under Unix and since PHP 4.2.3 under win32. Patterns are checked for UTF-8 validity since PHP 4.3.5. This is exactly what is required for a correct match. In fact, I also want to remind you that it is best to add the modifier u when using metacharacters to match strings in the UTF-8 environment. This is just experience.
Here are two examples: www.2cto.com
(1) In ANSI programming environment:
$strtest = "yyg Chinese character yyg";
$pregstr = "/([".chr(0xb0)."-".chr(0xf7)."][".chr(0xa1)."-".chr(0xfe)."])+/i";
if(preg_match($pregstr,$strtest,$matchArray)){
echo $matchArray[0];
}
//output: Chinese characters
(2) In Utf-8 programming environment:
$strtest = "yyg Chinese character yyg";
$pregstr = "/[x{4e00}-x{9fa5}]+/u";
if(preg_match($pregstr,$strtest,$matchArray)){
echo $matchArray[0];
}
//output: Chinese characters
Author: zdrjlamp