Home  >  Article  >  Backend Development  >  Summary of PHP regular expression matching Chinese problem analysis_PHP tutorial

Summary of PHP regular expression matching Chinese problem analysis_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:19:38756browse

Copy code The code is as follows:

$str = 'People's Republic of China 123456789abcdefg';
echo preg_match("/^[u4e00 -u9fa5_a-zA-Z0-9]{3,15}$",$strName);

Run the above code and see what prompts there will be?

Warning: preg_match(): Compilation failed: PCRE does not support L, l, N, P, p, U, u, or X at offset 3 in F:wwwrootphptest.php on line 2
It turns out that the following Perl escape sequences are not supported in PHP regular expressions: L, l, N, P, p, U, u, or X

In UTF-8 mode, "x{. ..}", the content in the curly brackets is a string representing a hexadecimal number.

The original hexadecimal escape sequence xhh matches a double-byte UTF-8 character if its value is greater than 127.
So,
can be solved like this
Copy the code The code is as follows:

preg_match("/^[ x80-xff_a-zA-Z0-9]{3,15}$",$strName);
preg_match('/[x{2460}-x{2468}]/u', $str);


Match internal coded Chinese characters
Test according to the method he provided, the code is as follows:

Copy the code The code is as follows:

$str = "php programming";
if (preg_match("/^[x{2460}-x{2468}]+$/u",$str )) {
print("This string is all in Chinese");
} else {
print("This string is not all in Chinese");
}


I found that this time I still misjudged whether it was Chinese or not. However, since the hexadecimal data represented by x, why is it different from the range x4e00-x9fa5 provided in js? So I changed to the following code:

Copy code The code is as follows:

$str = "php Programming";
if (preg_match("/^[x4e00-x9fa5]+$/u",$str)) {
print("The string is all in Chinese");
} else {
print("The string is not all in Chinese");
}


What I thought was a sure success, unexpectedly, the warning occurred again:
Warning: preg_match() [function.preg-match]: Compilation failed: invalid UTF-8 string at offset 6 in test.php on line 3

It seems that there is another wrong expression, so I compared the expression in that article and wrapped "4e00" and "9fa5" with "{" and "}" respectively. I ran it again and found that it was really accurate:

Copy code The code is as follows:

$str = "php programming";
if (preg_match("/^[x{4e00}- x{9fa5}]+$/u",$str)) {
print("This string is all in Chinese");
} else {
print("This string is not all in Chinese ");
}


I know the final correct expression for using regular expressions to match Chinese characters under UTF-8 encoding in PHP——/^[x{4e00}-x {9fa5}]+$/u,

Finally summarized

Copy the code The code is as follows:

//if (preg_match("/^[".chr(0xa1)."-".chr(0xff)."]+$/", $str)) { //Can only be used in the case of GB2312
if (preg_match(“/^[x7f-xff]+$/”, $str)) { //Compatible with gb2312, utf-8
echo “Correct input”;
} else {
echo “Wrong input”;
}


Double-byte character encoding range

1. GBK (GB2312/GB18030)
x00-xff GBK Double-byte encoding range
x20-x7f ASCII
xa1-xff Chinese gb2312
x80-xff Chinese gbk

2. UTF-8 (Unicode)

u4e00- u9fa5 (Chinese)
x3130-x318F (Korean)
xAC00-xD7A3 (Korean)
u0800-u4e00 (Japanese)

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/325211.htmlTechArticleCopy the code code as follows: $str = 'People's Republic of China 123456789abcdefg'; echo preg_match("/^[u4e00- u9fa5_a-zA-Z0-9]{3,15}$",$strName); Run the above code and see what happens...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn