由编码识别遇到有关问题，思考utf8编码正则表达式（php版本）-PHP Tutorial-php.cn

起因：

最近遇到一件事情，一个接口能够接收传入编码可能是utf-8,gbk 两种。做过编码方面转换的同学应该知道的，是什么编码不会在字符串里面有什么标记位的。不过utf-8编码有特殊性，因此可以通过正则表达式来检查。只要发现是utf-8编码。就转换，不是utf-8就当gbk处理。编码一些常见问题可以查看：由web程序出现乱码开始挖掘(Bom头、字符集与乱码）

行动：

知道这个原理，马上领任务，开始工作。想到php版本有个mbstring模块可以进行编码检测转换：

<span style="line-height: 1.5; color: #0000ff;"></span>php<span style="line-height: 1.5; color: #008000;">//当前编码是gbk</span>$str="<span style="line-height: 1.5; color: #8b0000;">中国</span>";$aStrList=array($str,iconv('<span style="line-height: 1.5; color: #8b0000;">gbk</span>','<span style="line-height: 1.5; color: #8b0000;">utf-8</span>',$str));foreach ($aStrList as $v){	echo mb_convert_encoding($v,'<span style="line-height: 1.5; color: #8b0000;">gbk</span>','<span style="line-height: 1.5; color: #8b0000;">utf-8,gbk</span>'),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";}

运行结果：

<img src="/static/imghwm/default1.png" data-src="/img/2012/09/13/182309550.png" class="lazy" alt="image"    style="max-width:90%" title="image" style="border-color: initial; display: inline;"  style="max-width:90%">

两个不同编码的“中国”，用一个函数mb_convert_encoding就可以自动转换成gbk编码。首页，尝试用utf-8解码，如果出现问题，就会用gbk转码。看来问题解决了，哈哈，可以交差了……

问题：

发布后，平静了几天，突然接到反馈：有中文：”袁小”解码出错。⊙﹏⊙b汗 …… ,想……(难道php内置检测模块有问题，或是我哪里欠缺……)

<img src="/static/imghwm/default1.png" data-src="/img/2012/09/13/182309551.png" class="lazy" alt="image"    style="max-width:90%" title="image" style="border-color: initial; display: inline;"  style="max-width:90%">

⊙﹏⊙b汗……  看来果然有问题，查询手册：<strong>mbstring 模块编码检查，只是识别字符串部分编码，发现与某个字符集匹配上，就认为它属于那种编码。 这不属于它的bug,因为字符串本身没有编码信息标识，没有那个语言能够完全检测通过。 </strong>

<strong></strong>?

问题：

能不能自己写一个检查正则表达式看下到底怎么样呢？要写正则表达式，首先须了解utf8编码规范，查看：http://zh.wikipedia.org/zh/UTF-8?

目前编码集合只有这样6个维度：php得到维度代码

<span style="line-height: 1.5; color: #0000ff;"></span>php<span style="line-height: 1.5; color: #008000;">//得到utf8字编码各个维度的范围 </span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">1111111</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度1</span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">10000000</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">10111111</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";echo base_convert('<span style="line-height: 1.5; color: #8b0000;">11000000</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">11011111</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度2</span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">11100000</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">11101111</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度3</span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">11110000</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">11110111</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度4</span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">11111000</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">11111011</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度5</span>echo base_convert('<span style="line-height: 1.5; color: #8b0000;">11111100</span>',2,16),base_convert('<span style="line-height: 1.5; color: #8b0000;">11111101</span>',2,16),"<span style="line-height: 1.5; color: #8b0000;">\r\n</span>";<span style="line-height: 1.5; color: #008000;">//维度6</span>

运行结果：

通过上面6个维度得到得到对应的正则表达式：

[\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3}|[\xf8-\xfb][\x80-\xbf]{4}|[\xfc-\xfd][\x80-\xbf]{5}

以上分别是各个维度范围

<span style="line-height: 1.5; color: #0000ff;"></span>php<span style="line-height: 1.5; color: #008000;">//当前编码是gbk</span>$str="<span style="line-height: 1.5; color: #8b0000;">袁</span>";echo urlencode($str);echo is_utf8($str);function is_utf8($str){	<span style="line-height: 1.5; color: #008000;">///utf8编码正则检测函数</span>	<span style="line-height: 1.5; color: #008000;">///copyright qq:8292669  http://www.cnblogs.com/chengmo</span>	$re='<span style="line-height: 1.5; color: #8b0000;">/^([\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3}|[\xf8-\xfb][\x80-\xbf]{4}|[\xfc-\xfd][\x80-\xbf]{5})+$/</span>';	return preg_match($re,$str);}

<strong><span style="line-height: 1.5; color: #ff0000;">上面执行结果返回为1，然后”袁“本身应该是gbk编码。看来上面函数还是不能彻底检查utf8编码。分析原因，从上面正则可以看到，utf8的6个维度对应字节长度从1-6字节。 而gbk是1-2个字节。因此他们之间会在1-2个字节长度地方检查出现重合。1个字节的时候gbk与utf8的 编码与字符对应关系都一样，但是2个字节时候，对应编码与字符各不相同。</span></strong>

通过查询gbk编码表：http://www.knowsky.com/resource/gb2312tbl.htm 进一步确认，范围会在：

[c0-df][a0-bf]  之内汉字都会有问题了。 <strong>如果纯这个范围的汉字组合为字符串就会出现判断不了情况。如果它与其它范围字符组合都可以正确的判断出来。</strong>

<strong></strong>?

GBK与UTF8字符集重叠对应的字符是：（gbk编码表）

?

由编码识别遇到有关问题，思考utf8编码正则表达式（php版本）

GBK与UTF8字符集重叠对应的字符是：（gbk编码表）

Related articles