Home  >  Article  >  php教程  >  PHP中文分词 - jerrylsxu

PHP中文分词 - jerrylsxu

WBOY
WBOYOriginal
2016-05-20 10:14:551199browse

最常见的词语二分法:

$str = '这是我的网站www.7di.net!'
//$str = iconv('GB2312','UTF-8',$str); 
$result = spStr($str); 
print_r($result); 
   
/**
 * UTF-8版 中文二元分词
 */ 
function spStr($str
    $cstr = array(); 
   
    $search = array(",", "/", "\\", ".", ";", ":", "\"", "!", "~", "`", "^", "(", ")", "?", "-", "\t", "\n", "'", "<code class="php plain">, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,); 
   
    $str = str_replace($search, , $str); 
    preg_match_all(, $str, $estr); 
    preg_match_all(, $str, $nstr); 
   
    $str = preg_replace(, , $str); 
    $str = preg_replace(, , $str); 
   
    $str = explode(, trim($str)); 
   
    foreach ($str as $s) { 
        <code class="php variable">$l = strlen($s); 
   
        $bf = null; 
        for ($i= 0; $i<code class="php variable">$l; $i=$i+3) { 
            $ns1 = $s{$i}.$s{$i+1}.$s{$i+2}; 
            if (isset($s{$i+3})) { 
                $ns2 = $s{$i+3}.$s{$i+4}.$s{$i+5}; 
                if (preg_match(,$ns2)) $cstr[] = $ns1.$ns2
            } else if ($i == 0) { 
                $cstr[] = $ns1
            
        
    
   
    $estr = isset($estr[0])?$estr[0]:array(); 
    $nstr = isset($nstr[0])?$nstr[0]:array(); 
   
    return array_merge($nstr,$estr,$cstr); 
}

 執行結果是:

Array ( [0] => 7 [1] => www [2] => di [3] => net [4] => 这是 [5] => 是我 [6] => 我的 [7] => 的网 [8] => 网站 )

 接下来,将以上结果转换为区位码,PHP代码是:

foreach ($result as $s) { 
    $s = iconv('UTF-8','GB2312',$s); 
    $code[] = gbCode($s); 
$code = implode(, $code); 
echo $code
   
function gbCode($str) { 
    $return = null; 
   
    if (!preg_match(,$str)) return $str
   
    <code class="php variable">$len = strlen($str); 
    for ($i= 0; $i<code class="php variable">$len; $i=$i+2) { 
        $return .= sprintf(,ord($str{$i})-160,ord($str{$i+1})-160); 
    
   
    return $return
}

 

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn