Home  >  Article  >  Backend Development  >  Word segmentation algorithm php unary word segmentation algorithm

Word segmentation algorithm php unary word segmentation algorithm

WBOY
WBOYOriginal
2016-07-29 08:41:191052browse

Copy code The code is as follows:


/**
* Uniary word segmentation algorithm
* UTF8 encodes the next character. If the ASCII code of the first character is not greater than 192, it will only occupy 1 byte.
* If the ASCII code of the first character is greater than 192 and less than 224, it will occupy 2 bytes. Otherwise, it will occupy 3 characters. Section
* Uniary word segmentation needs to add ft_min_word_len=1 in the my.ini file of mysql
* You can use the mysql query statement show variables like '%ft%' to view the mysql full-text search related settings
*
* @access global
* @param string $str
* @param boolean $unique Whether to remove duplicate values ​​
* @param boolean $merge Whether to merge additional values ​​
* @return array
*/
function seg_word($str,$unique=false,$merge=true)
{
$str = trim(strip_tags($str ));
$strlen = strlen($str);
if($strlen == 0) return array();
$spc = ' ';
//Add characters to be filtered as needed
$search = array( ',', '/', '\', '.', ';', ':', ''', '!', '~','"', '`', '^', '( ', ')', '?', '-', "t", "n", ''', '<', '>', "r", "rn", '$', '& ', '%', '#', '@', '+', '=', '{', '}', '[', ']', ')', '(', '.', '. ', ', '! ', ';', '"', '"', ''', ''', ', ', '—', ' , '《', '》', '-', '…', '【', '】',':');
$numpairs = array('1'=>'一','2'= >'two','3'=>'three','4'=>'four','5'=>'five','6'=>'six','7'= >'seven','8'=>'eight','9'=>'nine','0'=>'zero');
$str = alab_num($str);
$str = str_replace($search,' ',$str);
$ord = $i = $k = 0;
$prechar = 0;// 0-blank 1-English and symbol 2-Chinese
$result = array( );
$annex = array();
while($ord = ord($str[$i]))
{
//1 byte character
if ($ord <= 0xC0 )
{
// Remove empty strings
if($ord < 33) {
$prechar=0;
$i++;
$k++;
continue;
}
//Additional Chinese uppercase number conversion
if(isset($numpairs[$ str[$i]])) {
$annex[]=$numpairs[$str[$i]];
}
//If the preceding character is Chinese
if( $prechar == 2 ){
$result[+ +$k] = $str[$i];
}
else {
$result[$k] .= $str[$i];
}
$prechar = 1;
$i++;
}
else / /2-3 byte characters (Chinese)
{
if($ord < 0xE0)
$step = 2;
else
$step = 3;
$c = substr($str,$i,$step) ;
if(false !== $key = array_search($c,$numpairs)){
$annex[] = $key;
}
if ($prechar != 0) {
$result[++$k ] = $c;
}
else {
$result[$k] .= $c;
}
$prechar = 2;
$i+=$step;
}
}
$result = $merge ? array_merge( $result,$annex) : $result ;
return $unique ? array_unique($result) : $result ;
}

The above introduces the word segmentation algorithm PHP unary word segmentation algorithm, including the content of word segmentation algorithm. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn