Home >php教程 >php手册 >php简单中文分词系统(1/2)

php简单中文分词系统(1/2)

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal
2016-06-13 11:24:251226browse

php简单中文分词系统结构:首字散列表、Trie索引树结点优点:分词中,不需预知待查询词的长度,沿树链逐字匹配。缺点:构造和维护比较复杂,单词树枝多,浪费了一定的空间

php教程简单中文分词系统

结构:首字散列表、trie索引树结点
优点:分词中,不需预知待查询词的长度,沿树链逐字匹配。
缺点:构造和维护比较复杂,单词树枝多,浪费了一定的空间
* @version 0.1
* @todo 构造通用的字典算法,并写了一个简易的分词
* @author shjuto@gmail.com
* trie字典树
*
*/

class trie
{
        private $trie;

        function __construct()
        {
                 $trie = array('children' => array(),'isword'=>false);
        }

        /**
         * 把词加入词典
         *
         * @param string $key
         */
        function &setword($word='')
        {
                $trienode = &$this->trie;
                for($i = 0;$i                 {
                        $character = $word[$i];
                        if(!isset($trienode['children'][$character]))
                        {
                                $trienode['children'][$character] = array('isword'=>false);
                        }
                        if($i == strlen($word)-1)
                        {
                                        $trienode['children'][$character] = array('isword'=>true);
                        }
                        $trienode = &$trienode['children'][$character];
                }
        }

        /**
         * 判断是否为词典词
         *
         * @param string $word
         * @return bool true/false
         */
        function & isword($word)
        {
                $trienode = &$this->trie;
                for($i = 0;$i                 {
                        $character = $word[$i];
                        if(!isset($trienode['children'][$character]))
                        {
                                return false;
                        }
                        else
                        {
                                //判断词结束
                                if($i == (strlen($word)-1) && $trienode['children'][$character]['isword'] == true)
                                {
                                        return true;
                                }
                                elseif($i == (strlen($word)-1) && $trienode['children'][$character]['isword'] == false)
                                {
                                        return false;
                                }
                                $trienode = &$trienode['children'][$character];       
                        }
                }
        }


        /**
         * 在文本$text找词出现的位置
         *
         * @param string $text
         * @return array array('position'=>$position,'word' =>$word);
         */
        function search($text="")
        {
                $textlen = strlen($text);
                $trienode = $tree = $this->trie;
                $find = array();
                $wordrootposition = 0;//词根位置
                $prenode = false;//回溯参数,当词典ab,在字符串aab中,需要把$i向前回溯一次
                $word = '';
                for ($i = 0; $i                 {

                        if(isset($trienode['children'][$text[$i]]))
                        {
                                $word = $word .$text[$i];
                                $trienode = $trienode['children'][$text[$i]];
                                if($prenode == false)
                                {
                                        $wordrootposition = $i;
                                }
                                $prenode = true;
                                if($trienode['isword'])
                                {
                                        $find[] = array('position'=>$wordrootposition,'word' =>$word);
                                }
                        }
                        else
                        {
                                $trienode = $tree;
                                $word = '';
                                if($prenode)
                                {
                                        $i = $i -1;
                                        $prenode = false;
                                }
                        }
                }
                return $find;
        }
}

1 2

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn