Heim  >  Artikel  >  Backend-Entwicklung  >  手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天_PHP教程

手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天_PHP教程

WBOY
WBOYOriginal
2016-07-13 10:19:26816Durchsuche

手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天

客串:屌丝的坑人表单神器、数据库那点事儿

面向对象升华:面向对象的认识----新生的初识、面向对象的番外----思想的梦游篇(1)、面向对象的认识---如何找出类

负载均衡:负载均衡----概念认识篇、负载均衡----实现配置篇(Nginx)

 

吐槽:现在欠的文章有面向对象的认识----类的转化、面向对象的番外---思想的梦游篇(2)、负载均衡 ---- 文件服务策略、手把手教你做关键词匹配项目(搜索引擎)。真心太多了,能不能让我休息一会儿。

 

第二十一天

起点:手把手教你做关键词匹配项目(搜索引擎)---- 第一天

回顾:手把手教你做关键词匹配项目(搜索引擎)---- 第二十天

今天有个理论知识要理解的,叫做测试驱动编程,之前我提到过概念,在:手把手教你做关键词匹配项目(搜索引擎)---- 第十一天 

今天小帅帅秀逗了一回,使用了这个思想。

好了,以下正文开始。

 

话说小帅帅把自己写的业务拆词方法给了于老大看,于老大很高兴。

但是业务拆词的词组都是有限的,还有就是当业务拆词的数据量越来越大的时候,就会造成运算时间增加。

于老大就提到,是否可以用其它分词扩展来弥补拆词的不足。

毕竟人家专业人士做的,比较靠谱点。

于老大很有经验,就推荐小帅帅去了解SCWS的用法.

SCWS 是 Simple Chinese Word Segmentation 的首字母缩写(即:简易中文分词系统)。
官方网址:http://www.xunsearch.com/scws/index.php

小帅帅听了当然很开心罗,因为又有新的知识点了。

小帅帅照着SCWS的安装文档安装了SCWS。

并把php扩展装好了,并尝试写了个测试代码:

<?<span>php
</span><span>class</span><span> TestSCWS {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$keyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (self::isValidate(<span>$tmp</span><span>)) {
                    </span><span>$ret</span>[] = <span>$tmp</span><span>;
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>public</span> <span>static</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}


</span><span>var_dump</span>(TestSCWS::<span>split</span>("连衣裙xxl裙连衣裙"));

测试通过,跟理想中的一摸一样,小帅帅很高兴,就去问于老大:于老大我会用SCWS了,下一步该怎么办?

于老大也不慌,就对小帅帅说: 你先写个ScwsSplitter来拆分关键词吧。

小帅帅非常高兴,因为他学到了新的知识,就对于老大说到好的。

小帅帅说到做到,代码如下:

<span>class</span><span> ScwsSplitter {

    </span><span>public</span> <span>$keyword</span><span>;
    
    </span><span>public</span> <span>function</span> <span>split</span><span>(){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$this</span>-><span>keyword);

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$this</span>-><span>keyword);
       
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                    </span><span>$keywordEntity</span>->addElement(<span>$tmp</span>["word"<span>]);
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }
    
}</span>

小帅帅又跑去找于老大,说到:我把Scws的分词代码写好了。

于老大也佩服小帅帅的高效率。

又说到:如果我两个同时用了,我先用业务分词,遗留下来的词用Scws分词,小帅帅有好的方案吗?

小帅帅就问到: 为啥要这样,这不是多此一举。

于老大就说到:业务有些专有名词,SCWS分不出来丫,那怎么办好?

小帅帅又说到:我看文档的时候看到有词库和规则文件的设置,我们用它好不好?

于老大又说到:这个是可以,但是我们如何保证让运营人员维护,我们要学会把这些事情交出去丫。

小帅帅: …….

小帅帅沉默了片刻,觉得现在两个类都写了,一起用是最快的方案,就答应到:好吧,我回去改改….

首先小帅帅根据测试驱动编程的思想写了入口代码:

<span>class</span><span> SplitterApp {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span>,<span>$cid</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> DBSegmentation();
        </span><span>$seg</span>->cid = <span>$cid</span><span>;
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }<br />     return $keywordEntity;
    }
}</span>

小帅帅嘿了一声,有了测试入口,还怕其他的搞不定。

首先KeywordEntity的getElementWords,先搞定他.

<span>class</span><span> KeywordEntity
{

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>public</span> <span>$elements</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> __construct(<span>$keyword</span><span>)
    {
        </span><span>$this</span>->keyword = <span>$keyword</span><span>;
    }

    </span><span>public</span> <span>function</span> addElement(<span>$word</span>, <span>$times</span> = 1<span>)
    {

        </span><span>if</span> (<span>isset</span>(<span>$this</span>->elements[<span>$word</span><span>])) {
            </span><span>$this</span>->elements[<span>$word</span>]->times += <span>$times</span><span>;
        } </span><span>else</span>
            <span>$this</span>->elements[<span>$word</span>] = <span>new</span> KeywordElement(<span>$word</span>, <span>$times</span><span>);
    }

    </span><span>public</span> <span>function</span><span> getElementWords()
    {
        </span><span>$elementWords</span> = <span>array_keys</span>(<span>$this</span>-><span>elements);
        </span><span>usort</span>(<span>$elementWords</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });
        </span><span>return</span> <span>$elementWords</span><span>;
    }

    </span><span>/*</span><span>*
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     </span><span>*/</span>
    <span>public</span> <span>function</span> calculateWeight(<span>$word</span><span>)
    {
        </span><span>$element</span> = <span>$this</span>->elements[<span>$word</span><span>];
        </span><span>return</span> <span>ROUND</span>(<span>strlen</span>(<span>$element</span>->word) * <span>$element</span>->times / <span>strlen</span>(<span>$this</span>->keyword), 3<span>);
    }
}

</span><span>class</span><span> KeywordElement
{
    </span><span>public</span> <span>$word</span><span>;
    </span><span>public</span> <span>$times</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$word</span>, <span>$times</span><span>)
    {
        </span><span>$this</span>->word = <span>$word</span><span>;
        </span><span>$this</span>->times = <span>$times</span><span>;
    }
}</span>

其次就是分词了,首先先抽出公用类先,Splitter变成了公用类,有哪些方法呢?

  1. 抽象split方法

      2. 获取关键词待拆分的词组

      3. 是否需要拆分

按照这写,小帅帅写出了以下代码:

<span>abstract</span> <span>class</span><span> Splitter {

    </span><span>/*</span><span>*
     * @var KeywordEntity $keywordEntity
     </span><span>*/</span>
    <span>public</span> <span>$keywordEntity</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$keywordEntity</span><span>){
        </span><span>$this</span>->keywordEntity = <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>abstract</span> <span>function</span> <span>split</span><span>();


    </span><span>/*</span><span>*
     * 获取未分割的字符串,过滤单词
     *
     * @return array
     </span><span>*/</span>
    <span>public</span> <span>function</span><span> getRemainKeywords()
    {
        </span><span>$elementWords</span> = <span>$this</span>->keywordEntity-><span>getElementWords();

        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$this</span>->keywordEntity-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {
            </span><span>if</span> (<span>$this</span>->isSplit(<span>$element</span><span>)) {
                </span><span>$ret</span>[] = <span>$element</span><span>;
            }
        }
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>/*</span><span>*
     * 是否需要拆分
     *
     * @param $element
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isSplit(<span>$element</span><span>)
    {
        </span><span>if</span> (UTF8::isPhrase(<span>$element</span><span>)) {
            </span><span>return</span> <span>true</span><span>;
        }

        </span><span>return</span> <span>false</span><span>;
    }
}</span>

然后小帅帅继续实现业务拆分算法,以及Scws拆分算法。小帅帅淫笑了,这点小事情还是可以办到的。

<span>class</span> TermSplitter <span>extends</span><span> Splitter {

    </span><span>private</span> <span>$dictionary</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> setDictionary(<span>$dictionary</span> = <span>array</span><span>())
    {
        </span><span>usort</span>(<span>$dictionary</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });

        </span><span>$this</span>->dictionary = <span>$dictionary</span><span>;
    }

    </span><span>public</span> <span>function</span><span> getDictionary()
    {
        </span><span>return</span> <span>$this</span>-><span>dictionary;
    }

    </span><span>/*</span><span>*
     * 把关键词拆分成词组或者单词
     *
     * @return KeywordScore[] $keywordScores
     </span><span>*/</span>
    <span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>foreach</span> (<span>$this</span>->dictionary <span>as</span> <span>$phrase</span><span>) {
            </span><span>$remainKeyword</span> = <span>implode</span>("::",<span>$this</span>-><span>getRemainKeywords());
            </span><span>$matchTimes</span> = <span>preg_match_all</span>("/<span>$phrase</span>/", <span>$remainKeyword</span>, <span>$matches</span><span>);
            </span><span>if</span> (<span>$matchTimes</span> > 0<span>) {
                </span><span>$this</span>->keywordEntity->addElement(<span>$phrase</span>, <span>$matchTimes</span><span>);
            }
        }
    }
}


</span><span>class</span> ScwsSplitter <span>extends</span><span> Splitter
{
    </span><span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$remainElements</span> = <span>$this</span>-><span>getRemainKeywords();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {

            </span><span>$so</span> =<span> scws_new();
            </span><span>$so</span>->set_charset('utf8'<span>);
            </span><span>$so</span>->send_text(<span>$element</span><span>);
            </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
                </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                    </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                        </span><span>$this</span>->keywordEntity->addElement(<span>$tmp</span>['word'<span>]);
                    }
                }
            }
            </span><span>$so</span>-><span>close();
        }
    }

    </span><span>/*</span><span>*
     * @param array $scws_words
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}</span>

小帅帅终于把这些代码全部搞定了,高兴之余,他还顺手画了UML图送给大家:

小帅帅的成长真心够厉害的哦,于老大看后,连称赞了三次。

为了测试,小帅帅写了测试代码,代码如下:

<span>class</span><span> SplitterAppTest {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> TestSegmentation();
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }
       </span><span>return</span> <span>$keywordEntity</span><span>;
    }
}


SplitterAppTest</span>::<span>split</span>("连衣裙xl裙宽衣裙");

小帅帅意淫着,想到总有一天把你们踩在脚下。

 

手把手教你做,很适合上班族与学生想发大财的就不要来了,赚个话费

  每个人一生中都拥有一副好牌,可惜的是许多人都把它浪费了,手上握有一副富人的牌,却把自己打成了一个穷人。
  许多人心灵上都沾满了消极的灰尘,失望的污泥和贫穷落后的思想,甚至还怨恨的种子,这样你就永远不会快乐和富有的。穷人:致富和做生意到底有没有什么秘诀?
  富人:每件事情都 有它不同的内在规律,所谓的秘诀实际上就只是那么一点点东西。
  九十九度加一度,水就开了。开水与温水的区别是这么一度。有些事情之所以会有天壤之别,往往就是因为这微不足道的一度。我在报上看到这么一件事。
  两个下岗女工,各在路边开了一个早点铺,都卖包子和油茶。一个生意逐渐兴旺,一个30天后收了摊,据说原因是一个鸡蛋的问题。
  生意逐渐兴旺的那家,每当顾客到来时,总是问在油茶里打一个鸡蛋还是两个鸡蛋;垮掉的那一家问的是要不要。两种不同的问法总能使第一家卖出较多的鸡蛋。鸡蛋卖出得多,盈利就大,就付得起各项费用,生意也就做了下去。鸡蛋卖得少的,盈利少,去掉费用不赚钱,摊子只好收起。成功与失败之间仅一个鸡蛋的区别。
  名满天下的可口可乐中,百分之九十九的是水、糖、碳酸和*,世界上一切饮料的构成也大概如此。然而在可口可乐中有1%的东西是其他绝对有的,据说就是这个神秘的1%,使它每年有4亿多美的纯利润,而其他品牌的饮料,每年有8000万美的收入就算满意了。
  在这世界上成与败之间的距离就这么一点点,所谓秘诀也就这一点点,但就这一点点东西是最宝贵的,许多人要用多次的失败才换回它,然后走向成功。穷人:如果知道了某种生意的秘诀,然后从事这个项目就容易成功吗?
  富人:各种生意都有自己的小秘密,谁也不会把这小秘密告诉别人,因为有的是不能摆到桌面上的,另外也怕被别人学走了,他们都把它列入了祖传秘方。那个诊所的朋友,他告诉我,一个诊所要赚钱,原则上:一要便宜,二要有效。但你如果死照这原则做,是不了钱的。既然便宜你收费就不能贵,有效的话,病一次就看好了,这样赚的钱除了打点主管部门、房租、员工工资,以及七七八八的社会各种收费所剩无几了……不如剩早关门。你要从事什么行业,你就要先去跟从事这行业的人做朋友或先到他那里做雇员最好同,用心就能学到这个祖传秘方。这比自己损失了不少时间在实践中慢慢摸索要合算得多。
  小老板做事,中老板做市,大老板做势!
  我们许多人用体力赚钱,不少人用技术赚钱,很少人用知识赚钱,极少人是用智慧赚钱的。在财富时代,智慧的人太少太少,有智慧又能抓住商机的人更是凤毛麟角。只要我们开动脑筋,发挥智慧,就可以把握机会,成为财富的主人。
 

手把手教你做兼职,很适合上班族与学生想发大财的就不要来了,赚个话费

???
 

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/874526.htmlTechArticle手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天 客串:屌丝的坑人表单神器、数据库那点事儿 面向对象升华...
Stellungnahme:
Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn