Home  >  Article  >  Backend Development  >  Teach you step by step how to do a keyword matching project (search engine) ---- Day 20, teach you how to do it on the 20th day_PHP Tutorial

Teach you step by step how to do a keyword matching project (search engine) ---- Day 20, teach you how to do it on the 20th day_PHP Tutorial

WBOY
WBOYOriginal
2016-07-13 10:19:35888browse

Teach you step by step how to do a keyword matching project (search engine) ---- On the 20th day, teach you how to do it on the 20th day

Guest appearance: Diaosi’s cheating form Things like artifacts and databases

Object-oriented sublimation: object-oriented understanding - first acquaintance with new students, object-oriented extras - sleepwalking of thoughts (1), object-oriented understanding - how to find classes

Load Balancing: Load Balancing - Concept Understanding, Load Balancing - Implementation Configuration (Nginx)

Complaints: Some people reported such information, saying that the article became harder to understand towards the end and could not keep up with the rhythm. Some people also asked why Xiao Shuai Shuai’s ability increased so fast, and whether I was stupid. Some people just read the text without looking at the code, because the code is too difficult to understand.

Actually, I have been thinking about this issue these days, so I had no choice but to launch some object-oriented courses. I hope it will be helpful to those who can't keep up. In fact, to be honest, if readers don't give feedback, I will have to carry out the course according to what I think Xiaoshuai Shuai is.

Day 20

Starting point: Teach you step by step how to do keyword matching project (search engine) ---- Day 1

Review: Teach you step by step how to do keyword matching project (search engine) ---- Day 19

It is said that Xiao Shuai Shuai wrote the first version in order to solve the word segmentation algorithm. When he showed it to Boss Yu, he was asked to rewrite it.

The reasons are as follows:

1. How to test and test data?

2. Has Splitter done too much?

3. What should I do if there are repeated phrases in dresses like xxl dresses?

Xiao Shuai Shuai took these questions and began to reconstruct.

First he discovered this, the judgment of Chinese, English and Chinese-English, and the calculation of length. He wrote this as a class:

<?<span>php

</span><span>class</span><span> UTF8 {

    </span><span>/*</span><span>*
     * 检测是否utf8
     * @param $char
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> is(<span>$char</span><span>){
        </span><span>return</span> (<span>preg_match</span>("/^([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){1}/",<span>$char</span>) ||
            <span>preg_match</span>("/([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){1}$/",<span>$char</span>) ||
            <span>preg_match</span>("/([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){2,}/",<span>$char</span><span>));
    }

    </span><span>/*</span><span>*
     * 计算utf8字的个数
     * @param $char
     * @return float|int
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> length(<span>$char</span><span>) {

        </span><span>if</span>(self::is(<span>$char</span><span>))
            </span><span>return</span> <span>ceil</span>(<span>strlen</span>(<span>$char</span>)/3<span>);
        </span><span>return</span> <span>strlen</span>(<span>$char</span><span>);
    }

    </span><span>/*</span><span>*
     * 检测是否为词组
     * @param $word
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> isPhrase(<span>$word</span><span>){

        </span><span>if</span>(self::length(<span>$word</span>)<=1<span>)
            </span><span>return</span> <span>false</span><span>;
        </span><span>return</span> <span>true</span><span>;
    }

}</span>

Xiao Shuai Shuai also considered that the source of the dictionary may come from multiple places, such as the test data I gave. This can solve the problem that Boss Yu mentioned that cannot be tested. Xiao Shuai Shuai took a cut of the source of the dictionary. Created a class, the class is as follows:

<?<span>php

</span><span>class</span><span> DBSegmentation {

    </span><span>public</span> <span>$cid</span><span>;

    </span><span>/*</span><span>*
     * 获取类目下分词的词组数据
     * @return array
     </span><span>*/</span>
    <span>public</span> <span>function</span><span> transferDictionary(){
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>$sql</span> = "select word from category_linklist where cid='<span>$this</span>->cid'"<span>;
        </span><span>$words</span> = DB::makeArray(<span>$sql</span><span>);
        </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$strWords</span><span>){
            </span><span>$words</span> = <span>explode</span>(",",<span>$strWords</span><span>);

            </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$word</span><span>){
                </span><span>if</span>(UTF8::isPhrase(<span>$word</span><span>)){
                    </span><span>$ret</span>[] = <span>$word</span><span>;
                }
            }
        }
        </span><span>return</span> <span>$ret</span><span>;
    }
} 

</span><span>class</span><span> TestSegmentation {
    
    </span><span>public</span> <span>function</span><span> transferDictionary(){
        </span><span>$words</span> = <span>array</span><span>(
            </span>"连衣裙,连衣",
            "XXL,xxl,加大,加大码",
            "X码,中码",
            "外套,衣,衣服,外衣,上衣",
            "女款,女士,女生,女性"<span>
        );

        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$strWords</span><span>){
            </span><span>$words</span> = <span>explode</span>(",",<span>$strWords</span><span>);

            </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$word</span><span>){
                </span><span>if</span>(UTF8::isPhrase(<span>$word</span><span>)){
                    </span><span>$ret</span>[] = <span>$word</span><span>;
                }
            }
        }
        </span><span>return</span> <span>$ret</span><span>;

    }
}</span>

Then Splitter will focus on word segmentation. The code is as follows:

<span>class</span><span> Splitter {

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>private</span> <span>$dictionary</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> setDictionary(<span>$dictionary</span> = <span>array</span><span>()){

        </span><span>usort</span>(<span>$dictionary</span>,<span>function</span>(<span>$a</span>,<span>$b</span><span>){
            </span><span>return</span> (UTF8::length(<span>$a</span>)>UTF8::length(<span>$b</span>))?1:-1<span>;
        });

        </span><span>$this</span>->dictionary = <span>$dictionary</span><span>;
    }

    </span><span>public</span> <span>function</span><span> getDictionary(){
        </span><span>return</span> <span>$this</span>-><span>dictionary;
    }

    </span><span>/*</span><span>*
     * 把关键词拆分成词组或者单词
     * @return KeywordEntity $keywordEntity
     </span><span>*/</span>
    <span>public</span> <span>function</span> <span>split</span><span>(){

        </span><span>$remainKeyword</span> = <span>$this</span>-><span>keyword;

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$this</span>-><span>keyword);

        </span><span>foreach</span>(<span>$this</span>->dictionary <span>as</span> <span>$phrase</span><span>){

            </span><span>$matchTimes</span> = <span>preg_match_all</span>("/<span>$phrase</span>/",<span>$remainKeyword</span>,<span>$matches</span><span>);
            </span><span>if</span>(<span>$matchTimes</span>>0<span>){
                </span><span>$keywordEntity</span>->addElement(<span>$phrase</span>,<span>$matchTimes</span><span>);

                </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$phrase</span>,"::",<span>$remainKeyword</span><span>);
            }
        }

        </span><span>$remainKeywords</span> = <span>explode</span>("::",<span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainKeywords</span> <span>as</span> <span>$splitWord</span><span>){

            </span><span>if</span>(!<span>empty</span>(<span>$splitWord</span><span>)){
                </span><span>$keywordEntity</span>->addElement(<span>$splitWord</span><span>);
            }
        }

        </span><span>return</span> <span>$keywordEntity</span><span>;

    }

}


</span><span>class</span><span> KeywordEntity {

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>public</span> <span>$elements</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> __construct(<span>$keyword</span><span>){
        </span><span>$this</span>->keyword = <span>$keyword</span><span>;
    }

    </span><span>public</span> <span>function</span> addElement(<span>$word</span>,<span>$times</span>=1<span>){

        </span><span>if</span>(<span>isset</span>(<span>$this</span>->elements[<span>$word</span><span>])){
            </span><span>$this</span>->elements[<span>$word</span>]->times += <span>$times</span><span>;
        }</span><span>else</span>
            <span>$this</span>->elements[] = <span>new</span> KeywordElement(<span>$word</span>,<span>$times</span><span>);
    }

    </span><span>/*</span><span>*
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     </span><span>*/</span>
    <span>public</span> <span>function</span> calculateWeight(<span>$word</span><span>)
    {
        </span><span>$element</span> = <span>$this</span>->elements[<span>$word</span><span>];
        </span><span>return</span> <span>ROUND</span>(<span>strlen</span>(<span>$element</span>->word)*<span>$element</span>->times / <span>strlen</span>(<span>$this</span>->keyword), 3<span>);
    }
}


</span><span>class</span><span> KeywordElement {
    </span><span>public</span> <span>$word</span><span>;
    </span><span>public</span> <span>$times</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$word</span>,<span>$times</span><span>){
        </span><span>$this</span>->word = <span>$word</span><span>;
        </span><span>$this</span>->times = <span>$times</span><span>;
    }
}</span>

He also left the weight calculation to a class to handle specifically.

After Xiao Shuai Shuai finished writing, he also wrote a test example:

<?<span>php

</span><span>$segmentation</span> = <span>new</span><span> TestSegmentation();

</span><span>$splitter</span> = <span>new</span><span> Splitter();
</span><span>$splitter</span>->setDictionary(<span>$segmentation</span>-><span>transferDictionary());
</span><span>$splitter</span>->keyword = "连衣裙xxl裙连衣裙"<span>;
</span><span>$keywordEntity</span> = <span>$splitter</span>-><span>split</span><span>();

</span><span>var_dump</span>(<span>$keywordEntity</span>);

This way, even if your algorithm changes, it will be able to face it calmly.

Xiao Shuaishuai understands this. When you feel that a class does too many things, you can consider the single responsibility principle.

Single Responsibility Principle: A class has only one reason for its change. There should be only one responsibility. Each responsibility is an axis of change. If a class has more than one responsibility, these responsibilities are coupled together. This can lead to fragile designs. When one responsibility changes, it may affect other responsibilities. In addition, multiple responsibilities are coupled together, which affects reusability. For example: To achieve the separation of logic and interface. 【From Baidu Encyclopedia】

When Mr. Yu mentioned whether there are other word segmentation algorithms and whether we can use them, Xiao Shuaishuai was very happy because the code is so beautiful now.

How Xiao Shuai Shuai plays with third-party word segmentation extensions, please stay tuned for the next chapter’s breakdown: I’ll teach you step by step how to do keyword matching projects (search engines) ---- Day 21

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/873919.htmlTechArticleTeach you step by step how to do keyword matching projects (search engines) ---- On the 20th day, teach you how to do it Guest appearance on the 20th day: Diaosi’s deceptive form artifact, database thing, object-oriented sublimation: face...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn