Home  >  Article  >  Backend Development  >  Teach you step by step how to do a keyword matching project (search engine)----Teach you how to do a keyword matching project on the 21st day_PHP tutorial

Teach you step by step how to do a keyword matching project (search engine)----Teach you how to do a keyword matching project on the 21st day_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 10:19:26816browse

Teach you step by step how to do a keyword matching project (search engine) ---- On the 21st day, teach you how to do it on the 21st day

Guest appearance: The Pitfalls of Diaosi Things like human form artifacts and databases

Object-oriented sublimation: object-oriented understanding - first acquaintance with new students, object-oriented extras - sleepwalking of thoughts (1), object-oriented understanding - how to find classes

Load Balancing: Load Balancing - Concept Understanding, Load Balancing - Implementation Configuration (Nginx)

Tucao: The articles I owe now include Object-oriented understanding----class transformation, object-oriented extras---The Sleepwalking of Thought (2), Load Balancing----File Service strategy, teach you step by step how to do keyword matching projects (search engines) . It’s really too much. Can you let me rest for a while?

Day 21

Starting point: Teach you step by step how to do keyword matching project (search engine) ---- Day 1

Review: Teach you step by step how to do keyword matching project (search engine) ---- Day 20

There is theoretical knowledge to understand today, which is called test-driven programming. I have mentioned the concept before, in: Teach you step by step how to do keyword matching projects (search engines) ---- Day 11

Today Xiaoshuai Shuaixiu had a little fun and used this idea.

Okay, the following text begins.

It is said that Xiao Shuaishuai showed Boss Yu the method of breaking down business words he wrote, and Boss Yu was very happy.

However, the phrases for business word splitting are limited, and when the amount of data for business word splitting increases, the calculation time will increase.

Boss Yu mentioned whether other participle expansions can be used to make up for the shortcomings of word splitting.

After all, it is done by professionals and is more reliable.

Boss Yu is very experienced and recommends Xiao Shuaishuai to learn how to use SCWS.

SCWS is the acronym for Simple Chinese Word Segmentation (ie: Simple Chinese Word Segmentation System).
Official website: http://www.xunsearch.com/scws/index.php

Of course Xiao Shuai Shuai was very happy after hearing this, because he gained new knowledge.

Xiao Shuaishuai installed SCWS according to the SCWS installation document.

And installed the php extension, and tried to write a test code:

<?<span>php
</span><span>class</span><span> TestSCWS {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$keyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (self::isValidate(<span>$tmp</span><span>)) {
                    </span><span>$ret</span>[] = <span>$tmp</span><span>;
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>public</span> <span>static</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}


</span><span>var_dump</span>(TestSCWS::<span>split</span>("连衣裙xxl裙连衣裙"));

The test passed, exactly as expected. Xiao Shuai Shuai was very happy, so he asked Boss Yu: Boss Yu, I can use SCWS, what should I do next?

Boss Yu didn’t panic and said to Xiao Shuai Shuai: First write a ScwsSplitter to split the keywords.

Xiao Shuai Shuai was very happy because he learned new knowledge and said yes to the boss.

Xiao Shuai Shuai does what he says, the code is as follows:

<span>class</span><span> ScwsSplitter {

    </span><span>public</span> <span>$keyword</span><span>;
    
    </span><span>public</span> <span>function</span> <span>split</span><span>(){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$this</span>-><span>keyword);

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$this</span>-><span>keyword);
       
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                    </span><span>$keywordEntity</span>->addElement(<span>$tmp</span>["word"<span>]);
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }
    
}</span>

Xiao Shuai Shuai ran to find Boss Yu again and said: I have written the word segmentation code for Scws.

Boss Yu also admires Xiao Shuai Shuai’s high efficiency.

He also said: If I use both at the same time, I will use business word segmentation first, and use Scws word segmentation for the remaining words. Does Xiaoshuai Shuai have a good plan?

Xiao Shuai Shuai asked: Why is this done? This is not unnecessary.

Boss Yu said: There are some proper nouns in business, and SCWS can’t tell them apart, so what should we do?

Xiao Shuaishuai said again: When I looked at the document, I saw that there are settings for the vocabulary and rule files. Can we use it?

Boss Yu said again: This is possible, but how do we ensure that the operation personnel maintain it? We must learn to hand over these things.

Xiao Shuai Shuai: …….

Xiao Shuaishuai was silent for a moment. He felt that now that both classes have been written, using them together is the fastest solution, so he agreed: Okay, I will go back and change it...

First of all, Xiao Shuaishuai wrote the entry code based on the idea of ​​test-driven programming:

<span>class</span><span> SplitterApp {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span>,<span>$cid</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> DBSegmentation();
        </span><span>$seg</span>->cid = <span>$cid</span><span>;
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }<br />     return $keywordEntity;
    }
}</span>

Xiao Shuai Shuai said hey, now that he has the test entrance, he is worried that he will not be able to handle the other things.

First getElementWords of KeywordEntity, get it first.

<span>class</span><span> KeywordEntity
{

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>public</span> <span>$elements</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> __construct(<span>$keyword</span><span>)
    {
        </span><span>$this</span>->keyword = <span>$keyword</span><span>;
    }

    </span><span>public</span> <span>function</span> addElement(<span>$word</span>, <span>$times</span> = 1<span>)
    {

        </span><span>if</span> (<span>isset</span>(<span>$this</span>->elements[<span>$word</span><span>])) {
            </span><span>$this</span>->elements[<span>$word</span>]->times += <span>$times</span><span>;
        } </span><span>else</span>
            <span>$this</span>->elements[<span>$word</span>] = <span>new</span> KeywordElement(<span>$word</span>, <span>$times</span><span>);
    }

    </span><span>public</span> <span>function</span><span> getElementWords()
    {
        </span><span>$elementWords</span> = <span>array_keys</span>(<span>$this</span>-><span>elements);
        </span><span>usort</span>(<span>$elementWords</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });
        </span><span>return</span> <span>$elementWords</span><span>;
    }

    </span><span>/*</span><span>*
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     </span><span>*/</span>
    <span>public</span> <span>function</span> calculateWeight(<span>$word</span><span>)
    {
        </span><span>$element</span> = <span>$this</span>->elements[<span>$word</span><span>];
        </span><span>return</span> <span>ROUND</span>(<span>strlen</span>(<span>$element</span>->word) * <span>$element</span>->times / <span>strlen</span>(<span>$this</span>->keyword), 3<span>);
    }
}

</span><span>class</span><span> KeywordElement
{
    </span><span>public</span> <span>$word</span><span>;
    </span><span>public</span> <span>$times</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$word</span>, <span>$times</span><span>)
    {
        </span><span>$this</span>->word = <span>$word</span><span>;
        </span><span>$this</span>->times = <span>$times</span><span>;
    }
}</span>

The second step is word segmentation. First, extract the public class. Splitter becomes a public class. What are the methods?

 1. Abstract split method

2. Get the keyword phrases to be split

3. Whether to split

According to this writing, Xiao Shuai Shuai wrote the following code:

<span>abstract</span> <span>class</span><span> Splitter {

    </span><span>/*</span><span>*
     * @var KeywordEntity $keywordEntity
     </span><span>*/</span>
    <span>public</span> <span>$keywordEntity</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$keywordEntity</span><span>){
        </span><span>$this</span>->keywordEntity = <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>abstract</span> <span>function</span> <span>split</span><span>();


    </span><span>/*</span><span>*
     * 获取未分割的字符串,过滤单词
     *
     * @return array
     </span><span>*/</span>
    <span>public</span> <span>function</span><span> getRemainKeywords()
    {
        </span><span>$elementWords</span> = <span>$this</span>->keywordEntity-><span>getElementWords();

        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$this</span>->keywordEntity-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {
            </span><span>if</span> (<span>$this</span>->isSplit(<span>$element</span><span>)) {
                </span><span>$ret</span>[] = <span>$element</span><span>;
            }
        }
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>/*</span><span>*
     * 是否需要拆分
     *
     * @param $element
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isSplit(<span>$element</span><span>)
    {
        </span><span>if</span> (UTF8::isPhrase(<span>$element</span><span>)) {
            </span><span>return</span> <span>true</span><span>;
        }

        </span><span>return</span> <span>false</span><span>;
    }
}</span>

Then Xiao Shuaishuai continued to implement the business splitting algorithm and the Scws splitting algorithm. Xiao Shuai Shuai smiled lewdly, this little thing can still be done.

<span>class</span> TermSplitter <span>extends</span><span> Splitter {

    </span><span>private</span> <span>$dictionary</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> setDictionary(<span>$dictionary</span> = <span>array</span><span>())
    {
        </span><span>usort</span>(<span>$dictionary</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });

        </span><span>$this</span>->dictionary = <span>$dictionary</span><span>;
    }

    </span><span>public</span> <span>function</span><span> getDictionary()
    {
        </span><span>return</span> <span>$this</span>-><span>dictionary;
    }

    </span><span>/*</span><span>*
     * 把关键词拆分成词组或者单词
     *
     * @return KeywordScore[] $keywordScores
     </span><span>*/</span>
    <span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>foreach</span> (<span>$this</span>->dictionary <span>as</span> <span>$phrase</span><span>) {
            </span><span>$remainKeyword</span> = <span>implode</span>("::",<span>$this</span>-><span>getRemainKeywords());
            </span><span>$matchTimes</span> = <span>preg_match_all</span>("/<span>$phrase</span>/", <span>$remainKeyword</span>, <span>$matches</span><span>);
            </span><span>if</span> (<span>$matchTimes</span> > 0<span>) {
                </span><span>$this</span>->keywordEntity->addElement(<span>$phrase</span>, <span>$matchTimes</span><span>);
            }
        }
    }
}


</span><span>class</span> ScwsSplitter <span>extends</span><span> Splitter
{
    </span><span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$remainElements</span> = <span>$this</span>-><span>getRemainKeywords();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {

            </span><span>$so</span> =<span> scws_new();
            </span><span>$so</span>->set_charset('utf8'<span>);
            </span><span>$so</span>->send_text(<span>$element</span><span>);
            </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
                </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                    </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                        </span><span>$this</span>->keywordEntity->addElement(<span>$tmp</span>['word'<span>]);
                    }
                }
            }
            </span><span>$so</span>-><span>close();
        }
    }

    </span><span>/*</span><span>*
     * @param array $scws_words
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}</span>

Xiao Shuaishuai finally finished all these codes. When he was happy, he also drew a UML diagram and gave it to everyone:

Little Shuaishuai’s growth is really amazing. Boss Yu praised him three times after seeing it.

For testing, Xiao Shuai Shuai wrote the test code, the code is as follows:

<span>class</span><span> SplitterAppTest {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> TestSegmentation();
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }
       </span><span>return</span> <span>$keywordEntity</span><span>;
    }
}


SplitterAppTest</span>::<span>split</span>("连衣裙xl裙宽衣裙");

Xiao Shuai Shuai is lustful, thinking that one day he will trample you under his feet.

Teach you how to do it step by step, very suitable for office workers and students. If you want to make a fortune, don’t come here, just make a phone call

Everyone has a good deck of cards in their life, but it is a pity that many people waste it. They have a deck of rich cards in their hands, but they make themselves poor.
Many people’s souls are stained with the dust of negativity, the sludge of disappointment, thoughts of poverty and backwardness, and even the seeds of resentment, so you will never be happy or rich. Poor people: Is there any secret to getting rich and doing business?
Rich man: Everything has its own internal laws. The so-called secret is actually just a little bit of it.
Ninety-nine degrees plus one degree, the water will boil. The difference between boiling water and warm water is this degree. The reason why some things are so different is often because of this trivial degree. I saw such a thing in the newspaper.
Two laid-off female workers each opened a breakfast shop on the roadside, selling steamed buns and camellia oleifera. One business gradually prospered, and the other closed the stall after 30 days. It is said that the reason was an egg problem.
Whenever a customer comes to the business that is gradually booming, they always ask if they want to beat one egg or two eggs in the oil tea; the one that is failing asks if they want it. Two different questions will always make the first house sell more eggs. The more eggs you sell, the greater the profit will be, and you can afford to pay all the expenses, and the business will continue. Those who sell fewer eggs will make less profit. After removing the expenses, they will not make any money, so the stall has to be closed down. The difference between success and failure is just one egg.
Ninety-nine percent of the world-famous Coca-Cola is water, sugar, carbonic acid and *, and the composition of all beverages in the world is probably the same. However, there is 1% of things in Coca-Cola that others absolutely have. It is said that it is this mysterious 1% that makes it have a net profit of more than 400 million U.S. dollars every year, while other brands of drinks are satisfied with an annual income of 80 million U.S. dollars.
In this world, the distance between success and failure is only a little bit, and the so-called secret is only this little bit, but this little thing is the most precious, and many people have to spend many failures to get it back. , and then move towards success. Poor person: If you know the secret of a certain business, will it be easier to succeed in this project?
Rich man: All businesses have their own little secrets. No one will tell others this little secret, because some of them cannot be put on the table. In addition, they are afraid that others will learn from them, so they all list them. Incorporated into the ancestral secret recipe. A friend from that clinic told me that in order for a clinic to make money, in principle: first, it must be cheap, and second, it must be effective. But if you follow this principle to the letter, you won’t make any money. Since it's cheap, you can't charge too high. If it's effective, you can treat the disease once. In this way, there will be very little money left except for the management of the administrative department, rent, employee wages, and various social charges... It's better to save money early. close the door. Whatever industry you want to engage in, you must first make friends with people who are engaged in this industry or work with them as employees. If you work hard, you can learn this ancestral secret recipe. This is much more cost-effective than losing a lot of time and slowly exploring in practice.
The small boss does things, the middle boss makes the market, and the big boss makes the momentum!
Many of us make money with physical strength, many of us make money with technology, few of us make money with knowledge, and very few of us make money with wisdom. In the age of wealth, there are too few smart people, and those who are smart and can seize business opportunities are even rarer. As long as we use our brains and wisdom, we can seize the opportunity and become the master of wealth.

Teach you how to do part-time work step by step, it is very suitable for office workers and students. If you want to make a fortune, don’t come here, just make some phone money

? ? ?

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/874526.htmlTechArticleTeach you step by step how to do keyword matching projects (search engines) ---- Day 21, teach you Guest appearance on the 21st day: Diaosi’s deceptive form artifact, object-oriented sublimation of the database...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn