Home  >  Article  >  Backend Development  >  Introduction to the principles of implementing Chinese full-text search in PHP

Introduction to the principles of implementing Chinese full-text search in PHP

藏色散人
藏色散人forward
2019-04-26 10:48:025390browse

Relevant articles or content in general development are searched through keyword tags and titles, but this search will basically use inefficient like statements. Due to the low efficiency, in the development of slightly larger projects We cannot conduct detailed field searches for articles or related content (the server is under too much pressure and the efficiency is extremely low).

Common solutions

1. sphinx coreseek

Advantages: Mature and stable technology

Disadvantages: sphinx does not support Chinese coressk has currently stopped maintenance [if it is a pure English environment, sphinx is excellent]

2. Xunsearch(Xunsearch)

Advantages: Mature and stable technology

Disadvantages: The installation process is complicated and the configuration is not flexible enough

3. Mysql full-text search

Advantages: Easy installation and high efficiency

Disadvantages: Yes Chinese support is not good enough

Solution from hcoder (self-configured word segmentation)

Advantages: Simple installation (php component), the bottom layer is written by the developer himself Clearer bottom layer, easier optimization

Disadvantages: Developers need to have a PHP mysql foundation and need to write the code for the entire process themselves

Principle

1、获取词语环节
文章数据表 -> 逐行读取文章信息 -> 组合所有文字内容 -> 分词、去重 -> 记录到新的数据表
2、搜索环节
搜索关键字记录表 -> 合并文章数据 -> 去重 -> 展示数据

The third party used Component (scws)

http://www.xunsearch.com/scws/

SCWS is the acronym for Simple Chinese Word Segmentation (ie: Simple Chinese Word Segmentation System).

This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly divide a whole paragraph of Chinese text into words. Word is the smallest morpheme unit in Chinese, but when written, words are not separated by spaces like English. Therefore, how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation.

SCWS is developed in pure C language and does not rely on any external library functions. It can directly use dynamic link libraries to embed applications. Supported Chinese encodings include GBK, UTF-8, etc. In addition, a PHP extension module is provided to quickly and easily use the word segmentation function in PHP.

There are not many innovative elements in the word segmentation algorithm. It uses the word frequency dictionary collected by itself, supplemented by certain proper names, names of people, place names, digital ages and other rule recognition to achieve basic word segmentation. The range test accuracy is between 90% and 95%, which can basically meet the needs of some small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.

SCWS was developed by hightman and released as open source under the BSD license. The source code is hosted on github.

The above is the detailed content of Introduction to the principles of implementing Chinese full-text search in PHP. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:hcoder.net. If there is any infringement, please contact admin@php.cn delete