Home >Database >Mysql Tutorial >In-depth analysis of Chinese full-text search in MySQL 5.7

In-depth analysis of Chinese full-text search in MySQL 5.7

黄舟
黄舟Original
2017-01-18 11:48:212067browse

Mysql relational database management system

MySQL is an open source small relational database management system developed by the Swedish MySQL AB company. MySQL is widely used in small and medium-sized websites on the Internet. Due to its small size, fast speed, low total cost of ownership, and especially the characteristics of open source, many small and medium-sized websites choose MySQL as their website database in order to reduce the total cost of website ownership.


InnoDB's default full-text index parser is very suitable for Latin, because Latin uses spaces to segment words. But for languages ​​like Chinese, Japanese and Korean, there is no such separator. A word can be made up of multiple words, so we need to deal with it in different ways. In MySQL 5.7.6 we can use a new full-text index plug-in to handle them: n-gram parser.

Preface

In fact, full-text search has been supported in MySQL for a long time, but it has only supported English. The reason is that he always uses spaces as the separators for word segmentation. For Chinese, it is obviously inappropriate to use spaces, and word segmentation needs to be based on Chinese semantics. No, starting from MySQL 5.7, MySQL has built-in ngram full-text search plug-in to support Chinese word segmentation and is effective for MyISAM and InnoDB engines.

Before using the Chinese search word segmentation plug-in ngram, you must first set its word segmentation size in the MySQL configuration file. For example,

[mysqld]
ngram_token_size=2

Here, set the word segmentation size to 2. Remember, the larger the SIZE of the word segmentation, the larger the index will be, so you should set the appropriate size according to your own situation.

Sample table structure:

CREATE TABLE articles (
   id INTUNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
   titleVARCHAR(200),
   body TEXT,
   FULLTEXT (title,body) WITH PARSER ngram
  ) ENGINE=InnoDBCHARACTER SET utf8mb4;

Sample data, with 6 rows of records.

mysql> select * from articles\G
***************************1. row ***************************
  id: 1
title: 数据库管理
 body: 在本教程中我将向你展示如何管理数据库
***************************2. row ***************************
  id: 2
title: 数据库应用开发
 body: 学习开发数据库应用程序
***************************3. row ***************************
  id: 3
title: MySQL完全手册
 body: 学习MySQL的一切
***************************4. row ***************************
  id: 4
title: 数据库与事务处理
 body: 系统的学习数据库的事务概论
***************************5. row ***************************
  id: 5
title: NoSQL精髓
 body: 学习了解各种非结构化数据库
***************************6. row ***************************
  id: 6
title: SQL 语言详解
 body: 详细了解如果使用各种SQL
6 rows inset (0.00 sec)

Explicitly specify the full-text search table source

mysql> SETGLOBAL innodb_ft_aux_table="new_feature/articles";
Query OK, 0 rows affected (0.00 sec)

Through the system table, you can see how the data in articles is divided.

mysql> SELECT *FROM information_schema.INNODB_FT_INDEX_CACHE LIMIT 20,10;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID| POSITION |
+------+--------------+-------------+-----------+--------+----------+
| 中我 |   2 |   2 |   1 |  2 |  28 |
| 习m |   4 |   4 |   1 |  4 |  21 |
| 习了 |   6 |   6 |   1 |  6 |  16 |
| 习开 |   3 |   3 |   1 |  3 |  25 |
| 习数 |   5 |   5 |   1 |  5 |  37 |
| 了解 |   6 |   7 |   2 |  6 |  19 |
| 了解 |   6 |   7 |   2 |  7 |  23 |
| 事务 |   5 |   5 |   1 |  5 |  12 |
| 事务 |   5 |   5 |   1 |  5 |  40 |
| 何管 |   2 |   2 |   1 |  2 |  52 |
+------+--------------+-------------+-----------+--------+----------+
10 rows in set (0.00 sec)

You can see here that when the word segmentation length is set to 2, all data is only in groups of two. The above data also includes row location, ID and other information.

Next, I will conduct a series of search demonstrations. The usage method is the same as the original English search.

1. Search in natural language mode:

1. Get the number of qualified items,

mysql>SELECT COUNT(*) FROM articles
-> WHERE MATCH (title,body) AGAINST ('数据库' IN NATURALLANGUAGE MODE);
+----------+
| COUNT(*) |
+----------+
|  4 |
+----------+
1 row in set (0.05 sec)

2. Get the matching ratio,

mysql>SELECT id, MATCH (title,body) AGAINST ('数据库' IN NATURAL LANGUAGE MODE)
 AS score FROM articles;
+----+----------------------+
| id| score    |
+----+----------------------+
| 1 | 0.12403252720832825 |
| 2 | 0.12403252720832825 |
| 3 |     0 |
| 4 | 0.12403252720832825 |
| 5 | 0.062016263604164124|
| 6 |     0 |
+----+----------------------+
6rows in set (0.00 sec)

2. Search in Boolean mode, which is more complicated than natural mode search:

1. Match records in both management and database,

mysql> SELECT * FROM articles WHERE MATCH (title,body)
  ->  AGAINST ('+数据库 +管理' IN BOOLEAN MODE);
+----+------------+--------------------------------------+
| id| title  | body         |
+----+------------+--------------------------------------+
| 1 | 数据库管理 | 在本教程中我将向你展示如何管理数据库  |
+----+------------+--------------------------------------+
1 rowin set (0.00 sec)

2. Match a database, but there is no management record,

mysql> SELECT * FROM articles WHERE MATCH (title,body)
  ->  AGAINST ('+数据库 -管理' IN BOOLEAN MODE);
+----+------------------+----------------------------+
| id| title    | body      |
+----+------------------+----------------------------+
| 2 | 数据库应用开发  | 学习开发数据库应用程序   |
| 4 | 数据库与事务处理 | 系统的学习数据库的事务概论  |
| 5 | NoSQL 精髓  | 学习了解各种非结构化数据库  |
+----+------------------+----------------------------+
3 rows in set (0.00 sec)

3. Match MySQL, but reduce the relevance of the database,

mysql> SELECT * FROM articles WHERE MATCH (title,body)
  ->  AGAINST ('>数据库 +MySQL' INBOOLEAN MODE);
+----+---------------+-----------------+
| id| title   | body   |
+----+---------------+-----------------+
| 3 | MySQL完全手册 |学习MySQL的一切 |
+----+---------------+-----------------+
1 rowin set (0.00 sec)


3. Query expansion mode. For example, if you want to search the database, MySQL, oracle, and DB2 will also be searched.

mysql> SELECT * FROM articles
  ->  WHERE MATCH (title,body)
  ->  AGAINST ('数据库' WITH QUERY EXPANSION);
+----+------------------+--------------------------------------+
| id| title   | body         |
+----+------------------+--------------------------------------+
| 1 | 数据库管理  | 在本教程中我将向你展示如何管理数据库  |
| 4 | 数据库与事务处理 | 系统的学习数据库的事务概论    |
| 2 | 数据库应用开发  | 学习开发数据库应用程序     |
| 5 | NoSQL 精髓  | 学习了解各种非结构化数据库    |
| 6 | SQL 语言详解  | 详细了解如果使用各种SQL     |
| 3 | MySQL完全手册  | 学习MySQL的一切      |
+----+------------------+--------------------------------------+
6 rows in set (0.01 sec)

Of course, I am just a functional demonstration here, more performance tests, everyone is interested Detailed testing is possible. Since N-grm is a commonly used word segmentation algorithm for Chinese retrieval and has been widely used on the Internet, if it is integrated into MySQL this time, there will definitely not be much problem with the effect.

The above is the in-depth analysis of the Chinese full-text search of MySQL 5.7. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn