Home  >  Article  >  Backend Development  >  mysql full text search techniques_PHP tutorial

mysql full text search techniques_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:54:53755browse

<< Back to man.ChinaUnix.net


MySQL Reference Manual for version 4.1.0-alpha.



------ -------------------------------------------------- -----------------------

6.8 MySQL full-text search

As of 3.23.23, MySQL begins to support full-text indexing and search. The full-text index in MySQL is a FULLTEXT type index. FULLTEXT indexes are used on MyISAM tables and can be created on CHAR, VARCHAR or TEXT columns using ALTER TABLE or CREATE INDEX at or after CREATE TABLE. For large databases, it is very fast to load the data into a table without a FULLTEXT index and then use ALTER TABLE (or CREATE INDEX) to create the index. Loading data into a table that already has a FULLTEXT index will be very slow.

Full text search is completed through the MATCH() function.

mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO articles VALUES
-> ( NULL,'MySQL Tutorial', 'DBMS stands for DataBase ...'),
-> (NULL,'How To Use MySQL Efficiently', 'After you went through a ...'),
-> (NULL,'Optimising MySQL','In this tutorial we will show ...'),
-> (NULL,'1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> (NULL,'MySQL vs. YourSQL', 'In the following database comparison ...'),
-> (NULL,'MySQL Security', 'When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM articles
-> WHERE MATCH (title,body) AGAINST ('database');
+----+--------------------------------+---- -----------------------------------------------+
| id | title | body | 🎜>+----+-------------------+----------------------- --------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
+----+--------------------------------+------------- --------------------------------+
2 rows in set (0.00 sec)

Function MATCH( ) performs a natural language search for a string against a text set (a set of one or more columns contained in a FULLTEXT index). The search string is given as an argument to AGAINST(). The search is performed ignoring the case of letters. For each row in the table, MATCH() returns a correlation value. That is, the similarity measure between the search string and the text of the record row in the column specified in the MATCH() list.

When MATCH() is used in a WHERE clause (see the example above), the returned rows are automatically sorted in order of relevance from highest to lowest. Correlation values ​​are non-negative floating point numbers. Zero correlation means no similarity. Relevance is calculated based on: the number of words in a row, the number of unique words in a row, the total number of words in a set, and the number of documents (rows) that contain a particular word.

It can also perform a logical pattern search. This is described in the following sections.

The previous examples are some basic instructions on the use of function MATCH(). Rows are returned in order of decreasing similarity.

The next example shows how to retrieve an explicit similarity value. If there is neither a WHERE nor an ORDER BY clause, the returned rows are not sorted.

mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles;
+----+------------- --------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') |
+ ----+-----------------------------------------+
| 1 | 0.64840710366884 |
| 2 | 0                                                                                                                                                                                                                       ---+-----------------------------------------+
6 rows in set (0.00 sec)

The example below is a little more complicated. The query returns similarities and still returns rows in descending order of similarity. To accomplish this result, you should specify MATCH() twice. This causes no additional overhead because the MySQL optimizer will notice the same MATCH() call twice and call the full-text search code only once.

mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH ( title,body) AGAINST
-> ('Security implications of running MySQL as root');
+----+---------------- ------------------+------------------+
| id | body                                                                                               >+----+----------------------------------------+----- ------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5055546709332 |
| 6 | When configured properly, MySQL ... | 1.31140957288 |
+----+----------------------------------------+- ----------------+
2 rows in set (0.00 sec)

MySQL uses a very simple parser to separate text into words. A "word" is any sequence of characters composed of text, data, "'" and "_". Any "word" that appears on the stopword list, or is too short (3 characters or less) will be ignored.

Each suitable word in the set and query is measured according to its importance in the set and query. This way, a word that appears in multiple documents will have a lower weight (perhaps even a zero weight) because it has lower semantic value in this particular set. Otherwise, if the word is less, it will get a higher weight. The weights of the words are then combined to calculate the similarity of the record lines.

Such a technique works well with large sets (in fact, it will be carefully harmonized with them). For very small tables, word classification does not adequately reflect their semantic value, and sometimes this model can produce strange results.

mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)

In the above example, the search term MySQL But got no results because this word appears in more than half of the recorded lines.As such, it is effectively treated as a stopword (i.e., a word with zero semantic value). This is ideal behavior - a natural language query should not return every second row from a 1GB table.

Words that match half of the rows in the table are unlikely to find relevant documents. In fact, it may find many irrelevant documents. We all know that this happens all the time when we are trying to search for something on the internet through a search engine. For this reason, in this particular dataset, such rows are set to a low semantic value.

As of 4.0.1, MySQL can also use the IN BOOLEAN MODE modifier to perform a logical full-text search.

mysql> SELECT * FROM articles WHERE MATCH (title,body)
-> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+--- --------------------------+---------------------- ---------------+
| id | title                                                          ---------------+---------------------------------- ---+
| 1 | MySQL Tutorial | DBMS stands for DataBase... |
| 2 | How To Use MySQL Efficiently | After you went through a... |
| 3 | Optimizing MySQL | In this tutorial we will show ... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
| 6 | MySQL Security | When configured properly, MySQL. .. |
+----+----------------------------------+------- ----------------------------------+

This query returns all rows containing the word MySQL (Note: The 50% threshold is not used), but it does not contain the word YourSQL. Note that a logical pattern search does not automatically sort rows in descending order of similar values. As you can see from the above results, the highest similarity value (the one that contains MySQL twice) is listed last, not first. A logical full-text search can work even without a FULLTEXT index, however it is slower.

Logical full-text search supports the following operators:

+
A leading plus sign indicates that the word must appear in each returned record row.

-
A leading minus sign indicates that the word must not appear in each returned record row.

The default (when neither plus nor minus sign is specified) word is optional, but rows containing it will be ranked higher. This mimics the behavior of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier.

< >
These two operators are used to change the base value of the similarity value of a word. The < operator decrements the base value, and the > operator increases it. See the example below.

( )
Parentheses are used to group words in a subexpression.

~
A leading negation sign acts like a negation operator, causing row similarity words to have a negative base value. It is useful for labeling a noisy word. A record containing such a word will be ranked lower, but will not be completely excluded, because the - operator can be used.

*
An asterisk is the truncation operator. Unlike other operators, it should be appended to a word, not before it.

"
phrase, enclosed in double quotes ", matches only lines containing the phrase (literally, as if typed).
Here are some examples:

apple banana
Find rows that contain at least one of the above words
+apple +juice
... Both words are included
+apple macintosh
... Contains the word "apple", but if it also contains "macintosh", it will be ranked higher
+apple -macintosh
... Contains "apple" but does not contain "macintosh"
+apple +(>pie ... contains "apple" and "pie", or "apple" and "strudel" (in any order ), but "apple pie" is arranged a little higher than "apple strudel"
apple*
... Contains "apple", "apples", "applesauce" and "applet"
"some words "
... can contain "some words of wisdom", but not "some noise words"
6.8.1 Full text restrictions
All parameters of the MATCH() function must come from the same table The columns must also be part of the same FULLTEXT index, unless MATCH() is IN BOOLEAN MODE.

MATCH() column list must exactly match the column list defined in a FULLTEXT index of the table, unless MATCH() is IN BOOLEAN MODE . The parameter of

AGAINST() must be a constant string.
6.8.2 Fine-tuning MySQL full-text search
Unfortunately, full-text search still has very few user-tunable parameters, although adding a few would rank high on the TODO. If you have a MySQL source distribution (see Section 2.3 Installing a MySQL Source Distribution), you can exert more control over full-text search.

Note that full-text search is carefully tuned for the best search results. Modifying the default behavior will, in most cases, only make search results worse. Do not modify MySQL source code unless you know what you are doing!

The minimum length of the indexed word is specified by the MySQL variable ft_min_word_len. See section 4.5.6.4 SHOW VARIABLES. Change it to your desired value and rebuild your FULLTEXT index. (This variable is only supported starting from MySQL 4.0) The

stopword list can be read from the file specified by the ft_stopword_file variable. See section 4.5.6.4 SHOW VARIABLES. After modifying the stopword list, rebuild your FULLTEXT index. (This variable is only supported starting from MySQL 4.0.10)

50% threshold selection is determined by the specific measurement mode selected. To disable it, modify the following line in the `myisam/ftdefs.h' file:
#define GWS_IN_USE GWS_PROB

to:
#define GWS_IN_USE GWS_FREQ

and then recompile MySQL. In this case, there is no need to rebuild the index. NOTE: Using this will severely reduce MySQL's ability to provide sufficient similarity values ​​for MATCH(). If you really need to search for such a public word, it is better to use the IN BOOLEAN MODE search instead, which does not respect the 50% threshold.

Sometimes search engine maintainers wish to change the operators used for logical full-text searches. These are defined by the variable ft_boolean_syntax. See section 4.5.6.4 SHOW VARIABLES. However, this variable is read-only and its value is set in `myisam/ft_static.c'.
For these changes, you are required to rebuild your FULLTEXT index. For a MyISAM table, the easiest way to rebuild the index file is as follows:

mysql> REPAIR TABLE tbl_name QUICK;

6.8.3 Full text search TODO
Make all operations on FULLTEXT index faster
Proximity operator
Support for "always-index words". They can be any string the user wishes to treat as a word, such as "C++", "AS/400", "TCP/IP", etc.
Supports full-text search in MERGE table
Many Byte character support
Create a stopword list according to the language of the data
Stemming (depending on the language of the data, of course)
Generic user-suppliable UDF preparser.
Make the schema more flexible (by providing CREATE/ FULLTEXT in ALTER TABLE adds certain adjustable parameters)

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/318357.htmlTechArticleBacktoman.ChinaUnix.net MySQLReferenceManualforversion4.1.0-alpha. ------------- -------------------------------------------------- ------------------ 6.8 When MySQL full-text search reaches 3.23.23,...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn