TodayMySQL database column introduces the comparison between MySQL index and ElasticSearch index.
I am maintaining the search function of the product during this period, and every time I see ## on the management console #elasticsearch I am very curious about how he achieves such efficient query efficiency.
MySQL on my local machine.
. It will segment the data and save the index. It is good at managing a large number of Index data, compared to
MySQL, is not good at frequently updating data and related queries.
MySQL. Everyone must be familiar with the word index. It usually exists in some query scenarios and is a typical space-for-time exchange. case.
以下内容以 Innodb 引擎为例。复制代码Common data structures Assuming that we design the index of
MySQL ourselves, what are the options?
Java
HashMap
O (1), for example, when we want to query the data of
id=3, we need to hash 3 and then find the corresponding position in this array.
1≤id≤6, the hash table cannot satisfy it well. Since it is unordered, all the data must be It takes one pass to know which data belongs to this interval.
id=4 data, you only need to use binary search to efficiently locate the data
O(logn).
id=2.5, all subsequent data must be moved by one bit at the same time, and the writing efficiency will become very low.
The left node is smaller than the parent node and the right node is larger than the parent node.So assuming we want to query the data of
id=11, we only need to query
10—>12—>11 to finally find the data. The time complexity is
O(logn), and similarly when writing data, it is also
O(logn).
5≤id≤20, we need to query the left subtree of 10 nodes first and then query the left subtree of 10 nodes. Only the right subtree can finally query all the data.
Redis ##sort set
is implemented using a skip table. <p>Here we briefly introduce the advantages of the data structure implemented by jump tables. </p>
<p>We all know that even querying an <strong>ordered linked list</strong> is not efficient. Since it cannot use array subscripts for binary search, the time complexity is<code>o( n)
But we can also cleverly optimize the linked list to implement binary search in disguise, as shown below:
## We can extract primary indexes and secondary indexes for the lowest level data. Depending on the amount of data, we can extract N-level indexes. When we query, we can use the index here to implement a binary search in disguise. Suppose you want to query the data ofid=13, you only need to traverse the four nodes
1->7->10->13 to query When it comes to data, when the number is larger, the efficiency improvement will be more obvious.
The linked list is in order) to the target node. The entire range of data is queried.
At the same time, since we will not store real data in the index, but only store a pointer, the space occupied can be ignored compared to the linked list at the bottom where data is stored. Optimization of balanced binary treeBut in fact,Innodb in
MySQL does not use a skip table, but uses one called
B Tree data structure.
B tree here can be thought of as evolving from a balanced binary tree.
MySQL will store the index file directly in in the disk.
This is actually the origin of B-tree.Some suggestions for using indexesIn fact, through the understanding of
B tree in the above figure, we can also optimize some small details of daily work; for example, why we need the most Is the good thing increasing in order?
B tree index. Mobile has already written the data.
That’s why we require the primary key of the database to have an increasing trend as much as possible. The most reasonable thing is to auto-increment the primary key without considering the situation of split tables.Overall, the idea is similar to that of a skip table, but adjustments have been made based on the usage scenario (for example, all data is stored in leaf nodes). ES Index
MySQL After chatting, let’s take a look at how
Elasticsearch uses indexes.
Inverted index is used; before we formally talk about inverted index, let’s talk about His opposite
is ranked index .
The above figure is an example. The way we can query specific objects through doc_id
is called using Forward index
, in fact, can also be understood as a hash table.
The essence is to find value through key.
For example, through doc_id=4
, you can quickly query the data name=jetty wang,age=20
.
Then if in turn I want to query what data contains li
in name
? How to query efficiently in this way?
Just using the forward index mentioned above obviously does not have any effect. We can only traverse all the data in sequence and then determine whether the name contains li
; this is very inefficient.
But if we rebuild an index structure:
When querying name
contains li
data, you only need to query the data contained in Posting List
through this index structure, and then query the final data through mapping.
This index structure is actually Inverted index
.
But how to efficiently query li
in this index structure? Combined with our previous experience, as long as we add Term
Arranged in order, you can use the data structure of the binary tree search tree to query the data under o(logn)
.
The process of splitting a text into independent Term
is actually what we often call word segmentation.
Merging all Term
together is a Term Dictionary
, which can also be called a word dictionary.
When the amount of our text is huge, there will be a lot of Term
after word segmentation. If such an inverted index data structure is stored in memory, it will definitely not be enough. But if it is stored on disk like MySQL
, the efficiency is not that high.
So we can choose a compromise method. Since the entire Term Dictionary
cannot be put into memory, then we can Term Dictionary
Create an index and put it into memory.
In this way, Term Dictionary
can be queried efficiently, and finally Posting List
can be queried through Term Dictionary
.
Compared with the B tree
in MySQL
, it will also reduce the disk IO
several times.
ThisTerm Index
We can use this Trie tree
which is what we often sayDictionary tree
to store.
For more information about dictionary trees, please see here.
If we search for Term
starting with j
, the first step is to search through The Term Index
in the memory queries the position of the Term
starting with j
in the Term Dictionary
dictionary file (this position can be a File pointer, possibly an interval range).
Then take out all the Term
in this position range. Since they have been sorted, you can quickly locate the specific position through binary search; in this way, you can query the Posting List
.
Finally, the target data can be retrieved from the original file through the location information in Posting List
.
Of courseElasticSearch
has also made many targeted optimizations. When we retrieve two fields, we can use bitmap
optimize.
For example, now we need to query the data of name=li and age=18
. At this time, we need to retrieve the respective results Posting List
through these two fields.
The simplest method is to traverse the two collections separately and remove the duplicate data, but this is obviously inefficient.
At this time we can use the bitmap
method to store (also save storage space), and at the same time use the innate bits and
** calculations to get the result . **
[1, 3, 5]
⇒ 10101
##[1, 2, 4, 5] ⇒
11011
10001 ⇒
[1, 5]
Posting List is solved as
[1, 5], which is naturally much more efficient.
MySQL. It just filters out the data with a small amount of data first and then filters the second field. Naturally, the efficiency is not as good
ES High.
Posting List will also be compressed in the latest version of
ES. For specific compression rules, you can check the official documentation, which will not be introduced in detail here.
ES inverted index. Only by writing it myself can I deepen my understanding.
Related free learning recommendations:mysql database (video)
The above is the detailed content of MySQL index VS ElasticSearch index. For more information, please follow other related articles on the PHP Chinese website!