Home >Database >Mysql Tutorial >MySQL index VS ElasticSearch index

MySQL index VS ElasticSearch index

coldplay.xixiforward: 2020-10-09 17:03:572061browse

TodayMySQL database column introduces the comparison between MySQL index and ElasticSearch index.

Preface

I am maintaining the search function of the product during this period, and every time I see ## on the management console #elasticsearch I am very curious about how he achieves such efficient query efficiency.

This is even faster than querying by primary key using

MySQL on my local machine.

I searched for relevant information:

There are many answers to this type of question on the Internet. The general meaning is as follows:

Lucene. It will segment the data and save the index. It is good at managing a large number of Index data, compared to MySQL, is not good at frequently updating data and related queries.

The explanation is not very thorough, and there is no analysis of the relevant principles; but since the index is mentioned repeatedly, let's compare the differences between the two from the perspective of the index.

MySQL Index

Let’s start with

MySQL. Everyone must be familiar with the word index. It usually exists in some query scenarios and is a typical space-for-time exchange. case.

以下内容以 Innodb 引擎为例。复制代码

Common data structures

Assuming that we design the index of

MySQL ourselves, what are the options?

Hash table

The first thing we should think of is the hash table, which is a very common and efficient data structure for querying and writing, and corresponds to

Java HashMap

This data structure should not need much introduction, its writing efficiency is very high

O (1), for example, when we want to query the data of id=3, we need to hash 3 and then find the corresponding position in this array.

But if we want to query interval data like

1≤id≤6, the hash table cannot satisfy it well. Since it is unordered, all the data must be It takes one pass to know which data belongs to this interval.

Ordered array

Ordered array query efficiency is also very high, when we want to query

id=4 data, you only need to use binary search to efficiently locate the data O(logn).

At the same time, since the data is also ordered, it can naturally support interval queries; so it seems that ordered arrays are suitable for use as indexes?

Of course not, it has another major problem ;Suppose we insert the data with

id=2.5, all subsequent data must be moved by one bit at the same time, and the writing efficiency will become very low.

Balanced Binary Tree

Since the writing efficiency of ordered arrays is not high, let’s take a look at the ones with high writing efficiency. It’s easy to think of binary trees; here we use balanced binary trees as Example:

Due to the characteristics of a balanced binary tree:

The left node is smaller than the parent node and the right node is larger than the parent node.

So assuming we want to query the data of

id=11, we only need to query 10—>12—>11 to finally find the data. The time complexity is O(logn), and similarly when writing data, it is also O(logn).

But it still does not support interval range search very well. Suppose we want to query the data of

5≤id≤20, we need to query the left subtree of 10 nodes first and then query the left subtree of 10 nodes. Only the right subtree can finally query all the data.

As a result, such query efficiency is not high.

Jump table

The skip table may not be as common as the hash table, ordered array, and binary tree mentioned above, but in fact, the # in

Redis ##sort set is implemented using a skip table. Here we briefly introduce the advantages of the data structure implemented by jump tables. We all know that even querying an ordered linked list is not efficient. Since it cannot use array subscripts for binary search, the time complexity is<code>o( n)

But we can also cleverly optimize the linked list to implement binary search in disguise, as shown below:

## We can extract primary indexes and secondary indexes for the lowest level data. Depending on the amount of data, we can extract N-level indexes.

When we query, we can use the index here to implement a binary search in disguise.

Suppose you want to query the data of

id=13, you only need to traverse the four nodes 1->7->10->13 to query When it comes to data, when the number is larger, the efficiency improvement will be more obvious.

At the same time, interval query is also supported. It is similar to the query of a single node just now. You only need to query the starting node, and then traverse it in sequence (

The linked list is in order) to the target node. The entire range of data is queried.

At the same time, since we will not store real data in the index, but only store a pointer, the space occupied can be ignored compared to the linked list at the bottom where data is stored.

Optimization of balanced binary tree

But in fact,

Innodb in MySQL does not use a skip table, but uses one called B Tree data structure.

This data structure is not like the binary tree that university teachers often talk about as a basic data structure, because this type of data structure is evolved from the basic data structure according to the demand scenario in actual projects.

For example, the

B tree here can be thought of as evolving from a balanced binary tree.

Just now we mentioned that the interval query efficiency of binary trees is not high. This can be optimized:

In the original binary tree After optimization based on: All non-leaf nodes do not store data, but only serve as indexes for leaf nodes, and all data are stored in leaf nodes.

In this way, the data of all leaf nodes are stored in order, and interval queries can be well supported.

You only need to query the position of the starting node first, and then traverse the leaf nodes in sequence.

When the amount of data is huge, it is obvious that the index file cannot be stored in the memory. Although it is very fast, it consumes a lot of resources; so

MySQL will store the index file directly in in the disk.

This is slightly different from the elasticsearch index mentioned later.

Since the index is stored on the disk, we must reduce the IO with the disk as much as possible (the efficiency of disk IO is not of the same order of magnitude as that of memory)

As can be seen from the above figure, we need to Querying a piece of data requires at least 4 IO times. Obviously, the number of IO times is closely related to the height of the tree. The lower the height of the tree, the fewer the number of IO times and the better the performance.

How can we reduce the height of the tree?

#We can try to change the binary tree into a trinomial tree, so that the height of the tree will drop a lot, and the number of IOs when querying data will naturally decrease. Reduced, while query efficiency will be greatly improved.

This is actually the origin of B-tree.

Some suggestions for using indexes

In fact, through the understanding of

B tree in the above figure, we can also optimize some small details of daily work; for example, why we need the most Is the good thing increasing in order?

Assuming that the primary key data we write is unordered, then it is possible that the id of the data written later is smaller than the one written before, which may be necessary when maintaining the

B tree index. Mobile has already written the data.

If you are writing data incrementally, you will not have this consideration. You only need to write it sequentially each time.

That’s why we require the primary key of the database to have an increasing trend as much as possible. The most reasonable thing is to auto-increment the primary key without considering the situation of split tables.

Overall, the idea is similar to that of a skip table, but adjustments have been made based on the usage scenario (for example, all data is stored in leaf nodes).

ES Index

MySQL After chatting, let’s take a look at how Elasticsearch uses indexes.

Forward index

In ES, a data structure called

Inverted index is used; before we formally talk about inverted index, let’s talk about His opposite is ranked index .

The above figure is an example. The way we can query specific objects through doc_id is called using Forward index, in fact, can also be understood as a hash table.

The essence is to find value through key.

For example, through doc_id=4, you can quickly query the data name=jetty wang,age=20.

Inverted index

Then if in turn I want to query what data contains li in name? How to query efficiently in this way?

Just using the forward index mentioned above obviously does not have any effect. We can only traverse all the data in sequence and then determine whether the name contains li; this is very inefficient.

But if we rebuild an index structure:

When querying name contains li data, you only need to query the data contained in Posting List through this index structure, and then query the final data through mapping.

This index structure is actually Inverted index.

Term Dictionary

But how to efficiently query li in this index structure? Combined with our previous experience, as long as we add Term Arranged in order, you can use the data structure of the binary tree search tree to query the data under o(logn).

The process of splitting a text into independent Term is actually what we often call word segmentation.

Merging all Term together is a Term Dictionary, which can also be called a word dictionary.

English word segmentation is relatively simple. You only need to separate the text by spaces and punctuation marks to split the words. Chinese is relatively complicated, but there are also many open source tools to support it (since it is not the focus of this article, word segmentation is Those who are interested can search by themselves).

When the amount of our text is huge, there will be a lot of Term after word segmentation. If such an inverted index data structure is stored in memory, it will definitely not be enough. But if it is stored on disk like MySQL, the efficiency is not that high.

Term Index

So we can choose a compromise method. Since the entire Term Dictionary cannot be put into memory, then we can Term Dictionary Create an index and put it into memory.

In this way, Term Dictionary can be queried efficiently, and finally Posting List can be queried through Term Dictionary.

Compared with the B tree in MySQL, it will also reduce the disk IO several times.

ThisTerm Index We can use this Trie tree which is what we often sayDictionary tree to store.

For more information about dictionary trees, please see here.

If we search for Term starting with j, the first step is to search through The Term Index in the memory queries the position of the Term starting with j in the Term Dictionary dictionary file (this position can be a File pointer, possibly an interval range).

Then take out all the Term in this position range. Since they have been sorted, you can quickly locate the specific position through binary search; in this way, you can query the Posting List.

Finally, the target data can be retrieved from the original file through the location information in Posting List.

More optimization

Of courseElasticSearch has also made many targeted optimizations. When we retrieve two fields, we can use bitmap optimize.

For example, now we need to query the data of name=li and age=18. At this time, we need to retrieve the respective results Posting List through these two fields.

The simplest method is to traverse the two collections separately and remove the duplicate data, but this is obviously inefficient.

At this time we can use the bitmap method to store (also save storage space), and at the same time use the innate bits and ** calculations to get the result . **

[1, 3, 5] ⇒ 10101

##[1, 2, 4, 5] ⇒ 11011

In this way, the result can be obtained by summing the two binary arrays:

10001 ⇒ [1, 5]

In the end,

Posting List is solved as [1, 5], which is naturally much more efficient.

There is no special optimization for the same query requirement in

MySQL. It just filters out the data with a small amount of data first and then filters the second field. Naturally, the efficiency is not as good ES High.

Of course,

Posting List will also be compressed in the latest version of ES. For specific compression rules, you can check the official documentation, which will not be introduced in detail here.

Summary

Finally, let’s summarize:

Through the above content, we can see that no matter how complex the product is, In the end, they are all composed of basic data structures, but they will be optimized specifically for different application scenarios. Therefore, after you have a good foundation of data structures and algorithms, you can quickly get started when looking at a new technology or middleware, and you can even know the optimization yourself. direction.

Finally draw a pie. In the future, I will try to make a stand-alone search engine based on the idea of

ES inverted index. Only by writing it myself can I deepen my understanding.

Related free learning recommendations:
mysql database (video)

The above is the detailed content of MySQL index VS ElasticSearch index. For more information, please follow other related articles on the PHP Chinese website!

Statement：

This article is reproduced at:juejin.im. If there is any infringement, please contact admin@php.cn delete

Previous article：How to open database with mysqlNext article：How to open database with mysql

See more