By pre-sorting, the index can be searched using high-efficiency algorithms such as binary search. The complexity of general sequential search is O(n), while the complexity of binary search is O(log2n); when n is very large, the efficiency difference between the two is huge.
Mysql is a very popular database on the Internet. The design of its underlying storage engine and data retrieval engine is very important. In particular, the storage form of Mysql data and the design of the index determine the overall data of Mysql. Retrieval performance.
We know that the function of an index is to quickly retrieve data, and the essence of fast retrieval is the data structure. Through the selection of different data structures, various data can be quickly retrieved. In the database, efficient search algorithms are very important, because a large amount of data is stored in the database, and an efficient index can save huge time. For example, in the following data table, if Mysql does not implement the index algorithm, then to find the data with id=7, you can only use violent sequential traversal to find the data. To find the data with id=7, you need to compare it 7 times. If this table stores 10 million pieces of data To search for the data with id=1000W, it will be compared 1000W times. This speed is unacceptable.
Hash table (Hash)
Hash table is an effective tool for fast data retrieval. Hash algorithm: Also called hash algorithm, it converts any value (key) into a fixed-length key address through a hash function, and uses this address to create a data structure for specific data.select * from user where id=7;The hash algorithm first calculates the physical address addr=hash(7)=4231 to store the data with id=7, and the physical address mapped by 4231 is 0x77, and 0x77 is where id=7 is stored. The physical address of the data. The data corresponding to user_name='g' can be found through this independent address. This is the calculation process used by the hash algorithm to quickly retrieve data. But the hash algorithm has a data collision problem, that is, the hash function may calculate the same result for different keys. For example, hash(7) may calculate the same result as hash(199). That is, different keys are mapped to the same result. This is a collision problem. A common way to solve the collision problem is the chain address method, which uses a linked list to connect the colliding data. After calculating the hash value, you also need to check whether the hash value has a collision in the data linked list. If so, it will be traversed to the end of the linked list until the data corresponding to the real key is found.
Because considering that a common method for data retrieval is range search, such as the following SQL statement:
select * from user where id \>3;
For the above statement, what we hope to do is to find the data with id>3, This is a very typical range search. If you use an index implemented by a hash algorithm, how to do a range search? A simple idea is to find all the data at once and load it into the memory, and then filter the data within the target range in the memory. But this range search method is too cumbersome and not efficient at all.
Therefore, although the index implemented using the hash algorithm can quickly retrieve data, it cannot perform efficient range search of data. Therefore, the hash index is not a data structure suitable as the underlying index of Mysql.
Binary search tree (BST)
Binary search tree is a data structure that supports fast data search, as shown in the figure below:
The time complexity of a binary search tree is O(lgn). For example, for the above binary tree structure, we need to calculate and compare 3 times to retrieve it. The data with id=7 saves half the time compared to direct traversal query. From the perspective of retrieval efficiency, it seems that high-speed retrieval can be achieved. In addition, can the structure of the binary tree solve the range search function that the hash index cannot provide?
The answer is yes. Observe the picture above, the leaf nodes of the binary tree are arranged in order, in ascending order from left to right. If we need to find the data with id>5, then we can take out the node with node 6 and its right subtree. , range search is relatively easy to implement.
But the ordinary binary search tree has a fatal shortcoming: in extreme cases, it will degenerate into a linear linked list, binary search will also degenerate into a traversal search, the time complexity will degenerate to O(N), and the retrieval performance will drop sharply. For example, in the following situation, the binary tree is extremely unbalanced and has degenerated into a linked list, and the retrieval speed is greatly reduced. At this time, the number of calculations required to retrieve the data with id=7 has become 7.
In the database, data auto-increment is a very common form. For example, the primary key of a table is id. The primary key is generally auto-incrementing by default. If a data structure such as a binary tree is used as an index, the linear search problem caused by the unbalanced state introduced above will inevitably occur. Therefore, a simple binary search tree has the problem of reduced retrieval performance caused by imbalance, and cannot be directly used to implement the underlying index of Mysql.
AVL tree and red-black tree
Binary search tree has an imbalance problem, so scholars proposed to make the binary tree through automatic rotation and adjustment of tree nodes By always maintaining a basically balanced state, you can maintain the best search performance of the binary search tree. Binary trees with self-adjusting equilibrium states based on this idea include AVL trees and red-black trees.
First of all, we briefly introduce the red-black tree. This is a tree structure that automatically adjusts the tree shape. For example, when the binary tree is in an unbalanced state, the red-black tree will automatically rotate left-right nodes and the nodes will change color. Adjusting the shape of the tree to maintain a basic balanced state (time complexity is O(logn)) ensures that the search efficiency will not be significantly reduced. For example, if data nodes are inserted in ascending order from 1 to 7, an ordinary binary search tree will degenerate into a linked list, but a red-black tree will continuously adjust the shape of the tree to maintain a basic balance, as shown in the figure below. The number of nodes to be compared when searching for id=7 in the red-black tree below is 4, which still maintains the good search efficiency of the binary tree.
The red-black tree has good average search efficiency, and there is no extreme O(n) situation. Can the red-black tree be used as the underlying index implementation of Mysql? In fact, red-black trees also have some problems. Look at the following example.
The red-black tree inserts 1~7 nodes sequentially, and the number of nodes that need to be calculated when searching for id=7 is 4.
The red-black tree inserts 1~16 nodes sequentially, and the number of nodes that need to be compared when searching for id=16 is 6 times . Observe the shape of this tree. Is it true that when data is inserted sequentially, the shape of the tree has always been in a "right-leaning" trend? Fundamentally, the red-black tree does not completely solve the binary search tree. Although this "right-leaning" trend is far less exaggerated than the binary search tree degenerating into a linear linked list, the basic primary key auto-increment operation in the database, the primary key is generally Millions and tens of millions. If the red-black tree has this kind of problem, it will also consume a huge amount of search performance. Our database cannot tolerate this meaningless waiting.
Now consider another more strict self-balancing binary tree, the AVL tree. Because the AVL tree is an absolutely balanced binary tree, it consumes more performance in adjusting the shape of the binary tree.
AVL tree inserts 1~7 nodes sequentially, and the number of times to compare the node with id=7 is 3.
AVL tree sequentially inserts 1~16 nodes, and finds id=16. The number of nodes that need to be compared is 4. In terms of search efficiency, the search speed of AVL tree is higher than that of red-black tree (AVL tree is 4 comparisons, red-black tree is 6 comparisons). Judging from the shape of the tree, AVL trees do not have the "right leaning" problem of red-black trees. In other words, a large number of sequential insertions will not lead to a decrease in query performance, which fundamentally solves the problem of red-black trees.
Summarize the advantages of AVL trees:
Good search performance (O(logn)), there is no extreme inefficient search situation.
Can realize range search and data sorting.
It seems that the AVL tree is really good as a data structure for data search, but the AVL tree is not suitable for the index data structure of the Mysql database, because consider this problem:
The bottleneck of database query data is disk IO. If we use an AVL tree, each of our tree nodes only stores one data. We can only take out the data on one node and load it into the memory with one disk IO. Then For example, to query the data id=7, we have to perform disk IO three times, which is very time-consuming. Therefore, when designing database indexes, we need to first consider how to reduce the number of disk IOs as much as possible.
Disk IO has a characteristic, that is, the time it takes to read 1B data and 1KB data from the disk is basically the same. Based on this idea, we can read as many times as possible on a tree node. Store data locally, and load more data into memory with one disk IO. This is the design principle of B-tree and B-tree.
B tree
The following B tree is limited to storing up to two keys per node. If a node has more than two keys, it will automatically split. For example, the following B-tree stores 7 data. You only need to query two nodes to know the specific location of the data with id=7. That is, you can query the specified data with two disk IOs, which is better than the AVL tree.
The following is a B-tree that stores 16 pieces of data. Similarly, each node stores up to 2 keys. Query The data with id=16 needs to be queried and compared on 4 nodes, which means 4 times of disk IO. Looks like query performance is the same as AVL tree.
But considering that the time consumed by disk IO to read one piece of data is basically the same as that of reading 100 pieces of data, then our optimization The idea can be changed to: read as much data as possible into the memory in one disk IO. This is directly reflected in the structure of the tree, that is, the key that each node can store can be increased appropriately.
When we set the number of keys limited to a single node to 6, for a B-tree that stores 7 pieces of data, the disk IO required to query the data with id=7 is 2 times.
A B-tree that stores 16 pieces of data, querying the data with id=7 requires 2 disk IOs. Compared with the AVL tree, the number of disk IOs is reduced to half.
So in terms of selection of database index data structure, B-tree is a very good choice. In summary, B-tree has the following advantages when used as a database index:
Excellent retrieval speed, time complexity: The search performance of B-tree is equal to O(h* logn), where h is the tree height and n is the number of keywords in each node;
As little disk IO as possible speeds up the retrieval speed;
Can support range search.
B tree
What is the difference between B tree and B tree?
First, B tree stores data in one node, and B tree stores indexes (addresses), so one node in B tree cannot store a lot of data , but one node of the B-tree can store many indexes, and the leaf nodes of the B-tree store all the data.
Second, The leaf nodes of the B tree are connected in series using a linked list in the data stage to facilitate range search.
MYI: Index file in the table (myisam index)
From the generated file, the underlying data and index of these two engines The organization method is different. The MyISAM engine separates the data and index, and each person has one file, which is called the non-clustered index method; the Innodb engine puts the data and index in the same file, which is called the clustered index method. The following will analyze how these two engines rely on the B-tree data structure to organize engine implementation from the perspective of underlying implementation.
The underlying implementation of MyISAM engine (non-clustered index mode)
MyISAM uses a non-clustered index mode, that is, the data and index fall into two different on the file. When MyISAM creates a table, it uses the primary key as the KEY to create a primary index B-tree. The leaf nodes of the tree store the physical address of the corresponding data. After we get this physical address, we can directly locate the specific data record in the MyISAM data file.
When we add an index to a field, we will also generate an index tree for the corresponding field. The leaf nodes of the index tree also record the physical address of the corresponding data, and then use this physical address to locate the specific data record in the data file.
The underlying implementation of the Innodb engine (clustered index method)
InnoDB is a clustered index method, so the data and index are stored in the same file. First, InnoDB will build an index B tree based on the primary key ID as KEY, as shown in the lower left figure. The leaf nodes of the B tree store the data corresponding to the primary key ID. For example, when executing the statement select * from user_info where id=15, InnoDB The primary key ID index B-tree will be queried to find the corresponding user_name='Bob'.
This is when InnoDB will automatically build the primary key ID index tree when creating a table. This is why Mysql requires the primary key to be specified when creating a table. How does InnoDB build an index tree when we add an index to a field in the table? For example, if we want to add an index to the user_name field, then InnoDB will create a user_name index B-tree. The KEY of user_name is stored in the node, and the data stored in the leaf nodes is the primary key KEY. Note that the leaves store the primary key KEY! After getting the primary key KEY, InnoDB will go to the primary key index tree to find the corresponding data based on the primary key KEY just found in the user_name index tree.
The question is, why does InnoDB only store specific data in the leaf nodes of the primary key index tree, but not in other index trees? What about the specific data? What if we need to find the primary key first and then find the corresponding data in the primary key index tree?
It's actually very simple, because InnoDB needs to save storage space. There may be many indexes in a table. InnoDB will generate an index tree for each indexed field. If the index tree of each field stores specific data, then the index data file of this table will become very huge (data Extremely redundant). From the perspective of saving disk space, there is really no need to store specific data in each field index tree. Through this seemingly "unnecessary" step, huge disk space is saved at the expense of less query performance. This It's very worthwhile.
When comparing the features of InnoDB and MyISAM, it was mentioned that MyISAM has better query performance. The reason can be seen from the design of the index file data file above: MyISAM can directly locate the physical address after directly finding it. Data records, but after InnoDB queries the leaf nodes, it needs to query the primary key index tree again to locate the specific data. It means that MyISAM can find the data in one step, but InnoDB requires two steps. Of course, MyISAM's query performance is higher.
The above is the detailed content of What is the reason why mysql index is fast?. For more information, please follow other related articles on the PHP Chinese website!