What is an index
An index is a data structure whose function is to improve the efficiency of data query. A common metaphor is to compare it to the catalog of a book. Through the table of contents, you can accurately find the page where the content of a certain chapter is located.
There is actually no point in using an index when the amount of data is small. Even if there is no index and it needs to traverse the data one by one, it does not take much time for the computer. Once the amount of data is large, indexing is necessary to ensure that we can provide normal external services and ensure user experience.
Index type
Index is a data structure, and there are multiple implementations to deal with different scenarios. In MySQL, the main ones are Hash index and B Tree.
Hash Index
Hash I believe everyone should be familiar with it. Hash is a data structure in the form of key-value. The implementation is generally an array-linked list structure. The position of the key in the array is calculated through the hash function, and then if a hash conflict occurs, it is resolved through the linked list (zipper method). Of course there are other ways to resolve hash conflicts. The data structure of hash is very commonly used. For example, our system uses HashMap to build hotspot data cache, and the access efficiency is very good.
The hash structure stores data. First, the hash value of the key is calculated to determine its position in the array. If there is a conflict, a linked list is built at the array position. This obviously has several problems:
Even the calculated positions of keys with the same characteristics may be far apart, making continuous queries inefficient. That is, range queries are not supported.
The hash index stores the calculated hash value and row pointer, but does not store the specific row value, so querying the data through the hash index requires two queries (first query the position of the row, and then find the specific Data)
The premise of hash index query data is to calculate the hash value, which requires the key to be a key that can accurately point to a piece of data, so matching queries such as like are not supported.
So what we can know is that the hash index is suitable for quickly selecting a certain row of data.
B Tree structure
Judging from the name, this is obviously a tree structure. Tree structures are a must-have in data structure textbooks in college. The tree structure is a particularly important data structure that is used in many places.
We mentioned above that hash indexes cannot perform range queries. There is also a structure in the tree structure that is convenient for ordered queries - a binary search tree. The structure of the binary search tree requires that the value of the parent node is greater than the left child node and less than the right child node, as shown below:
The time complexity of the binary tree query in the above figure It is O(log(n)). Of course, to ensure the time complexity of O(log(n)), we need to ensure that the binary tree remains balanced at all times.
Although the tree structure is also used in the MySQL index, it is not a binary tree. Because the data in the database is ultimately stored on the disk, and if the tree has too many nodes, it will take a lot of time to transfer between nodes. In the implementation of MySQL, we choose to put more content on the same node, and transfer the operations on the same node to the memory to reduce the number of transfers between nodes in external memory to achieve the purpose of improving efficiency. This is B Tree. In the implementation of B Tree, a three-layer tree structure can basically meet almost all our needs.
Related recommendations: "mysql database knowledge learning"
B-Tree
To understand B Tree first You have to understand B-Tree. B-Tree is a balanced tree. The B here refers to Balance rather than Binary. To be more precise, B-Tree is a multi-way balanced search tree.
The multi-path balanced search tree is as shown below:
This is a 2-3 tree, which means that each node stores two values, and at the same time The number of branches per node is 3. As can be seen from the above figure, the middle structure is very suitable for querying data. The value of the left subtree of each node is less than the smallest value of the current node, the values of the middle subtree are all in the middle of the two values of the current node, and the values of the right subtree are all greater than the maximum value of the current node.
For example, we want to find the value 24:
(1) First, judge from the root node that 24 is between the root nodes (15, 25), so the left and right subtrees are excluded and the search is from the middle.
(2) Then find the root node (18,22) of the intermediate subtree. Comparison finds that 24 is greater than the maximum value of the node, excluding the left subtree and the intermediate subtree.
(3) Find the right subtree, judge that the maximum value of the node is exactly equal to 24, and the query ends.
Based on the above process, the B-tree search can be summarized:
(1) Starting from the root node, perform a binary search on the keyword (ordered) sequence within the node.
(2) If hit, end, otherwise enter the child node of the range to which the query keyword belongs;
(3) Repeat the above process until the corresponding child node is empty or is already a leaf node;
It can be seen that the search performance is equivalent to doing a binary search in the keyword set . From here it seems that there is nothing wrong with B-Tree, but it should be noted that each node in B-Tree stores the index key and the specific row data it represents. In MySQL, database loading data is loaded in page units, and the size of each page is fixed (default 16k). If each node stores all values, then there will be very few nodes that can be stored on a page, and a query may load data from memory multiple times, resulting in reduced performance.
B Tree
B Tree is a variant of B-Tree, making it more suitable for indexing external storage files.
The biggest difference between the two is that each node of B-Tree stores all data, while the data that B-Tree needs to store is all on the leaf nodes, and a sequential access pointer is added to each leaf. Each node has an address pointing to the next adjacent leaf node. This structure ensures that more index nodes can be stored in one memory page, and is more suitable for range queries.
Index
Because the storage engine is responsible for implementing the index, the indexes discussed next are all based on MySQL's InnoDB engine.
Clustered index
Clustering means that data rows and adjacent key value clusters are stored together. Some databases allow you to select a specific index as a clustered index, while in the implementation of InnoDB, the primary key index is directly designated as a clustered index. If no primary key is defined, InnoDB will choose a unique non-null index to replace the primary key index. If such an index is not defined, InnoDB will implicitly define a primary key as a clustered index (row_id).
Clustered index example is shown in the figure:
Non-clustered index index
Except the primary key in InnoDB Except for the index, everything else is a non-clustered index, so it is also called a non-primary key index. The leaf nodes of non-primary key indexes do not store the value of a row, but the primary key value of a specific row. The definition of clustering is not met.
An example of a non-clustered index is shown in the figure:
The difference between clustered index and non-clustered index during query
It can be seen from the above two index example diagrams that if the query is through the primary key index, the data rows will be directly queried and returned. However, if you query through a non-primary key index, you first need to determine the primary key through the index, and then use the obtained primary key to find the data of the specific row from the primary key index. The subsequent process of obtaining data from the primary key index through the obtained primary key is called Return to table.
The process of returning the table makes querying through ordinary indexes one step more than querying through primary key indexes, and in many cases the efficiency is relatively low. Therefore, in our query process, if we can determine the data only through the primary key, it is best to query directly using the primary key.
Covered index
The above describes the process of returning the table through non-primary key query, but it should be noted that not every query has the process of returning the table. First, for an ordinary index, its leaf nodes store the value of the primary key. So what if the data I need now is only the value of the primary key? After obtaining the value of the primary key through the ordinary index, there is no need to look it up in the primary key index, so there is no process of returning to the table.
In the above example, the non-primary key index already contains the value we need, so the index is also called a covering index. Covering index is not a fixed structure. It can be a single index (an index on one field) or a compound index. Anything that can directly provide query results without the need to perform a table return process can be called a covering index.
Many times it is impossible for us to determine data only through the primary key. Using ordinary indexes may lead to inefficiency, so covering indexes are also a very common performance optimization method in the daily development process.
Of course, covering index pages is not always a good thing. For example, I have now created an index index(a,b). The advantage of establishing an index using two fields, a and b, is that the table will not be returned when querying the ab field. However, if you query only through the b field, you cannot use this index. The index items of the created index are sorted according to the order of fields appearing in the index definition.
Leftmost prefix principle
Assume that there is an index index (a, b), then if you query through a and b, the index can be applied, use a alone to The query can also be applied to the index, but if you use b alone to query, it cannot be applied to the index. This is the leftmost prefix principle. When matching the index, the leftmost n fields of the index will be matched. If there is a match, the index can be applied.
Due to the existence of the leftmost prefix principle, we may need to consider more things when building an index.
First of all, you need to be clear that an index is a data structure. When building an index, storage space is consumed. Therefore, it is not that the more indexes are created, the better. Instead, the number of indexes should be reduced as much as possible according to needs.
The existence of the leftmost prefix principle allows a joint index to be used as multiple indexes. Of course, the premise is to design the order of the fields in the index (in fact, the leftmost prefix principle is not only applicable to Union index, also used for string index, the leftmost n characters in the string index are equivalent to the leftmost n fields in the union index).
For example, index(a,b), with this index, we do not need to create a separate index for a, so when designing a joint index, generally put the fields with higher frequency of use first.
Then move the fields with higher discrimination to the front. The discrimination is the repetition rate of the values in the field. The lower the repetition rate, the higher the discrimination. For example, gender is not suitable as an index. Fields with higher distinction can filter out more rows after one filter.
Then what needs to be considered is the size of the field. Since the index also needs to occupy space, smaller fields are generally selected.
Reference materials
MySQL operation and maintenance internal reference: MySQL, Galera, Inception core principles and best practices
The above is the detailed content of Understand the indexes in MySQL in one article. For more information, please follow other related articles on the PHP Chinese website!