Home  >  Article  >  Database  >  High-performance MySQL-detailed explanation of creating high-performance indexes (picture and text)

High-performance MySQL-detailed explanation of creating high-performance indexes (picture and text)

黄舟
黄舟Original
2017-03-15 17:20:241682browse


This article is about creating index:
(1) Type of index
(2) Advantages of index
(3) Strategies for optimizing indexes

Here is a mind map of indexing:
High-performance MySQL-detailed explanation of creating high-performance indexes (picture and text)

Index is a method used by storage engines to quickly find records data structure. Indexes are the most effective means for optimizing query performance. Indexes can easily improve query performance by several orders of magnitude. Indexing We generally add an index to a certain column.

The storage engine first finds the corresponding value in the index, and then finds the corresponding data row based on the rowid on the matching index record. For example, run the following query statement:

SELECT first_name from actor where actor_id=5;

If there is an index on the actor_id column, MySQL will use the index to find the row corresponding to actor_id 5. That is to say, MySQL first searches on the index Finds by value and returns all rows containing that value.

The index can contain the values ​​of one or more columns. If the index contains multiple columns, the order of the columns is also very important, because MySQL can only efficiently use the leftmost prefix column of the index. There is a big difference between creating an index that contains two columns and creating two indexes that contain one column.

1. Type of index data structure:

The most common indexes are B-Tree indexes and hash indexes.

(1) B-Tree tree index

Generally, index refers to B-Tree index, which uses B-Tree data structure to store data. In fact, it is implemented based on B+Tree. Each leaf node contains a pointer to the next leaf node.

B-Tree means that all values ​​are stored in order. For example, for the name attribute, they are stored in the order from a-z. After using the B-Tree index, the storage engine no longer needs to perform a full table scan to obtain the required data. Instead, it starts searching from the root node of the index. The final result is that either the corresponding value is found or the record does not exist. This enables faster access to data.

B-Tree organizes and stores index columns sequentially, so it is very suitable for searching range data. (For example, searching for names starting with I-k will be very efficient)

Query type suitable for B-Tree index
(1) Full value matching: and in the index All columns are matched.

(2) Match the leftmost prefix: For an index containing multiple columns, only the first column of the index is used.

(3) Match column prefix: match the beginning of the value of a certain column. (For example, when matching the name field, only names starting with J are matched.) Only the first column of the index is used here.

(4) Matching range value: Match the records whose field is within a certain range. Only the first column of the index is used here.

(5) Exactly match a certain column and range match another column: For the case where an index contains multiple fields, such as exact match on the first column and range match on the second column.

(6) Query that only accesses the index: it accesses the index rows without accessing the data rows of other fields in the record.

The above range matching is mainly because the index stores the index columns in order, which leads to the high efficiency of range matching.

There are also some restrictions on B-Tree indexes:
(1) The index can only search from the leftmost column

(2) If there is a range search of a certain column in the query , then all columns to the right of it cannot use index optimization.

Seeing the above two restrictions, you should be able to understand that when the index contains multiple columns, the order of the index columns is very important.

(2) Hash hash index

The hash index is implemented based on the hash table. Only queries that accurately match all columns of the index are valid. For each row of data, the data storage engine calculates a hash code for all index columns. The hash code is a smaller value, and the hash codes calculated for rows with different key values ​​are also different.

1) The hash index only stores hash values ​​and row pointers, and does not store specific field values, so there must be a process of reading rows.

2) Hash index is not stored in the order of index value, so it cannot be used for sorting.

3) The hash index only supports equality comparison queries and does not support range comparison queries. This is related to the characteristics of the hash table.

4) Hash indexes have the problem of hash conflicts. For hash conflict data, all row pointers in the linked list must be traversed.

With the above limitations, hash indexes are only suitable for specific occasions, but once they are suitable for hash indexes, the performance will be particularly high.

When using a hash index, usually the value before the hash must be included in the query conditions, such as:

mysql>select * from words where crc=crc32(‘gnu’) and word=’gnu’;

这里crc字段就是word字段哈希之后的值,因为hash之后可能存在冲突,带上原本的值做上二次比较,就可以精确定位。

2.索引的优点:

索引可以让服务器快速定位到表的指定位置。但是这不是唯一的作用,比如:
(1)对于B-Tree索引,由于B-Tree是按照顺序存储数据的,所以用来做order by 操作或则是 group by操作的效率很高。

(2)因为索引中存储了实际的列值,所以某些查询只需要索引就可以完成全部查询。

总结来说就是3点:
(1)索引大大减少服务器需要扫描的数据量;

(2)索引可以帮助服务器避免排序和临时表;

(3)索引可以将随机IO变为排序IO。

3.高性能的索引策略

先概括一下索引的策略:
 1)单列索引
 2)多列索引
 3)前缀索引
 4)聚簇索引
 5)覆盖索引

单列索引

所谓单列索引是指:使用数据表字段中的某一列作为索引。但是索引列不能是表达式的一部分,也不能是函数的参数

比如对于下面的一个例子:

select actor_id from actor where actor_id+1=5;

对于这样的一个SQL,where语句后面 是一个表达式,其实很明显是actor_id=4的条件,但是MySQL却无法解析,索引无法正却使用索引。

还有一种是函数参数:也是无法正常的使用索引的

select ... where TO_DAYS(CURRENT_DATE) - TO_DAYS(date_col)<=10;

多列索引以及选择合适的索引顺序

注意这里要区分:为每个列创建独立的索引和为多个列创建一个索引的区别。

比如下面这种情况:

CREATE TABLE t{
c1 int,
c2 int,
c3 int,key(c1),key(c2),key(c3)
}

这一种就是为表中的3个列都创建了索引。

但是多个列创建索引就是:创建了一个索引,包含customer_id,和staff_id

alter table payment add KEY(customer_id, staff_id);

上面这个索引其实是包含了两个索引,一个是customer_id这个索引,还有一个是(customer_id,staff_id)。注意staff_id并不能作为单独的索引使用。



对于多列索引,最重要的就是怎么选择索引列的顺序,其实这一点与实际的查询需求有关。主要是为了满足排序和分组。


先从数据结构层次来分析,我们知道索引是以B-Tree的形式存储的,在一个多列索引列中,索引的顺序意味着索引首先按照最左列进行排序,其次是第二列。所以对于一个多列索引,如果以第二列或则第三列直接作为索引,基本是没有用到索引。由于索引的有序性很好的满足了order by、group by和distinct等子句的查询需求。

从上面的分析我们就能认识到多列索引中列的顺序是多么重要。关于多列索引中有一点经验法则:
(1)在不需要考虑排序和分组时,通常情况下将选择性最高的列放在索引最前列。(这时候索引只需要优化where查询条件,能够很快过滤出需要的行)

索引列的选择性定义:不重复的索引值和数据表的记录总数的比值。索引的选择性越高也就是查询效率越高。比如对于人员信息表,phone这一字段的选择性是很高的,几乎为1,但是对于sex性别这一字段的选择性是非常低的,因为只有两个选择男或则是女,几乎为0。

(2)实际情况下也与数据的分布有很大关系。
以下面的查询为例:

SELECT * FROM item WHERE staff_id=2 AND customer_id=584;

这时候应该创建(staff_id, customer_id)的索引还是应该创建(customer_id,staff_id)的索引呢?这时候就应该确认一下那个字段的选择性更高,先查询一下staff_id和customer_id的总数,哪个小就将哪个放在前面。

前缀索引

前缀索引:有时候需要索引的列可能会很长,这时候会导致索引大而且很慢,我们可以只索引列开始的部分(也就是只索引某一列的前面几个字符),这样可以大大节省索引空间也能加快索引的速度,但是也会降低索引的选择性(也就是索引查出来的结果会变多)。

使用的技巧在于:选择足够长的前缀保证较高的选择性,同时又不能太长,避免占用太多的存储空间。

Clustered index

Clustered index is not a separate index type, but a data storage method. Here we mainly use InnoDB as an example to illustrate the clustered index.

The clustered index in InnoDB actually stores the B-tree index and data rows in the same structure. When there is a clustered index in a table, its data rows are actually stored in the leaf pages of the index. The meaning of clustering is actually that data rows and key values ​​in adjacent B-Tree are stored compactly together. Data rows can only be stored in one place, so there can only be one clustered index.

The following is an example diagram to illustrate: the index column is an integer value, the leaf page contains all the data of the row, but the node page only contains the index column (the integer value in the figure below).

High-performance MySQL-detailed explanation of creating high-performance indexes (picture and text)

In the current version of MySQL, InnoDB's clustered index only supports using primary keys to cluster data. If no primary key is defined, InnoDB will choose a unique non-empty index instead.

Advantages of clustered data:
(1) Related data can be saved together. For example, when querying email addresses, user ID is used as the primary key and the data is clustered by user ID. In this way, all emails of a user can be obtained by reading only a small number of data pages from the disk.

(2) Data access is faster. A clustered index stores the index and data in a B-Tree, so retrieving data from a clustered index is usually faster than looking up the same index. (Of course, there are cases where the search column is the index column)

(3) Queries using covering index scans can directly use the primary key in the page node.

The above advantages can greatly improve performance when querying and designing tables, but there are also some disadvantages:
(1) Clustered data greatly improves the performance of IO-intensive applications, but all data If placed in memory, the order of access is not important, and the clustered index loses its advantage.

(2) The insertion speed depends heavily on the insertion order.

(3) Updating clustered index columns is very expensive and will force each updated row of InnoDB to be moved to a new location.

Covering index

If an index contains (or covers) the values ​​of all the fields that need to be queried, we call it a covering index.

Covering index is a very useful tool. For indexes, you only need to scan the index to get all the data in the leaf nodes of the index, without the need to query back to the table, which can be extremely effective. Greatly improve performance. There are also many benefits:

(1) Index entries are usually much smaller than the size of the data row. If you only need to read the index, MySQL will greatly reduce the amount of data access, which will impose a heavy load on the cache. Very important.

(2) Because the index is stored in column value order, IO-intensive range searches will require much less Io than randomly reading each row of data from the disk.

Use index scan for sorting

MySQL has two ways to generate ordered results:
(1) Through order by sorting operation;

( 2) Scan in index order;
If the type value explained is index, it means that MySQL uses index scanning for sorting.

The index scan itself is very fast. You only need to move from one record to the next. However, if the index column cannot cover all the query fields, then each time you scan an index record, you have to go back to the table and query it once. The performance is not as good as direct sequential full table scan.

Try to design the same index that satisfies sorting and can be used for search.

4. Disadvantages of index

(1) For insert, update, and delete operations, the index needs to be updated synchronously, resulting in slow speed.

(2) Indexes will take up a lot of storage.

The above is the detailed content of High-performance MySQL-detailed explanation of creating high-performance indexes (picture and text). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn