Home >Database >Mysql Tutorial >Mysql uses indexes to implement query optimization_MySQL
The purpose of the index is to improve query efficiency, which can be compared to a dictionary. If we want to look up the word "mysql", we definitely need to locate the m letter, then find the y letter from bottom to bottom, and then find the remaining sql. Without an index, you may need to look through all the words to find what you want.
1. Advantages of index
Suppose you have three unindexed tables t1, t2 and t3. Each table contains data columns i1, i2 and i3 respectively, and each table contains 1000 data rows, with serial numbers from 1 to 1000. A query to find combinations of data rows that match certain values might look like this:
SELECT t1.i1, t2.i2, t3.i3 FROM t1, t2, t3 WHERE t1.i1 = t2.i2 AND t2.i1 = t3.i3;
The result of this query should be 1000 rows, each data row containing three equal values. If we were processing this query without indexes, we would have no way of knowing which rows contained which values without scanning all of the tables. Therefore you must try all combinations to find records that match the WHERE condition. The number of possible combinations is 1000 x 1000 x 1000 (a billion!), which is a million times the number of matching records. This wastes a lot of work. This example shows that without the use of indexes, as the records in the tables grow, the time spent processing joins between these tables grows faster, resulting in poor performance. We can achieve significant speed improvements by indexing these data tables, because indexes allow queries to be processed as follows:
1. Select the first row in table t1 and view the value of that data row.
2. Use the index on table t2 to directly locate the data row that matches the value of t1. Similarly, use the index on table t3 to directly locate the row of data that matches the value in table t2.
3. Process the next row of table t1 and repeat the previous process. This operation is performed until all data rows in t1 have been checked.
In this case, we still perform a full scan on table t1, but we can perform index lookups on t2 and t3 to get rows directly from these tables. Running the above query this way would theoretically be a million times faster. Of course, this example is artificially established to draw conclusions. However, the problem it solves is real, and adding indexes to tables that don't have indexes often results in staggering performance improvements.
-
2. The cost of index
First, indexes speed up retrieval, but slow down insertions and deletions, as well as updating values in indexed data columns. That is, indexes slow down most operations involving writes. The reason for this phenomenon is that when writing a record, not only the data rows need to be written, but also all indexes need to be changed. The more indexes a data table has, the more modifications need to be made and the greater the reduction in average performance. In the "Loading Data Efficiently" section of this article, we'll take a closer look at these phenomena and find out how to deal with them.
Secondly, the index will consume disk space, and multiple indexes will consume more disk space accordingly. This may result in reaching the data table size limit faster:
· For MyISAM tables, frequent indexing may cause the index file to reach the maximum limit faster than the data file.
· For BDB tables, which store data and index values together in the same file, adding an index causes such tables to reach the maximum file limit faster.
· All tables allocated in InnoDB’s shared tablespace compete to use the same common space pool, so adding indexes will exhaust storage in the tablespace faster. However, unlike the files used by MyISAM and BDB tables, the InnoDB shared tablespace is not limited by the file size of the operating system because we can configure it to use multiple files. As long as additional disk space is available, you can expand the table space by adding new components.
An InnoDB table using a separate table space is subject to the same constraints as a BDB table, because its data and index values are stored in a single file.
What these elements actually mean is: If you don’t need a special index to help queries execute faster, don’t create an index.
3. Select index
It is assumed that you already know the syntax for indexing, but the syntax does not tell you how the data table should be indexed. This requires us to consider how data tables are used. This section guides you on how to identify candidate data columns for indexing, and how to best build indexes:
Index data columns used for searching, sorting, and grouping are not just for output display. In other words, the best candidate data columns for indexing are those that appear in the WHERE clause, join clause, ORDER BY, or GROUP BY clause. Data columns that only appear in the output data column list after the SELECT keyword are not good candidate columns:
SELECT col_a <- 不是备选列 FROM tbl1 LEFT JOIN tbl2 ON tbl1.col_b = tbl2.col_c <- 备选列 WHERE col_d = expr; <- 备选列
Of course, the displayed data column and the data column used in the WHERE clause may also be the same. Our point is that the data columns in the output list are inherently not good candidates for indexing.
Join子句或WHERE子句中类似col1 = col2形式的表达式中的数据列都是特别好的索引备选列。前面显示的查询中的col_b和col_c就是这样的例子。如果MySQL能够利用联结列来优化查询,它一定会通过减少整表扫描来大幅度减少潜在的表-行组合。
考虑数据列的基数(cardinality)。基数是数据列所包含的不同值的数量。例如,某个数据列包含值1、3、7、4、7、3,那么它的基数就是4。索引的基数相对于数据表行数较高(也就是说,列中包含很多不同的值,重复的值很少)的时候,它的工作效果最好。如果某数据列含有很多不同的年龄,索引会很快地分辨数据行。如果某个数据列用于记录性别(只有”M”和”F”两种值),那么索引的用处就不大。如果值出现的几率几乎相等,那么无论搜索哪个值都可能得到一半的数据行。在这些情况下,最好根本不要使用索引,因为查询优化器发现某个值出现在表的数据行中的百分比很高的时候,它一般会忽略索引,进行全表扫描。惯用的百分比界线是”30%”。现在查询优化器更加复杂,把其它一些因素也考虑进去了,因此这个百分比并不是MySQL决定选择使用扫描还是索引的唯一因素。
索引较短的值。尽可能地使用较小的数据类型。例如,如果MEDIUMINT足够保存你需要存储的值,就不要使用BIGINT数据列。如果你的值不会长于25个字符,就不要使用CHAR(100)。较小的值通过几个方面改善了索引的处理速度:
· 较短的值可以更快地进行比较,因此索引的查找速度更快了。
· 较小的值导致较小的索引,需要更少的磁盘I/O。
· 使用较短的键值的时候,键缓存中的索引块(block)可以保存更多的键值。MySQL可以在内存中一次保持更多的键,在不需要从磁盘读取额外的索引块的情况下,提高键值定位的可能性。
对于InnoDB和BDB等使用聚簇索引(clustered index)的存储引擎来说,保持主键(primary key)短小的优势更突出。聚簇索引中数据行和主键值存储在一起(聚簇在一起)。其它的索引都是次级索引;它们存储主键值和次级索引值。次级索引屈从主键值,它们被用于定位数据行。这暗示主键值都被复制到每个次级索引中,因此如果主键值很长,每个次级索引就需要更多的额外空间。
索引字符串值的前缀(prefixe)。如果你需要索引一个字符串数据列,那么最好在任何适当的情况下都应该指定前缀长度。例如,如果有CHAR(200)数据列,如果前面10个或20个字符都不同,就不要索引整个数据列。索引前面10个或20个字符会节省大量的空间,并且可能使你的查询速度更快。通过索引较短的值,你可以获得那些与比较速度和磁盘I/O节省相关的好处。当然你也需要利用常识。仅仅索引某个数据列的第一个字符串可能用处不大,因为如果这样操作,那么在索引中不会有太多的唯一值。
你可以索引CHAR、VARCHAR、BINARY、VARBINARY、BLOB和TEXT数据列的前缀。
使用最左(leftmost)前缀。建立多列复合索引的时候,你实际上建立了MySQL可以使用的多个索引。复合索引可以作为多个索引使用,因为索引中最左边的列集合都可以用于匹配数据行。这种列集合被称为”最左前缀”(它与索引某个列的前缀不同,那种索引把某个列的前面几个字符作为索引值)。
假设你在表的state、city和zip数据列上建立了复合索引。索引中的数据行按照state/city/zip次序排列,因此它们也会自动地按照state/city和state次序排列。这意味着,即使你在查询中只指定了state值,或者指定state和city值,MySQL也可以使用这个索引。因此,这个索引可以被用于搜索如下所示的数据列组合:
state, city, zip state, city state
MySQL不能利用这个索引来搜索没有包含在最左前缀的内容。例如,如果你按照city或zip来搜索,就不会使用到这个索引。如果你搜索给定的state和具体的ZIP代码(索引的1和3列),该索引也是不能用于这种组合值的,尽管MySQL可以利用索引来查找匹配的state从而缩小搜索的范围。
不要过多地索引。不要认为”索引越多,性能越高”,不要对每个数据列都进行索引。我们在前面提到过,每个额外的索引都会花费更多的磁盘空间,并降低写操作的性能。当你修改表的内容的时候,索引就必须被更新,甚至可能重新整理。如果你的索引很少使用或永不使用,你就没有必要减小表的修改操作的速度。此外,为检索操作生成执行计划的时候,MySQL会考虑索引。建立额外的索引会给查询优化器增加更多的工作量。如果索引太多,有可能(未必)出现MySQL选择最优索引失败的情况。维护自己必须的索引可以帮助查询优化器来避免这类错误。
如果你考虑给已经索引过的表添加索引,那么就要考虑你将增加的索引是否是已有的多列索引的最左前缀。如果是这样的,不用增加索引,因为已经有了(例如,如果你在state、city和zip上建立了索引,那么没有必要再增加state的索引)。
让索引类型与你所执行的比较的类型相匹配。在你建立索引的时候,大多数存储引擎会选择它们将使用的索引实现。例如,InnoDB通常使用B树索引。MySQL也使用B树索引,它只在三维数据类型上使用R树索引。但是,MEMORY存储引擎支持散列索引和B树索引,并允许你选择使用哪种索引。为了选择索引类型,需要考虑在索引数据列上将执行的比较操作类型:
· 对于散列(hash)索引,会在每个数据列值上应用散列函数。生成的结果散列值存储在索引中,并用于执行查询。散列函数实现的算法类似于为不同的输入值生成不同的散列值。使用散列值的好处是散列值比原始值的比较效率更高。散列索引用于执行=或96b4fef55684b9312718d5de63fb7121操作等精确匹配的时候速度非常快。但是对于查询一个值的范围效果就非常差了:
id < 30 weight BETWEEN 100 AND 150
· B树索引可以用于高效率地执行精确的或者基于范围(使用操作a792d7cec6e729943d26b51f1cfad30a=、>、a8093152e673feb7aba1828c43532094、!=和BETWEEN)的比较。B树索引也可以用于LIKE模式匹配,前提是该模式以文字串而不是通配符开头。
如果你使用的MEMORY数据表只进行精确值查询,散列索引是很好的选择。这是MEMORY表使用的默认的索引类型,因此你不需要特意指定。如果你希望在MEMORY表上执行基于范围的比较,应该使用B树索引。为了指定这种索引类型,需要给索引定义添加USING BTREE。例如:
CREATE TABLE lookup ( id INT NOT NULL, name CHAR(20), PRIMARY KEY USING BTREE (id) ) ENGINE = MEMORY;
如果你希望执行的语句的类型允许,单个MEMORY表可以同时拥有散列索引和B树索引,即使在同一个数据列上。
有些类型的比较不能使用索引。如果你只是通过把值传递到函数(例如STRCMP())中来执行比较操作,那么对它进行索引就没有价值。服务器必须计算出每个数据行的函数值,它会排除数据列上索引的使用。
使用慢查询(slow-query)日志来识别执行情况较差的查询。这个日志可以帮助你找出从索引中受益的查询。你可以直接查看日志(它是文本文件),或者使用mysqldumpslow工具来统计它的内容。如果某个给定的查询多次出现在”慢查询”日志中,这就是一个线索,某个查询可能没有优化编写。你可以重新编写它,使它运行得更快。你要记住,在评估”慢查询”日志的时候,”慢”是根据实际时间测定的,在负载较大的服务器上”慢查询”日志中出现的查询会多一些。
*4.建索引的几大原则*
4.1.最左前缀匹配原则,非常重要的原则,mysql会一直向右匹配直到遇到范围查询(>、d0d3ef2feb1bc4c8d48c16963c753a7e 3 and d = 4 如果建立(a,b,c,d)顺序的索引,d是用不到索引的,如果建立(a,b,d,c)的索引则都可以用到,a,b,d的顺序可以任意调整。
4.2.=和in可以乱序,比如a = 1 and b = 2 and c = 3 建立(a,b,c)索引可以任意顺序,mysql的查询优化器会帮你优化成索引可以识别的形式
4.3.尽量选择区分度高的列作为索引,区分度的公式是count(distinct col)/count(*),表示字段不重复的比例,比例越大我们扫描的记录数越少,唯一键的区分度是1,而一些状态、性别字段可能在大数据面前区分度就是0,那可能有人会问,这个比例有什么经验值吗?使用场景不同,这个值也很难确定,一般需要join的字段我们都要求是0.1以上,即平均1条扫描10条记录
4.4. Index columns cannot participate in calculations. Keep the columns "clean". For example, if from_unixtime(create_time) = '2014-05-29', the index cannot be used. The reason is very simple. All fields stored in the b+ tree are fields in the data table. value, but when retrieving, you need to apply the function to all elements to compare, which is obviously too costly. So the statement should be written as create_time = unix_timestamp('2014-05-29');
4.5. Expand the index as much as possible and do not create a new index. For example, there is already an index of a in the table, and now you want to add an index of (a, b), then you only need to modify the original index.
The above is the editor's introduction to Mysql's use of indexes to achieve query optimization. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. Thank you very much for your support of the website!