Home >Database >Mysql Tutorial >Online slow query accident caused by wrong index selection in MySQL
Meet you all again! Another two weeks have passed, and I have a few more half-written drafts of articles in my cloud notes. Some are prepared to add more content because the quality has not met expectations, while others are just an inspiration and have no content at all. I envy many big guys who can produce five or six articles a week. Even if they give me two livers, it’s not enough. Okay, no more nonsense...
Recently, I encountered a database failure caused by a slow SQL query in the online environment, which affected the online business. After investigation, it was determined that the reason was that when SQL was executed, the MySQL optimizer selected the wrong index (it should not be said to be an "error", but selected an index that actually took longer to execute). During the investigation process, I consulted a lot of information and learned the basic guidelines for index selection by the MySQL optimizer. In this article, I share ideas for solving the problem. My understanding of MySQL is limited. If I make any mistakes, I welcome rational discussions and corrections.
In this accident, we can also fully see the importance of in-depth understanding of the operating principles of MySQL. This is the key to being able to solve problems independently when encountering problems. Imagine that on a dark and stormy night, the company's online line suddenly goes down, and your colleagues are not online. You are the only one who has the conditions to solve the problem. At this time, if you are stuck because of the basic skills of an engineer, Just asking if you are embarrassed...
The main content of this article:
In July At 11 o'clock on the 24th, a certain database suddenly received a large number of alarms online. The number of slow queries exceeded the standard and caused a sudden increase in the number of connections, causing the database to respond slowly and affecting the business. Looking at the chart, slow queries reached 14,000 times per minute at the peak. Under normal circumstances, the number of slow queries is only below two digits, as shown below:
Hurry up and check the slow SQL After recording, I found that the slow queries were all caused by the same type of statements (private data such as table names, I have hidden):
select * from sample_table where 1 = 1 and (city_id = 565) and (type = 13) order by id desc limit 0, 1复制代码
It seems that the statements are very simple, nothing special. But the query time for each execution reached an astonishing 44s.
It is simply sensational, this can no longer be described as "slow"...
Next check the table data information, as shown below:
You can see that the table has a large amount of data, and the estimated number of rows is 83683240, which is about 8000w, a table with tens of millions of data.
The general situation is like this, let’s enter the troubleshooting process.
First of all, of course, you must doubt whether the statement is not indexed. Check the index in the table creation DML:
KEY `idx_1` (`city_id`,`type`,`rank`), KEY `idx_log_dt_city_id_rank` (`log_dt`,`city_id`,`rank`), KEY `idx_city_id_type` (`city_id`,`type`)复制代码
Please ignore both idx_1 and idx_city_id_type The duplication of indexes is a problem left over from history.
You can see that there are idx_city_id_type and idx_1 indexes. Our query conditions are city_id and type, and both indexes can be reached.
But, do our query conditions really only need to consider city_id and type? (Astute friends should have noticed the problem. Let’s go on and leave it to everyone to think about.)
Since there is an index, it is time to see whether the statement actually reaches the index. MySQL provides Explain can analyze SQL statements. Explain is used to analyze SELECT query statements.
Explain The more important fields are:
For more detailed introduction to Explain, please refer to: MySQL performance optimization artifact Explain usage analysis
We use Explain to analyze this statement:
select * from sample_table where city_id = 565 and type = 13 order by id desc limit 0,1复制代码
Get the result:
It can be seen that although possiblekey has our index, the primary key index was finally used. The table is tens of millions in size, and the query condition actually returns empty data, which means MySQL actually takes a long time to retrieve the primary key index, resulting in a slow query.
We can use force index(idx_city_id_type) to let the statement select the joint index we set:
select * from sample_table force index(idx_city_id_type) where ( ( (1 = 1) and (city_id = 565) ) and (type = 13) ) order by id desc limit 0, 1复制代码
This time it is obviously executed very fast, analysis statement:
实际执行时间0.00175714s,走了联合索引后,不再是慢查询了。
问题找到了,总结下来就是:MySQL优化器认为在limit 1的情况下,走主键索引能够更快的找到那一条数据,并且如果走联合索引需要扫描索引后进行排序,而主键索引天生有序,所以优化器综合考虑,走了主键索引。实际上,MySQL遍历了8000w条数据也没找到那个天选之人(符合条件的数据),所以浪费了很多时间。
MySQL一条语句的执行流程大致如下图,而查询优化器则是选择索引的地方:
引用参考文献一段解释:
首先要知道,选择索引是MySQL优化器的工作。
而优化器选择索引的目的,是找到一个最优的执行方案,并用最小的代价去执行语句。在数据库里面,扫描行数是影响执行代价的因素之一。扫描的行数越少,意味着访问磁盘数据的次数越少,消耗的CPU资源越少。
当然,扫描行数并不是唯一的判断标准,优化器还会结合是否使用临时表、是否排序等因素进行综合判断。
总结下来,优化器选择有许多考虑的因素:扫描行数、是否使用临时表、是否排序等等
我们回头看刚才的两个explain截图:
走了主键索引的查询语句,rows预估行数1833,而强制走联合索引行数是45640,并且Extra信息中,显示需要Using filesort进行额外的排序。所以在不加强制索引的情况下,优化器选择了主键索引,因为它觉得主键索引扫描行数少,而且不需要额外的排序操作,主键索引天生有序。
同学们就要问了,为什么rows只有1833,明明实际扫描了整个主键索引啊,行数远远不止几千行。实际上explain的rows是MySQL预估的行数,是根据查询条件、索引和limit综合考虑出来的预估行数。
MySQL是怎样得到索引的基数的呢?这里,我给你简单介绍一下MySQL采样统计的方法。 为什么要采样统计呢?因为把整张表取出来一行行统计,虽然可以得到精确的结果,但是代价太高了,所以只能选择“采样统计”。 采样统计的时候,InnoDB默认会选择N个数据页,统计这些页面上的不同值,得到一个平均值,然后乘以这个索引的页面数,就得到了这个索引的基数。 而数据表是会持续更新的,索引统计信息也不会固定不变。所以,当变更的数据行数超过1/M的时候,会自动触发重新做一次索引统计。 在MySQL中,有两种存储索引统计的方式,可以通过设置参数innodb_stats_persistent的值来选择: 设置为on的时候,表示统计信息会持久化存储。这时,默认的N是20,M是10。 设置为off的时候,表示统计信息只存储在内存中。这时,默认的N是8,M是16。 由于是采样统计,所以不管N是20还是8,这个基数都是很容易不准的。复制代码
我们可以使用analyze table t
命令,可以用来重新统计索引信息。但是这条命令生产环境需要联系DBA,所以我就不做实验了,大家可以自行实验。
为什么这么说?因为如果我这个表中的索引是city_id
,type
和id
的联合索引,那优化器就会走这个联合索引,因为索引已经做好了排序。
把limit数量调大会影响预估行数rows,进而影响优化器索引的选择吗?
答案是会。
我们执行limit 10
select * from sample_table where city_id = 565 and type = 13 order by id desc limit 0,10复制代码
图中rows变为了18211,增长了10倍。如果使用limit 100,会发生什么?
优化器选择了联合索引。初步估计是rows还会翻倍,所以优化器放弃了主键索引。宁愿用联合索引后排序,也不愿意用主键索引了。
问:这个查询语句已经在线上稳定运行了非常长的时间,为何这次突然出现了慢查询?
答:以前的语句查询条件返回结果都不为空,limit1很快就能找到那条数据,返回结果。而这次代码中查询条件实际结果为空,导致了扫描了全部的主键索引。
知道了MySQL为何选择这个索引的原因后,我们就可以根据上面的思路来列举出解决办法了。
主要有两个大方向:
就像上面我最开始的操作那样,我们直接使用force index,让语句走我们想要走的索引。
select * from sample_table force index(idx_city_id_type) where ( ( (1 = 1) and (city_id = 565) ) and (type = 13) ) order by id desc limit 0, 1复制代码
这样做的优点是见效快,问题马上就能解决。
缺点也很明显:
force index()
并不容易加进去。我们换一种办法,我们去引导优化器选择联合索引。
通过增大limit,我们可以让预估扫描行数快速增加,比如改成下面的limit 0, 1000
SELECT * FROM sample_table where city_id = 565 and type = 13 order by id desc LIMIT 0,1000复制代码
这样就会走上联合索引,然后排序,但是这样强行增长limit,其实总有种面向黑盒调参的感觉。我们还有更优美的解决方案吗?
我们这句慢查询使用的是order by id,但是我们却没有在联合索引中加入id字段,导致了优化器认为联合索引后还要排序,干脆就不太想走这个联合索引了。
我们可以新建city_id
,type
和id
的联合索引,来解决这个问题。
这样也有一定的弊端,比如我这个表到了8000w数据,建立索引非常耗时,而且通常索引就有3.4个g,如果无限制的用索引解决问题,可能会带来新的问题。表中的索引不宜过多。
还有什么办法?我们可以用子查询,在子查询里先走city_id和type的联合索引,得到结果集后在limit1选出第一条。
但是子查询使用有风险,一版DBA也不建议使用子查询,会建议大家在代码逻辑中完成复杂的查询。当然我们这句并不复杂啦~
Select * From sample_table Where id in (Select id From `newhome_db`.`af_hot_price_region` where (city_id = 565 and type = 13)) limit 0, 1复制代码
SQL优化是个很大的工程,我们还有非常多的办法能够解决这句慢查询问题,这里就不一一展开了。留给大家做为思考题了。
本文带大家回顾了一次MySQL优化器选错索引导致的线上慢查询事故,可以看出MySQL优化器对于索引的选择并不单单依靠某一个标准,而是一个综合选择的结果。我自己也对这方面了解不深入,还需要多多学习,争取能够好好的做一个索引选择的总结(挖坑)。不说了,拿起巨厚的《高性能MySQL》,开始...
压住我的泡面...
最后做个文章总结:
相关免费学习推荐:mysql视频教程
The above is the detailed content of Online slow query accident caused by wrong index selection in MySQL. For more information, please follow other related articles on the PHP Chinese website!