Home >Database >Mysql Tutorial >Why is count(*) so slow? Cause Analysis
Why is count(*) so slow? The following article will analyze the reasons for you and talk about the execution process of count(*). I hope it will be helpful to everyone!
I didn’t want to write this article, because I think most experienced developers have encountered this thing and must have understood the related reasons, but recently I saw that several technical public accounts that follow are pushing related articles. It really surprised me!
First go to the conclusion of the public account article:
Conclusion: count(*) ≈ count(1) > count(id) > count (ordinary index column) > count (unindexed column)
I I don’t want to give it away, the above conclusion is pure fart. It is simply something that comes out of personal fantasies, and I am not even willing to verify it. Even if I take a look at the execution plan, I cannot come to such an outrageous conclusion.
I can’t believe this is an article that has been reposted by multiple technical public accounts! All the following content is based on the analysis ofmysql 5.7 InnoDB engine.
CREATE TABLE `hospital_statistics_data` ( `pk_id` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '主键', `id` varchar(36) COLLATE utf8mb4_general_ci NOT NULL COMMENT '外键', `hospital_code` varchar(36) COLLATE utf8mb4_general_ci NOT NULL COMMENT '医院编码', `biz_type` tinyint NOT NULL COMMENT '1服务流程 2管理效果', `item_code` varchar(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '考核项目编码', `item_name` varchar(64) COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '考核项目名称', `item_value` varchar(36) COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '考核结果', `is_deleted` tinyint DEFAULT NULL COMMENT '是否删除 0否 1是', `gmt_created` datetime DEFAULT NULL COMMENT '创建时间', `gmt_modified` datetime DEFAULT NULL COMMENT 'gmt_modified', `gmt_deleted` datetime(3) DEFAULT '9999-12-31 23:59:59.000' COMMENT '删除时间', PRIMARY KEY (`pk_id`) ) DEFAULT CHARSET=utf8mb4 COMMENT='医院统计数据';The initial state of this table has only one
clustered index.
1) Take a look at the execution plan when there is only one clustered index.
EXPLAIN select COUNT(*) from hospital_statistics_data;Result: #The meaning of each parameter of the execution plan is not within the scope of this article, you can understand it yourself. Only focus on the following attributes here.
count(*) will also go to the index , and in the current case a clustered index is used.
2) There is a non-clustered index (secondary index)
Add a hospital_code index to the table.alter table hospital_statistics_data add index idx_hospital_code(hospital_code)There are two indexes in the table at this time,
primary key and
hospital_code.
EXPLAIN select COUNT(*) from hospital_statistics_data;Result:
##Similarly, look at the three fields of type, key and key_len .
Do you feel a little "
magical"? Why does the index change to the newly added
idx_hospital_code. Don’t rush to the conclusion, and then look at the following situation.
On the basis of the above, add another secondary index.
alter table hospital_statistics_data add index idx_biz_type(biz_type)
There are three indexes in the table at this time, primary key, hospital_code and biz_type.
Similarly, execute:
EXPLAIN select COUNT(*) from hospital_statistics_data;
Result:
Are you even more confused? The index is... and... .Changed.
becomes the newly added idx_biz_type.
Let’s not talk about why the above changes occurred, let’s continue the analysis below.
Based on the above three indexes, let’s take a look at them respectively,
count(1), count(id)
, count(index)
, count (no index)
What are the differences between these four situations and the execution plan of count(*).
count(index)
这里选取biz_type索引字段。
count(无索引)
小结:
count(index) 会使用当前index指定的索引。
count(无索引) 是全表扫描,未走索引。
count(1) , count(*), count(id) 一样都会选择idx_biz_type索引
看到这,你还觉得那些千篇一律的公众号文章的结论正确吗?
mysql 分为service层
和引擎层
。
所有的sql在执行前会经过service层的优化,优化分为很多类型,简单的来说可分为成本
和规则
。
执行计划所反映的是service层经过sql优化后,可能的执行过程。并非绝对(免得有些人说我只看执行计划过于片面)。绝大多数情况执行计划是可信的
。
索引类型分为聚簇索引
和非聚簇索引(二级索引)
。其中数据都是挂在聚簇索引上的,非聚簇索引上只是记录的主键id。
抛开数据内存,只谈数据量,都是扯淡。什么500w就是极限,什么2个表以上的join都需要优化了,什么is null不会走索引等,纯纯的放屁。
相信一点,编写mysql代码的人比,看此文章的大部分人都要优秀。他们会尽可能在执行前,对我这样菜逼写的乱七八糟的sql进行优化。
其实原因非常非常简单,上面也说了,service层会基于成本进行优化。
并且,正常情况下,非聚簇索引
所占有的内存要远远小于聚簇索引
。所以问题来了,如果你是mysql的开发人员,你在执行count(*)查询的时候会使用那个索引?
我相信正常人都会使用非聚簇索引
。
那如果存在2个甚至多个非聚簇索引又该如何选择呢?
那肯定选择最短的,占用内存最小的一个呀,在回头看看上面的实例,还迷惑吗。
同样都是非聚簇索引。idx_hospital_code
的len
是146
字节;而idx_biz_type
的len
只有1
。那还要选吗?
那为何count(*)走了索引,却还是很慢呢?
这里要明确一点,索引只是提升效率的一种方式,但不能完全的解决效率问题。count(*)有一个明显的缺陷,就是它要计算总数,那就意味着要遍历所有符合条件的数据,相当于一个计数器,在数据量足够大的情况下,即使使用非聚簇索引也无法优化太多。
官方文档:
InnoDBhandlesSELECT COUNT(*)andSELECT COUNT(1)operations in the same way. There is no performance difference.
简单的来说就是,InnoDB下 count(*) 等价于 count(1)
既然会自动走索引,那么上面那个所谓的速度排序还觉得对吗? count(*)的性能跟数据量有很大的关系,此外最好有一个字段长度较短的二级索引。
拓展:
另外,多说一下,关于网上说的那些索引失效的情况,大多都是片面的,我这里只说一点。量变才能引起质变,索引的失效取决于你圈定数据的范围,若你圈定的数据量占整体数据量的比例过高,则会放弃使用索引,反之则会优先使用索引。但是此规则并不是完美的,有时候可能与你预期的不同,也可以通过一些技巧强制使用索引,但这种方式少用。
举个栗子:
通过上面这个表hospital_statistics_data
,我进行了如下查询:
select * from hospital_statistics_data where hospital_code is not null;
此时这个sql会使用到hospital_code
的索引吗?
这里也不卖关子了,若hospital_code只有很少一部分数据是null
值,那么将不会走索引,反之则走索引。
原因就2个字:回表
。
It’s like buying sugar oranges. If you only buy a few kilograms, then you can just pick the best ones in the basket. But if you want to buy a basket, I believe the boss will not let you pick one by one, but will give you a whole basket at once. Of course, everyone is not stupid, and they all know that there must be a few bad fruits in the basket. But this is the most efficient and causes less loss to the boss.
Excerpted from "Understanding MySQL from the Root". I strongly recommend that those who have not systematically learned MySQL read this book.
1. First maintain a count variable in the server layer
2. The server layer asks the InnoDB engine for the first record
3. InnoDB finds the first secondary index record and return it to the server layer (note: since it is only counting the number of records at this time, there is no need to return the table)
4. Since the parameter of the COUNT function is *, MySQL will treat * as a constant 0 . Since 0 is not NULL, the server layer adds 1 to the count variable.
5. The server layer asks InnoDB for the next record.
6.InnoDB finds the next secondary index record through the next_record attribute of the secondary index record and returns it to the server layer.
7. The server layer continues to add 1 to the count variable.
8. Repeat the above process until InnoDB returns no recordable message to the server layer.
9. The server layer sends the final value of the count variable to the client.
After I finished writing, I still felt quite depressed. There are fewer and fewer good articles that can be obtained from the public account, and now it is paid for knowledge. It's time.
I really miss the days when I first started working. At that time, I spent some time reading official account articles every morning, but now they are all advertisements. Why!
But it’s normal, no one can always generate electricity for love.
It is recommended to read more books when studying. Generally, those that can be written into books are not too bad. Now all I can search at night are the same articles, I don’t know whether they are right or wrong. Online
[Related recommendations: mysql video tutorial]
The above is the detailed content of Why is count(*) so slow? Cause Analysis. For more information, please follow other related articles on the PHP Chinese website!