1、文档结构示例
{
_id: xxxx,
user: 'xiaoming',
level: 5,
from: 'iPhone',
info: 'something wrong'
}
2、场景:user为'xiaoming'的文档有六七百万条
3、问题:怎么提升aggregate+group+sum速度
aggregate([
{$match:{user: 'xiaoming', info:{$regex:'wrong'}}},
{$group:{_id:null, count:{$sum:1}}}
])
用上面这个来统计xiaoming带有wrong的文档数量,结果
{"_id": null, "count": 2299999 }
耗时30s-40s。user、info、user+info三种索引都尝试过,速度都没有提升
baidu、google查到‘带条件计数慢无解’
怎么提升效率,10s以内能实现吗
PHP中文网2017-05-02 09:22:16
The first thing to explain is that for OLAP-type operations, expectations should not be too high. After all, it is an operation of large amounts of data. IO alone far exceeds the usual OLTP operation, so it is unrealistic and meaningless to require the speed and concurrency of OLTP operations. But that doesn’t mean there’s no room for optimization.
Let’s start with the index. How long does it take to find 6 million {user: "xiaoming"}
without indexing? Full table scan COLLSCAN
to find 6 million pieces of data out of 7 million pieces of data, and finding 6 million pieces of data out of 100 million pieces of data are obviously two different concepts. If the index IXSCAN
is hit, the difference will be much smaller and almost negligible. So you are wrong to say that {user: 1}
this index has no effect. It may just be because the amount of data in the collection is too small to see the difference. By the way, it should be mentioned that to see whether there is a difference in efficiency, you should look at the execution plan, not the execution time, as the time is inaccurate. {user: "xiaoming"}
需要多少时间?全表扫描COLLSCAN
从700w条数据中找出600w条,跟从1亿条数据中找出600w条显然是两个概念。命中索引IXSCAN
,这个差异就会小很多,几乎可以忽略。所以你说{user: 1}
这个索引没有作用是不对的,可能只是因为集合数据量太少看不出差异而已。顺便应该提一下看效率是否有差异应该看执行计划,不要看执行时间,时间是不准确的。
在有user
索引的前提下,结果仍然有600w条,剩下的部分是个regex
,regex
无法命中索引,所以不管有没有对info
的索引都没有意义。在找到600w条数据之后还有一个对600w数据的filter
操作。唯一对这个操作可能有帮助的只有全文索引
Under the premise of user
index, there are still 6 million results, and the remaining part is a regex
. regex
cannot hit the index, so no matter It doesn’t make sense whether there is an index to
filter
operation on 6 million pieces of data. The only thing that may be helpful for this operation is full-text index
, but full-text index cannot completely replace regular expressions. You need to read the documentation for specific questions. When considering full-text indexing is feasible, a composite index can be established:
db.coll.createIndex({
user: 1,
info: "text"
});
The corresponding query should be changed to:
db.coll.aggregate([
{$match:{user: 'xiaoming', $text: { $search: "wrong" }}},
{$group:{_id:null, count:{$sum:1}}}
])
For an introduction to composite full-text index, please refer to here. There are still some limitations that need to be noted. After this optimization, it is expected that the time can be reduced to less than 20 seconds under the same hardware, which is still far from the 10 seconds you want. The reason was mentioned at the beginning, you cannot have such high expectations for OLAP. If you really have this need, you should start from the source and consider: info
Count every time a
field is updated or inserted
某草草2017-05-02 09:22:16
I don’t know, but would it be better if it could be split into two matches? .
Similar to
aggregate([
{$match:{user: 'xiaoming'}},
{$match:{info: /wrong/}},
{$group:{_id:null, count:{$sum:1}}}
])
I think the main thing is to spend time regularly.
If there is an index, index user.
给我你的怀抱2017-05-02 09:22:16
The real-time requirement is not high and can be counted regularly and cached