How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

Guanhui

Jul 27, 2020 pm 05:24 PM

mysql

##Interview Questions & Real Experience

Interview question: How to achieve deep paging when the amount of data is large?

You may encounter the above questions during interviews or when preparing for interviews. Most of the answers are basically to divide databases and tables to build indexes. This is a very standard correct answer, but Reality is always very hard, so the interviewer will usually ask you, now that the construction period is insufficient and the personnel are insufficient, how can we achieve deep paging?

At this time, students who have no practical experience are basically numb. So, please listen to me.

Painful Lessons

First of all, it must be clear: depth paging can be done, but depth is random Page jumps absolutely need to be banned.

Previous picture:

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

Guess, if I click on page 142360, will the service explode?

Like MySQL, MongoDB database is okay. It is a professional database in itself. The processing is not good, and at most it is slow. But if it involves ES, the nature is different. We have to use SearchAfter Api to loop Obtaining data involves the issue of memory usage. If the code is not written elegantly, it may directly lead to memory overflow.

Why random depth page jumps cannot be allowed

Let’s briefly talk about why random depth page jumps cannot be allowed from a technical point of view, or that Why is deep paging not recommended?

MySQL

The basic principle of paging:

SELECT * FROM test ORDER BY id DESC LIMIT 10000, 20;

LIMIT 10000, 20 means scanning 10020 rows that meet the conditions and throwing them away Drop the first 10,000 lines and return the last 20 lines. If it is LIMIT 1000000, 100, 1000100 rows need to be scanned. In a highly concurrent application, each query needs to scan more than 100W rows. It would be strange if it does not explode.

MongoDB

The basic principle of paging:

db.t_data.find().limit(5).skip(5);

Similarly, as the page number increases, the items skipped by skip will also increase. becomes larger, and this operation is implemented through the iterator of the cursor. The consumption of the CPU will be very obvious. When the page number is very large and frequent, it will inevitably explode.

ElasticSearch

From a business perspective, ElasticSearch is not a typical database. It is a search engine. If the desired data is not found under the filter conditions , we will not find the data we want if we continue deep paging. To take a step back, if we use ES as a database for query, we will definitely encounter the limit of max_result_window when paging. Did you see it? Officials tell you the maximum The offset limit is ten thousand.

Query process:

If you query page 501, with 10 items per page, the client sends a request to a certain node
This node broadcasts data to each shard, and each shard queries the first 5010 pieces of data.
The query results are returned to the node, and then the data is integrated and the first 5010 pieces of data are retrieved.
Return to the client

From this we can see why it is necessary to limit the offset. In addition, if you use a scrolling method such as Search After API's deep page jump query also requires scrolling thousands of items each time. It may be necessary to scroll millions or tens of millions of pieces of data in total, just for the last 20 pieces of data. The efficiency can be imagined.

Align with the product again

As the saying goes, problems that cannot be solved by technology should be solved by business!

During my internship, I believed in the evil of the product, and it was necessary to implement deep paging and page jumps. Now we must correct the chaos, and the following changes must be made in the business:

Add default filtering conditions as much as possible, such as : Time period, the purpose is to reduce the amount of data displayed

Modify the display method of page jumps, change it to scrolling display, or jump pages in a small range

Scrolling display reference picture:

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

Small-scale page jump reference picture:

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

##General solutionThe quick solution in a short period of time mainly includes the following points:

MySQL

Original paging SQL:

# 第一页
SELECT * FROM `year_score` where `year` = 2017 ORDER BY id limit 0, 20;
# 第N页
SELECT * FROM `year_score` where `year` = 2017 ORDER BY id limit (N - 1) * 20, 20;

Through context, rewritten as:

# XXXX 代表已知的数据
SELECT * FROM `year_score` where `year` = 2017 and id > XXXX ORDER BY id limit 20;

在没内鬼，来点干货！SQL优化和诊断一文中提到过，LIMIT会在满足条件下停止查询，因此该方案的扫描总量会急剧减少，效率提升Max！

方案和MySQL相同，此时我们就可以随用所欲的使用 FROM-TO Api，而且不用考虑最大限制的问题。

MongoDB

方案基本类似，基本代码如下：

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

相关性能测试：

How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?

如果非要深度随机跳页

如果你没有杠过产品经理，又该怎么办呢，没关系，还有一丝丝的机会。

在 SQL优化一文中还提到过MySQL深度分页的处理技巧，代码如下：

# 反例（耗时129.570s）
select * from task_result LIMIT 20000000, 10;
# 正例（耗时5.114s）
SELECT a.* FROM task_result a, (select id from task_result LIMIT 20000000, 10) b where a.id = b.id;
# 说明
# task_result表为生产环境的一个表，总数据量为3400万，id为主键，偏移量达到2000万

该方案的核心逻辑即基于聚簇索引，在不通过回表的情况下，快速拿到指定偏移量数据的主键ID，然后利用聚簇索引进行回表查询，此时总量仅为10条，效率很高。

因此我们在处理MySQL，ES，MongoDB时，也可以采用一样的办法：

限制获取的字段，只通过筛选条件，深度分页获取主键ID
通过主键ID定向查询需要的数据

瑕疵：当偏移量非常大时，耗时较长，如文中的 5s

推荐教程：《MySQL教程》

文章来源：https://juejin.im/post/5f0de4d06fb9a07e8a19a641

The above is the detailed content of How to be compatible with MySQL + ES + MongoDB to achieve deep paging of hundreds of millions of data?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:juejin. If there is any infringement, please contact admin@php.cn delete

What is the difference between unset() and session_destroy()?May 04, 2025 am 12:19 AM

Thedifferencebetweenunset()andsession_destroy()isthatunset()clearsspecificsessionvariableswhilekeepingthesessionactive,whereassession_destroy()terminatestheentiresession.1)Useunset()toremovespecificsessionvariableswithoutaffectingthesession'soveralls

What is sticky sessions (session affinity) in the context of load balancing?May 04, 2025 am 12:16 AM

Stickysessionsensureuserrequestsareroutedtothesameserverforsessiondataconsistency.1)SessionIdentificationassignsuserstoserversusingcookiesorURLmodifications.2)ConsistentRoutingdirectssubsequentrequeststothesameserver.3)LoadBalancingdistributesnewuser

What are the different session save handlers available in PHP?May 04, 2025 am 12:14 AM

PHPoffersvarioussessionsavehandlers:1)Files:Default,simplebutmaybottleneckonhigh-trafficsites.2)Memcached:High-performance,idealforspeed-criticalapplications.3)Redis:SimilartoMemcached,withaddedpersistence.4)Databases:Offerscontrol,usefulforintegrati

What is a session in PHP, and why are they used?May 04, 2025 am 12:12 AM

Session in PHP is a mechanism for saving user data on the server side to maintain state between multiple requests. Specifically, 1) the session is started by the session_start() function, and data is stored and read through the $_SESSION super global array; 2) the session data is stored in the server's temporary files by default, but can be optimized through database or memory storage; 3) the session can be used to realize user login status tracking and shopping cart management functions; 4) Pay attention to the secure transmission and performance optimization of the session to ensure the security and efficiency of the application.

Explain the lifecycle of a PHP session.May 04, 2025 am 12:04 AM

PHPsessionsstartwithsession_start(),whichgeneratesauniqueIDandcreatesaserverfile;theypersistacrossrequestsandcanbemanuallyendedwithsession_destroy().1)Sessionsbeginwhensession_start()iscalled,creatingauniqueIDandserverfile.2)Theycontinueasdataisloade

What is the difference between absolute and idle session timeouts?May 03, 2025 am 12:21 AM

Absolute session timeout starts at the time of session creation, while an idle session timeout starts at the time of user's no operation. Absolute session timeout is suitable for scenarios where strict control of the session life cycle is required, such as financial applications; idle session timeout is suitable for applications that want users to keep their session active for a long time, such as social media.

What steps would you take if sessions aren't working on your server?May 03, 2025 am 12:19 AM

The server session failure can be solved through the following steps: 1. Check the server configuration to ensure that the session is set correctly. 2. Verify client cookies, confirm that the browser supports it and send it correctly. 3. Check session storage services, such as Redis, to ensure that they are running normally. 4. Review the application code to ensure the correct session logic. Through these steps, conversation problems can be effectively diagnosed and repaired and user experience can be improved.

What is the significance of the session_start() function?May 03, 2025 am 12:18 AM

session_start()iscrucialinPHPformanagingusersessions.1)Itinitiatesanewsessionifnoneexists,2)resumesanexistingsession,and3)setsasessioncookieforcontinuityacrossrequests,enablingapplicationslikeuserauthenticationandpersonalizedcontent.

See all articles