1 Main database downtime
First let’s take a look at the main database downtime disaster recovery process: as shown below
In When the main database goes down, our most common disaster recovery strategy is "cutting off the main database". Specifically, it selects a slave library from the remaining slave libraries of the cluster and upgrades it to the master library. After the slave library is upgraded to the master library, the remaining slave libraries are mounted under it to become its slave library, and finally the entire master-slave database is restored. Cluster structure.
The above is a complete disaster recovery process, and the most costly process is the remounting of the slave library, not the switching of the main library.
This is because redis cannot continue to synchronize data from the new main database after the main database changes based on synchronization points like mysql and mongodb. Once the slave database changes master in the redis cluster, redis's approach is to clear the slave database of the replaced master database and then completely synchronize a copy of the data from the new master database before resuming the transfer.
The entire slave database redo process is as follows:
The main library bgsave its own data to the disk
The main library sends rdb file to the slave library
Start loading from the library
After the loading is completed, the upload will resume and the service will start at the same time
Obviously, the larger the memory size of redis during this process, the time for each step above will be lengthened. The actual test data is as follows (we believe that our machine performance is better):
It can be seen that when the data reaches 20G, the recovery time of a slave database has been extended to nearly 20 minutes. If there are 10 slave databases, it will take a total of 10 slave databases to recover sequentially. 200 minutes, and if the slave library is responsible for a large number of read requests at this time, can you tolerate such a long recovery time?
Seeing this, you will definitely ask: Why can't all slave libraries be redone at the same time? This is because if all the slave libraries request RDB files from the main library at the same time, the network card of the main library will be full immediately and enter a state where services cannot be provided normally. At this time, the main library will die again, which is simply adding insult to injury.
Of course, we can restore slave databases in batches, for example, in groups of two, then the recovery time of all slave databases is only reduced from 200 minutes to 100 minutes. Isn’t this a fifty-step solution to a hundred steps?
Another important issue lies in the red position in the fourth point. The resume transfer can be understood as a simplified mongodb oplog. It is a fixed-volume memory space, which we call the "synchronization buffer".
The write operation of the redis main library will be stored in this area and then sent to the slave library. If steps 1, 2, and 3 above take too long, then it is likely that the synchronization buffer will be Rewrite, what will it do if the slave library cannot find the corresponding resumption location? The answer is to redo steps 1, 2, and 3!
But because we cannot solve the time-consuming steps 1, 2, and 3 Therefore, the slave library will forever enter a vicious cycle: it will constantly request complete data from the main library, which will have a serious impact on the network card of the main library.
2 Capacity expansion problem
Many times there will be a sudden increase in traffic. Usually, before the cause is found, our emergency approach is to expand the capacity.
According to the table in Scenario 1, it takes nearly 20 minutes to expand a 20G redis slave database. Can the 20-minute business be tolerated at this critical moment? It may be dead before the expansion is completed.
3 A poor network leads to redoing the slave library and eventually triggering an avalanche
The biggest problem in this scenario is that the synchronization between the master library and the slave library is interrupted. It is likely that the slave library is still accepting write requests, so the synchronization buffer is likely to be overwritten once the interruption time is too long. At this time, the last synchronization position of the slave library has been lost. After the network is restored, although the master library has not changed, because the synchronization position of the slave library is lost, the slave library must be redone, which is 1, 2, and 3 in question 1. 4 steps. If the memory size of the main library is too large at this time, the redo speed of the slave library will be very slow, and the read requests sent to the slave library will be seriously affected. At the same time, because the size of the transferred rdb file is too large, the main library's network card will It will be severely affected for a long time.
4 The larger the memory, the longer the operation that triggers persistence blocks the main thread.
Redis is a single-threaded in-memory database. In redis, time-consuming operations need to be performed. During operation, a new process will be forked, such as bgsave and bgrewriteaof. When forking a new process, although the shareable data content does not need to be copied, the memory page table of the previous process space will be copied. This copying is done by the main thread and will block all read and write operations. As the memory usage increases, The longer it takes. For example: for redis with 20G of memory, bgsave takes about 750ms to copy the memory page table, and the redis main thread will also be blocked for 750ms.
Solution
The solution is of course to try to reduce memory usage. Under normal circumstances, we do this:
1 Set the expiration time
Set the expiration time for time-sensitive keys, and use redis’ own expired key cleanup strategy to reduce the memory usage of expired keys. It can also reduce business troubles and eliminate It needs to be cleaned regularly
2 Do not store garbage in redis
This is simply nonsense, but is there anyone who has the same problem as us?
3 Clean up useless data in a timely manner
For example, a redis carries 3 businesses Data, two businesses will go offline after a period of time, then you should clean up the relevant data of these two businesses
4 Try to compress the data as much as possible
For example, for some long text data, compression can significantly reduce memory usage
5 Pay attention to memory growth and locate large-capacity keys
Whether it is a DBA or a developer, When you use redis, you must pay attention to memory, otherwise, you are actually incompetent. Here you can analyze which keys in the redis instance are relatively large to help the business quickly locate abnormal keys (unexpected growth of keys is often the source of the problem)
6 pika
If you really don’t want to be so tired, then migrate the business to the new open source pika, so that you don’t have to pay too much attention to the memory. Redis memory is too The problems caused by it are no longer a problem.
The above is the detailed content of What will happen if Redis memory is too large?. For more information, please follow other related articles on the PHP Chinese website!