During our work a few days ago, we suddenly received an alert that our Redis had crashed, and many people were discussing a certain Redis connection timeout. I thought there was a big problem at first, but who knew it would recover after a while. At that time, I logged into the server and checked the monitoring. Take a look at QPS for the first time:
You can see that the QPS is not high, but why did the data not be obtained for a period of time? Then continue to look at the cpu usage of Redis:
You can see that the cpu is saturated, which can also explain why the graph is broken, because redis is single-threaded, and after using 100% of the cpu, it cannot process other commands, and zabbix cannot execute info. The command is taken from qps. So we already know that the problem is caused by CPU saturation, so what is the reason? Then continue to check whether there are any slow logs during the period of high CPU usage:
This does not appear to be the culprit of high CPU usage, so it is difficult to troubleshoot. This example is a master node and a slave node. Then let me take a look at the cpu usage of the slave library:
Damn it, what’s going on? How come the CPU that is not used by the slave library is also used at 74%? Isn't this scientific? Whatever, check if there are any slow logs from the slave library:
Damn it, what’s going on? No one is using the slave library. Check if it is read-only:
127.0.0.1:6103> CONFIG GET "slave-read-only" 1) "slave-read-only" 2) "yes" 127.0.0.1:6103>
It seems to be read-only, which confused me. Finally, it suddenly occurred to me that the main library has a big key that has expired at this point, and the slow operation of the expired key in the main library will not record slow logs. The key expiration of the slave library is deleted by the main library initiating a DEL instruction. At this time, the slave library will record the slow log. From the slow log above, you can see that the maximum DEL operation is 335ms. No wonder there are application connection timeouts.
Use the command info commandstats again to see:
The above is the detailed content of Example analysis of Redis timeout troubleshooting. For more information, please follow other related articles on the PHP Chinese website!