search
HomeBackend DevelopmentPHP TutorialAnalysis on Redis cluster failure

Analysis on Redis cluster failure

Dec 14, 2017 pm 03:04 PM
redisanalyzeFault

Redis Cluster is an advanced version of Redis that implements distribution and allows single points of failure. Redis cluster does not have the most important or central node. One of the main goals of this version is to design a linear scalable function (nodes can be added and deleted at will).

This article mainly introduces the detailed analysis of Redis cluster failures, hoping to help everyone.

Fault symptoms:

Business level display prompts that query redis failed

Cluster composition:

3 master and 3 slaves, each node has 8GB of data

Machine distribution:

In the same rack,

xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201

redis-server process status:

Through the command ps -eo pid,lstart | grep $pid,

found that the process has continued to run For 3 months

The node status of the cluster before the failure:

xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201 :8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

# ---------- ------------------------The following is the log analysis---------------------- ----------------

Step 1:
Master node 8371 loses connection with slave node 8373 :
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

Step 2: Master node 8370/8375 determines that 8371 is lost:
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

Step 3:
Slave node 8372/8373/8374 received master node 8375 saying that 8371 lost contact:
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0 363ac4ae97d56832316e65

Step 4: Master node 8370/8375 authorizes 8373 to upgrade to master node transfer:
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

Step 5: The original master node 8371 modifies its configuration and becomes the slave node of 8373:
46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

Step 6:Master node 8370/8375/8373 clear 8371 failure status:
42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

Step 7:Start from new slave node 8371 from new master node 8373 , the first full synchronization data:
8373 Log::
4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
8371 Log: :
46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

Step 8:Master node 8370/8375 determines that 8373 (new master) has lost contact:
42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

Step 9: Master node 8370/8375 determination 8373 (new master) restored:
60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

Step 10:
Master node 8373 completes the BGSAVE operation required for full synchronization:
5230:C 09 Sep 18:59:01.474 * DB saved on disk
5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

Step 11: Start receiving data from master node 8373 from node 8371:
46590:S 09 Sep 18:59:02.263 * MASTER SLAVE sync: receiving 2657606930 bytes from master

Step 12: Master node 8373 finds that slave node 8371 has restricted the output buffer:
4255:M 09 Sep 19:00:19.014 # Client id =14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl= 16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200: 8371 lost.

Step 13:
Slave node 8371 failed to synchronize data from master node 8373, the connection was broken, and the first full synchronization failed:
46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00 :20.102 * MASTER SLAVE sync started

Step 14:
Restarts synchronization from node 8371, the connection fails, and the number of connections to master node 8373 is full Up:
46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:21.103 * MASTER SLAVE sync started
46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

Step 15:
Slave node 8371 reconnects to master node 8373, and starts full synchronization for the second time:
8371 Log:
46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:49.175 * MASTER SLAVE sync started
46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
46590: S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373 Log:
4255 :M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200: 8371
4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
18413:C 09 Sep 19:01:52.466 * DB saved on disk
18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 09 Sep 19:01: 53.186 * Background saving terminated with success

Step 16:
Successfully synchronized data from node 8371 and started loading memory:
46590:S 09 Sep 19: 01:53.190 * MASTER SLAVE sync: receiving 2637183250 bytes from master
46590:S 09 Sep 19:04:51.485 * MASTER SLAVE sync: Flushing old data
46590:S 09 Sep 19:05:58.695 * MASTER SLAVE sync: Loading DB in memory

Step 17:
The cluster returns to normal:
42645 ; It took 7 minutes:
46590:S 09 Sep 19:08:19.303 * MASTER SLAVE sync: Finished with success


8371 Reason for loss of contact Analysis:

Since several machines are in the same rack, network interruption is unlikely to occur, so I checked the slow query log through the SLOWLOG GET command and found that there is The KEYS command was executed, which took 8.3 seconds. Then I checked the cluster node timeout setting and found that it was 5s (cluster-node-timeout 5000)

The reason why the node lost connection :


The client executed a command that took 8.3 seconds.
2016/9/9 18:57:43 Start execution KEYS command
2016/9/9 18:57:50 8371 is judged to be disconnected (redis log)

2016/9/9 18:57:51 KEYS command is executed


In summary, there are the following problems:


1. Due to the relatively short cluster-node-timeout setting, slow query of KEYS caused the cluster judgment node 8371 to lose contact

2. Due to 8371 losing contact, 8373 was upgraded to the master and master-slave synchronization started.

3. Due to the limitation of configuring client-output-buffer-limit, the first The first full synchronization failed

4. Due to a problem with the connection pool of the PHP client, the server was connected to the server frantically, resulting in an effect similar to a SYN attack

5. First After the first full synchronization failed, it took 30 seconds for the slave node to reconnect to the master node (exceeding the maximum number of connections 1w)


About the client-output-buffer-limit parameter:


# The syntax of every client-output-buffer-limit directive is the following: 
# 
# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> 
# 
# A client is immediately disconnected once the hard limit is reached, or if 
# the soft limit is reached and remains reached for the specified number of 
# seconds (continuously). 
# So for instance if the hard limit is 32 megabytes and the soft limit is 
# 16 megabytes / 10 seconds, the client will get disconnected immediately 
# if the size of the output buffers reach 32 megabytes, but will also get 
# disconnected if the client reaches 16 megabytes and continuously overcomes 
# the limit for 10 seconds. 
# 
# By default normal clients are not limited because they don&#39;t receive data 
# without asking (in a push way), but just after a request, so only 
# asynchronous clients may create a scenario where data is requested faster 
# than it can read. 
# 
# Instead there is a default limit for pubsub and slave clients, since 
# subscribers and slaves receive data in a push fashion. 
# 
# Both the hard or the soft limit can be disabled by setting them to zero. 
client-output-buffer-limit normal 0 0 0 
client-output-buffer-limit slave 256mb 64mb 60 
client-output-buffer-limit pubsub 32mb 8mb 60

##Take action:


1. Cut a single instance to less than 4G, otherwise the master-slave switch will take a long time
2. Adjust the client-output-buffer-limit parameter to prevent synchronization Failed in the middle

3. Adjust cluster-node-timeout, it cannot be less than 15s


4. Prohibit any slow query that takes longer than cluster-node-timeout , because it will cause master-slave switching


5. Fix the crazy connection method of the client similar to SYN attack


Related recommendations:


Redis Full record of cluster construction


Detailed explanation of redis cluster specification knowledge

redis cluster actual combat

The above is the detailed content of Analysis on Redis cluster failure. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How can you protect against Cross-Site Scripting (XSS) attacks related to sessions?How can you protect against Cross-Site Scripting (XSS) attacks related to sessions?Apr 23, 2025 am 12:16 AM

To protect the application from session-related XSS attacks, the following measures are required: 1. Set the HttpOnly and Secure flags to protect the session cookies. 2. Export codes for all user inputs. 3. Implement content security policy (CSP) to limit script sources. Through these policies, session-related XSS attacks can be effectively protected and user data can be ensured.

How can you optimize PHP session performance?How can you optimize PHP session performance?Apr 23, 2025 am 12:13 AM

Methods to optimize PHP session performance include: 1. Delay session start, 2. Use database to store sessions, 3. Compress session data, 4. Manage session life cycle, and 5. Implement session sharing. These strategies can significantly improve the efficiency of applications in high concurrency environments.

What is the session.gc_maxlifetime configuration setting?What is the session.gc_maxlifetime configuration setting?Apr 23, 2025 am 12:10 AM

Thesession.gc_maxlifetimesettinginPHPdeterminesthelifespanofsessiondata,setinseconds.1)It'sconfiguredinphp.iniorviaini_set().2)Abalanceisneededtoavoidperformanceissuesandunexpectedlogouts.3)PHP'sgarbagecollectionisprobabilistic,influencedbygc_probabi

How do you configure the session name in PHP?How do you configure the session name in PHP?Apr 23, 2025 am 12:08 AM

In PHP, you can use the session_name() function to configure the session name. The specific steps are as follows: 1. Use the session_name() function to set the session name, such as session_name("my_session"). 2. After setting the session name, call session_start() to start the session. Configuring session names can avoid session data conflicts between multiple applications and enhance security, but pay attention to the uniqueness, security, length and setting timing of session names.

How often should you regenerate session IDs?How often should you regenerate session IDs?Apr 23, 2025 am 12:03 AM

The session ID should be regenerated regularly at login, before sensitive operations, and every 30 minutes. 1. Regenerate the session ID when logging in to prevent session fixed attacks. 2. Regenerate before sensitive operations to improve safety. 3. Regular regeneration reduces long-term utilization risks, but the user experience needs to be weighed.

How do you set the session cookie parameters in PHP?How do you set the session cookie parameters in PHP?Apr 22, 2025 pm 05:33 PM

Setting session cookie parameters in PHP can be achieved through the session_set_cookie_params() function. 1) Use this function to set parameters, such as expiration time, path, domain name, security flag, etc.; 2) Call session_start() to make the parameters take effect; 3) Dynamically adjust parameters according to needs, such as user login status; 4) Pay attention to setting secure and httponly flags to improve security.

What is the main purpose of using sessions in PHP?What is the main purpose of using sessions in PHP?Apr 22, 2025 pm 05:25 PM

The main purpose of using sessions in PHP is to maintain the status of the user between different pages. 1) The session is started through the session_start() function, creating a unique session ID and storing it in the user cookie. 2) Session data is saved on the server, allowing data to be passed between different requests, such as login status and shopping cart content.

How can you share sessions across subdomains?How can you share sessions across subdomains?Apr 22, 2025 pm 05:21 PM

How to share a session between subdomains? Implemented by setting session cookies for common domain names. 1. Set the domain of the session cookie to .example.com on the server side. 2. Choose the appropriate session storage method, such as memory, database or distributed cache. 3. Pass the session ID through cookies, and the server retrieves and updates the session data based on the ID.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.