How to prevent split-brain in an HA cluster
1. Introduction
Split-brain refers to a high-availability (HA) system that when connected When two nodes are disconnected, the system that was originally a whole is split into two independent nodes. At this time, the two nodes begin to compete for shared resources, resulting in system chaos and data damage.
For HA of stateless services, it does not matter whether it is split-brain or not; but for HA of stateful services (such as MySQL), split-brain must be strictly prevented. (But some systems in production environments configure stateful services according to the stateless service HA set, and the results can be imagined...)
2. How to prevent HA cluster split-brain
Generally, 2 methods are used
1. Arbitration
When two nodes disagree, the arbiter of the third party decides who to listen to. This arbiter may be a lock service, a shared disk or something else.
2. fencing
When the status of a node cannot be determined, kill the other node through fencing to ensure that the shared resources are completely released. The premise is that there must be reliable fence equipment.
Ideally, neither of the above should be missing.
However, if the node does not use shared resources, such as database HA based on master-slave replication, we can also safely omit the fence device and only retain the quorum. And many times there are no fence devices available in our environment, such as in cloud hosts.
So can we omit arbitration and only keep the fence device?
No. Because when two nodes lose contact with each other, they will fencing each other at the same time. If the fencing method is reboot, then the two machines will restart continuously. If the fencing method is power off, then the outcome may be that two nodes die together, or one may survive. But if the reason why two nodes lose contact with each other is that one of the nodes has a network card failure, and the one that survives happens to be the faulty node, then the ending will be tragic.
So, a simple double node cannot prevent split-brain anyway.
3. Is the device safe without a fence?
Take the data replication of PostgreSQL or MySQL as an example to illustrate this issue.
In a replication-based scenario, the master-slave nodes do not share resources, so there is no problem if both nodes are alive. The question is whether the client will access the node that is supposed to be dead. This again involves the issue of client routing.
There are several methods for client routing, based on VIP, based on Proxy, based on DNS or simply the client maintains a list of server addresses to determine the master and slave by itself. No matter which method is used, the routing must be updated when the master-slave switches.
Routing based on DNS is not reliable because DNS may be cached by the client and is difficult to clear.
VIP-based routing has some variables. If the node that is supposed to die does not remove its VIP, it may come out to cause trouble at any time (even if the new owner has updated the arp cache on all hosts through arping , if the arp of a certain host expires and an arp query is sent, an ip conflict will occur). Therefore, it can be considered that VIP is also a special shared resource and must be removed from the faulty node. As for how to pick it, the simplest way is to pick it by itself after the faulty node discovers that it has lost contact, if it is still alive (if it is dead, there is no need to pick it). What if the process responsible for extracting VIP cannot work? At this time, you can use soft fence devices that are not reliable (such as ssh).
Proxy-based routing is more reliable, because Proxy is the only service entrance. As long as the Proxy is updated in one place, the problem of client misaccess will not occur, but Proxy must also be considered. High availability.
As for the method based on the server address list, the client needs to determine the master and slave through the background service (such as whether the PostgreSQL/MySQL session is in read-only mode). At this time, if there are two masters, the client will be confused. In order to prevent this problem, the original master node must stop the service by itself after discovering that it has lost contact. This is the same as the previous VIP removal.
Therefore, in order to prevent the faulty node from causing trouble, the faulty node should release the resources by itself after losing contact. In order to cope with the failure of the process that releases the resources, a soft fence can be added. Under this premise, it can be considered that it is safe without reliable physical fence equipment.
4. Can data be lost after master-slave switching?
Whether data will be lost after master-slave switching and brain splitting can be considered two different issues. Also take the data replication of PostgreSQL or MySQL as an example to illustrate.
For PostgreSQL, if configured for synchronous streaming replication, no data will be lost regardless of whether the routing is correct. Because the client routed to the wrong node cannot write any data at all, it will always wait for feedback from the slave node, and the slave node it thought was now the master, of course, will ignore it. Of course, it is not good if this happens all the time, but it provides sufficient time for the cluster monitoring software to correct routing errors.
For MySQL, even if it is configured for semi-synchronous replication, it may automatically downgrade to asynchronous replication after a timeout occurs. In order to prevent MySQL replication from being degraded, you can set an extremely large rpl_semi_sync_master_timeout while keeping rpl_semi_sync_master_wait_no_slave on (the default value). However, if the slave fails at this time, the master will also stop. The solution to this problem is the same as PostgreSQL, either configuring it as 1 master and 2 slaves, as long as both slaves are not down, it will be fine, or using external cluster monitoring software to dynamically switch between semi-synchronous and asynchronous.
If it is originally configured asynchronous replication, it means that you are ready to lose data. At this time, it’s not a big deal to lose some data when switching between master and slave, but the number of automatic switches must be controlled. For example, the original owner whose control has been failed over is not allowed to go online automatically. Otherwise, if failover occurs due to network jitter, the master and slave will keep switching back and forth, losing data, and destroying data consistency.
5. How to implement the above strategy
You can implement a script that conforms to the above logic from scratch. But I prefer to build it based on mature cluster software, such as Pacemaker Corosync and appropriate resource agents. I highly do not recommend Keepalived. It is not suitable for HA of stateful services. Even if you add arbitration and fences to the solution, it always feels awkward.
There are also some precautions when using Pacemaker Corosync
1) Understand the functions and principles of Resource Agent
Only by understanding the functions and principles of Resource Agent can you know the scenarios it is applicable to. For example, the resource agent of pgsql is relatively complete, supports synchronous and asynchronous stream replication, and can automatically switch between the two, and can ensure that data will not be lost during synchronous replication. But the current resource agent of MySQL is very weak. Without GTID and without log compensation, it is easy to lose data. It is better not to use it and continue to use MHA (but be sure to guard against split-brain when deploying MHA).
2) Ensure the quorum (quorum)
Quorum can be considered as Pacemkaer’s own arbitration mechanism. A majority of all nodes in the cluster elects a coordinator, and all instructions in the cluster are controlled by this coordinator. Issued, it can perfectly eliminate the problem of split brain. In order for this mechanism to work effectively, there must be at least 3 nodes in the cluster, and no-quorum-policy is set to stop, which is also the default value. (Many tutorials set no-quorum-policy to ignore for the convenience of demonstration. If the production environment does this and there is no other arbitration mechanism, it is very dangerous!)
However, if there are only 2 nodes what to do?
The first is to borrow a machine to gather 3 nodes, and then set location restrictions to prevent resources from being allocated to that node.
The second is to pull together multiple small clusters that do not meet the quorum to form a large cluster. Location restrictions are also applied to control the location of resource allocation.
But if you have many two-node clusters, you can’t find so many nodes to make up the number, and you don’t want to pull these two-node clusters together to form a large cluster (for example, you find it inconvenient to manage). Then you can consider the third method.
The third method is to configure a preempted resource, as well as services and colocation constraints of this preempted resource. Whoever seizes the preempted resource will provide the service. This preempted resource can be a lock service, such as one packaged based on zookeeper, or simply make one from scratch, like the following example.
http://my.oschina.net/hanhanztj/blog/515065
(This example is a short connection based on the http protocol. A more detailed approach is to use long connection heartbeat detection so that the server can detect it in time The lock is released when the connection is disconnected)
However, you must also ensure the high availability of this preempted resource. You can make the service that provides preempted resources into lingyig high availability, or you can be simpler and deploy 3 services on dual nodes. One is deployed first, and the third one is deployed on another dedicated arbitration node. The lock is considered to be acquired when at least 2 of the 3 locks are obtained. This quorum node can provide quorum services for many clusters (because a machine can only deploy one Pacemaker instance, otherwise you can use an arbiter node with N Pacemaker instances deployed to do the same thing.). However, if you have no last resort, try to use the previous method, that is, to meet the Pacemaker's statutory number of votes. This method is simpler and more reliable.
6. Reference
http://blog.chinaunix.net/uid-20726500-id-4461367.html
http://my.oschina.net/hanhanztj/blog /515065
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html
http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster
http://mysqllover.com/?p=799
http://gmt-24.net/archives/1077
http://www.bkjia.com/PHPjc/1073479.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/1073479.htmlTechArticleHow to prevent split-brain in an HA cluster 1. Introduction Split-brain refers to a high availability ( In HA) system, when the two connected nodes are disconnected, the system as a whole...