Home  >  Article  >  Database  >  Let's talk about how GitHub makes MySQL highly available

Let's talk about how GitHub makes MySQL highly available

青灯夜游
青灯夜游forward
2022-10-26 19:19:302399browse

Let's talk about how GitHub makes MySQL highly available

Github uses a MySQL database as the data store for all non-git transactions. Database availability is critical to the proper functioning of Github. Both the Github website itself, the Github API, the authentication service, etc. all need to access the database. Github runs multiple database clusters to support different service tasks. The database architecture adopts a traditional master-slave structure. One node (master database) in the cluster supports write access, and the remaining nodes (slave databases) synchronize changes to the master database and support read services.

The availability of the main library is crucial. Once the main database goes down, the cluster will not be able to support data writing services: any data that needs to be saved cannot be written to the database for storage. As a result, any changes on Github, such as code submission, questions, user creation, code review, warehouse creation, etc., cannot be completed.

In order to ensure the normal operation of the business, we naturally need to have an available database node in the cluster that supports writing. At the same time, we must also be able to quickly discover available writable service database nodes.

That is to say, under abnormal circumstances, if the main database is down, we must ensure that the new main database can go online immediately to support services, and at the same time ensure that other nodes in the cluster can quickly identify the new main database. The total time for fault detection, master migration, and other data nodes in the cluster to identify the new master constitutes the total time of service interruption.

This article explains GitHub's MySQL high availability and master service discovery solution, which allows us to reliably run operations across data centers, tolerate data center isolation, and shorten the time it takes to Downtime in case of failure.

High Availability Implementation

The solution described in this article is an improved version of the previous Github high availability solution. As mentioned before, MySQL's high availability strategy must adapt to business changes. We expect MySQL and other services on GitHub to have highly available solutions that can cope with changes.

When designing high availability and service discovery system solutions, starting from the following questions may help us quickly find a suitable solution:

  • Maximum allowable service interruption What is the time?
  • How accurate is the service interruption detection? Can service outage detection be allowed to detect false positives (causing premature failover)?
  • How reliable is the failover? What conditions can cause failover to fail?
  • Can this solution be implemented across data centers, and how is it implemented? What will happen under different network conditions, high latency, or low latency?
  • Can this solution withstand an entire data center (DC) failure or network isolation?
  • Is there any mechanism to prevent HA cluster split-brain situation (in an overall system, two connected nodes split into two independent nodes, and the two nodes compete for shared resources to write data)?
  • Can data loss be tolerated? What level of loss is tolerated?

In order to explain the above issues, let’s first take a look at our previous high availability solution and why we want to improve it.

Abandon the discovery mechanism based on VIP and DNS

In the previous solution, the following technical solution was applied:

  • Useorchestrator As a fault detection migration solution.
  • Use VIP and DNS methods as the master node discovery solution.

The client finds the master node by parsing the node name, such as mysql-writer-1.github.net, into the virtual IP address (VIP) of the master node.

Therefore, under normal circumstances, the client can connect to the master node of the corresponding IP by parsing the node name.

Consider the topology of three data centers:

Lets talk about how GitHub makes MySQL highly available

Once the main database is abnormal, one of the data replica servers must be updated to the main database server.

orchestrator will detect anomalies, select a new master database, and then reassign the database name and virtual IP (VIP). The client itself does not know the changes to the main library. The information the client has is only the name of the main library, so this name must be able to be resolved to the new main library server. Consider the following:

VIP needs to be negotiated: The virtual IP is held by the database itself. The server must send an ARP request to occupy or release the VIP. Before the new database can allocate new VIPs, the old server must first release the VIPs it holds. This process will cause some abnormal problems:

  • The sequence of failover is to first request the failed machine to release VIP, and then contact the new main database machine to allocate VIP. But what if the failed machine itself is inaccessible, or refuses to release the VIP? Considering the scenario of machine failure, the failed machine will not respond immediately or not at all to the request to release the VIP. The whole process has the following two problems:
    • Split-brain situation: If two hosts hold the same In the case of VIP, different clients will connect to different hosts based on the shortest network link.
    • The entire VIP reassignment process relies on the mutual coordination of two independent servers, and the setup process is unreliable.
  • Even if the failed machine releases the VIP normally, the entire process is very time-consuming because the switchover process also requires connecting to the failed machine.
  • Even if the VIP is reassigned, the client's existing connections will not automatically disconnect from the old failed machine, causing a split-brain situation in the entire system.

When we actually set up the VIP, the VIP is also subject to the actual physical location. This mainly depends on where the switch or router is located. Therefore, we can only reassign the VIP on the same local server. In particular, there are cases where we cannot assign VIPs to servers in other data centers and must make DNS changes.

  • DNS changes take longer to propagate. The client caches DNS names first with the configured time. Cross-platform failover means more outage: clients need to spend more time recognizing the new primary server.

These limitations alone are enough to push us to find new solutions, but here are the things to consider:

  • Master server usagept-heartbeat The service goes to inject access heartbeats for the purpose of latency measurement and throttling control . The service must be started on the new primary server. If possible, the service of the old primary server will be shut down when the primary server is replaced.

  • Similarly, Pseudo-GTID is managed by the server itself. It needs to be started on the new master and preferably stopped on the old master.

  • The new master will be set up to be writable. If possible, the old master will be set to read_only(read-only).

These extra steps are a factor in total downtime and introduce their own glitches and friction.

The solution works, GitHub has successfully failed over MySQL, but we would like HA to improve in the following areas:

  • Data center agnostic.
  • Tolerate data center failures.
  • Remove unreliable collaboration workflows.
  • Reduce total downtime.
  • Wherever possible, have lossless failover.

GitHub's high-availability solutions: orchestrator, Consul, GLB

New strategies can improve, solve or optimize the problems mentioned above. The current high availability components are as follows:

  • orchestrator Performs detection and failover. Adopt a hybrid datacenter as described in the link orchestrator/raft .
  • Hashicorp's Consul is discovered as a service.
  • GLB/HAProxy Acts as a proxy between the client and the writing node. GLB guidance is open source.
  • anycast as a network router.

Lets talk about how GitHub makes MySQL highly available

#The new structure removes VIP and DNS. As we introduce more components, we are able to decouple them and simplify the associated tasks and make it easier to leverage a reliable and stable solution. Details are as follows.

Normal process

Normally, the application connects to the write node through GLB/HAProxy.

The application cannot sense the master identity. Previously, names were used. For example, the master of cluster1 is mysql-writer-1.github.net. In the current structure, this name is replaced by anycast IP.

With anycast, the name is replaced by the same IP, but the traffic is routed by the client's location. In particular, when the data center has GLB, the high-availability load balancer is deployed in a different box. Traffic to mysql-writer-1.github.net is directed to the GLB cluster in the local data center. This way all clients are served by the local proxy.

Use GLB on top of HAProxy. HAProxy has a write pool : one for each MySQL cluster. Each pool has a backend service: the cluster master node. All GLB/HAProxy boxes in the data center have the same pool, which means that these pools correspond to the same backend services. So if the application expects to write mysql-writer-1.github.net, it does not care which GLB service to connect to. It will lead to the actual cluster1 master node.

As far as the application is connected to GLB and the discovery service is concerned, rediscovery is not required. GLB is responsible for directing all traffic to the correct destination.

How does GLB know which services are backends, and how does it notify GLB of changes?

Discovery via Consul

Consul is known for its service discovery solutions and also provides DNS services. However, in our solution we use it as a highly available key-value (KV) store.

In Consul’s KV storage, we write the identity of the cluster master node. For each cluster, there is a set of KV entries indicating the cluster's master fqdn, port, ipv4, ipv6.

Each GLB/HAProxy node runs consul-template: a service that listens for changes in Consul data (in our case: changes in the cluster master data). consul-template Generates a valid configuration file and the ability to reload HAProxy when configuration changes.

So each GLB/HAProxy machine observes Consul's change to the master identity, then it reconfigures itself, sets the new master as a single entity in the cluster backend pool, and reloads to reflect these changes.

At GitHub, we have a Consul setup in each data center, and each setup is highly available. However, these settings are independent of each other. They do not copy between each other and do not share any data.

How does Consul learn about changes, and how is information distributed across DCs?

orchestrator/raft

#We run a orchestrator/raft setup: orchestrator node Communicate with each other through the raft mechanism. Each datacenter has one or two orchestrator nodes.

orchestrator is responsible for fault detection and MySQL failover, as well as communicating changes from master to Consul. Failover is operated by a single orchestrator/raft leader node, but the message that the cluster now has a new master is propagated to all orchestrator via the raft mechanism node.

When orchestrator nodes receive the message of master changes, they each communicate with the local Consul settings: they each call KV write. A DC with multiple orchestrator agents will write multiple (identical) times to Consul.

Combining the processes

In the case of master crash:

  • orchestrator Node detection failure.
  • orchestrator/raft The leader node begins to recover, and a data server is updated to be the master.
  • orchestrator/raft Announce that all raft sub-cluster nodes have changed master.
  • Each orchestrator/raft member receives a notification of a leader node change. Each of their members updates the new master to the local Consul KV store.
  • Each GLB/HAProxy runs consul-template, which observes changes in the Consul KV store and reconfigures and reloads HAProxy.
  • Client traffic is redirected to the new master.

Each component has clear responsibility, and the entire design is both decoupled and simplified. orchestrator Don’t know about the load balancer. Consul No need to know the source of the message. The agent only cares about Consul. The client only cares about the proxy.

Additionally:

  • There are no DNS changes to be propagated
  • There is no TTL.
  • The stream is not cooperating with the failed master, and it is largely ignored.

Additional Details

To further protect traffic, we also have the following:

  • HAProxy is configured with a very short hard-stop-after. When it reloads the new backend server in the writer pool, it automatically terminates any existing connections to the old master server.
    • With hard-stop-after, we don’t even need the customer’s cooperation, which mitigates split-brain situations. It's worth noting that this is not closed, and some time passes before we terminate the old connection. But after a certain point, we feel comfortable and don't expect any nasty surprises.
  • We do not strictly require Consul to be on call at all times. In fact, we only need it to be available on failover. If Consul fails, GLB will continue to operate using the last known values ​​and will not take drastic action.
  • GLB is set up to verify the identity of the newly promoted master. Similar to our Context-Aware MySQL Pool, a check is made on the backend server to confirm that it is indeed the writer node. If we happen to delete the identity of master in Consul, no problem; empty entries are ignored. If we mistakenly write a non-master name in Consul, no problem; GLB will refuse to update it and continue running in the last known state.

We will further address the problem and pursue the HA goals in the following sections.

Orchestrator/RAFT Fault Detection

#orchestrator Use a holistic approach to detect faults, So very reliable. We observed no false positives: we had no premature failovers and therefore did not suffer unnecessary downtime.

orchestrator/raft Further addresses the case of complete DC network isolation (aka DC fencing). DC network isolation can lead to confusion: servers within that DC can communicate with each other. Are they isolated from other DC networks, or are other DCs isolated from the network?

In an orchestrator/raft setup, the raft leader node is the node running failover. The leader is the node that has the support of the majority of the group (the quorum). Our coordinator node deployment is such that no one data center has a majority, any n-1 data centers will do.

When the DC network is completely isolated, orchestrator nodes in the DC will be disconnected from peer nodes in other DCs. Therefore, the orchestrator node in an isolated DC cannot become the leader of the raft cluster. If any such node happens to be the leader, it will step down. A new leader will be assigned from any other DC. This leader will be supported by all other DCs that can communicate with each other.

Therefore, the orchestrator node giving the orders will be a node outside the network isolation data center. If there is a master in a standalone DC, orchestrator will initiate a failover, replacing it with a server from one of the available DCs. We mitigate DC isolation by delegating decisions to a quorum in a non-isolated DC.

Faster Advertising

Total downtime can be further reduced by publishing major changes faster. How to achieve this?

When orchestrator starts a failover, it observes the queue of servers available for promotion. Understanding replication rules and adhering to tips and limitations enables you to make educated decisions about the best course of action.

It may be recognized that a server available for promotion is also an ideal candidate, for example:

  • There is nothing to prevent the upgrade of the server (the user may have hinted that such a server is the preferred upgrade), and the server is expected to be able to have all of its siblings as replicas.
  • In this case,
orchestrator

first make the server writable and then immediately advertise the server (in our case writing to the Consul KV store ), even if repairing the replication tree starts asynchronously, this operation usually takes several seconds. It is possible that the replication tree will be intact before the GLB server is completely reloaded, but it is not strictly required. The server accepts writes very well!

Semi-synchronous replication

In MySQL's

Semi-synchronous replication

, the master server does not acknowledge transaction commits until known changes have been sent to One or more copies. It provides a way to achieve lossless failover: any changes applied to the primary server are either applied to the primary server or wait to be applied to one of the replicas.

Consistency comes with a cost: the risk of availability. If no replica acknowledges receipt of the changes, the master blocks and stops writing. Fortunately, there is a timeout configuration after which the master can revert to asynchronous replication mode, making writes available again.

We have set the timeout to a reasonably low value: 500ms. It is sufficient to send changes from the master DC replica to the local DC replica and also to the remote DC. With this timeout we can observe perfect semi-synchronous behavior (no fallback to asynchronous replication) and can easily use very short blocking periods in case of acknowledgment failure.

We enable semi-sync on the local DC replica, and in the event of the master's death we expect (although not strictly enforce) a lossless failover. Lossless failover of a complete DC fault is expensive and we do not expect it.

While experimenting with semi-synchronous timeouts, we also observed a behavior that worked in our favor: in the case of primary failure, we were able to influence the identity of the ideal candidate. By enabling semi-sync on designated servers and marking them as candidate servers, we are able to reduce overall downtime by influencing the outcome of a failure. In our experiments, we observed that we often ended up with ideal candidates to advertise quickly.

Inject heartbeat

We do not manage the startup/shutdown of the pt-heart service on the upgraded/downgraded host, but choose to Run it anywhere anytime. This requires some patching to enable pt-heart to adapt to situations where the server changes back and forth to read_only (read-only status) or crashes completely.

In our current setup, the pt-heart service runs on the master and replicas. On the host, they generate heartbeat events. On replicas, they identify servers as read_only(read-only) and periodically recheck their status. Once a server is promoted to master, pt-heart on that server identifies the server as writable and begins injecting heartbeat events.

orchestrator Ownership Delegation

#We further orchestrator:

  • Inject Pseudo-GTID,
  • Set the promoted master to be writable, clear its replication status, and,
  • if possible, set the old master to read_only.

This reduces friction on top of all new masters. A newly promoted master should obviously be alive and acceptable, otherwise we won't promote it. Therefore, it makes sense to let the orchestrator speak directly about changing the msater that is applied to the boost.

orchestrator ownership delegation

#We further orchestrator:

  • Pseudo-GTID injection,
  • Set the promoted master to be writable, clear its replication status, and,
  • if possible, set the old master to read_only.

This reduces friction on top of all new masters. A newly promoted master should obviously be live and acceptable, otherwise we won't promote it. Therefore, it makes sense to have the orchestrator apply the changes directly to the promoted msater.

Limitations and Disadvantages

The proxy layer keeps the application unaware of the identity of the primary server, but it also masks the identity of the application's primary server. All that is primarily seen is the connection coming from the proxy layer, and we lose information about the actual origin of the connection.

With the development of distributed systems, we are still faced with unhandled scenarios.

It is worth noting that in a data center isolation scenario, assuming the primary server is located in an isolated DC, applications in the DC can still write to the primary server. This can lead to inconsistent state once the network is restored. We are working to mitigate this split brain by implementing a reliable STONITH from within a very siled DC. As before, it will be a while before the primaries are destroyed, and there may be a brief period of brain-splitting. The operational cost of avoiding brain splitting is very high.

There are more cases: stopping consul service on failover; partial DC isolation; other cases. We know it's impossible to plug all vulnerabilities in a distributed system of this nature, so we focus on the most important cases.

Conclusion

Our coordinator/GLB/Consul provided us with:

  • Reliable fault detection,
  • Datacenter-independent failover,
  • Generally lossless failover,
  • Datacenter network isolation support,
  • Mitigating split-brain (more work results),
  • does not rely on cooperation,
  • in most cases, the total interruption time is between 10 and 13 seconds between.
    • In uncommon cases we may see up to 20 seconds of total downtime, and in extreme cases up to 25 seconds time.

Conclusion

The business process/proxy/service discovery paradigm uses well-known and trusted components in a separated architecture, This makes it easier to deploy, operate, and observe, and each component can be scaled up or down independently. We can constantly test settings and look for improvements.

Original address: https://github.blog/2018-06-20-mysql-high-availability-at-github/

Translation address: https://learnku .com/mysql/t/36820

[Related recommendations: mysql video tutorial]

The above is the detailed content of Let's talk about how GitHub makes MySQL highly available. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:learnku.com. If there is any infringement, please contact admin@php.cn delete