検索

This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm

This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community.

Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.

At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node.

Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them.

The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration.

Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it?

Metadata is not data
====================

In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is:

1) Speed.
2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis.
3) The ability to retain most writes originating from clients connected to the majority of masters.

Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the node has no other chatting to do with other nodes.

However this means that Redis Cluster is not a true CP system: there is a window where writes can be lost. The trivial case to lose a write is to write to a master that stops working after accepting our write and replying to the client, but before being able to propagate the write to its slaves.

This window is very small, in the sub-millisecond range. However when a client is partitioned away from the majority of the master nodes there is a bigger window to lose data, as a partitioned master will continue to accept writes for some time, while on the majority side the same master may be failed over by a slave. So when the partition will be fixed, the master will be reconfigured as a slave and writes will be lost.

Apart from the replicas, a key is stored into a single master node, so there is no need to agree or merge its value. Given the design, there is no need to use an agreement protocol in order to write or read data to/from the cluster. However metadata is a different story, we want that every node has a coherent vision of the cluster setup, and that the right configuration is eventually propagated to all the nodes even in case of failures and network partitions.

Using an agreement algorithm
============================

The simplest way to solve such a problem is to use a consensus algorithm such as Paxos or Raft, and this was the direction I was going to take. However implementing consensus algorithms is hard. Actually it is so hard that often years are needed for implementations to stabilize.

At some point I noticed something that makes the work of Redis Cluster nodes simpler, that is, information about hash slots is always idempotent. If hash slot 5 is served by A, and later because the configuration changes hash slot 5 is served by B, nodes don't need to know what happened, everything they need is that the configuration for an hash slot was updated.

This changes everything basically: agreement protocols are able to take a state machine synchronized by running the same sequence of operations in every node. If the state machine is deterministic, then the internal state will be the same in all the nodes eventually. However all the state Redis Cluster needs, for a given slot, can be embedded into a single packet.

Because of this we don't need a log of operations stored on disk, nor a way to make sure to fetch all the operations still not fetched, or to figure out what should be applied and what not, all the state can be copied in a single message. In short Redis Cluster does not require a full agreement protocol so I stolen just what I needed from Raft, and tried to figure out the rest.

Failure detection
=================

In order to see if a node has issues, Redis Cluster still uses the old algorithm that is based on gossip. Nodes inform other nodes about the state of a few random nodes using ping / pong packets. These ping / pong packets are in turn used to check if a node is reachable from the point of view of the sender of the ping. If the (informal) majority of masters agree that a given node is not reachable, then the node is flagged as failing. I said "informal" as there is no true agreement here, but simply:

1) Every node flags other nodes are failing if the majority of master nodes signaled the node as down in a given time range.
2) Every node removes the flag if the node is back reachable and is a salve, or a master that after some time is still serving slots from our point of view (was not failed over).

The failure detection is completely informal and has the only property that eventually all the nodes will agree on the failure: either the majority of nodes will mark it as failing resulting into a chain effect that will force all the other nodes to mark the node as failing, OR there is no majority and if the node is reachable again everybody will clear the flag.

The point here is that the failure detection does not require any safety, it is only useful in order to trigger the safe part of the algorithm, that is, replacing the master with a slave and update the cluster configuration.

Slave promotion
===============

Promoting a slave must be a safe operation, and one that should ensure that the configuration change is propagated across the cluster as soon as possible.

Slave promotion is up to slaves and uses a mechanism very similar to the Raft algorithm leader election. This is what happens:

1) A slave detects its master is failing.
2) The slave will try to win the election in order to promote itself to master.
3) If it is successful, it will change its state and will advertise the new configuration.
4) If it is unsuccessful it will try again to win the election after some time.

Every slave will try to start the election at a slightly different time in order to avoid a split brain condition that will require a new election. Redis Cluster uses a random delay that is driven by the number of seconds a slave was disconnected from the master, in order to favor slaves that were able to talk with the master more recently (slaves with too old data don't try at all to get elected).

Every cluster node has the concept of currentTerm as in Raft, that is called currentEpoch in Redis Cluster. Every node tries to have a currentEpoch that is the highest found among other nodes, so this information is always added in ping /pong packets headers. Every time a node sees a currentEpoch of another node that is greater than its epoch, it updates its currentEpoch.

The election is like Raft: a slave that tries to get elected increments its currentEpoch and sends a failover-auth-request to every master hoping to get its vote. Masters refuse to vote if the master instance of the slave is not failing from the point of view of a given master, or if the currentTerm advertised by the slave is smaller than the currentTerm of the receiving master.

Also masters vote a single time for every epoch: this ensures that for every epoch we can have just a single winner, this is central both in the Raft algorithm and in the Redis Cluster slave election.

Basically, if a slave wins the election, it uses the epoch at which the election was won as the version of its configuration, and newer configurations always win over older configurations.

The configEpoch
===============

In order to make more clear how it works, we need to add some more information. Basically every ping / pong packet does not just publish the currentEpoch, but also the configEpoch, that is, the epoch at which the master started to serve its set of hash slots. Initially when a cluster is created every master uses a configEpoch of zero. As failover events happen, there will be nodes with greater configEpoch values.

As in the old algorithm, the ping and pong packets also carry a bitmap with the hash slots served by a given node. Since every node knows the last observed configEpoch of every other node, it can detect configuration changes to incorporate.

For instance if node B claims to serve hash slot 5 that was previously served by node A, but the configEpoch of node B is greater than the configEpoch we have for A, then we accept the new configuration.

The same mechanism is also used in order to reconfigure a reappearing master as a slave, or to reconfigure all the other slaves after a failover. The old master's served slots count will drop to zero, and the nodes will switch as replicas of the node that is serving the slots now.

The real algorithm used has more details that don't change the semantics, but make everything more fluid in common cases. For example after a slave wins an election it broadcasts a PONG to every node in order to make the configuration change faster, and to prevent other slaves from initiating a new useless election.

Similarly a master that was partitioned out from the majority for enough time (the same time needed to flag it as failing) stop accepting writes, and will not accept writes for a few more seconds even after the majority of masters is reachable again, in order to give some time to the other nodes to inform it of configuration changes. This makes less likely that a client with an old routing table will try and succeed to write to the returning master that is now failed over.

From the point of view of the code, the implementation is requiring a minor amount of code, as everything was already implemented in the old algorithm even if in a broken way, it was unsafe but the basic abstractions and message formats were ok.

All in all I'm failing in love with distributed programming and I hope to learn more in the next weeks... Comments
声明
この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。
MySQLと他のSQL方言の構文の違いは何ですか?MySQLと他のSQL方言の構文の違いは何ですか?Apr 27, 2025 am 12:26 AM

MySQLdiffersfromotherSQLdialectsinsyntaxforLIMIT,auto-increment,stringcomparison,subqueries,andperformanceanalysis.1)MySQLusesLIMIT,whileSQLServerusesTOPandOracleusesROWNUM.2)MySQL'sAUTO_INCREMENTcontrastswithPostgreSQL'sSERIALandOracle'ssequenceandt

MySQLパーティションは何ですか?MySQLパーティションは何ですか?Apr 27, 2025 am 12:23 AM

MySQLパーティション化により、パフォーマンスが向上し、メンテナンスが簡素化されます。 1)大きなテーブルを特定の基準(日付範囲など)、2)物理的に独立したファイルに物理的に分割する、3)MySQLはクエリするときに関連するパーティションに焦点を合わせることができます。

MySQLで特権をどのように許可して取り消しますか?MySQLで特権をどのように許可して取り消しますか?Apr 27, 2025 am 12:21 AM

mysqlで許可を許可および取り消す方法は? 1。grantallprivilegesondatabase_name.to'username'@'host 'などの許可を付与するために付与ステートメントを使用してください。 2。Revokeallprivilegesondatabase_name.from'username'@'host 'など、Revoke Statementを使用して、許可のタイムリーな通信を確保します。

InnoDBとMyisamストレージエンジンの違いを説明してください。InnoDBとMyisamストレージエンジンの違いを説明してください。Apr 27, 2025 am 12:20 AM

INNODBは、トランザクションサポートと高い並行性を必要とするアプリケーションに適していますが、Myisamはより多くの読み取りとより少ない書き込みを必要とするアプリケーションに適しています。 1.INNODBは、eコマースおよび銀行システムに適したトランザクションおよび銀行レベルのロックをサポートしています。 2. Myisamは、ブログやコンテンツ管理システムに適した、迅速な読み取りとインデックス作成を提供します。

MySQLの結合のさまざまなタイプは何ですか?MySQLの結合のさまざまなタイプは何ですか?Apr 27, 2025 am 12:13 AM

MySQLには4つのメイン結合タイプがあります:innerjoin、leftjoin、rightjoin、fullouterjoin。 1.InnerJoinは、結合条件を満たす2つのテーブルのすべての行を返します。 2.右のテーブルに一致する行がない場合でも、Leftjoinは左のテーブルのすべての行を返します。 3。右joinはleftjoinに反しており、右のテーブルのすべての行を返します。 4.fullouterjoinは、結合条件を満たしている、または満たさない2つのテーブルのすべての行を返します。

MySQLで利用可能なさまざまなストレージエンジンは何ですか?MySQLで利用可能なさまざまなストレージエンジンは何ですか?Apr 26, 2025 am 12:27 AM

mysqloffersvariousstorageEngines、それぞれのfordifferentusecases:1)Innodbisidealforapplicationsingingidcomplianceanceandhighconcurrency、support transactions andforeignkeys.2)myisamisbestforread-havyworkloads、transactionsupptort.3)

MySQLの一般的なセキュリティの脆弱性は何ですか?MySQLの一般的なセキュリティの脆弱性は何ですか?Apr 26, 2025 am 12:27 AM

MySQLの一般的なセキュリティの脆弱性には、SQLインジェクション、弱いパスワード、不適切な許可構成、および非合事ソフトウェアが含まれます。 1。SQL注射は、前処理ステートメントを使用することで防ぐことができます。 2。強力なパスワード戦略を強制的に使用することにより、弱いパスワードを回避できます。 3.不適切な許可構成は、ユーザー許可の定期的なレビューと調整を通じて解決できます。 4.未使用のソフトウェアは、MySQLバージョンを定期的にチェックして更新することでパッチを適用できます。

MySQLでスロークエリをどのように識別できますか?MySQLでスロークエリをどのように識別できますか?Apr 26, 2025 am 12:15 AM

MySQLの遅いクエリを識別することは、遅いクエリログを有効にし、しきい値を設定することで実現できます。 1.スロークエリログを有効にし、しきい値を設定します。 2.スロークエリログファイルを表示および分析し、詳細な分析のためにMySQLDumpSlowやPT-Query-Digestなどのツールを使用します。 3.インデックスの最適化、クエリの書き換え、およびselect*の使用を回避することで、遅いクエリの最適化を実現できます。

See all articles

ホットAIツール

Undresser.AI Undress

Undresser.AI Undress

リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover

AI Clothes Remover

写真から衣服を削除するオンライン AI ツール。

Undress AI Tool

Undress AI Tool

脱衣画像を無料で

Clothoff.io

Clothoff.io

AI衣類リムーバー

Video Face Swap

Video Face Swap

完全無料の AI 顔交換ツールを使用して、あらゆるビデオの顔を簡単に交換できます。

ホットツール

VSCode Windows 64 ビットのダウンロード

VSCode Windows 64 ビットのダウンロード

Microsoft によって発売された無料で強力な IDE エディター

SublimeText3 Linux 新バージョン

SublimeText3 Linux 新バージョン

SublimeText3 Linux 最新バージョン

メモ帳++7.3.1

メモ帳++7.3.1

使いやすく無料のコードエディター

SublimeText3 中国語版

SublimeText3 中国語版

中国語版、とても使いやすい

mPDF

mPDF

mPDF は、UTF-8 でエンコードされた HTML から PDF ファイルを生成できる PHP ライブラリです。オリジナルの作者である Ian Back は、Web サイトから「オンザフライ」で PDF ファイルを出力し、さまざまな言語を処理するために mPDF を作成しました。 HTML2FPDF などのオリジナルのスクリプトよりも遅く、Unicode フォントを使用すると生成されるファイルが大きくなりますが、CSS スタイルなどをサポートし、多くの機能強化が施されています。 RTL (アラビア語とヘブライ語) や CJK (中国語、日本語、韓国語) を含むほぼすべての言語をサポートします。ネストされたブロックレベル要素 (P、DIV など) をサポートします。