New Redis Cluster meta-data handling-MySQL 튜토리얼-php.cn

집

데이터 베이스

MySQL 튜토리얼

New Redis Cluster meta-data handling

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:35 PM

clusternewredis

This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community.

Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.

At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node.

Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them.

The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration.

Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it?

Metadata is not data
====================

In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is:

1) Speed.
2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis.
3) The ability to retain most writes originating from clients connected to the majority of masters.

Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the node has no other chatting to do with other nodes.

However this means that Redis Cluster is not a true CP system: there is a window where writes can be lost. The trivial case to lose a write is to write to a master that stops working after accepting our write and replying to the client, but before being able to propagate the write to its slaves.

This window is very small, in the sub-millisecond range. However when a client is partitioned away from the majority of the master nodes there is a bigger window to lose data, as a partitioned master will continue to accept writes for some time, while on the majority side the same master may be failed over by a slave. So when the partition will be fixed, the master will be reconfigured as a slave and writes will be lost.

Apart from the replicas, a key is stored into a single master node, so there is no need to agree or merge its value. Given the design, there is no need to use an agreement protocol in order to write or read data to/from the cluster. However metadata is a different story, we want that every node has a coherent vision of the cluster setup, and that the right configuration is eventually propagated to all the nodes even in case of failures and network partitions.

Using an agreement algorithm
============================

The simplest way to solve such a problem is to use a consensus algorithm such as Paxos or Raft, and this was the direction I was going to take. However implementing consensus algorithms is hard. Actually it is so hard that often years are needed for implementations to stabilize.

At some point I noticed something that makes the work of Redis Cluster nodes simpler, that is, information about hash slots is always idempotent. If hash slot 5 is served by A, and later because the configuration changes hash slot 5 is served by B, nodes don't need to know what happened, everything they need is that the configuration for an hash slot was updated.

This changes everything basically: agreement protocols are able to take a state machine synchronized by running the same sequence of operations in every node. If the state machine is deterministic, then the internal state will be the same in all the nodes eventually. However all the state Redis Cluster needs, for a given slot, can be embedded into a single packet.

Because of this we don't need a log of operations stored on disk, nor a way to make sure to fetch all the operations still not fetched, or to figure out what should be applied and what not, all the state can be copied in a single message. In short Redis Cluster does not require a full agreement protocol so I stolen just what I needed from Raft, and tried to figure out the rest.

Failure detection
=================

In order to see if a node has issues, Redis Cluster still uses the old algorithm that is based on gossip. Nodes inform other nodes about the state of a few random nodes using ping / pong packets. These ping / pong packets are in turn used to check if a node is reachable from the point of view of the sender of the ping. If the (informal) majority of masters agree that a given node is not reachable, then the node is flagged as failing. I said "informal" as there is no true agreement here, but simply:

1) Every node flags other nodes are failing if the majority of master nodes signaled the node as down in a given time range.
2) Every node removes the flag if the node is back reachable and is a salve, or a master that after some time is still serving slots from our point of view (was not failed over).

The failure detection is completely informal and has the only property that eventually all the nodes will agree on the failure: either the majority of nodes will mark it as failing resulting into a chain effect that will force all the other nodes to mark the node as failing, OR there is no majority and if the node is reachable again everybody will clear the flag.

The point here is that the failure detection does not require any safety, it is only useful in order to trigger the safe part of the algorithm, that is, replacing the master with a slave and update the cluster configuration.

Slave promotion
===============

Promoting a slave must be a safe operation, and one that should ensure that the configuration change is propagated across the cluster as soon as possible.

Slave promotion is up to slaves and uses a mechanism very similar to the Raft algorithm leader election. This is what happens:

1) A slave detects its master is failing.
2) The slave will try to win the election in order to promote itself to master.
3) If it is successful, it will change its state and will advertise the new configuration.
4) If it is unsuccessful it will try again to win the election after some time.

Every slave will try to start the election at a slightly different time in order to avoid a split brain condition that will require a new election. Redis Cluster uses a random delay that is driven by the number of seconds a slave was disconnected from the master, in order to favor slaves that were able to talk with the master more recently (slaves with too old data don't try at all to get elected).

Every cluster node has the concept of currentTerm as in Raft, that is called currentEpoch in Redis Cluster. Every node tries to have a currentEpoch that is the highest found among other nodes, so this information is always added in ping /pong packets headers. Every time a node sees a currentEpoch of another node that is greater than its epoch, it updates its currentEpoch.

The election is like Raft: a slave that tries to get elected increments its currentEpoch and sends a failover-auth-request to every master hoping to get its vote. Masters refuse to vote if the master instance of the slave is not failing from the point of view of a given master, or if the currentTerm advertised by the slave is smaller than the currentTerm of the receiving master.

Also masters vote a single time for every epoch: this ensures that for every epoch we can have just a single winner, this is central both in the Raft algorithm and in the Redis Cluster slave election.

Basically, if a slave wins the election, it uses the epoch at which the election was won as the version of its configuration, and newer configurations always win over older configurations.

The configEpoch
===============

In order to make more clear how it works, we need to add some more information. Basically every ping / pong packet does not just publish the currentEpoch, but also the configEpoch, that is, the epoch at which the master started to serve its set of hash slots. Initially when a cluster is created every master uses a configEpoch of zero. As failover events happen, there will be nodes with greater configEpoch values.

As in the old algorithm, the ping and pong packets also carry a bitmap with the hash slots served by a given node. Since every node knows the last observed configEpoch of every other node, it can detect configuration changes to incorporate.

For instance if node B claims to serve hash slot 5 that was previously served by node A, but the configEpoch of node B is greater than the configEpoch we have for A, then we accept the new configuration.

The same mechanism is also used in order to reconfigure a reappearing master as a slave, or to reconfigure all the other slaves after a failover. The old master's served slots count will drop to zero, and the nodes will switch as replicas of the node that is serving the slots now.

The real algorithm used has more details that don't change the semantics, but make everything more fluid in common cases. For example after a slave wins an election it broadcasts a PONG to every node in order to make the configuration change faster, and to prevent other slaves from initiating a new useless election.

Similarly a master that was partitioned out from the majority for enough time (the same time needed to flag it as failing) stop accepting writes, and will not accept writes for a few more seconds even after the majority of masters is reachable again, in order to give some time to the other nodes to inform it of configuration changes. This makes less likely that a client with an old routing table will try and succeed to write to the returning master that is now failed over.

From the point of view of the code, the implementation is requiring a minor amount of code, as everything was already implemented in the old algorithm even if in a broken way, it was unsafe but the basic abstractions and message formats were ok.

All in all I'm failing in love with distributed programming and I hope to learn more in the next weeks... Comments

성명

본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.

관련 기사

MySQL과 다른 SQL 방언의 구문의 차이점은 무엇입니까?Apr 27, 2025 am 12:26 AM

mysqldiffersfromothersqldialectsinsyntaxforlimit, 자동 점유, 문자열 comparison, 하위 쿼리 및 퍼포먼스 앤 알리 분석 .1) mysqluse Slimit, whilesqlSerVerusestOpandoracleSrownum.2) MySql'Sauto_incrementContrastSwithPostgresql'serialandoracle '

MySQL 파티셔닝이란 무엇입니까?Apr 27, 2025 am 12:23 AM

MySQL 파티셔닝은 성능을 향상시키고 유지 보수를 단순화합니다. 1) 큰 테이블을 특정 기준 (예 : 날짜 범위)으로 작은 조각으로 나누고, 2) 데이터를 독립적 인 파일로 물리적으로 나눌 수 있습니다.

MySQL에서 어떻게 권한을 부여하고 취소합니까?Apr 27, 2025 am 12:21 AM

MySQL에서 권한을 부여하고 취소하는 방법은 무엇입니까? 1. 보조금 명세서를 사용하여 grantallprivilegesondatabase_name.to'username'@'host '와 같은 부여 권한; 2. Revoke 문을 사용하여 Revokeallprivilegesondatabase_name.from'username'@'host '와 같은 권한을 취소하여 허가 변경의 적시에 의사 소통을 보장하십시오.

InnoDB와 MyISAM 스토리지 엔진의 차이점을 설명하십시오.Apr 27, 2025 am 12:20 AM

InnoDB는 거래 지원 및 높은 동시성이 필요한 응용 프로그램에 적합한 반면, MyISAM은 더 많은 읽기와 덜 쓰는 응용 프로그램에 적합합니다. 1. INNODB는 전자 상거래 및 은행 시스템에 적합한 거래 및 은행 수준의 자물쇠를 지원합니다. 2. Myisam은 블로깅 및 컨텐츠 관리 시스템에 적합한 빠른 읽기 및 색인을 제공합니다.

MySQL의 다른 유형의 조인은 무엇입니까?Apr 27, 2025 am 12:13 AM

MySQL에는 Innerjoin, Leftjoin, RightJoin 및 FullouterJoin의 네 가지 주요 조인 유형이 있습니다. 1. 결합 조건을 충족하는 두 테이블의 모든 행을 반환합니다. 2. Leftjoin 오른쪽 테이블에 일치하는 행이 없더라도 왼쪽 테이블의 모든 행을 반환합니다. 3. RightJoin은 LeftJoin과 상반되며 오른쪽 테이블의 모든 행을 반환합니다. 4. FULLOUTERNOIN은 조건을 충족 시키거나 충족하지 않는 두 테이블의 모든 행을 반환합니다.

MySQL에서 사용 가능한 다른 스토리지 엔진은 무엇입니까?Apr 26, 2025 am 12:27 AM

mysqloffersvariousStorageEngines, 각각의 everitedforentUsecases : 1) innodbisidealforapplicationsneedingAcidCoInceandHighConcurrency, 지원 트랜잭션 및 foreignKeys.2) myIsAmisbestforread-heverworkloads, memoryengineis

MySQL의 일반적인 보안 취약점은 무엇입니까?Apr 26, 2025 am 12:27 AM

MySQL의 일반적인 보안 취약점에는 SQL 주입, 약한 암호, 부적절한 권한 구성 및 업데이트되지 않은 소프트웨어가 포함됩니다. 1. 전처리 명령문을 사용하여 SQL 주입을 방지 할 수 있습니다. 2. 강력한 비밀번호 전략을 사용하여 약한 암호는 피할 수 있습니다. 3. 정기적 인 검토 및 사용자 권한 조정을 통해 부적절한 권한 구성을 해결할 수 있습니다. 4. Unupdated 소프트웨어는 MySQL 버전을 정기적으로 확인하고 업데이트하여 패치 할 수 있습니다.

MySQL에서 느린 쿼리를 어떻게 식별 할 수 있습니까?Apr 26, 2025 am 12:15 AM

느린 쿼리 로그를 활성화하고 임계 값을 설정하여 MySQL에서 느린 쿼리를 식별 할 수 있습니다. 1. 느린 쿼리 로그를 활성화하고 임계 값을 설정하십시오. 2. 느린 쿼리 로그 파일을보고 분석하고 심층 분석을 위해 MySQLDumpSlow 또는 PT-Query 소수성과 같은 도구를 사용하십시오. 3. 인덱스 최적화, 쿼리 재 작성 및 select*의 사용을 피함으로써 느린 쿼리 최적화를 달성 할 수 있습니다.

See all articles

핫 AI 도구

Undresser.AI Undress

사실적인 누드 사진을 만들기 위한 AI 기반 앱

AI Clothes Remover

사진에서 옷을 제거하는 온라인 AI 도구입니다.

Undress AI Tool

무료로 이미지를 벗다

Clothoff.io

AI 옷 제거제

Video Face Swap

완전히 무료인 AI 얼굴 교환 도구를 사용하여 모든 비디오의 얼굴을 쉽게 바꾸세요!

뜨거운 도구

Dreamweaver Mac版

시각적 웹 개발 도구

SecList

SecLists는 최고의 보안 테스터의 동반자입니다. 보안 평가 시 자주 사용되는 다양한 유형의 목록을 한 곳에 모아 놓은 것입니다. SecLists는 보안 테스터에게 필요할 수 있는 모든 목록을 편리하게 제공하여 보안 테스트를 더욱 효율적이고 생산적으로 만드는 데 도움이 됩니다. 목록 유형에는 사용자 이름, 비밀번호, URL, 퍼징 페이로드, 민감한 데이터 패턴, 웹 셸 등이 포함됩니다. 테스터는 이 저장소를 새로운 테스트 시스템으로 간단히 가져올 수 있으며 필요한 모든 유형의 목록에 액세스할 수 있습니다.

mPDF

mPDF는 UTF-8로 인코딩된 HTML에서 PDF 파일을 생성할 수 있는 PHP 라이브러리입니다. 원저자인 Ian Back은 자신의 웹 사이트에서 "즉시" PDF 파일을 출력하고 다양한 언어를 처리하기 위해 mPDF를 작성했습니다. HTML2FPDF와 같은 원본 스크립트보다 유니코드 글꼴을 사용할 때 속도가 느리고 더 큰 파일을 생성하지만 CSS 스타일 등을 지원하고 많은 개선 사항이 있습니다. RTL(아랍어, 히브리어), CJK(중국어, 일본어, 한국어)를 포함한 거의 모든 언어를 지원합니다. 중첩된 블록 수준 요소(예: P, DIV)를 지원합니다.