Understanding distributed consensus algorithms

Time:2021-9-22

Starting with rocketmq supporting automatic failover


Before rocketmq version 4.5, rocketmq only had a master / slave deployment mode. There was one master in a group of brokers and there were zero to multiple slaves. The slave synchronized the master’s data through synchronous replication or asynchronous replication. Master / slave deployment mode provides certain high availability.

However, this deployment model has some defects. For example, in terms of failover, if the master node hangs, it needs to be restarted or switched manually, and a slave node cannot be automatically converted to the master node. Therefore, we hope to have a new multi replica architecture to solve this problem.

The key to rocketmq’s implementation of high availability multi replica architecture: the commit log repository dledger based on raft protocol
Apache rocketmq – version 4.5.0 implements high availability multi replica architecture based on raft protocol

The master broker loses the ability to write messages after it goes down

Since the master broker supports reading and writing at the same time, the slave broker only supports reading, and even only the broker with brokerid = 1 supports message reading load (rocketmq determines the master-slave according to the brokerid, brokerid = 0 indicates the master, and non-0 indicates the slave)

Rocketmq version 3.2.6 is now used in our mall. There are two nodes (broker a and broker C), one master and one slave for each broker

In addition to one master and one slave, the broker in rocketmq also supports a variety of other deployment methods:

  • Single master single slave
  • Single master multi slave synchronous replication
  • Single master multi slave asynchronous replication
  • Multi master zero slave

Kafka leadership election

For comparison, let’s take a look at how Kafka does it (Taobao middleware team implemented rocketmq in Java after fully reviewing Kafka). The replicas in Kafka are divided into follower replica and leader replica. The follower replica does not provide external services and is only used to back up data (neither read nor write). The advantage is that there is no message delay from the node, and the disadvantage is that the horizontal expansion of read operation is lost

The general idea of Kafka leader election is as follows:

The leader creates a temporary node on the zookeeper, and all followers register to listen to this node. When the leader goes down, all followers in the ISR try to create this node, and the successful creator (zookeeper guarantees that only one can be created successfully) is the new leader, and other replicas are followers

Rocketmq best practices
RocketMQ architecture
Related concepts of rocketmq

What is the raft agreement

At the beginning, Lamport proposed a distributed consensus algorithm (named Paxos, a Greek island, popular democratic elections in ancient Greek city states). Because the algorithm was not easy to understand and the expression was not concise enough (however, it does not prevent people from getting the Turing Award. Relevant algorithms were proposed in 1990 and later applied to many projects by Internet companies. They won the Turing Award in 2013)
Finally, in 2014, someone proposed a raft protocol to replace Paxos algorithm and simplify or optimize it. The raft protocol has three roles:

  • Leader
  • Followers
  • And candidate

Raft protocol divides the problem into several sub problems to solve:

  • Leader election
  • Log replication
  • Safety rules (safty)

Leader election

  • In the algorithm initialization phase or when the existing leaders are down / lost contact, follower will initiate a round of leader election with a new term.
  • If one round of election is successful, the new leader will start working. Otherwise, the election will be deemed to be over, and a new term number will be used in the next round of election.
  • When the follower node receives the leader’s heartbeat timeout, it will select itself as a candidate to launch a new round of election. Candidates first vote for themselves, and then send voting requests to other servers.
  • Each node is allowed to vote only once in each term number, and the principle of first come, first served is followed.
  • If a candidate receives more than half of the votes, he will be elected as a new leader. If no new leader is elected after the timeout, the term of office will be automatically terminated and the new term number will be used to start the next round of election.

Note: if candidate (a) receives heartbeat messages from other nodes (b) during the election process, and the tenure number contained in the heartbeat is not less than the tenure number of a, a will immediately return to the follower status and recognize B as the leader.

Two random times
Raft uses two random times to simplify the split vote problem, which can reduce the probability of multi service simultaneous election and the probability of election failure when both candidates win only half of the votes

  • The timeout of each follower waiting for the leader to send heartbeat is random (150 ms – > 300 ms)
  • Two candidates are elected at the same time and get the same number of votes. At this time, the candidate will randomly delay for a period of time and then reissue the voting request to other nodes

Raft protocol features:

  • There is only one leader role in the system, which accepts all read-write requests sent by clients
  • The leader is responsible for communicating with all followers, copying the proposal / value / change to all followers, and collecting the responses of the majority followers
  • Minority downtime will not affect the overall availability of the system
  • Leader daily maintenance and heartbeat of all followers
  • If the leader goes down, the system will automatically re select the master. During the selection period, the system cannot serve the outside world

Log replication

After receiving the request from the client (including an executed command), the leader is responsible for copying the log. After receiving the request, the leader creates a new log entry and attaches it to the local log, and then copies the new log entry to other follower nodes. If the follower node is unavailable, the leader will retry sending the add log entry message indefinitely until all followers finally receive and store it.

When the leader receives a confirmation message from most followers that the log item has been copied, the leader will submit the log item locally. Next, the leader sends a log entry submission message to the followers, informing them to apply the log entry to their local state machine. This completes the log consistency between cluster servers.

When the leader crashes, the logs may be inconsistent, that is, some logs of the old leader are not fully replicated in the cluster. New leaders deal with inconsistencies by forcing followers to copy their own logs. The general process is that the leader compares each follower’s log with his own log, finds the largest log item on the follower node that is the same as his own log, and then deletes all logs after this key log item in the follower’s log (the previous logs are exactly the same). This mechanism is used to restore the log consistency of the failed cluster.

Look at the animation and understand the raft algorithm

Demo animation
Chinese translation of animation demonstration
Stanford raft Museum


Engineering practice of related algorithms

project algorithm time
mongodb bully -> raft 2007 – 2009 open source
zookeeper zab(multi-paxos) Yahoo 2008
chubby paxos Google
mysql 5.7 group replication paxos 2013 – 2015
rocketmq null -> raft Open source in 2012
kafka zookeeper -> raft Linked 2011 open source
etcd raft Coreos 2018 CNCF

The general technical trend is: Master / slave – > raft, Paxos – > raft

Middleware using raft protocol
Kafka is discussing replacing zookeeper with raft algorithm
MySQL 5.7 Group Replication Background
RocketMQ Add store with dledger

Paxos before raft

PaxosIs an algorithm used to reach a consensus between a group of distributed computers communicating over an asynchronous network. One or more clients propose a proposed value to Paxos. When most systems running Paxos agree to one of the proposed values, a consensus will be reached. Paxos is widely used and legendary in the field of computer science because it is the first consensus algorithm that has been strictly proved to be correct.

Paxos simply selects a value from one or more proposed values and lets everyone know what the value is. If you need to use Paxos to create a replication log (for example, a replication state machine), you need to run Paxos repeatedly. This is calledmulti-Paxos。 For multi Paxos, some optimizations can be implemented, but they will not be discussed here.

Related roles

Paxos has three roles:

  1. Proposer: receive a request (value) from the customer and try to persuade the recipient to accept its proposed value.
  2. recipient: accept some of the proposer’s offer values and let the proposer know whether other proposals have been accepted before. The recipient’s response indicates a vote on a particular proposal.
  3. learner: save a backup of the consensus voting results.

Basic Paxos protocol

This protocol is the most basic protocol in Paxos series. A proposed value will be obtained after the basic Paxos protocol is successfully agreed. If a round of negotiation fails, several rounds of negotiation are usually carried out. The process of a successful round of negotiation has two stages: Stage 1 (divided into two stages)aandb)And phase 2 (divided intoaandb )。

Understanding distributed consensus algorithms

Phase I
Phase 1A:Prepare

A proponent creates a message, which we call “ready”, using numbersnidentification. Please note that,nIt is not a value to be proposed or agreed, but just a number that uniquely identifies this initial message (sent to the recipient) by the proposer. numbernMust be greater than any previousPrepareNumber used in the message. It then sends a containing messagenofPrepareSend the message to the recipient. Please note that,PrepareThe message contains only numbersn(that is, it does not have to contain the recommended value, which is usually usedvIndicates). If the proposer cannot communicate with any recipient, he should not start the Paxos negotiation process.

Phase 1b:Promise

whateverrecipientWaiting for anyProposerofprepareIf the recipient receives a messageprepareMessage, the recipient must view the message just receivedprepareProposal number of the messagen, there are two situations.

IfnIf it is greater than the proposal number previously received by the recipient from any proposer, the recipient must return a message to the proposer, which we call “commitment”, so as to ignore all future proposal numbers less thannRequest for proposal. If the recipient used toacceptedIf a proposal is, it must be accompanied by the previous proposal number (e.gm)And the corresponding accepted value (e.gw )。

Otherwise (i.e.:n Less than or equal toThe number of previous proposals received by the recipient), and the recipient can ignore the received proposals. In this case, the recipient may not respond to the proposer’s request. However, for optimization, a reject is sent(Nack)The response will tell the proposer that it can stop using the proposal numbernTo try to build consensus on the proposal.

Phase II
Phase 2A:Propose / accept

If the proposer gets most of the commitment from the recipient, it needs to set the proposal value for its proposalv。 If the recipient has previously accepted the proposal value, they will send its value to the proposer, who must now submit its proposal valuevSet to the value associated with the highest proposal number the recipient responds to, which we callz。 If so far, there is no recipientacceptedIf the proposal value is exceeded, the proposer can select any proposal value (e.gx )。

The proposer willacceptnews[n,v]Send to recipient, proposal Non(same as previously sent to proposer)prepareThe number contained in the message is the same), and the proposal value isvv = zperhapsv = x )。

ShouldacceptThe message should be interpreted as “request”, similar to “please accept this offer!”.

Phase 2B:Accepted

If a recipient receivesacceptnews[n,v], then as long asnIf it is greater than or equal to the maximum proposal number maxn that the recipient has promised (in phase 1b of the agreement), the proposal must be accepted

If the recipient has not committed (in phase 1b) the proposal number is greater thannIf so, the newly receivedacceptThe value of the messagevRegister as the accepted proposal value and send it to the proposer and each learner as the passed proposal value.

Otherwise, it can ignore thisacceptMessage or request.

Please note that the recipient can accept multiple proposals. For example, when a new proposer does not know the value of the original proposal being negotiated, the larger proposal number will be usednWhen starting a new round of proposals, even if the recipient has accepted a proposal value before, it can promise and accept the new proposal value later. In the case of some failures, these proposals may even have different proposal values. However, the Paxos protocol will ensure that the recipient will eventually agree on a single value.

talk is cheap show me the code

Stage 1A: proposer (prepare)

Proposer initiatedprepareMessage, select a unique, increasing value.

ID is equal to n above and value is equal to V above

ID = cnt++;
send PREPARE(ID) 
Stage 1b: promote

Recipient receivedprepare[n, ]Message:

    if (ID <= max_id)
        do not respond (or respond with a "fail" message)
    else
        max_id = ID     // save highest ID we've seen so far
        if (proposal_accepted == true) // was a proposal already accepted?
            respond: PROMISE(ID, accepted_ID, accepted_VALUE)
        else
            respond: PROMISE(ID) 
Stage 2A: proposer

Now, the proposer checks whether his proposal can be used, or whether the highest numbered proposal received from all responses must be used:

did I receive PROMISE responses from a majority of acceptors?
if yes
    do any responses contain accepted values (from other proposals)?
    if yes
        val = accepted_VALUE //value in PROMISE with the highest acceptedID
    if no
        val = VALUE     // we can use our proposed value
    send PROPOSE(ID, val) to at least a majority of acceptors 
Stage 2B: accepted

Each recipient receives a offer (ID, value) message from the proposer. If the ID is the largest proposal number processed, the proposal is accepted and the value is propagated to the proposer and all learners.

if (ID == max_id) // is the ID the largest I have seen so far?
    proposal_accepted = true     // note that we accepted a proposal
    accepted_ID = ID             // save the accepted proposal number
    accepted_VALUE = VALUE       // save the accepted proposal data
    respond: ACCEPTED(ID, VALUE) to the proposer and all learners
else
    do not respond (or respond with a "fail" message) 

If the majority of recipients accept the proposal, a consensus can be reached. Note: the consensus is the proposal value, not the proposal ID.

Understanding Paxos

Watch video to learn Paxos
Stanford Paxos lecture
Chinese translation of video screenshots

You can use the following figure to help understand
Understanding distributed consensus algorithms
Graphical distributed consistency protocol Paxos