Raft algorithm for leader election


Original address: https://qeesung.github.io/202…
Raft paper address: https://ramcloud.atlassian.ne…

Raft’s paper is divided into three parts

  • Leadership election
  • Log replication
  • Security

This paper mainly introducesLeader election

Node state in raft

Nodes in raft have three states:

  • Leader status:Leader
  • Follower status:Follower
  • Candidate status:Candidate

Each node is a state machine. Raft will transfer the state according to the current heartbeat, tenure and other states, as shown in the following figure:

Raft algorithm for leader election

First, when the raft node is started, all tasks areFollowerStatus, because there is no status at this timeLeader, allFollowerAll of them can not receive the data fromLeaderThe heart beat of, thus becameCandidateState, start the electionLeader

When the node is inCandidateIn the state, all nodes will be asked to vote simultaneouslyRequestVote(details will be given in the following sections)CandidateIn the state, the node may have three kinds of state migration changes:

  • Start the next round of new elections: the number of nodes receiving response (or consent to vote) is not reached when the voting request is sent within a fixed timeN/2+1Then select timeout to enter the next round of election
  • The election was a success and became a new oneLeader: if you receive more thanN/2+1The number of nodes to vote, then the election is successful, the current node becomes newLeader
  • becomeFollower: if received from other nodes during the election processLeaderA heartbeat, or a request to vote in responseTermGreater than current nodeTermThat means there is a new termLeader

If the node election is successful, it becomes aLeader, soLeaderThe heartbeat will be sent to all nodes within a fixed period, but if the heartbeat request receives a responseTermLarger than the current nodeTerm, then the current node’sFollower。 such asLeaderThe network of the node is unstable and has been offline for a period of time. When the network recovers, there must be other nodes selected as new onesLeaderHowever, when the current node is offline, no other nodes are selectedLeaderIf the heartbeat is still sent to other nodes, the other nodes will send the current newTermResponse to obsoleteLeaderIn order to change intoFollower

Leader election

The whole cluster must be able to elect the only one in the case of packet loss, out of order, delay and other unstable factorsLeader

Request to vote RPC

As mentioned above, ifFollowerIf the heartbeat request is not received within a certain period of time, it will switch toCandidateState, start a new round of selection, the election process will be to the clusterAll nodes send RPC requests to vote

RPC request parameters:

  • term: term number of the current candidate
  • candidateId: ID of the candidate
  • lastLogIndex: the last log entry index value of the candidate
  • lastLogTerm: the term number of the candidate’s last log entry

amonglastLogIndexandlastLogTermIt is used to judge whether the candidate’s log is as new as the server’s log (which will be explained later), and must be at least as new before voting.

RPC response value:

  • term: the term number of the requested node
  • voteGrantedDo you agree to vote for the candidate

Candidate send request to vote RPC

When does candidate send a request to vote RPC?

IfLeaderIf an exception occurs, then basically allFollowerSwitch to at the same timeCandidateAt the same time, an RPC requesting a vote is sent, which may lead to a balanced division of votes, and a new round of voting needs to be initiated. In order to avoid the problem of ballot being divided, the election overtime can be from a fixed interval (e.g150-300MS) randomly selected.

How does candidate send voting RPC?
  1. Automatically increase the term number of the current node
  2. Vote for yourself
  3. Reset election timeout timer
  4. Send an RPC requesting a vote to another server
How to deal with RPC requests to vote?
  1. Judge the currentTermAnd in the request voting parametersTerm

    • If the currentTerm>In the request voting parametersTermThen refuse to vote (setvoteGrantedbyfalse)And returns the currentTerm
    • Otherwise, the current is updatedTermIn the request voting parametersTermAnd switch its state toFollower
  2. Current voting state of the node:

    • If the current node has not voted for any other node, or has voted for the current node, continue to detect the matching status of the log (step 3)
    • Otherwise, reject the vote (setvoteGrantedbyfalse)Because a node cannot vote for more than one node in one term
  3. Check whether the candidate’s log is at least newer than the current node’s loglastLogIndexandlastLogTermAnd the current node log to ensure that the newly electedLeaderThe submitted logs will not be lost:

    • If the logs match, that is, the current tenure is the same as the candidate’s tenure, and the candidate’s log length is longer than the current log lengthperhapsIf the term of office of the candidate is higher than that of the current node, then vote for the candidate (setvotedGrantedbytrue)And becomeFollower
    • Otherwise, the vote is rejected (setvoteGrantedbyfalse
How to deal with the response of a candidate requesting a vote?

Each candidate will send out a round of voting requests in each term of office, if, within the specified time, more thanN/2+1The response of consent voting of nodes indicates that the voting is successful, and the promotion toLeader

Because in the whole voting process, assuming that the network is unstable, the voting request and response may be lost, out of order, delay, and so on, resulting in the response that does not match the current tenure, so if the response does not match the current tenure, it will be discarded directly.

The complete processing flow is as follows:

  1. Check the responseTermIs it larger than the current candidate’sTerm

    • If so, it indicates that there are other nodes that have started a new round of elections or there are new onesLeaderIf the current node is elected, the current node is removed from theCandidateSwitch toFollowerStatus, and update the current node’sTerm
    • Otherwise, proceedStep 2
  2. Check the responseTermIs it the same as the current nodeTermEqual or not:

    • If it is equal, it means that the response to the voting request has been received within the specified timeStep 3
    • Otherwise, it means that this is an expired response to the voting request, and it will be discarded directly
  3. Check whether the response agrees to vote:

    • If yes, increase the number of voting nodes for the current term, and check that the number of nodes that agree to vote is greater thanN/2+1, then switch toLeader
    • If you do not agree, it may be that the logs do not match becauseLeaderAt leastFollowerNew ideas

Security of leadership election

From the processing flow of the request to vote RPC above,LeaderIf a candidate does not meet the requirements, other nodes will not vote for the candidate.

If any node in the cluster can become aLeaderSo what will happen? This situation may cause the logs that have been committed to be covered. If the state machine has applied the covered logs, it will lead to inconsistent results. So for the sake of election security,RaftThe following restrictions have been added:

  1. LeaderIt will not cover any of your own logs,FollowerStrictly in accordance withLeaderCopy (force to override if necessary)
  2. electionLeaderAt the same time,CandidateThe log of the current node must be at least newer than the current node (the “new” will be explained later), otherwise, the vote will be rejected; because the submitted logs must be greater than or equal toN/2+1On nodes, and voting needs to be at leastN/2+1Nodes agree, so there must be nodes in the voting process that contain all the submitted logs.

The “new” in the above is: the current term of office is the same as that of the candidate, and the length of the candidate’s log is longer than the current oneperhapsThe term of office of the candidate is higher than that of the current node