Raft algorithm is divided into two stages
Follower, candidate, leader (in order)
The three roles cannot be overstepped, that is, leader can only degenerate to follower, follower can only be upgraded to candidate, candidate can be degenerated to follower, or elected as leader. How to convert it? Will term and timeout change
Suppose that at the beginning of the system consisting of three ABC nodes:
1. If the whole system is started, that is, there is no data before persistence (which data is persisted), then the roles of the three nodes will be initialized as follower, and then the timeout time (the time is 150-200ms random value) will be calculated, When the timeout period is reached, the node changes its role to candidate and initiates an RPC request to vote. First, take a look at the parameters of the request vote:
Term: the initiator’s term (term? Under what circumstances will self increase?)
Candidate ID: initiator’s ID
Lastlogindex: the index (logindex-1) of the last log (logindex-1) of the latest log of the current node, which is understood as a subscript
Lastlogterm: the term corresponding to the above log
Term: the recipient’s term,Note that this term is the term after processing the request. For example, if your term is larger than me and asks me to vote, I will return the latest term. In this case, I think it is used to process. When a candidate requests a node to vote, but this node has already voted for others, and its term has been updated and is larger than the subsequent candidate term
Vote granted: vote or not, vote for me for true
When a vote is initiated, there are three possibilities:
1. If the candidate receives the consent of more than half of the candidates and is elected successfully, the candidate will be upgraded to leader and reset his time-out. What are the opportunities to reset your timeout
2. If a candidate receives a vote request from others, and the term of others is larger than his own, then the candidate degenerates into a follower.
3. If there are no messages, such as network failure, or two people sending requests at the same time, which are less than half of them, the system will try again after the timeout is reached. If the RPC call fails, it will directly send a request to the node again.
When a candidate sends a request to someone else, what does the other person do?
1. There are two situations in which voting will be refused directly:
The initiator’s term is not as big as its own, which means that the initiator has been behind for a long time. This situation is directly rejected.
The initiator is the same as his term, but others arrive early. Because a term can only vote for one person, I have to refuse directly.
Another situation is that if I found that I have already voted for you in this term, it will also return true, because the last reply message may have lost the message.
2. There are several situations that need to be further judged
The terms are the same, and they have not voted for others, or the term of others is bigger than myself. No matter whether I have voted or not before, the status should be changed to follow and the reset timeout time should be set.
In these two cases, we need to further judge the last log and term message.
First go to get a logindex and his term from the initiator, and then find the term corresponding to the term in your hand. If my term is larger than his, it means that I have a new person elected here. If the terms are equal, the number of logs will be compared. Whose latest is subject to whose term is less than him, it indicates that there is an updated leader generated by him.
Theoretical basis: 1. The same term and the same index must store the same data. 2. The data before the same term and the same index must be the same
Raft algorithm can ensure that the data in the same term must be given by only one leader, and each piece of data will only create a logentry and the position will not change.
However, this location may be covered. If it has not been submitted, if a leader with a larger term is generated, it may find that your log here is invalid and replace it.
That’s why we compare lastlogindex with lastlogterm.
1. ABCDE, a is selected as the leader in term1. Although the network partition occurs in the later De, the de can not select the master. Then their term will be higher. Will they be elected after reconnecting the network?
No, because the term corresponding to their last log is lower than ABC.
2. The premise is the same. Then AB partitions and CDE re selects the master. A accepts a lot of logs, and then a restarts continuously. The term of itself is very large. At this time, the term is also large, and there are many logs. Can it be selected? No, the reason is the same. All his logs are received at term1, and it is impossible to receive them at high term, because since he is a high term, he can not be the master and the non master will not receive the log
At this time, the main operator, cdash, is synchronized with the main user, but then the other two are synchronized with the main one.
Since the BCD voted for a for the second time, it must be known that the latest one is term2. However, there is no synchronization for index 2. First of all, the term is 3 and the lastloginindex is 1. In other people’s eyes, the term corresponding to index1 is also 1, the term is the same, and the logindex of E is equal, so CDE will vote, B will not vote, because the log is not as good as her.
Then a becomes the master again. Then a continues to copy logindex2 to other people. At this time, ABC is copied to. Can index2 mark submission?
No. Because if a crashes again, then E is elected again. Why is he elected again?
Because the new term of E is 4, and the logindex is 2, first of all, the term is large, and the term corresponding to the previous logindex is ා, so BCDE will select both.
At this time, logindex2 will override them if the previous term2log is committed
This will result in inconsistent data.
What should I do? If I don’t commit, I won’t have a problem if I crash and I don’t commit.
If crash occurs, but a is selected, term4 receives the updated log. During this synchronization, the last commit will be submitted. At this time, because ABC’s logindex is updated, e will not be selected.
- If the leader submits several log entries when a follower goes down, then the follower may be elected as the leader and override these log entries after it goes online, which will result in inconsistency.
Raft guarantees that any newly elected leader has all submitted log entries of previous tenure for a given term number by restricting the election of leader. The restriction rule is: the voting request message sent by candidate must carry the index and term of the last log entry; the receiver needs to judge that the index and term are at least the last of the local log A log entry is as new as it is, otherwise it will not be voted on. Because the condition for the previous leader to submit log entries is that the log is copied to most members in the cluster, the condition for candidate to be elected as leader also requires the majority of members to vote. Then there must be an intersection between the two majority members, that is, one member has the log and voted for the new leader, which means that the log of the new leader is at least no older than that member, so the new leader also has the log. This proves that the subsequent leader must have the log submitted by the previous leader.
- Even if the above election rules are guaranteed, the consistency can not be guaranteed. In other words, after the leader submits the log entry of the previous term, the entry may be covered by the later leader, resulting in inconsistency. As shown in the figure below:
(a) S1 is the leader and partially copies Index-2;
(b) When S1 goes down, S5 gets the votes of S3, S4 and S5 and is elected as the new leader (S2 will not choose S5 because the log of S2 is newer than S5), and a new entry is written in Index-2, which is term = 3 (2);
(c) S5 is down, S1 is restored and elected as leader, and continues to copy logs (that is, Index-2 from term-2 is copied to S3). At this time, term-2 and Index-2 have been copied to most servers, but they have not yet been submitted;
(d) S1 goes down again, and S5 recovery is elected as leader (voting through S2, S3, S4, because the term = 4 < 5 of S2, S3 and S4, and the log entry (term = 2, Index = 2) there is no new log entry in S5, so the election can be successful), and then override the Index-2 in follower to Index-2 from term-3; (Note: it appears that the Index-2 in term-2 has been copied to three servers, or has it been covered);
(e) However, if S1 has copied all the entries for its current term (term-4) before the outage, and then the entry is submitted (then S5 will not win the election, because the logs of S1, S2, and S3 have term = 4 are newer than S5). At this point, all previous entries are well submitted.
If the log entry with term = 2 and index = 2 is copied to most of the above cases (c), if the selected S1 submits the log entry, the subsequent term = 3 and index = 2 will override it. At this time, a different log may be submitted successively at the same index location, which violates the state machine security and produces inconsistency. That is to say, when a new leader is selected, because the log progress of all members is different, it is likely to continue to copy the log entries of the previous term. Even if the log entries are copied to most servers and submitted, they may still be covered. Because the log entries corresponding to the previous term are older, it is easy to make other servers without these entries selected as leaders, and then these logs will be covered Entry.
In order to eliminate the above scenario, it is stipulated that the leader can copy the logs of previous tenure, but will not actively submit the logs of previous tenure. Instead, the log of the current tenure is submitted, and the log of the previous tenure is indirectly submitted.