Analysis of Consistency Protocol: From Logical Clock to Raft



During the Spring Festival, I read several papers idly at home and passed through several papers on the agreement of consistency. Before I read these papers, I always had some doubts, such as Leader and two-stage submission, how Zookeeper’s ZAB protocol differs from Raft, and how Paxos protocol can be used in practical projects. I found the answers in these papers. Next, I will try to explain these protocols in my own language, so that you can understand these algorithms. At the same time, I have some doubts. I will raise them in my presentation and welcome you to discuss them together. Limited level, there will inevitably be some gaps in the article, you are also welcome to point out.

Logical clock

Logical clocks are not really a consensus protocol. They are an idea put forward by Lamport in 1987 to solve the problem of inconsistency of different machine clocks in distributed systems. In a stand-alone system, we can clearly know the sequence of two different events by identifying events with machine time. However, in distributed systems, due to the time error of each machine, it is impossible to accurately distinguish the sequence of two events by physical clock. But in fact, in distributed systems, only two related events can make us care about the relationship between them. For example, two transactions, one modifying rowa and the other modifying rowb, we don’t really care who happens first and who happens later. The so-called logical clock is used to define the order of occurrence of two related events, that is,’happens before’. For unrelated events, logical clocks do not determine their order, so the’happens before’relationship is a partial order relationship.

Figures and examples from this blog

In this figure, arrows represent inter-process communication, and ABC represents three processes in a distributed system.

The algorithm for logical clocks is simple: each event corresponds to a Lamport timestamp with an initial value of 0.

If an event occurs within a node, add 1 to the timestamp

If the event belongs to the sending event, the timestamp is added 1 and the timestamp is added to the message.

If the event belongs to the receiving event, the timestamp = Max (local timestamp, timestamp in the message) +1

In this way, we can ensure that the timestamp of sending and receiving events is less than that of receiving events for all associated sending and receiving events. If there is no correlation between two events, such as A3 and B5, their logical time is the same. It is precisely because they have nothing to do with each other that we can arbitrarily arrange the order of occurrence between them. For example, we stipulate that when Lamport timestamp is the same, the events of A process occur earlier than B process and earlier than C process, so we can get A3’happens before’B5. In the physical world, B5 obviously happens earlier than A3, but it doesn’t matter.

Logical clocks don’t seem to be widely used at present, except that DynamoDB uses vector clocks to solve multi-version sequential problems (if there are other practical applications, please point out that I’m probably uninformed), and Google’s Spanner also uses physical atomic clocks to solve clock problems. But from Larmport’s logical clock algorithm, we can see some shadow of consistency protocol.

Replicated State Machine

When it comes to consistency protocols, we usually talk about replication state machines. Because we usually use replication state machine and consistency protocol algorithm to solve high availability and fault tolerance in distributed systems. Many distributed systems use replication state machines to synchronize data between replicas, such as HDFS, Chubby and Zookeeper.

The so-called replication state machine is to maintain a persistent log in each instance copy of a distributed system, and then use a certain consistency protocol algorithm to ensure that the log of each instance is completely consistent, so that the state machine within the instance replays every command in the log in the order of the log. In this way, when the client reads, the same data can be read on each copy. The core of replication state machine is the Constensus module in the graph, which is Paxos, ZAB, Raft and other consistency protocol algorithms that we will discuss today.


Paxos is the consistency protocol algorithm proposed by Lamport God in the 1990s. It has always been difficult to understand. So Lamport published a new paper Paxos made simple in 2001, in which he said that Paxos is the simplest consistency algorithm in the world and is very easy to understand… But the industry still agrees that Paxos is hard to understand. After reading Lamport’s paper, I feel that the Paxos protocol itself is quite understandable, except for the complex process of validation. However, the Paxos protocol is still too theoretical and far from the specific engineering practice. When I first looked at the Paxos protocol, I was also confused. It seems that I found that the Paxos protocol was only for a single event to be consistent, and the value after the agreement can not be modified. How to use Paxos to implement replication state machine? In addition, only Propose and some followers know how to use the Paxos protocol. However, it’s much easier to understand if you just look at Paxos protocol as a theory rather than a practical engineering problem. Lamport’s paper has only a general idea about the application of StateMachine, and there is no specific implementation logic. It is impossible to use Paxos directly in the replication state machine. There are many things to add to Paxos. These are why Paxos has so many variants.


Basic-Paxos, the Paxos algorithm originally proposed by Lamport, is actually very simple. It can be finished in three words. Next, I try to describe the Paxos protocol in my own language, and then I will give an example. To understand Paxos, just remember that Paxos can only agree on one value. Once a Propose is determined, the value will never change. That is to say, the whole Paxos Group will only accept one proposal (or multiple proposals, but the values of these proposals are the same). As for how to accept multiple values to form a replicated state machine, you can see the next section, Multi-Paxos.

Paxos protocol does not have the concept of Leader, except for Learner (just the result of learning Propose, we can not discuss this role), only Proposer and Accepter. Paxos also allows multiple Proposers to propose at the same time. Proposer proposes a value for all Acceptors to agree on. First, in the Prepare phase, Proposer will give a ProposeID n (note that in this phase Proposer will not pass the value to Acceptor) to each Acceptor. If an Acceptor finds that it has never received a Proposer greater than or equal to n, it will reply to Proposer and promise not to accept a Propose whose ProposeID is less than or equal to n. Repare. If the Acceptor has committed to a larger proposal than n, it will not reply to Proposer. If the Acceptor had previously Accepted (completed the second stage) a ropose less than n, the value of the Propose would be returned to the Propose, otherwise a null value would be returned. When Proposer receives more than half of Acceptor’s responses, it can start the second phase of accept. However, at this stage, the value that Propose can propose is limited. Only if the response it receives does not contain the value of the previous Propose, can he freely propose a new value. Otherwise, he can only use the maximum value of Propose in the response as the proposed value. Proposer uses this value and ProposeID n to initiate Accept requests for each Acceptor. That is to say, even though Proposer has received acceptor’s promise before, before accept is launched, Acceptor may give proposeID a higher Propose promise, leading to accept failure. That is to say, due to the existence of multiple Proposers, although the first stage is successful, the second stage may still be rejected.

Let me give you an example from this blog.

Suppose there are three servers: Server 1, Server 2 and Server 3. They all want to pass the Paxos protocol and make everyone agree that they are leaders. These servers are Proposer roles. The value of their proposal is their own server name. They need the consent of Acceptor 1 to 3. First Server2 initiated a proposal [1], that is, ProposeID is 1, then Server1 initiated a proposal [2], and Server3 initiated a proposal [3].

First, the Prepare stage:

Assuming that the message sent by Server1 arrives at acceptor 1 and acceptor 2 first, they have not received the request, so they receive the request and return [2, null] to Server1, and promise not to accept requests with number less than 2;

Next, the message from Server 2 arrives at acceptor 2 and acceptor 3. Aceptor 3 has not accepted the request, so it returns to proposer 2 [1, null] and promises not to accept messages with number less than 1. Aceptor 2 has accepted Server 1’s request and promised not to accept requests with a number less than 2, so acceptor 2 rejects Server 2’s request.

Finally, the message from Server 3 arrives at acceptor 2 and acceptor 3. Both of them have accepted the proposal, but the message number 3 is larger than acceptor 2 and acceptor 3, so they both accept the proposal and return to Server 3 [3, null].

At this time, Server 2 did not receive more than half of the replies, so it retrieved the number 4 and sent it to acceptor 2 and acceptor 3. At this time, the number 4 was larger than the number 3 of the proposal they had accepted, so it accepted the proposal and returned to Server 2 [4, null].

Next, we enter the Accept phase.

Server3 receives more than half (2) of the responses and returns a null value, so Server3 submits a proposal [3, server3].

Server1 also received half of the replies in the Prepare phase, returning null value, so Server1 submitted a proposal [2, server1].

Server2 also received more than half of the replies, returning null value, so Server2 submitted a proposal [4, server2].

Acceptor 1 and acceptor 2 received Server 1’s proposal [2, server 1]. Aceptor 1 accepted the request and acceptor 2 promised not to accept proposals with a number less than 4, so it refused.

Acceptor 2 and acceptor 3 received the proposal of Server 2 [4, server 2], and both adopted the proposal.

Acceptor 2 and acceptor 3 received proposals from Server 3 [3, server3]. Both of them promised not to accept proposals with numbers less than 4, so they rejected them.

At this time, more than half of the acceptors (acceptor 2 and acceptor 3) accepted the proposal [4, server 2]. Learner perceived the adoption of the proposal, and learner began to learn the proposal, so server2 became the ultimate leader.


As I said just now, Paxos is too theoretical to be used directly in replication state machines. Generally speaking, there are several reasons.

  • Paxos can only determine one value and cannot be used for continuous replication of logs
  • Because there are multiple Proposers, there may be a livelock. For example, in the example I mentioned above, Server 2 raised a total of two Proposes before finally getting the proposal passed. In extreme cases, there may be more.
  • The final result of the proposal may only be known by some Acceptors, which fails to meet the requirement that every instance of the replicated state machine must have a fully consistent log.

In fact, the purpose of Multi-Paxos is to solve the three problems mentioned above, so that Paxos protocol can be used in the state machine. Solving the first problem is actually very simple. Each index value for Log Entry uses a separate Paxos instance. Solving the second question is also a simple answer. Let a Paxos group not have multiple Proposers. When writing, first select a leader using Paxos protocol (as in my example above), and then write only by this leader, you can avoid the looplock problem. Moreover, with a single leader, we can omit most of the preparedness process. All Acceptors need to do is prepare once after the leader is elected. All Acceptors have not accepted the preparedness request of other Leaders. Every time they write, they can do Accept directly, unless Acceptor refuses, which indicates that a new leader is writing. To solve the third problem, Multi-Paxos introduced a first Unchosen Index for each server, enabling leader to synchronize selected values to each Acceptor. After solving these problems, Paxos can be used in practical engineering.

Paxos has many additions and variations up to now. In fact, ZAB or Raft, which I will discuss later, can be regarded as modifications and variations of Paxos. In addition, there is a widely circulated saying, “There is only one consistency algorithm in the world, that is Paxos”.


ZAB, Zookeeper Atomic BoardCast, is the consistency protocol used in Zookeeper. ZAB is a special protocol of Zookeeper, which is strongly bound to Zookeeper and not separated into independent libraries, so its application is not very extensive, only limited to Zookeeper. But ZAB protocol is proved in detail in the paper, which proves that ZAB protocol can strictly meet the consistency requirements.

ZAB was born in 2007, when Raft protocol was not invented. According to ZAB’s paper, the reason why Zookeeper did not use Paxos directly is that they thought Paxos could not meet their requirements. For example, Paxos allows multiple proposers, which may cause multiple commands submitted by clients to fail to execute in FIFO order. At the same time, some follower data are incomplete in the recovery process. These assertions are based on the most primitive Paxos protocol. In fact, some Paxos variants, such as Multi-Paxos, have solved these problems. Of course, we can only look at this issue from a historical point of view. Because Pax at that time could not solve these problems very well, the developers of Zookeeper created a new consistency protocol ZAB.

In fact, ZAB is very similar to Raft later. It has a selection process, a recovery process and a two-stage submission of writing. First, it initiates a round of voting from leader, and then a commit is initiated after more than half of the consent is obtained. The epoch number of each master in ZAB is actually equivalent to the term in Raft that I’m going to talk about next. It’s just that the epoch number and transition number in ZAB form a zxid that exists in each entry.

When ZAB replicates logs, when it submits two stages, one stage is the voting stage. As long as it receives more than half of the consent votes, this stage will not really transmit data to follower. The actual effect is to ensure that more than half of the machines were not hung up at that time or in the same network partition. In the second stage, commit will transfer data to each follower, and each follower (including leader) will append data to the log. This writing operation will be completed. If the first stage votes successfully, it doesn’t matter if the second stage has follower suspended. After restart, the leader will also guarantee the follower data and the leader to it. If the common stage leader hangs up and if the write operation has been commit on at least one follower, the follower will be selected as leader, because his zxid is the largest, then he will make all followers commit this message when he chooses the leader. If there is no follower commit message when the leader hangs up, the write is considered to be incomplete.

Since additional log writing is required only when commit occurs, ZAB’s log requires only append-only capabilities.

In addition, ZAB supports stale read from replica. If you want to read strongly and consistently, you can use sync read. The principle is to initiate a virtual write operation first, and then do nothing to write. When this operation is completed, the local commit also performs this sync operation, and then read on the local replica, which ensures the time to read sync. All correct data before the point, and all Raft reads and writes through the primary node


Raft is a new consistency protocol proposed by Stanford University in 2014. The author argues that the reason for designing a new consistency protocol is that Paxos is too difficult to understand, and Paxos is only a theory, which is far from the actual engineering implementation. So the author gave Paxos a piece of advice:

  1. In Paxos protocol, Leader is not required, and each Proposer can propose a proposal. Compared with Raft, which separates the selector from the agreement at the beginning of the design, Paxos is a mixture of the selector and the proposal stage, which makes Paxos more difficult to understand.
  2. The original Paxos protocol only answers to a single event. Once this value is determined, it can not be changed. In our real life, including the consistency of our database, we need to consistently answer the log entry value. So it is not enough to understand the Paxos protocol itself. We still need to understand the Paxos protocol. Paxos protocol should be improved and supplemented in order to truly apply Paxos protocol to engineering. The Paxos protocol supplement itself is very complex, and although the Paxos protocol has been proven by Lamport, these improved Paxos-based algorithms, such as Multi-Paxos, are unproven after adding these supplements.
  3. The third groove is that the Paxos protocol only provides a very rough description, leading to each subsequent improvement of Paxos, as well as the use of Paxos projects, such as Google’s Chubby, are their own implementation of a set of projects to solve some specific problems in Paxos. Details of Chubby’s implementation, however, are not publicly available. That is to say, if we want to use Paxos in our own projects, basically everyone needs to customize and implement a set of Paxos protocols for themselves.

Therefore, when designing Raft, Raft’s authors have a very clear goal, which is to make the protocol better understood. In the process of designing Raft, if there are many alternatives to choose, they will choose the one that is easier to understand. The author gives an example. In Raft’s selection phase, it would have been possible to attach an ID to each server. Everyone would have voted for the server with the largest ID to be the leader. It would have been faster to reach agreement (similar to ZAB protocol), but this solution adds a concept of server rid. At the same time, when the server with high ID is suspended, the server with low ID must have one to become master. Waiting time affects availability. So Raft’s contestants used a very simple scheme: each server was randomly sleeping for a period of time, and the first server to wake up was to start a vote and get the majority of votes. In the normal network environment, the first server to vote will also receive the approval votes of other servers at the earliest time, so basically only one round of voting is needed to decide the leader. The whole selection process is very simple and straightforward.

Apart from the selector, the design of the whole Raft protocol is very simple. There are only two RPC calls for the interaction between leader and follower (without considering snapshot and the number of transformers). One of them is RequestVote, which is needed only when the winner is selected. That is to say, all data interactions are only performed by AppendEntries RPC.

To understand the Raft algorithm, we first need to understand the concept of Term. Each leader has its own Term, and the term is brought to every entry in the log to represent which leader term the entry was written during. In addition, Term is equivalent to a lease. If the leader does not send a heartbeat within the specified time (the heartbeat is also the RPC call of AppendEntries), Follower will assume that the leader has been deactivated and will use the highest Term he has ever received plus one as a new term to launch a round of elections. If the candidate’s term is not high enough, follower will vote against it to ensure that the new leader’s term is the highest. If no one gets enough votes in the time out cycle (which is possible), follower adds another one to the term to make a new request until the leader is elected. The original raft is implemented in C language. This timeout time can be set very short, usually in tens of Ms. Therefore, in raft protocol, leader can be detected in tens of ms after hanging up, and the recovery time can be very short. For example, Raft libraries implemented in Java, such as Ratis, considering GC time, I don’t think this timeout can be set so short.

Writing by Leader is also a two-stage commit process. First leader writes the first blank index found in its log and sends the entry value to each follower through AppendEntries RPC. If more than half of followers (including themselves) reply true, then in the next AppendEntries leader adds committedIndex to 1, representing the written one. Entry has been submitted. For example, in the following figure, leader writes x = 4 into the entry of index = 8 and sends it to all followers. When it receives the first (itself), the third and the fifth (there is no entry with index = 8 in the figure, but because all entries are consistent with leader before this server, it will definitely agree). Then leader gets a majority of votes. In the next rpc, Committed index will be moved forward one place, representing all entries of index <=8 have been submitted. As for the second and fourth servers, the log content has lagged behind significantly, either because the previous RPC failed. The leader will retry indefinitely until these followers and Leaders’logs are even. Another possibility is that the two servers have been restarted and are in a recovery state. When these two servers receive RPC written to index = 8, follower will also send them the last entry’s term and index. That is to say, prevLogIndex = 7, prevLogTerm = 3, this information will be sent to the second server. For the second server, the entry of index = 7 is empty, that is, the log and leader are inconsistent. He will return a false to the leader, and the leader will go back and forth continuously until he finds an entry and the second server one. Yes, from this point on, you can send the leader’s log content back to the follower, and you can complete the recovery. The raft protocol guarantees each index location in the replicated log of all members, and if their terms are identical, the content is identical. If not, leader will rewrite the index content to be consistent with leader.

In fact, after some of my descriptions just now, I have basically finished the selection, writing process and recovery of Raft. From here, we can see some very interesting things about Raft.

The first interesting point is that Raft’s entry to the log can be modified. For example, a follower receives a lead’s preparedness request and writes the value to an index. The leader hangs and the newly selected leader may reuse the index. Then the corresponding index content of the follower will be rewritten to an index. New content. This causes two problems. First, raft’s log cannot be implemented on append-only files or file systems. For example, ZAB and Paxos protocols, log can only be added, as long as the file system has append capability, and no random access to modify capability is required.

The second interesting point is that for simplicity, Raft maintains only one Committed Index, that is, any entry less than or equal to the committed Index is considered commit. This results in a write process that hangs up before the leader gets most of the votes (or before the leader has notified any follower of its logs) and if the server continues to be elected leader after the restart, the value will still be permanently validated by commit. Because leader logs have this value, leader will ensure that all follower logs are consistent with themselves. Subsequent writes, after increasing committedIndex, are also commit by default.

For example, there are now five servers, of which S2 is leader, but when he writes to entry with index = 1, he writes to his log first, and before he can notify the other servers append entry, he goes down.

When S2 is restarted, it is still possible to be re-elected leader. When S2 is re-elected leader, the entry with index = 1 will still be copied to each server (but Committed index will not be moved forward).

At this point, S2 writes again, which moves Committed index to 2, so entry with index = 1 is considered commit.

This behavior is a bit odd, because this equals raft, which will eventually commit a value that has not been agreed by most people. This behavior depends on the leader. If S2 is not selected as leader after restarting in the above example, the entry content of index = 1 will be overwritten by the content of the new leader, so that the content will not be submitted without a vote.

Although this behavior is a bit strange, it will not cause any problems, because leader and follower will remain the same, and the leader hanging in the writing process is a pending semantics for the client, raft paper also pointed out that if the user wants the exact semantics, it can be written. When adding something similar to uuid, leader checks whether the UUID has been written before writing. To a certain extent, the semantics of exactly once can be guaranteed.

Raft’s paper also compares ZAB’s algorithms. One drawback of the ZAB protocol is that it requires leaders and followers to exchange data back and forth in the recovery phase. I don’t quite understand that in the process of re-selecting the master, ZAB will choose the largest slave of Zxid to become the master, while other followers will make up the total from the leader. According to this, there is no such saying that leader supplements data from follower nodes.

Later words

At present, the improved Paxos protocol has been used in many distributed products, such as Chubby, Paxos Store, X-DB of Aliyun, and Ocean Base of Ants. They all choose Paxos protocol, but they have made some supplements and improvements. Like Raft protocol, it is generally believed that Raft protocol can only be sequential commit entry, and its performance is not as good as Paxos. But TiKV uses Raft. Its publicity has done a lot of optimization to Raft, which makes Raft’s performance remarkable. PolarDB, another database in Ali, also uses an improved version of Parallel-Raft to enable Raft to achieve parallel submission. It is believed that more Paxos/Raft-based products will be available in the future, and more improvements will be made to Raft/Paxos.


  1. 《Time, clocks, and the ordering of events in a distributed system》
  2. 《Implementing fault-tolerant services using the state machine approach- A tutorial》
  3. 《Paxos Made Simple》
  4. 《Paxos made live- An engineering perspective》
  5. Multi-Paxos (a ppt at Standford University)
  6. 《Zab- High-performance broadcast for primary-backup systems》
  7. 《In search of an understandable consensus algorithm》(raft)

Author: Zhengyan

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.

Recommended Today

Summary of import and export usage in JavaScript

import import 和 require 的区别 import 和js的发展历史息息相关,历史上 js没有模块(module)体系,无法将一个大程序拆分成互相依赖的小文件,再用简单的方法拼装起来。这对开发大型工程非常不方便。在 ES6 之前,社区制定了一些模块加载方案,最主要的有 CommonJS 和 AMD 两种。前者用于服务器,后者用于浏览器。ES6 在语言标准的层面上,实现了模块功能,而且实现得相当简单,完全可以取代 CommonJS 和 AMD 规范,成为浏览器和服务器通用的模块解决方案。也就是我们常见的 require 方法。 比如 `let { stat, exists, readFile } = require(‘fs’);` 。ES6 在语言标准的层面上,实现了模块功能。ES6 模块不是对象,而是通过export命令显式指定输出的代码,再通过import命令输入。 import 的几种用法: 1. import defaultName from ‘modules.js’; 2. import { export } from ‘modules’; 3. import { export as ex1 } from […]