Consensus problem


This paper describes the concept, algorithm and application of distributed consensus. It involves raft, Paxos (basic, multi, fast), zk, etcd, chubby. And thinking. Concern and consensus on how zk, etcd, chubby interfaces can be applied to service discovery, distributed locks, etc. are omitted.

Consensus: One or more nodes make a proposal and the algorithm chooses a value. Consensus, complete (once decided), valid (node voting is valid, is the proposed value), terminated (outage does not return).
Paxos is completely consistent, but raft and zap consider the situation that the downtime will come back, with log assurance. It can solve the following problems in the previous article (…):
Full-order broadcasting is equivalent to duplicating the Toronto consensus: raft and zap directly implement full-order broadcasting, and mutil-paxos also has leaders, which only determine the order of values. Fixed leaders are written to each node, leaders are kept in full order, other nodes are synchronized, and conflicts are less (without leader paxos, it is necessary to synchronize once to get the maximum value, vote again, and pre once more). Drawing a leading and non-leading graph with sequence number can make it easier to understand leaders, but leaders alone have bottlenecks.
Single leadership consensus: 1. Select a leader, 2 vote on the leader’s proposal (prevent 1, a node believes that it is the leader) voting is synchronous, dynamic membership expansion is difficult, relying on overtime detection of node failure, if only a specific network is unreliable, it will enter the situation of frequent leadership duet. If there are leaders, they can’t make decisions and prevent the resumption of multi-leadership situations.
Disadvantage: It’s slow to get the majority to agree. Basic dynamic scaling is difficult, manual such as etcd

Consensus algorithm


  • Consensus process
    Data consistency is through log replication. Client sends to leader (write only to leader, follower backup recovery), leader writes to log and synchronizes to follower. When most followers write to log and return to leader, leader submits data, returns to client confirmation message, sends follower data to be submitted, follower submits data and sends back confirmation to leader. All transmissions go with frequency hopping. All communications between servers in raft are RPC calls, and there are only two types of RPC calls: the first is RequestVote, which is used to elect leaders; the second is AppendEntries. Logs and voting results need to be continuously written on disk to ensure that restart is normal after downtime.
    Consensus problem
  • Leader election
    Lead (term field), candidate, follower. Each node has a random selection timeout between T and 2T. Leader and follower are connected by frequency hopping. When a follower fails to receive leader’s frequency hopping timeout, it will vote for itself. Any follower can vote only one vote and refuses to vote if he finds that his log is more updated than the one he requested. When there are multiple candidates at the end of a round of voting, these candidates are reassignedRandom timeout
  • Leader synchronization log to follower
    In the above data consensus, when the leader confirms the submission of data, the leader will continue to retry the RPC submitted to follower, retry until the request succeeds; even if the follower is down, the leader will continue to send the request until the request succeeds, and when the leader is down, how to continue sending to follower; 1. The leader log can only be increased,= so when choosing.Select large term and long log2. Leader copies its logs to other machines. If the logs are new to the majority and more than half of the existing data is submitted during the term of office (the previous data will not be submitted repeatedly), the synchronization follower will still be synchronized if only the new term of office is submitted.
  • Log consistency
    Each log entry: iterm + index. Every time AppendEntries is sent, it needs to be brought with it. Follower checks whether the logs are the same, and then accepts the same to ensure that all machine logs are consistent.
    In order to restore log consistency, leader saves a state variable for all follower s in the cluster, that is, nextIndex: 1) NextIndex is the index of the next log entry that leader intends to send to a follower; 2) When leader is in place, the initial value of nextIndex is (1 + leader’s last index);
    When leader sees that the request is rejected, its action is very simple: just try nextIndex-1 again.
  • Need to persist term and vote
    Term needs inventory
    Any server can only vote in a single term; once it has voted for a candidate, it must reject other candidate requests for voting; in fact, the server doesn’t care who votes for it at all, it only votes for the candidate who first asks for it; in order to ensure this, it must persist the voting information on disk, so as to ensure that even the server can vote for it. After the vote was cast, the machine went down and was restarted immediately, and the second candidate would not be voted for in the same term.


  • basic paxos
    The first stage collects the latest N and V, and the second stage initiates proposals:
    Consensus problem
    In fact, the proposal here is leader. The consensus algorithm is normally proposor, leader, accepter, leaner (ignored first), which is used to determine the proposer’s proposal number and success. Each proposal comes first to the leader (optional, unimportant), and the leader sends the accepter to continue the process if there is no conflict to return to any or otherwise to the selected one.
    Problem: Multiple Proposals may have deadlocks that increase N in cycles:

    Consensus problem

    The above one is…. In order to facilitate understanding, the implementation details are removed. In real-time applications, the client will not handle the conflict + 1 by itself and vote again and send it to other leaners, which should be another role. In basic, it should be composed of a group of C coordinators, which can be the same as acceptor, or part of it. Each round of random C acts as leader, responsible for collecting the results of this round and notifying leaner. Proposal – > leader (each client can be sent randomly as the leader of this round) – > pre – > acceptors returns the maximum N value V – > with N requests – > acceptors – > leader – > returns to proposal – > client failed or voted again – > leaner after voting successfully. In the process, CLIENT2 is sent again as another leader.

  • fast paxos
    If proposals and acceptors, leaders and leaners are distributed and persistent, the cost of persistence + sending back and forth will be much higher.
    If leader finds no conflict, no longer participateProposal is submitted directly to acceptor (the same round is for first come only) and sent directly to leaner, which can be understood as based on the idea of optimistic lock. Both leaner and CLINENT decide by themselves.
    If the proposal is unsuccessful (voting first, no more than half), 1. Re-introduce the leader, send it asynchronously to the coordinator, the coordinator chooses (because acceptor only votes once), and send the proposal result (introducing leader again). 2. No leader, sent to all acceptors after the acceptor resolution, other acceptors can vote on the I + 1 round after receiving this message (even if half of the time can be compared again, the comparison between the two is complex, depending on the acceptor set of each proposal, see the paper in this part).…
  • muti-paxos
    When the leader is stable, the preparedness phase can be omitted. Simply put, a serial number is used to identify whether the leader is stable (the same process as raft, zk, consensus to full serial number). If the stable update serial number is sent directly to acceptor, acceptor needs to record the serial number. If an index > is found, it returns to false to prepare again. Because there is no preparation, we don’t know the largest n at a time, and we don’t know whether leader has steadily added the whole order. This process is not required in preparedness.
    The specific methods are as follows:

    Consensus problem

Chubby is a typical application of Muti-Paxos algorithm. In the case of stable operation of Master, only the same number is needed to execute each Instance’s Proise-> Accept stage processing in turn.

Raft/paxos/zap/mutli-paxos distinction

  • Raft needs a leader. stayEach follower can only vote once when selecting the winner.Unsuccessful random time next time. When there is a master, the consensus is that the main log number, follower is better, follower guarantees stability and replaceability.
  • Mutli-paxos is similar to raft in that it is log records after removing pre. The difference is that MP allows any acceptor to be upgraded to leader, while raft is very strict. For example, only the most complete log can be used. After preprare, MP will know the largest index at present and fill in the hole for the old asynchronous system. Raft felt that the process of filling the void was too complicated, which increased the complexity of the selection.
  • Paxos leader is not so important (fast Paxos does not even have leader involvement in conflict-free situations). It can be selected randomly each time, just a summary vote. Paxos in fast mode, when dealing with conflicts, each acceptor can update the votes and re-vote (in fact, conflict resolution, or not vote, according to complex logic such as set, in ZK according to the number of existing set votes).
  • Zap still has leader.Zap's voting algorithm is similar to fast Paxos when there is no ownerThere is a maximum XID (similar to the pre stage, but it was saved last time), the first choice, the proposal of each choice directly to acceptor and the use of non-coordinator conflict management. When there is a master, use the idea of Paxos to collect and synchronize information to ensure consistency. The master process writes, and most processes are successful before replying.
  • The advantage of Paxos is whether a single voter can resist it. A single voter can only vote one at a time.


ZK is located in distributed coordination service
Following is an introduction to the common functions, architecture and consensus process of zk.


Its own data organization is in the form of files, each leaf node znodes, non-page node is only the path identification of the namespace; but stored in memory, records disk logs, copies contain complete memory data and logs, znodes maintain the version of the node, zxid and all other information.
Zookeeper is quite flexible in the design of QuorumPeer for each node. QuorumPeer consists of four components: Server CnxnFactory, ZKDatabase, Election, and Leader/Follower/Observer.


1. Reliability and consistency guarantee of individual ZK cluster metadata, which is stored in all replicas of ZK (a small amount of data can be completely stored in memory)
Routing, database selection, scheduler
2. Individual ZK clusters, locks, protection tokens, acquisition locks or zxid
3. Change notification, each change is sent to all nodes
Watch mechanism
4. For detection, service discovery
Each ZooKeeper client configuration includes a list of servers in the collection. At startup, the client tries to connect to a server in the list. If a connection fails, it attempts to connect to another server, and so on until it successfully establishes a connection with a server or fails because all ZooKeeper servers are unavailable.
As long as a session is idle for more than a certain period of time, a ping request (also known as heartbeat) can be sent through the client to keep the session expired. The Ping request is sent automatically by ZooKeeper’s client library, so we don’t need to consider how to maintain the session in our code. This time-length setting should be low enough to allow file detection of server failures (reflected by read timeouts) and to re-connect to another server during session timeouts.

Zookeeper data synchronization process

Incremental transaction ID number (zxid) is used to identify transactions. All proposals were put forward with zxid. In the implementation, zxid is a 64-bit number. Its 32-bit height is epoch, which is used to identify whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch, which identifies the current period of the leader’s rule. Low 32 bits are used for incremental counting.

  • zab protocol

    Leader election
        In the leader election process, electionEpoch increases, and the bigger the last Processed Zxid is, the more likely it is to be a leader.
        First: Leader collects follower's lastProcessed Zxid, which is mainly used to confirm the scope of data follower needs to synchronize by comparing it with leader's lastProcessed Zxid.
        Second: elect a new peerEpoch, which is mainly used to prevent the old leader from submitting operations (when the old leader sends commands to the follower, follower finds that the peerEpoch of zxid is smaller than the current one, and refuses directly to prevent inconsistencies)
        The process of keeping the transaction log in follower consistent with the leader is based on the last Processed Zxid between follower and leader. If more followers are deleted, the redundant part is added, if less followers are added. If not, the follower deletes the wrong zxid and its subsequent parts and then synchronizes the data from the leader after that part.
        Processing client requests normally. Leader makes a motion for the client's transaction request, and then sends it to all followers. Once more than half of followers reply to OK, leader can submit the motion, send the request to all followers to submit the motion, and leader returns OK to respond to the client at the same time.

In fact, there are three phases of algorithm in zookeeper: FSE=> Recovery=> Broadcast

  • fast leader election
    Based on fast paxos. Send to all nodes. No random leader was involved in the collection.
    Consensus problem

    LOOKING: Entering Leader Election
    FOLLOWING: Leader election is over, enter follower status
    LEADING: The leader election is over and in the leader state
    OBSERVING: In the Observer State
    1. SerrA first increases the selection Epoch and then votes for itself.
    2 serverB receives the above notification and votes on PK
    If the selection Epoch in the notification received by server B is larger than itself, server B updates its selection Epoch as server A's selection Epoch
    If the selection Epoch in the notification received by the server B is smaller than itself, server B sends a notification to server A, sending server B's own vote and selection Epoch to server A, and server A updates its selection Epoch upon receipt.
    After the election Epoch has reached an agreement, the PK between the votes is started, the proposedEpoch is compared first, then the proposedZxid is compared first, and finally the proposedLeader is compared first.
    After the PK is finished, if the machine votes are deleted by pk, the voting information is updated to the other party's voting information, and the voting information is re-sent to all servers. If the machine vote is not dropped by pk, if look, the state is changed more than half. If FOLLOWING/LEADING indicates backwardness, the convergence is accelerated.
  • Recovery
  • Follower reading and writing process diagram:

Consensus problem


It is often used for configuration sharing and service discovery, which is simpler than zk. It is easy to write and deploy in Go language, to use HTTP as an interface, and to use Raft algorithm to ensure strong consistency so that users can understand it easily. There is no need to install the client. Provide interface K-V storage (storage up to a few GB of data with consistent ordering, provide linear reading), watch, lease, lock, election. Because the consensus is fully implemented raft, we can simply talk about deployment mode, node composition, data persistence and so on.

Architecture diagram:

The single node is as follows
Consensus problem
Store: Provide API for users
Clusters distinguish proxy, leader, follower

  • start-up
    There are three configurations of etcd cluster startup: static configuration startup, etcd self-service discovery, and service discovery through DNS.
    Self-service discovery:
    Firstly, a cluster is constructed with its own single url, and then the JoinCluster function of discovery/discovery.go source code is entered according to the parameters in the process of startup. Because we know beforehand the token address of etcd used at startup, which contains cluster size information. This process is actually a process of continuous monitoring and waiting. The first step to start is to register your own information in the token directory of the etcd, and then monitor the number of nodes in the token directory. If the number does not meet the criteria, then wait iteratively. When the quantity reaches the requirement, it will end and enter the normal start-up process.
  • Function
    Proxy, leader follower. Proxy is only responsible for forwarding, etcd proxy supports two modes of operation: readwrite and readonly, the default is readwrite, that is, proxy will forward all read and write requests to the etcd cluster; in readonly mode, read and write requests will return HTTP 501 errors. Proxy guarantees the limited number of votes, and all followers synchronize the data before returning to success (because the exception does not automatically return, can be all, has been completed by the administrator, otherwise only the reader, so the nodes can not be many), after the normal node failure, it can be handled manually by the administrator, a backup function.
    Etcd can proxy requests for access to leader nodes, so if you can access any etcd node, you can read and write the entire cluster regardless of the network topology, otherwise you can only connect to leader.
    Failure-free self-recovery (error-prone, administrators have time to recover on their own, given the high availability)
  • Data persistence: WAL + snapshot (delete WAL)
    Get the configuration information of the cluster from snapshot, including token, information of other nodes, and so on. Then load the contents of WAL directory and sort them from small to large. According to the term and index obtained in snapshot, find the WAL record immediately following the next snapshot, and then update it backwards until all WAL package entries have been traversed and Entry records are stored in the ents variable in memory. WAL then enters the append (read and append mutually exclusive) mode to prepare for data item addition.
    When the content of data item in WAL file is too large to reach the set value (default is 10000), WAL segmentation will be carried out, and snapshot operation will be carried out at the same time. This process can be seen in the snapshot function of etcdserver/server.go. So, in fact, there is only one snapshot and one WAL file in the data directory, and by default, etcd will keep five historical files each.
  • Data KV
    Memory BTREE Index, Physical B + Tree, Store History Version, Compress and Delete History Version without Reference.
  • Application:…


Large systems such as GFS and Big Table use him to solve a series of problems related to distributed lock services, such as distributed collaboration, metadata storage and Master selection.
The client sends feedback from the master to other machines, and then relays it to the master until it changes.
Data is organized in the same way as ZK.
Only the primary node provides read and write (data and log are hollow), high reliability and availability, throughput is not as good as zk. Blockage occurs during the main phase.

Comparison of etcd, chubby and ZK

Because zk, etcd will complement follower. So both master and slave can read. The master of the etcd is fixed unless the fault =”raft is replaced by the master. Chubby (06) used mutil-poxas master is generally unchanged, not guaranteed the consistency of master-slave data in each round, only the master has read and write ability, throughput will be worse, 10,000 machine distributed locks are still possible.
Etcd (14 years) is later, certainly better, with HTTP interface, everything is lighter and simpler, the shortcoming is only fault-free self-recovery, ZK will choose each time (but based on a xid, basically similar to mutil-poxas will be stable), can automatically restore.

Recommended Today

OC basis

IOS development interview essential skills chart.png What are objects and what are the objects in OC? An object is an instance of a class; Is an instance created through a class, which is generally called an instance object; Common objects in OC include instance objects, class objects, and metaclass objects; What is a class? What […]