CAP Consistency Protocol and Application Analysis


I. Consistency

1.1 CAP Theory

  1. C Consistency: In a distributed environment, consistency refers to whether multiple replicas can have the same value at the same time.
  2. Availability: The services provided by the system must always be available. Even if some of the nodes in the cluster fail.
  3. P partition fault tolerance: When the system encounters node failure or network partition, it can still provide consistent and usable services to the outside world. In practical terms, partitions correspond to the communication time limit requirements. If the system can not achieve data consistency within a certain implementation, it means that partitioning occurs. Choices must be made for current operations before C and A.

1.2 Proof that CAP cannot be satisfied at the same time

Suppose there are five nodes in the system, n1~n5. N1, n2, N3 in Physics Room A. N4, N5 in B physics room. Now there is a network partition, A computer room and B computer room network is not connected.
Ensuring consistencyAt this time, the client writes data in room A and cannot synchronize to room B. Write failed. At this point, usability is lost.
Guarantee availabilityData is written successfully in the N1 ~ N3 nodes of A computer room and returned successfully. The data is also written in the N4 ~ N5 nodes of B computer room and returned successfully. The same data is inconsistent in room A and room B. If you are smart, you can think of zookeeper. When a node is downloaded, the system will reject the node, but more than half of the other nodes will be written successfully. Does zookeeper satisfy CAP at the same time? In fact, there is a misunderstanding, the system will be removed from the node. An implicit condition is that the system introduces a scheduler, a scheduler who kicks out bad nodes. When the dispatcher and zookeeper nodes appear network partition, the whole system is still unavailable.

1.3 Common Scenarios

CA without P: In a distributed environment, P is unavoidable. Natural disasters (a soft company’s Azure was chopped by lightning) and man-made disasters (a company’s cables between A and B rooms were broken) can lead to P.
CP without A: It means that every write request must be strongly consistent before Server. P (partition) leads to infinite extension of synchronization time. This is guaranteed. For example, distributed transaction of database, two-stage commit, three-stage commit, etc.
AP without C: When network partitions occur, A and B clusters lose contact. In order to ensure high availability, when the system writes, part of the nodes will return to success, which will lead to a certain time, the client reads different data from different machines. For example, redis master-slave asynchronous replication architecture, when master down is removed, the system will switch to slave, because it is asynchronous replication, salve is not the latest data, which will lead to consistency problems.

II. Consistency Agreement

2.1 Two-Phase Submission Protocol (2PC)

Two-phase Commit is an algorithm designed for consistency of transaction committing among all nodes in the field of computer networks and databases based on distributed system architecture. Typically, two-stage submission is also referred to as a protocol. In distributed systems, although each node can know the success or failure of its own operation, it can not know the success or failure of the operation of other nodes. When a transaction spans multiple nodes, in order to maintain the ACID characteristics of the transaction, it is necessary to introduce a component as a coordinator to unify the operation results of all nodes (called participants) and ultimately indicate whether these nodes should actually submit the operation results (such as writing updated data to disk, etc.). Therefore, the two-stage submission algorithm can be summarized as follows: the participants will inform the coordinator of the success or failure of the operation, and then the coordinator will decide whether to submit or discontinue the operation according to the feedback information of all participants.

2.1.1 Two roles

  • coordinator
  • Participant

2.1.2 treatment stage

  • Inquiry voting phase: The transaction coordinator sends a Prepare message to each participant. After the participant receives the message, or after the redo and undo logs are successfully written locally, the transaction coordinator returns the agreed message, or a termination message.
  • Perform initialization (perform submission): After receiving messages from all participants, the coordinator sends a rollback instruction to each participant if one returns to terminate the transaction. Whether or not to send a commit message

2.1.3 Handling of Abnormal Conditions

  • Coordinator failure: The standby coordinator takes over and queries the address to which participants are executing
  • Participant failure: The coordinator waits for him to restart and execute
  • The coordinator and participant fail simultaneously: the coordinator fails, and then the participant fails. For example: Organic 1, 2, 3, 4. Four of them are coordinators. One, two, and three of them are participants who send one or two transaction submissions to each other and fail at the same time. Note that three of them fail to submit transaction data. When the standby coordinator starts, he asks the participant, because 3 is dead, he never knows what state it is in (acceptance of the commit transaction, feedback on whether it can or cannot execute the three states). Faced with this situation, 2PC, can not be solved, to solve the need for the 3PC described below.

2.1.4 Disadvantages

  • Synchronized blocking problemBecause all participating nodes are transaction blocking, for exampleupdate table set status=1 where current_day=20181103So participantstableTablecurrent_day=20181103Records will be locked and other changes will be made.current_day=20181103Everything that goes on will be blocked.
  • Single point failure blocking other transactionsThe coordinator then executes the submission phase down, and all participants are locked out of the state of transaction resources. Unable to complete related transaction operations.
  • Participants and coordinators go down at the same timeThe coordinator drops down after sending the commit message, and the only participant who receives the message also drops down. The new coordinator’s takeover is also a state of obscurity and ignorance of the state of the transaction. It is inappropriate to submit or roll back.This is an unchangeable two-stage submission

2.2 Three-Phase Submission Agreement (3PC)

2PC only considered the situation of single machine failure, which could be handled reluctantly. When the coordinator and participant fail at the same time, the theory of 2PC is not perfect. At this time, 3PC comes on stage.
3PC is a supplementary protocol for 2PC vulnerabilities. Two major changes

  1. In the first and second phases of 2PC, a preparatory phase is inserted to ensure that even participants and coordinators fail at the same time without blocking and consistency is guaranteed.
  2. Introducing timeouts between coordinators and participants

Three stages of 2.2.1 treatment

  • The can commit phase: The coordinator sends a commit request to the participant and waits for the participant to respond. Unlike the 2PC phase, participants do not lock resources, write redo, undo, and perform rollback logs.Low cost of rollback
  • Precommit: If all participants return ok, a Prepare message is sent and participants execute redo and undo logs locally. If not, a request to terminate a transaction is submitted to the participant. When the Prepare message is sent again, a request to terminate the transaction is also submitted to the participant if the waiting time is out.
  • Do commit: If all sending Prepare returns successfully, then it becomes the execution transaction stage, sending commit transaction messages to participants. If not, the transaction is rolled back. At this stage, if participants do not receive docommit messages within a certain period of time, triggering a timeout mechanism, they will commit their own transactions. The logic of this process is to be able to enter this stage, indicating that all nodes are good at the transaction query stage. Even if it fails partially at the time of submission, there is reason to believe that most nodes are good at this time. Submittable

2.2.2 Disadvantage

  • It can not solve the problem of inconsistent data caused by network partition: for example, 1-5 participant nodes, 1, 2, 3 nodes in A computer room, 4, 5 nodes in B computer room. staypre commitIn phase 1 to 5 nodes receive Prepare messages, but node 1 fails to execute. The coordinator sends the message of rollback transaction to 1 to 5 nodes. But at this time, the network partition of A and B computer rooms. Nodes 1-3 roll back. However, 4-5 nodes commit a transaction because they do not receive the message to roll back the transaction. When the network partition is restored, data inconsistency will occur.
  • Fail-recovery cannot be solved:

Because of the existence of timeout mechanism in 3PC, the unsolved problems in 2PC can be solved by downloading both participants and coordinators at the same time. Once participants do not receive the coordinator’s message within the timeout period, they submit it it themselves. This also avoids participants from occupying shared resources all the time. But in the case of network partition, it can not guarantee the consistency of data.

2.3 Paxos Protocol

For example, 2PC and 3PC need to introduce a coordinator role. When the coordinator downs, the whole transaction can not be submitted. Participants’resources are locked out. The impact on the system is catastrophic, and network partitioning is likely to occur, which may lead to inconsistent data. Is there a solution that does not require a coordinator role, and each participant coordinates transactions, while maximizing consistency in the case of network partitioning? At this point Paxos appeared.

Paxos algorithm is a consistency algorithm based on message passing proposed by Lamport in 1990. Because the algorithm is difficult to understand and has not attracted much attention at first, Lamport was republished eight years later, even so Paxos algorithm has not been taken seriously. In 2006, three Google papers were shocked. The chubby lock service used Paxos as consistency in chubbycell, which was later paid attention to.

2.3.1 What problems have been solved?

  • Paxos protocol is a communication protocol that resolves the problem of a distributed system in which multiple nodes agree on a certain value (proposal). It can deal with the situation that a few nodes are offline, and most of the remaining nodes can still reach agreement.That is, each node is not only a participant, but also a decision maker.

2.3.2 Two roles (both can be the same machine)

  • Proposer: Proposed server
  • Acceptor: A server that approves proposals

As Paxos is too similar to the ZAB protocol used by zookeeper mentioned below, please refer to the following for a detailed explanation.Zookeeper PrinciplePart

2.4 Raft protocol

Paxos demonstrates the feasibility of consistency protocol, but the process of demonstration is said to be obscure, lack of necessary implementation details, and the difficulty of engineering implementation is widely known to achieve only ZK implementation of Zab protocol. Then Raft, an easy-to-implement and understandable distributed consistent replication protocol, was proposed in the RamCloud project of Stanford University. Java, C++, Go and so on all have their corresponding implementations.

2.4.1 Basic Nouns

  • Node state

    • Leader: Accept client update requests, write them locally, and synchronize them to other replicas
    • Follower (slave node): Accept update requests from Leader and write to local log files. Providing read requests to clients
    • Candidate: If follower does not receive leader heartbeat for a period of time. Then judge the leader’s possible failure and initiate the selection proposal. Node state changes from Follower to Candidate state until the end of the selector
  • TermId: Term number, time is divided into one term, each election will produce a new termId, a term with only one leader. TermId is equivalent to the proposal Id of paxos.
  • Request Vote: Request for a vote. Candidate is initiated during the election process and becomes leader after receiving a quorum response.
  • AppendEntries: Additional logs, leader mechanism for sending logs and heartbeat
  • Election timeout: Election timeout is an election timeout if follower does not receive any messages (additional logs or heartbeat) for a period of time.

2.4.2 Characteristic

  • Leader will not modify its own logs, but will only do additional operations, logs can only be changed from Leader to Follower. For example, the Reader node that is about to be downloaded has submitted log 1 and not log 2,3. After download, node 2 starts the latest log with only 1, and then submits log 4. Unfortunately, Node 1 started again. At this point, the numbered 4 logs of node 2 are appended to the numbered 1 logs of node 1. Node 1 logs numbered 2,3 will be lost.
  • Consistency is guaranteed by term-id and log-id, which are incremental logically, independent of the physical timing of each node.

2.4.3 Opportunities for Choosing the Main Player

  1. No Leader heartbeat was received in the timeout
  2. On startup

2.4.4 Main Selection Process

CAP Consistency Protocol and Application Analysis

Picturedraft-2As shown, Raft divides time into multiple terms (term of office), terms are identified by consecutive integers, and each term represents the beginning of an election. For example, Follower node 1. At the time of the connection between term 1 and term2, the Leader could not be contacted. The current Term number was added to 1 and changed to 2. It entered the term of term2. The election was completed in the blue part of term2, and the green part worked normally. Of course, a term of office does not necessarily elect a Leader, so the current Term will continue to be added 1, and then continue to elect, such as T3 in the figure. The principle of elections is that in each round of elections, each voter has one vote, the request for voting comes first, and the voters find that the log ID of the candidate node is greater than or equal to their own, they will vote, and the voters will not vote. Nodes that get more than half of the votes become primary nodes(Be carefulThis does not mean that the selected transaction ID must be the largest. For example, the following figureraft-1A ~ F six nodes (the number in the square box is the number of electoral rounds term). In the fourth round of elections, a votes first. In six machines, a~e will vote for a. Even if f does not vote for a, a will win the election. If there is no transaction ID (such as when it is just started), follow the voting request first come first. Leader then copies the latest logs to each node and provides services to the outside world.
Of course, in addition to these electoral restrictions, there will be other situations. For example, commit restrictions and other guarantees, the success of Leader elections must include all commits and logs.

2.4.5 Log Replication Process

Raft log writing process, the primary node receives ax=1After the request is made, the local log is written, and then thex=1The log is broadcast, and if follower receives a request, it writes the log to the local log and returns success. When leader receives more than half of the node responses, it changes the status of the log to commit, and then broadcasts messages to Follwer to submit the log. The node updates the logindex in the state machine after the commit log.
First LogIndex / lastLogIndex is the index location (including commit, uncommitted, written to state machine) of the start and end of the node, commitIndex: the submitted index. ApplyIndex: An index that has been written to the state machine

The essence of log replication is to make Follwer and Leader’s submitted logs in exactly the same order and content to ensure consistency.
Specific principles are
Principle 1: Two logs in different raft nodes, if there are two identical terms and logIndex
Make sure that the contents of the two logs are exactly the same.
Principle 2: In different raft nodes, if the starting and ending term and logIndex are the same, then the contents of the logs in the two sections are exactly the same.
How to Guarantee
The first principle only needs to use the new logIndex when creating logIndex to ensure the uniqueness of logIndex. And it’s not changed after it’s created. After the leader is copied to follwer, the logIndex, term and log content remain unchanged.
Second, when the Leader replicates to Follower, it passes the current latest logs currenTermId and currentLogIndex, as well as the previous logs preCurrentTermId and preCurrentLogIndex. Picturedraft-1At D node, term7, logIndex12. When synchronizing with node a, send (term7, logIndex11), (term7, logIndex12), and node a can not find (term7, logIndex11) logs, which will cause Leader, D node to re-send. D node will resend (term6, logIndex10) (term7, logIndex11), or there is no (term6, logIndex10) log, will still refuse synchronization. Then send (term6, logIndex9) (term6, logIndex10). Now the a node has (term6, logIndex9). Then the leader node will give the (term6, logIndex9) ~ (term7, logIndex11) log content to node a, and node a will have the same log as node D.

3. Zookeeper Principle

3.1 Overview

Burrows, the designer and developer of Chubby, Google’s coarse-grained lock service, once said, “All conformance protocols are essentially either Paxos or its variants.” Paxos solves the problem that in distributed systems, multiple nodes agree on a certain value. But other problems have been introduced. Because of its each node, it can either propose a proposal or approve it. When three or more proposers send a preparerequest, it is difficult for one proposer to receive more than half of the replies and continue to implement the first phase of the agreement.In this competition, it will lead to slower elections.
So zookeeper proposed ZAB protocol on the basis of paxos. Essentially, only one machine can propose Proposer, which is called Leader role. Other participants play the role of Acceptor. In order to ensure the robustness of Leader, Leader election mechanism is introduced.

The ZAB protocol also solves these problems.

  1. When less than half of the nodes are down, they can still provide services to Taiwan.
  2. All write requests from clients are handled by Leader. After successful writing, all followers and observers need to be synchronized
  3. Leader downtime or cluster reboot. You need to ensure that all transactions that have been submitted by Leader are eventually submitted by the server, and that the cluster can quickly revert to its pre-failure state.

3.2 Basic Concepts

  • Basic nouns

    • Data Node: The smallest data unit in the ZK data model. The data model is a tree, which is uniquely identified by the path name separated by a slash (/). The data node can store data content and a series of attribute information. It can also mount sub-nodes to form a hierarchical namespace.
    • Transaction and zxid: Transaction refers to the operation that can change the state of Zookeeper server, generally including the creation and deletion of data nodes, content updating of data nodes and creation and failure of client sessions. For each transaction request, ZK assigns a globally unique transaction ID, namely zxid, which is a 64-bit number. A high 32-bit number indicates the cluster election cycle in which the transaction occurs (each cluster leader election occurs, plus 1), and a low 32-bit number indicates the increasing order of the transaction in the current selection cycle. R Each transaction request is processed with a value of 1, and a leader selection occurs with a 32-bit low clearance of 0.
    • Transaction Log: All transaction operations need to be recorded in the log file, which can be accessed through the dataLogDir configuration file directory. The file is suffixed by the first transaction zxid written to facilitate subsequent locations. ZK will adopt a “disk space pre-allocation” strategy to avoid disk Seek frequency and enhance the impact of ZK server on transaction requests. By default, every transaction log write operation will be real-time flushed to disk, or it can be set to non-real-time (write to memory file stream, write to disk in batches at a fixed time), but that will bring the risk of losing data when power is cut off.
    • Transaction snapshot: Data snapshot is another very core operation mechanism in ZK data storage. Data snapshots are used to record the full amount of memory data content at a certain time on the ZK server and write it to the specified disk file through the dataDir configuration file directory. The configurable parameter snapCount sets the number of transaction operations between two snapshots. When the ZK node records the transaction log, it will make a statistical judgment on whether it needs to take a data snapshot (when the number of transaction operations is equal to a value in snapCount/2~snapCount, the snapshot generation operation will be triggered. Random values are to avoid it). All nodes generate snapshots at the same time, resulting in slow cluster impact.
  • Core role

    • Leader: The system is in an electoral state when it first starts or after the Leader crashes.
    • Follower: The state of Follower node, Follower and Leader are in the data synchronization phase;
    • Observer: Leader is in a state where there is a Leader as the main process in the current cluster.
  • Node state

    • LOOKING: Nodes are in the state of selector and do not provide services until the end of selector.
    • FOLLOWING: As a slave node of the system, it receives updates from the master node and writes them to the local log.
    • LEADING: As the master node of the system, it accepts client updates, writes local logs and copies them to slave nodes.

3.3 Common Misunderstandings

  • It’s wrong to write data to a node and read it immediately.ZK writes must be written through leader serial, and as long as more than half of the nodes write successfully. And any node can provide read service。 For example: zk, with 1-5 nodes, writes the latest data, and the latest data is written to nodes 1-3, which returns success. The request is then read to read the latest node data, and the request may be assigned to nodes 4-5. At this time, the latest data has not been synchronized to nodes 4-5. You won’t be able to read the latest data.If you want to read the latest data, you can use the sync command before reading it.
  • It is also wrong that ZK startup nodes cannot be even. ZK requires more than half of the nodes to work properly. For example, four nodes are created, and more than half of the normal nodes are 3. That is to say, only one machine is allowed to download at most. And three nodes, more than half of the normal number of nodes is 2, is also the maximum allowed one machine down. Four nodes, one more machine cost, but robustness is the same as three nodes cluster. Cost-based considerations are not recommended

3.4 Election Synchronization Process

3.4.1 Opportunities for Voting

  1. Node Start
  2. The node cannot maintain a connection with Leader during its operation.
  3. Leader loses more than half of its nodes

3.4.2 How to Guarantee Transactions

The ZAB protocol is similar to two-stage submission, where the client has a write request, such as settings./my/testWith a value of 1, the Leader generates the corresponding transaction proposal (currently zxid is 0x5000010, zxid is Ox5000011), which will nowset /my/test 1(Pseudo code here) Write to the local transaction log, and thenset /my/test 1Logs are synchronized to all followers. Follower receives the transaction proposal and writes the proposal to the transaction log. If more than half of followers respond, the broadcast initiates a commit request. Follower receives the commit request. The zxid ox5000011 in the file will be applied to memory.

The above is normal. There are two situations. After the first Leader writes to the local transaction log, it downs without sending a synchronization request. Even if the chooser then starts as follower. At this point, the log will still be lost (because the selected leader does not have this log and cannot synchronize). The second kind of Leader makes a synchronization request, but it goes down without commit. At this point, the log will not be lost and will be submitted to other nodes synchronously.

3.4.3 Voting process during server startup

Now five ZK machines are numbered 1-5 in turn

  1. Node 1 starts and the request sent does not respond, which is the state of Looking.
  2. Node 2 starts up and communicates with Node 1 to exchange election results. Because they have no historical data, i.e. zxid cannot be compared, node 2 with larger ID value wins at this time, but since there are not more than half of the nodes, both 1 and 2 keep looking state.
  3. Node 3 starts. According to the analysis above, node 3 with the largest ID value wins, and more than half of the nodes participate in the election. Node 3 wins and becomes Leader
  4. Node 4 starts and communicates with 1-3 nodes. It is known that the latest leader is Node 3, and zxid is smaller than Node 3, so the role of leader of Node 3 is recognized.
  5. Node 5 starts and, like Node 4, chooses the leader role that recognizes Node 3

3.4.4 Server Selection Process in Running

CAP Consistency Protocol and Application Analysis

1. Node 1 votes.Vote for yourself in the first round of votingAnd then go into the Looking wait state
2. Other nodes (e.g. node 2) receive voting information from the other party. Node 2 broadcasts its voting results in the Looking state (which is the Loking branch on the left side of the figure above); if it is not in the Looking state, tell Node 1 who is the current Leader, and don’t make a futile election (which is the Ledding/Following branch on the right side of the figure above).
3. At this time, node 1 receives the election result of node 2. If the zxid of node 2 is larger, then empty the ballot box, create a new ballot box, and broadcast their latest results. In the same election, if more than half of the nodes in the ballot box elect a node after receiving the results of all the nodes’voting, it proves that the leader has been elected, and the voting is terminated. Otherwise, it’s always circulating.

The election of zookeeper gives priority to larger zxid, and the largest node of zxid has the latest data. If there is no zxid, if the system has just started, then compare the machine number, preferring the larger number.

3.5 Synchronization Process

After selecting Leader, ZK enters the process of state synchronization. In fact, it is to apply the latest zxid log data to other nodes. This zxid contains the zxid written to the log in follower but not submitted. It is called zxid in the server proposal cache queue committedLog.

Synchronization completes the initialization of three zxid values.

peerLastZxidThe learner server processes zxid at last.
minCommittedLogThe leader server proposes to cache the minimum zxid in the committedLog queue.
maxCommittedLogThe leader server proposes to cache the maximum zxid in the queue committedLog.
The system will be based on learner’speerLastZxidAnd leader’sminCommittedLogmaxCommittedLogMake comparisons and make different synchronization strategies

3.5.1 Direct Differential Synchronization

Scene:peerLastZxidBe situated betweenminCommittedLogZxidandmaxCommittedLogZxidbetween

This scenario occurs when, as mentioned above, Leader makes a synchronization request, but it goes down without commit. Leader sends Proposal packets and commit instruction packets. The newly elected leader continues to complete the unfinished work of the previous leader.

For example, the cache queue proposed by Leader is listed as 0x20001, 0x20002, 0x20003, 0x20004, where learner’s peerLastZxid is 0x20002. Leader will synchronize the two proposals 0x20003 and 0x20004 to learner.

3.5.2 Roll Back First in Differential Synchronization/Roll Back Only Synchronization

This scenario occurs, as mentioned above, when Leader writes to the local transaction log, it goes down without making a synchronization request, and then appears as a learner when synchronizing the log.

For example, leader node 1, which is about to be downloaded, has already processed 0x20001, 0x20002, and downloaded without a proposal when dealing with 0x20003. Later, Node 2 was elected as the new leader. When synchronizing data, Node 1 was magically revived. If the new leader hasn’t processed the new transaction, and the queue of the new leader is 0x20001, 0x20002, then only the node 1 is rolled back to the 0x20002 node, and the log of 0x20003 is discarded, which is called rollback synchronization only. If the new leader has processed 0x30001, 0x30002 transactions, then the new leader is ranked as 0x20001, 0x20002, 0x30001, 0x30002, then let node 1 roll back to 0x20002, and then differentiate and synchronize 0x30001, 0x30002.

3.5.3 Full Synchronization

peerLastZxidless thanminCommittedLogZxidOr there is no cache queue on the leader. Leader uses SNAP commands directly for full synchronization

Fourth, using Raft + RocksDB to support distributed KV storage services

At present, most open source caching kV systems are AP systems, such as setting up master-slave synchronous cluster redis, master asynchronous synchronization to slave. Although slave will come up after master stops serving. But when master writes data, but it downs before slave can be synchronized, and slave is selected as the primary node to continue to provide services to the outside world, some data will be lost. This is unacceptable for systems requiring strong consistency. For example, in many scenarios, redis makes distributed locks, which have a natural defect. If the master stops serving, the locks are not very reliable. Although the probability of occurrence is very small, once they occur, they will be fatal mistakes.

In order to implement the KV storage system of CP, and compatible with the existing redis business. ZanKV has been developed by Youzan (ZanRedisDB has been open source).

CAP Consistency Protocol and Application Analysis

The underlying storage structure is RocksDB (the underlying structure is LSM data structure). Oneset x=1It will be transmitted through redis protocol, and the content will be written to RocksDB of other nodes synchronously through Raft protocol. With the support of Raft theory, RocksDB has excellent storage performance. Even if it encounters a series of abnormal situations, such as network partition, master node down, slave node down, etc., RocksDB can easily deal with them. In the aspect of expansion, the system chooses to maintain the mapping table to establish the relationship between partition and node. The mapping table will be generated according to certain algorithms and flexible strategies to achieve easy expansion. Specific principles can be found in the use of open source technology to build distributed KV storage services

V. Summary

This paper introduces consistency in three aspects. First, it describes the core theory of distributed architecture, CAP, and its simple proof. The second part introduces the protocol in CAP, focusing on Raft protocol. The third part focuses on the commonly used zookeeper principle.

In order to ensure that the data commit is not lost, the system will use WAL write ahead log (write the operation content log before modifying the data, and then modify the data). Even if there are exceptions when modifying data, you can recover data by manipulating the content log.

Distributed storage system is designed under the assumption that the machine is unstable and it is possible to download at any time. That is to say, even if the machine downs, the data written by the user can not be lost to avoid a single point of failure. For this purpose, each written data needs to be stored in multiple copies at the same time. For example, ZK node data replication, etcd data replication. The replication of data will bring the problem of consistency to the nodes, such as how to synchronize the data when the data of the master node and the slave node are inconsistent. It also brings usability problems, such as downloading leader nodes, how to quickly select the master, restoring data, and so on. Fortunately, there are mature theories such as Paxos protocol, ZAB protocol Raft protocol and so on.

Reference Articles/Books
Principles and Practice of Distributed Consistency from Paxos to Zookeeper

Using Open Source Technology to Build a Supported Distributed KV Storage Service

On Distributed Transaction, Two-Phase Submission Protocol and Third-Level Submission Protocol

Data synchronization between zookeeper leader and learner

Analysis of Zookeeper’s Consistency Principle

[Graphical Distributed Protocol – Raft](

[Raft Protocol Explanation](…