Interviewer: young man, what do you think of the principle of distributed system

  • 1 Concept
  • 1.1 model
  • 1.2 copies
  • 1.3 indicators for measuring distributed systems
  • 2. Principle of distributed system
  • 2.1 data distribution
  • 2.2 basic copy agreement
  • 2.3 lease mechanism
  • 2.4 quorum mechanism
  • 2.5 log technology
  • 2.6 two phase submission protocol
  • 2.7 MVCC
  • 2.8 Paxos protocol
  • 2.9 CAP

1 Concept

1.1 model


In a specific project, a node is often a process on an operating system. In this model, the node is considered as a complete and indivisible whole. If a program process is actually composed of several relatively independent parts, a process can be divided into multiple nodes in the model.


  1. Machine downMachine downtime is one of the most common exceptions. In a large cluster, the probability of daily downtime is about one thousandth. In practice, the recovery time of a down machine is usually considered to be 24 hours, which generally requires manual intervention to restart the machine.
  2. Network abnormality: the message is lost, the two nodes can not communicate with each other, that is, there is a “network differentiation”; the message is out of order, there is a certain probability that it will arrive at the destination node not in the order when it was sent. We consider using the sequence number and other mechanisms to deal with the problem of network message out of order, so that the invalid and expired network messages do not affect the correctness of the system; data error; unreliable t CP, TCP protocol provides reliable and connection oriented transmission service for application layer, but in the protocol design of distributed system, it can not be considered that all network communication is reliable if it is based on TCP protocol. TCP protocol can only ensure that the network messages within the same TCP link are not out of order, but the order of network messages between TCP links cannot be guaranteed.
  3. Distributed tristate: if a node initiates an RPC (remote procedure call) call to another node, that is, a node a sends a message to another node B, and node B completes some operations according to the received message content, and returns the operation result to node a through another message, then the RPC call is invalid The result of execution has three states: “success”, “failure” and “timeout (unknown)”, which are called the three states of distributed system.
  4. Loss of stored data: for stateful nodes, data loss means state loss. Usually, the stored state can only be read and recovered from other nodes.
  5. Exception handling principlesThe golden principle of exception handling, which has been tested by a large number of engineering practices, is that any exception considered in the design phase must occur in the actual operation of the system, but the exception encountered in the actual operation of the system is likely not to be considered in the design. Therefore, unless the requirements permit, no exception can be ignored in the system design.

1.2 copies

Replica / copy refers to the redundancy provided for data or services in a distributed system. For data replica, it means that the same data is persisted on different nodes. When the data stored in a node is lost, the data can be read from the replica. Data replica is the only way to solve the problem of data loss in distributed system. Another kind of replica is service replica. Each node provides some kind of same service. This kind of service does not depend on the local storage of the node, and the data it needs usually comes from other nodes.

Replica protocol is the core theory throughout the whole distributed system.

Replica consistency

A distributed system uses a replica control protocol to make the data read from the outside of the system to the inside of the system identical under certain constraints, which is called replica consistency. Replica consistency is for distributed systems, not for a single replica.

  1. Strong consistency: at any time, any user or node can read the latest successfully updated replica data. Strong consistency is the highest degree of consistency requirements, but also the most difficult to achieve consistency in practice.
  2. Monotonic consistency: at any time, once any user reads the value of a data after a certain update, the user will not read the value older than this value. Monotonic consistency is a practical consistency level which is weaker than strong consistency. Generally speaking, users only care about the consistency observed from their own perspective, but not the consistency of other users.
  3. Session consistency: once a user reads the updated value of a certain data in a certain session, the user will not read the older value in this session. By introducing the concept of session, session consistency further relaxes the constraints on the basis of monotonic consistency. Session consistency only guarantees the monotonic modification of data in a single session of a single user, and does not guarantee the consistency between different users and between different sessions of the same user. In practice, there are many mechanisms corresponding to the concept of session, such as session in PHP.
  4. Final consistency: the final consistency requires that once the update is successful, the data on each replica will eventually reach the completely consistent state, but the time needed to reach the completely consistent state cannot be guaranteed. For the final consistency system, as long as a user always reads a copy of the data, it can achieve similar monotonic consistency effect, but once the user changes the read copy, it can not guarantee any consistency.
  5. Week consistency: once an update is successful, the user can’t read the updated value within a certain time, and even if the new value is read on one copy, it can’t guarantee that the new value can be read on other copies. Weak consistency system is difficult to use in practice, and it needs more work from the application side to make the system available.

1.3 indicators for measuring distributed systems

  1. performance: the throughput capacity of a system refers to the total amount of data that the system can process at a certain time, which is usually measured by the total amount of data that the system processes per second; the response delay of a system refers to the time that the system needs to complete a certain function; the concurrency capacity of a system refers to the ability of the system to complete a certain function at the same time, which is also usually measured by QPS (query per second). The above three performance indicators often restrict each other. It is difficult to achieve low latency for systems pursuing high throughput, and it is difficult to improve QPS for systems with long average response time.
  2. usabilityAvailability refers to the ability of a system to provide services correctly in the face of various exceptions. The availability of the system can be measured by the ratio of the time when the system stops serving and the time when the system normally serves, or by the ratio of the number of failures and the number of successes of a function. Availability is an important index of distributed system, which measures the robustness of the system and reflects the fault tolerance ability of the system.
  3. ScalabilityScalability refers to the characteristics of distributed systems that improve system performance (throughput, latency, concurrency), storage capacity and computing power by expanding the size of cluster machines. A good distributed system always pursues “linear scalability”, which means that a certain index of the system can grow linearly with the number of machines in the cluster.
  4. uniformityIn order to improve the availability of distributed system, it is inevitable to use replica mechanism, which leads to the problem of replica consistency. The stronger the consistency of the model, the easier it is for users to use.

2. Principle of distributed system

2.1 data distribution

The so-called distributed system, as the name suggests, is to use multiple computers to solve the problems of computing and storage that can not be solved by a single computer. The biggest difference between stand-alone system and distributed system lies in the scale of the problem, that is, the difference of the amount of data calculated and stored. In order to solve a single machine problem using distributed solution, the first thing to solve is how to decompose the problem into multi machine distributed solution, so that each machine in the distributed system is responsible for a subset of the original problem. Since the input object of the problem is data, whether it is computation or storage, how to disassemble the input data of the distributed system becomes the basic problem of the distributed system.

Hash mode

Interviewer: young man, what do you think of the principle of distributed system

The disadvantage of hash distribution data is also obvious, especially the low scalability. Once the cluster scale needs to be expanded, almost all the data needs to be migrated and redistributed. In engineering, when expanding the hash distributed data system, the cluster scale is often multiplied, and the hash is recalculated according to the data, so that only half of the original data on one machine needs to be migrated to another corresponding machine to complete the expansion.

In order to solve the problem of poor scalability of hash method, one idea is not simply to map the hash value to the machine by division modulus, but to manage the corresponding relationship as metadata by a special metadata server. At the same time, the number of hash value modulus is often larger than the number of machines, so the same machine needs to be responsible for the remainder of multiple hash modulus. However, a large amount of metadata needs to be maintained by a more complex mechanism. Another disadvantage of hash distribution data is that once the data of a data eigenvalue is seriously uneven, it is easy to appear the problem of “data skew”.

Another disadvantage of hash distribution data is that once the data of a data eigenvalue is seriously uneven, it is easy to appear the problem of “data skew”

Interviewer: young man, what do you think of the principle of distributed system

Distribution by data range

Distribution by data range is another common data distribution, which divides the data into different intervals according to the range of eigenvalues, so that each server (Group) in the cluster can process the data in different intervals.

Interviewer: young man, what do you think of the principle of distributed system

In engineering, for the convenience of data migration and other load balancing operations, dynamic partition technology is often used to make the amount of data served in each interval as much as possible. When the amount of data in an interval is large, the interval is split into two intervals to keep the amount of data in each interval below a relatively fixed threshold.

Generally, a special server is needed to maintain data distribution information in memory, which is called meta information. Even for large-scale clusters, because of the huge scale of meta information, a single computer can not be maintained independently, so multiple machines need to be used as meta information servers.

Distribution by data volume

Data volume distribution data has nothing to do with specific data characteristics. Instead, the data is regarded as a sequential growth file, which is divided into several chunks according to a relatively fixed size, and different chunks are distributed to different servers. Similar to the way of distributing data according to data range, distributing data according to data volume also needs to record the specific distribution of data blocks, and manage the distribution information as metadata using metadata server.

Because it has nothing to do with the specific data content, there is no data skew problem in the way of distributing data according to the amount of data, and the data is always evenly segmented and distributed to the cluster. When the cluster needs to rebalance, it only needs to migrate data blocks. Cluster expansion is not too limited, just move part of the database to the new machine to complete the expansion. The disadvantage of dividing data by data volume is that it needs to manage more complex meta information. Similar to the way of distributing data by scope, when the cluster scale is large, the data volume of meta information becomes large, and efficient management of meta information becomes a new topic.

Consistent Hashing

Consistent hashing is another widely used data distribution method in engineering. Consistent hashing was initially used as a common data distribution algorithm for distributed hash table (DHT) in P2P networks. The basic way of consistent hashing is to use a hash function to calculate the hash value of data or data characteristics, so that the output range of the hash function is a closed loop, that is, the maximum value of the hash function output is the preamble of the minimum value. The nodes are randomly distributed to the ring, and each node is responsible for processing all the hash data from its own clockwise to the next node.

Interviewer: young man, what do you think of the principle of distributed system

The way of using consistent hash needs to manage the location of nodes in consistent hash ring as meta information, which is more complex than the way of directly using hash to distribute data. However, the location information of nodes is only related to the size of machines in the cluster, and the amount of meta information is usually much smaller than that of data distributed by data range and data amount.

Therefore, a common improved algorithm is to introduce the concept of virtual node. Many virtual nodes are created at the beginning of the system, and the number of virtual nodes is generally much larger than the number of machines in the future cluster. The virtual nodes are evenly distributed on the consistent hash range ring, and its function is the same as the nodes in the basic consistent hash algorithm. Each node is assigned several virtual nodes. When operating data, we first find the corresponding virtual node in the ring through the hash value of the data, and then find the corresponding real node through the metadata. There are several advantages to using virtual node improvement. First of all, once a node is unavailable, the node will make multiple virtual nodes unavailable, so that multiple adjacent real nodes load the pressure of the failure node. In the same way, once a new node is added, multiple virtual nodes can be allocated, so that the new node can load the pressure of multiple original nodes. From the overall point of view, it is easier to achieve load balancing in capacity expansion.

Replica and data distribution

The basic means of fault tolerance and availability improvement of distributed system is to use replica. The distribution of data copies mainly affects the scalability of the system. A basic data replication strategy is to take the machine as the unit, several machines are replicas of each other, and the data between the replica machines is exactly the same. This strategy is suitable for the above data distribution. Its advantage is very simple, its disadvantage is that the efficiency of data recovery is not high, scalability is not high.

The more appropriate way is not to take the machine as the copy unit, but to split the data into more reasonable data segments and take the data segments as the copy unit. In practice, the size of each data segment is always equal and controlled within a certain size. Data segment has many different titles, such as segment, fragment, chunk, partition and so on. The choice of data segment is directly related to the way of data distribution. For hash data division, the remainder after each hash bucket can be used as a data segment. In order to control the size of the data segment, the number of buckets is often larger than the cluster size. Once the data is divided into data segments, the replica can be managed as a unit of data segments, so that the replica is no longer hard related to the machine, and each machine can be responsible for the replica of a certain data segment.

Interviewer: young man, what do you think of the principle of distributed system

Once the replica distribution is machine independent, the recovery efficiency will be very high after data loss. This is because, once the data of a machine is lost, the copies of the data segments on it will be distributed in all the machines in the cluster, not only in a few copy machines, so that the data can be copied and recovered from the whole cluster at the same time, and each data source machine in the cluster can copy with very low resources. Even if the speed limit of the recovery data source machine is 1MB / s, if 100 machines participate in the recovery, the recovery speed can reach 100MB / s. Moreover, the replica distribution is machine independent and also conducive to cluster fault tolerance. If there is a machine outage, because the copies on the down machine are scattered in the whole cluster, the pressure will naturally spread to the whole cluster. Finally, the replica distribution is machine independent and also conducive to cluster expansion. In theory, if the cluster size is n machines, when a new machine is added, it only needs to migrate the data segments with the ratio of 1 / N – 1 / N + 1 from each machine to the new machine to achieve the new load balancing. Since data is migrated from each machine in the cluster, it is the same as data recovery and has high efficiency. In a project, it will increase the cost of metadata to be managed and the difficulty of replica maintenance. A compromise approach is to group some data segments into a data segment group, which is divided into granularity for replica management. In this way, the replica granularity can be controlled in a more appropriate range.

Localized calculations

In the distributed system, the distribution of data also deeply affects the distribution of computing. In a distributed system, the computing node and the storage node that stores the computing data can be on the same physical machine or on different physical machines. If the computing node and the storage node are located in different physical machines, the computing data needs to be transmitted through the network, which has a high cost, and even the network bandwidth will become the overall bottleneck of the system. Another idea is to schedule the computing to the computing node on the same physical machine as the storage node, which is called localized computing. Localization computing is an important optimization of computing scheduling, which embodies an important distributed scheduling idea: “mobile data is not as good as mobile computing”.

The choice of data distribution mode

In practical engineering practice, the data distribution mode can be reasonably selected according to the demand and implementation complexity. In addition, the data distribution method can be combined flexibly, which can often have the advantages of various methods and receive better comprehensive effect.

For example: data skew problem, the method of distributing data according to the amount of data is introduced to solve the data skew problem on the basis of dividing data according to hash. The data is divided according to the hash value of user ID. when the amount of data of a user ID is particularly large, the user’s data always falls on a machine. At this time, the method of distributing data according to the amount of data is introduced to count the amount of data of users, and the data of users is cut into multiple uniform data segments according to a certain threshold, and these data segments are distributed to the cluster. Since most users’ data volume will not exceed the threshold, only the data segment distribution information of users exceeding the threshold is saved in the metadata, so the scale of metadata can be controlled. The scheme of combining hash data distribution and data distribution according to data volume has been used in a real system and achieved good results.

2.2 basic copy agreement

Replica control protocol refers to a distributed protocol that controls the reading and writing behavior of replica data according to a specific protocol process to make the replica meet certain availability and consistency requirements. The replica control protocol should have some fault tolerance ability against abnormal state, so as to make the system have a certain availability. At the same time, the replica control protocol should be able to provide a certain level of consistency. According to the principle of cap (detailed analysis in section 2.9), it is impossible to design a replica protocol that satisfies strong consistency and can be used in case of any network exception. Therefore, the replica control protocol in practice always compromises the availability, consistency and performance according to the specific requirements.

Replica control protocols can be divided into two categories: “centralized replica Control Protocol” and “decentralized replica Control Protocol”.

Centralized replica control protocol

The basic idea of centralized replica control protocol is that a central node coordinates the updating of replica data and maintains the consistency between replicas. Figure shows the general architecture of the centralized replica protocol. The advantage of centralized replica control protocol is that the protocol is relatively simple, and all replica related control is completed by the central node. Concurrency control will be completed by the central node, so that a distributed concurrency control problem can be simplified as a single machine concurrency control problem. Concurrency control means that when multiple nodes need to modify replica data at the same time, they need to solve concurrency conflicts such as “write” and “read write”. In a single machine system, locking and other methods are commonly used for concurrency control. For distributed concurrency control, locking is also a common method, but if there is no central node for unified lock management, we need a fully distributed lock system, which will make the protocol very complex. The disadvantage of the centralized replica control protocol is that the availability of the system depends on the centralized node. When the central node is abnormal or the communication with the central node is interrupted, the system will lose some services (usually at least the update service). So the disadvantage of the centralized replica control protocol is that there is a certain service downtime.

Interviewer: young man, what do you think of the principle of distributed system

Primary secondary protocol

In the primary secondary protocol, replicas are divided into two categories. There is only one replica as the primary replica, and all the replicas except the primary are secondary replicas. The node that maintains the primary replica is the central node, which is responsible for data update, concurrency control and replica consistency coordination.

Primary secondary protocols generally solve four kinds of problems: data update process, data reading mode, primary replica determination and switching, and data synchronization.

Basic process of data update
  1. Data updates are coordinated by the primary node.
  2. The external node sends the update operation to the primary node
  3. The primary node performs concurrency control, which determines the sequence of Concurrent update operations
  4. The primary node sends the update operation to the secondary node
  5. The primary node decides whether the update is successful according to the completion of the secondary node and returns the result to the external node

Interviewer: young man, what do you think of the principle of distributed system

In engineering practice, if the primary sends data to other N copies at the same time, the update throughput of each secondary is limited by the total exit network bandwidth of the primary, and the maximum is 1 / N of the exit network bandwidth of the primary. In order to solve this problem, some systems (for example, GFS) use relay mode to synchronize data, that is, the primary sends the update to the first secondary copy, the first secondary copy to the second secondary copy, and so on.

Data reading mode

The way of data reading is also highly related to consistency. If only final consistency is required, then reading any copy can satisfy the requirement. If session consistency is needed, the version number can be set for the replica. After each update, the version number is incremented. When the user reads the replica, the version number is verified, so as to ensure that the data read by the user increases monotonously within the session scope. Using primary secondary is more difficult to achieve strong consistency.

  1. Since the data update process is controlled by the primary, the data on the primary replica must be up-to-date. Therefore, if the data of the primary replica is always read-only, strong consistency can be achieved. If the primary copy is read-only, the secondary copy will not provide read service. In practice, if the replica is not bound to the machine, but is maintained according to the data segment as a unit, only the primary replica provides the read service, in many scenarios, it will not create a waste of machine resources.

If the primary is also randomly determined, then each machine has some primary copies of data and some secondary copies of other data segments. So a server actually provides read-write services.

  1. The primary node controls the availability of the secondary node. When the primary fails to update a secondary copy, the primary marks the secondary copy as unavailable, so that the user no longer reads the unavailable copy. The secondary replica that is not available can continue to try to synchronize data with the primary. When data synchronization is completed with the primary, the primary replica can be marked as available. In this way, all available copies, whether primary or secondary, are readable, and in a certain time, a secondary copy is either updated to the latest state consistent with the primary, or marked as unavailable, so as to meet the requirements of high consistency. This approach relies on a central metadata management system to record which copies are available and which are not. In a sense, this method improves the consistency of the system by reducing the availability of the system.

Determination and switch of primary replica

In the primary secondary protocol, another key problem is how to determine the primary replica, especially when the machine where the original primary replica is located is down and other exceptions occur, there needs to be a mechanism to switch the primary replica, so that a secondary replica becomes a new primary replica.

Generally, in a primary secondary distributed system, the information of which copy is primary belongs to meta information, which is maintained by a special metadata server. When performing the update operation, first query the metadata server to obtain the primary information of the replica, so as to further perform the data update process.

In a distributed system, it takes a certain detection time to reliably detect node anomalies. Such detection time is usually at the level of 10 seconds, which also means that once a primary anomaly occurs, it takes at most 10 seconds to start the system’s primary handover. In this 10 seconds, the system can’t provide more information because there is no primary If the system can only read the primary copy, it can not even provide the read service during this period. From here, we can see that the biggest disadvantage of the primary backup replica protocol is the down time caused by the primary handover.

Data synchronization

Inconsistent secondary copies need to be synchronized with primary.

Generally, there are three types of inconsistencies: first, due to network differentiation and other anomalies, the data on secondary lags behind the data on primary. 2、 Under some protocols, the data on secondary may be dirty and need to be discarded. The so-called dirty data is because the primary copy does not carry out a certain update operation, but the secondary copy carries out redundant modification operation instead, resulting in secondary copy data error. 3、 Secondary is a new replica. There is no data at all. You need to copy data from other replicas.

For the first case, the common synchronization method is to play back the operation log (usually the redo log) on the primary to catch up with the update progress of the primary. In the case of dirty data, the better way is that the designed distributed protocol does not produce dirty data. If the protocol must have the possibility of generating dirty data, the probability of generating dirty data should also be reduced to a very low level, so that once dirty data occurs, the copy with dirty data can be simply discarded, which is equivalent to the copy without data. In addition, some undo log based methods can be designed to delete dirty data. If the secondary copy has no data at all, the common way is to copy the data of the primary copy directly, which is much faster than the method of playing back the log to track the update progress. However, when copying data, the primary replica needs to continue to provide update service, which requires the primary replica to support snapshot function. That is to take a snapshot of the replica data at a certain moment, and then copy the snapshot. After the copy is completed, use the playback log to track the update operation after the snapshot is formed.

Decentralized replica control protocol

There is no central node in the decentralized replica control protocol, all nodes in the protocol are completely equal, and the nodes reach agreement through equal negotiation. Therefore, the decentralized protocol does not have the problem of service outage caused by the exception of the centralized node.

The biggest disadvantage of decentralized protocol is that the protocol process is usually complex. Especially when the decentralized protocol needs to achieve strong consistency, the protocol process becomes complex and difficult to understand. Due to the complexity of the process, the efficiency or performance of decentralized protocol is generally lower than that of centralized protocol. An inappropriate example is that centralized replica control protocol is similar to autocratic system, which has high efficiency but highly depends on the central node. Once the central node is abnormal, the system will be greatly affected; decentralized replica control protocol is similar to democratic system, which has collective negotiation and low efficiency, but the abnormality of individual nodes will not have much impact on the overall system.

Interviewer: young man, what do you think of the principle of distributed system

2.3 lease mechanism

Lease mechanism is the most important distributed protocol, which is widely used in various practical distributed systems.

Distributed cache system based on lease

The basic background of the problem is as follows: in a distributed system, there is a central server node, which stores and maintains some data, which is the metadata of the system. Other nodes in the system read and modify the metadata by accessing the central server node. Since various operations in the system depend on metadata, if every operation of reading metadata accesses the central server node, the performance of the central server node becomes the bottleneck of the system. Therefore, a metadata cache is designed to cache metadata information on each node, so as to reduce the access to the central server node and improve the performance. On the other hand, the correct operation of the system strictly depends on the correct metadata, which requires that the cache data on each node is always consistent with the data on the central server, and the data in the cache cannot be old dirty data. Finally, the designed cache system should be able to deal with node downtime, network interruption and other anomalies as much as possible, and improve the availability of the system to the greatest extent.

Therefore, a cache system is designed by using the lease mechanism. The basic principle is as follows. When the central server sends data to each node, it issues a lease to the node at the same time. Each lease has a validity period, which is similar to the validity period on a credit card. The validity period on a lease is usually a specific time point, such as 12:00:10. Once the real time exceeds this time point, the lease will expire. In this way, the validity period of the lease has nothing to do with the time when the node receives the lease, and the lease may have expired when the node receives the lease. In this paper, we first assume that the clock of the central server and each node is synchronized. In the next section, we discuss the impact of clock out of synchronization on lease. The meaning of the lease issued by the central server is: during the validity period of the lease, the central server guarantees that it will not modify the value of the corresponding data. Therefore, after receiving the data and the lease, the node adds the data to the local cache. Once the corresponding lease times out, the node will delete the corresponding local cache data. When the central server modifies the data, it first blocks all new read requests and waits for the expiration of all the lease timeouts previously issued for the data, and then modifies the value of the data.

Based on lease cache, the client node reads metadata

  1. Judge whether the metadata is in the local cache and the lease is within the validity period 1.1 yes: directly return the metadata in the cache 1.2 No: request the central server node to read the metadata information 1.2.1 after receiving the read request, the server returns the metadata and a corresponding lease 1.2.2, whether the client successfully receives the data returned by the server Failure or timeout: exit the process, read failed, can try again success: record the metadata and the metadata’s lease in memory, and return the metadata
  2. Based on the lease cache, the client node modifies the metadata process, and node 2.1 initiates the metadata modification request to the server. 2.2 after the server receives the modification request, it blocks all new read data requests, that is, it receives the read request but does not return the data. 2.3 the server timed out waiting for all leases related to the metadata. 2.4 the server modifies the metadata and returns the modification success to the client node.

The above mechanism can ensure that the cache on each node is consistent with the center on the central server. This is because the central server node grants the corresponding lease to the node while sending data. During the lease validity period, the server will not modify the data, so the client node can cache the data safely within the lease validity period. The key to fault tolerance of the above lease mechanism is: once the server sends data and lease, no matter whether the client receives it or not, no matter whether the subsequent client is down, no matter whether the subsequent network is normal or not, as long as the server waits for the lease timeout, it can ensure that the corresponding client node will not continue to cache data, so that it can safely modify the data without damaging the cache The consistency of the system.

The above basic process has some performance and usability problems, but it can be easily optimized and modified. Optimization point 1: when the server modifies the metadata, it must first block all new read requests, resulting in no read service. This is to prevent the new lease from being issued, which causes new client nodes to hold the lease and cache the data, forming a “livelock”. The optimization method is very simple. After the server enters the data modification process, once it receives a read request, it only returns data but does not issue a lease. As a result, in the process of modifying process execution, the client can read the metadata, but cannot cache the metadata. Further optimization is that when the modification process is in progress, the validity period of the lease issued by the server is selected as the maximum validity period of the issued lease. In this way, the client can continue to cache metadata after the server enters the modification process, but the server’s waiting time for all leases to expire will not continue to extend because of issuing new leases.

Finally, the difference between the = cache mechanism and the multi replica mechanism. The similarity between cache mechanism and multi copy mechanism is that one copy of data is saved on multiple nodes. However, the cache mechanism is much simpler. The cache data can be deleted and discarded at any time, and the only consequence of hitting the cache is that you need to access the data source to read the data. However, the replica mechanism is different. Every time you lose a replica, the quality of service is declining. Once the number of replicas drops to a certain extent, the service will no longer be available.

Analysis of lease mechanism

Definition of lease: lease is a promise granted by the issuer within a certain period of validity. Once the issuer sends a lease, no matter whether the receiver receives it or not, and no matter what the subsequent receiver is in, as long as the lease does not expire, the issuer will strictly abide by its promise; on the other hand, the receiver can use the issuer’s promise within the validity period of the lease, but once the lease expires, the receiver must not continue to use the issuer’s promise.

Lease mechanism has high fault tolerance. First of all, by introducing the validity period, the lease mechanism can be very good fault-tolerant network exceptions. The process of lease issuing only depends on the network and can communicate in one way. Even if the receiver cannot send messages to the issuer, the lease issuing will not be affected. Since the validity period of a lease is a certain time point, the semantics of a lease has nothing to do with the specific time of sending a lease, so the same lease can be repeatedly sent to the receiver by the issuer. Even if the issuer occasionally fails to send a lease, the issuer can simply resend it. Once the lease is successfully accepted by the receiver, the subsequent lease mechanism is no longer dependent on network communication, even if the network is completely interrupted, the lease mechanism will not be affected. Moreover, lease mechanism can tolerate node downtime better. If the issuer is down, the down issuer usually cannot change the previous commitment, which will not affect the correctness of the lease. After the issuer machine recovers, if the issuer recovers the previous lease information, the issuer can continue to abide by the lease promise. If the issuer can not recover the lease information, it only needs to wait for a maximum lease timeout to make all the leases invalid, so as not to destroy the lease mechanism.

For example, in the cache system example in the previous section, once the server is down, the metadata will not be modified. After the server is restored, it only needs to wait for a maximum lease timeout, and the cache information on all nodes will be cleared. When the receiver is down, the issuer doesn’t need to do more fault-tolerant processing. It just needs to wait for the lease to expire, and then it can take back the commitment. In practice, it is to take back the authority and identity given before. Finally, the lease mechanism does not depend on storage. The issuer can persist the issued lease information, so that the lease in the validity period can continue to be valid after the outage recovery. But this is only an optimization for the lease mechanism. As in the previous analysis, even if the issuer does not have persistent lease information, it can also make all the previously issued leases invalid by waiting for a maximum lease time, so as to ensure that the mechanism continues to be effective.

The lease mechanism depends on the expiration date, which requires that the clock of the issuer and the receiver be synchronized. On the one hand, if the clock of the issuer is slower than that of the receiver, the issuer still considers the lease valid when the receiver thinks that the lease has expired. The receiver can solve this problem by applying for a new lease before it is due. On the other hand, if the clock of the issuer is faster than that of the receiver, when the issuer thinks that the lease has expired, the receiver still thinks that the lease is valid. The issuer may issue the lease to other nodes, resulting in the failure of commitment and affecting the correctness of the system. For this kind of clock out of synchronization, the usual practice in practice is to set the validity period of the issuer slightly larger than that of the receiver, and only larger than the clock error can avoid the impact on the effectiveness of the lease.

Node state determination based on lease mechanism

Distributed protocol relies on the global consistency of node state cognition, that is, once node Q thinks that a node a is abnormal, node a must also think that it is abnormal, so that node a stops being the primary to avoid the “double master” problem. There are two ways to solve this problem: first, the distributed protocol designed can tolerate “double master” error, that is, it does not depend on the global consensus understanding of node state, or the global consensus state is the result of all negotiation; second, it uses the lease mechanism. For the first way of thinking, that is to abandon the use of centralized design and switch to decentralized design, which is beyond the scope of this section. The following focuses on the use of lease mechanism to determine the node state.

The central node sends a lease to other nodes. If a node holds a valid lease, it is considered that the node can provide service normally. In example 2.3.1, nodes a, B and C still send heart beat to report their own status periodically, and node Q sends a lease after receiving heart beat, indicating that node Q confirms the status of nodes a, B and C, and allows nodes to work normally within the lease validity period. Node Q can give a special lease to the primary node, indicating that the node can work as the primary node. Once node Q wants to switch a new primary node, it only needs to wait for the lease of the previous primary node to expire, and then it can safely issue a new lease to the new primary node without the “double master” problem.

In the actual system, if a central node is used to send a lease, there is also a great risk. Once the central node is down or the network is abnormal, all nodes will not have a lease, resulting in a high degree of system unavailability. For this reason, the actual system always uses multiple central nodes as copies of each other to form a small cluster, which has high availability and provides the function of issuing lease. Chubby and zookeeper are based on this design.

The choice of validity time of lease

In engineering, the time of lease is usually 10 seconds, which is a verified empirical value. In practice, it can be used as a reference to comprehensively select the appropriate time.

2.4 quorum mechanism

First of all, make the following agreement: the update operation (write) is a series of sequential processes. The sequence of update operations is determined by other mechanisms (for example, in the primary secondary architecture, the order is determined by the primary). Each update operation is recorded as wi, and I is the monotonous increasing sequence number of the update operation. After each wi is successfully executed, the replica data changes, which is called different data versions, and is recorded as wi Make VI. Suppose that each copy holds all versions of data in history.


Write all read one (waro) is one of the simplest replica control rules. As the name suggests, all replicas are written when updating. Only when all replicas are updated successfully, can the update be considered successful, so as to ensure that all replicas are consistent, so that the data on any replica can be read when reading data.

Because the update operation needs to be successful on all N copies, the update operation can be successful, so once there is a copy exception, the update operation fails and the update service is unavailable. For the update service, although there are n copies, the system cannot tolerate any one copy exception. On the other hand, as long as one of the N copies is normal, the system can provide read service. For the read service, when there are n copies, the system can tolerate the exception of n-1 copies. From the above analysis, we can find that the availability of waro read service is high, but the availability of update service is not high. Even though a replica is used, the availability of update service is equivalent to no replica.

Quorum definition

Under the quorum mechanism, once an update operation wi is successful on W copies of all N copies, the update operation is called “successfully submitted update operation”, and the corresponding data is called “successfully submitted data”. Let r > N-W, because the update operation wi is only successful on W copies, when reading data, you need to read at most R copies, then you can read wi updated data VI. If a certain update wi is successful on W copies, because W + R > N, the set composed of any r copies must intersect with the set composed of successful W copies, so reading r copies must be able to read wi updated data VI. As shown in Figure 2-10, the principle of quorum mechanism can be represented by Vincent diagram.

Interviewer: young man, what do you think of the principle of distributed system

There are five copies of a system, w = 3 and R = 3. The data of the first five copies are consistent, and they are all v1. A certain update operation W2 is successful on the first three copies, and the copy situation becomes (V2 V2 V1 V1). At this point, the set of any three copies must include v2. In the above definition, let w = n, r = 1, we get waro, that is, waro is a special case of quorum mechanism. Similar to analyzing waro, analyzing the availability of quorum mechanism. Limit the quorum parameter to W + r = n + 1. Because the update operation needs to be successful on all the W copies, the update operation can only be successful. Therefore, once N-W + 1 copies are abnormal, the update operation cannot be successful on all the W copies, and the update service is unavailable. On the other hand, once N-R + 1 copies are abnormal, it is impossible to guarantee that the set of copies intersecting W copies can be read, and the consistency of the read service will decline.

It is emphasized again that strong consistency cannot be guaranteed only by relying on the quorum mechanism. Because only the quorum mechanism can not determine the latest successfully submitted version number, unless the latest successfully submitted version number is managed by a specific metadata server or metadata cluster as metadata, it is difficult to determine the latest successfully submitted version number. In the next section, we will discuss the situations in which we can determine the latest successfully submitted version number only through the quorum mechanism.

The three system parameters N, W and R of quorum mechanism control the availability of the system, which is also the service commitment of the system to users: there are at most N copies of data, but if the data is updated successfully, W copies will return the success of the user. For the quorum system with high consistency requirements, the system should also promise not to read the unsuccessfully submitted data at any time, that is to say, the data read is the data that was successful on W copies.

Read the latest successfully submitted data

The quorum mechanism only needs to update w of N copies successfully, and when reading r copies, it can read the latest data successfully submitted. However, due to unsuccessful updates, only reading r copies does not necessarily determine which version of the data is the latest submitted data. For a strongly consistent quorum system, if there are less than W copies of data, assuming x copies, it will continue to read other copies. If W copies of this version are successfully read, then the data is the latest successfully submitted data; if the number of the data in all copies is determined not to meet w, then the second largest version in R is the latest successfully submitted copy. Example: when reading (V2 V1 V1), continue to read the remaining copies. If the remaining two copies are (V2 V2), then V2 is the latest submitted copy; if the remaining two copies are (V2 V1) or (V1 V1 V1), then v1 Is the latest successfully submitted version; if there is any timeout or failure in reading the next two copies, it is impossible to determine which version is the latest successfully submitted version.

It can be seen that when using the quorum mechanism alone, if you want to determine the latest successfully submitted version, you need to read at most R + (w-r-1) = N copies. When any copy exception occurs, the function of reading the latest successfully submitted version may not be available. In practical engineering, we should try to avoid reading the latest successfully submitted version through quorum mechanism by other technical means. For example, when the quorum mechanism is combined with the primary secondary control protocol, the latest submitted data can be read by reading the primary.

Selecting primary replica based on quorum mechanism

There are different ways to read data according to different consistency requirements: if you need to read the latest successfully submitted data immediately with strong consistency, you can simply read only the primary The data on the replica can be read either by the way of the previous section; if session consistency is needed, the data can be read selectively on each replica according to the version number of the data that has been read before; if weak consistency is only needed, any replica can be read.

In the primary secondary protocol, when the primary is abnormal, a new primary needs to be selected, and then the secondary copy synchronizes with the primary. Usually, the selection of a new primary is done by a central node. After the introduction of quorum mechanism, the common primary selection method is similar to the way of reading data, that is, the central node reads r copies and selects the copy with the highest version number as the new primary. After the new primary completes data synchronization with at least W copies, it provides read-write service as the new primary. First of all, the copy with the highest version number must contain the latest data successfully submitted. Furthermore, although it is not sure that the number of the highest version number is a successfully submitted data, the new primary synchronizes the data with the secondary, so that the number of copies of the version reaches W, which makes the data of the version a successfully submitted data.

For example: in the system of n = 5, w = 3, r = 3, the maximum version number of the replica at a certain time is (V2 V2 V1 V1). At this time, V1 is the latest successful submitted data of the system, and V2 is an unsuccessful submitted data in the intermediate state. Suppose that the original primary copy is abnormal at this moment, and the central node switches the primary. Whether this kind of “intermediate state” data is deleted as “dirty data” or synchronized as new data to become effective data completely depends on whether this data can participate in the election of the new primary. The following two cases are analyzed respectively.

Interviewer: young man, what do you think of the principle of distributed system

First, as shown in Figure 2-12, if the central node successfully communicates with three of the replicas, and the version number read is (V1 V1 V1), then any replica is selected as the primary. The new primary takes V1 as the latest successfully submitted version and synchronizes with other replicas. When synchronizing data with the first and second replicas, due to the first and second replicas, the primary node can synchronize with other replicas If the version number of the replica is larger than the primary, it belongs to dirty data, which can be solved according to the method of dealing with dirty data described in section In practice, the new primary may also provide data service after completing synchronization with the last two replicas, and then update its version number to v2. If the system cannot guarantee that the later V2 is exactly the same as the previous V2, the new primary needs to compare not only the data version number but also the specific content of the update operation when synchronizing data with the first and second replicas.

Interviewer: young man, what do you think of the principle of distributed system

Second, if the central node successfully communicates with the other three replicas, and the version number read is (V2 V1 V1), the replica with version number V2 is selected as the new primary. After that, once the new primary completes data synchronization with the other two replicas, the number of V2 compliant replicas reaches W, which becomes the latest successfully submitted replica, the new primary It can provide normal reading and writing services.

2.5 log technology

Log technology is one of the main technologies of downtime recovery. Log technology was first used in database system. Strictly speaking, log technology is not a distributed system technology, but in the practice of distributed system, log technology is widely used for downtime recovery, and even systems such as BigTable save logs to a distributed system, which further enhances the fault tolerance of the system.

Redo log and check point

This paper designs a high-speed single machine query system, which stores all data in memory to achieve high-speed data query. Each update operation updates a small part of the data (such as a key in the key value). Now the problem is to use the log technology to realize the recovery of the memory query system. Unlike database transactions, every successful update in the problem model takes effect. This is also equivalent to that each transaction of the database has only one update operation, and each update operation can and must be committed immediately (auto commit).

  • Redo Log
  1. Write the result of the update operation (for example, if set K1 = 1, record K1 = 1) to the disk log file by appending
  2. Modify the data in memory according to the update operation
  3. Return to update success

From the process of redo log, it can be seen that redo logs are written after the completion of the update operation (although this paper does not discuss undo log, which is one of the differences from undo log). Moreover, due to the sequential addition of log files, redo logs are more efficient on disk and other storage devices with powerful sequential writing.

It’s very simple to use redo log to recover from downtime. You just need to “replay” the log.

Process 2.5.2: redo log downtime recovery

  1. Read the results of each update operation in the log file from scratch, and use these results to modify the data in memory.

It can also be seen from the redo log recovery process that only the update result written to the log file can be recovered after the outage. This is why in the redo log process, the log file needs to be updated first, and then the data in memory needs to be updated. If the data in the memory is updated first, the user can read the updated data immediately. Once the downtime occurs between the completion of the memory modification and the writing of the log, the last update operation cannot be recovered, but the user may have read the updated data before, which causes inconsistency.

  • Check point

. In the simplified model, the process of checkpoint technology is to dump the data in the memory to the disk in a way that is easy to reload, so as to reduce the log data that needs to be played back during downtime recovery.

Process: check point

  1. Record “begin check point” to log file
  2. Dump the data in memory to disk in a way that is easy to reload
  3. Record “end check point” in the log file. In the check point process, the data can continue to be updated according to process 2.5.1. During this process, the newly updated data can be dumped to disk or not, depending on the implementation. For example, if K1 = V1 at the beginning of the checkpoint and K1 = V2 at a certain update during the checkpoint, the value of K1 on the disk can be V1 or V2.

Process: downtime recovery process based on checkpoint

  1. Load the data dump to disk into memory.
  2. Scan the log file from back to front for the last “end check point” log.
  3. Find the most recent “begin check point” log forward from the last “end check point” log, and play back all update operation logs after the log.
  • No Undo/No Redo log

If the data is maintained on disk, a batch of updates is composed of several update operations, which need to be atomic, that is, they are either effective at the same time or not.

Interviewer: young man, what do you think of the principle of distributed system

There are two directory structures in 0 / 1 directory technology, which are called directory 0 and directory 1. Another structure is called master record, and the directory currently in use is called active directory. In the master record, either the record uses directory 0 or the record uses directory 1. Directory 0 or directory 1 record the location of each data in the log file. The data update process of the 0 / 1 directory is always carried out on the inactive directory. Only before the data takes effect, the values of 0 and 1 in the master record are reversed to switch the master record.

Process: 0 / 1 directory data update process

  1. Copy the active directory completely to the inactive directory.
  2. For each update operation, create a new log entry to record the value after the operation, and change the location of the corresponding data to the location of the new log entry in the inactive directory.
  3. Atomic modify master record: invert the value in the master record to make the inactive directory take effect.

The update process of 0 / 1 directory is very simple. By switching the master records of 0 and 1 directory, a batch of changes can take effect atomically. 0 / 1 directory reduces the atomicity of batch transaction operations to the atomic switching of master records by directory means. Because the atomic modification of multiple records is generally difficult to achieve, but the atomic modification of a single record can often be achieved, which reduces the difficulty of implementation. In engineering, the idea of 0 / 1 directory is widely used, and its form is not limited to the above process. It can be two data structures in memory switching back and forth, or two file directories on disk switching back and forth.

2.6 two phase submission protocol

Two phase commit protocol is a classic strong consistency centralized replica control protocol. Although there are many problems in the project, the research of this protocol can well understand several typical problems of distributed system.

Process description

Two phase commit protocol is a typical “centralized replica control” protocol. In this protocol, the participating nodes are divided into two categories: a centralized coordinator node and N participant nodes. Each participant node is the node that manages the database copy described in the background above.

The idea of two-stage submission is relatively simple. In the first stage, the coordinator asks all participants whether they can submit the transaction (please vote), and all participants vote to the coordinator. In the second stage, the coordinator decides whether the transaction can be committed globally according to the voting results of all participants, and notifies all participants to implement the decision. In a two-stage submission process, participants cannot change their voting results. The premise of global commit in two-phase commit protocol is that all participants agree to commit the transaction. As long as one participant votes to abandon the transaction, the transaction must be abandoned.

Process: two stage submission coordinator process

  1. Write local log “begin”_ Commit “and enter the wait state;
  2. Send “prepare message” to all participants;
  3. Wait for and receive the response to “prepare message” sent by participants; 3.1 if “vote above” message sent by any participant is received; 3.1.1 write local “global above” log and enter abort; 3.1.2 send “global above” message to all participants; 3.1.3 enter abort state; 3.2 if “vote commit” message sent by all participants is received; 3.2.1 Write local “global commit” log and enter commit state; 3.1.2 send “global commit message” to all participants;
  4. Wait for and receive the confirmation response message of “global abort message” or “global commit message” sent by participants. Once the confirmation message of all participants is received, write “end” locally_ The “transaction” log process ends.

Process: two stage submission coordinator process

  1. Write the local log “init” to enter the init state
  2. Wait for and accept the “prepare message” sent by the coordinator. 2.1 if the participant can submit the transaction, 2.1.1 write the local log “ready” and enter the ready state. 2.1.2 send the “vote commit” message to the coordinator. 2.1.4 wait for the coordinator’s message. if the participant receives the “global abort” message from the coordinator, write the local log “abort” and enter abort State send the confirmation message of “global commit” to the coordinator ﹣ if the “global commit” message of the coordinator is received, write the local log “commit”, enter the commit state ﹣ send the confirmation message of “global commit” to the coordinator ﹣ 2.2.1 write the local log “abort”, enter the abort state ﹣ if the participant cannot submit this transaction, 2.2.1 write the local log “abort”, enter the abort state Status 2.2.2 send “vote about” message to coordinator 2.2.3 process ends for the participant 2.2.4 if the coordinator’s “global about” message is received later, it can respond
  3. Even if the process ends, a corresponding confirmation message will be sent whenever the “global abort” message or “global commit” message sent by the coordinator is received.

exception handling

Downtime recovery
  1. After the coordinator recovers from the outage, it first finds the state before the outage through the log. If the last word in the log is “begin”_ The “commit” record indicates that the coordinator is in wait state before the outage. The coordinator may or may not have sent the “prepare message”, but the coordinator must not have sent the “global commit message” or “global abort message”, that is, the global state of the transaction has not been determined. At this point, the coordinator can resend the “prepare message” to continue the two-stage submission process. Even if the participant has sent a response to the “prepare message”, it is only a retransmission of the previous response without affecting the protocol consistency. If the last record in the log is “global commit” or “global abort”, the coordinator is in commit or abort state before the outage. At this point, the coordinator only needs to re send “global commit message” or “global above message” to all participants to continue the two-phase commit process.
  2. After a participant recovers from downtime, it first searches the state before downtime through the log. If the last “init” record in the log indicates that the participant is in the init state and has not made a voting choice for this transaction, the participant can continue the process and wait for the “prepare message” sent by the coordinator. If the last record in the log is “ready”, it means that the participant is in the ready state. At this time, it means that the participant has made a voting choice for this transaction. However, it is not known whether the participant has sent a “vote commit” message to the coordinator before the outage. So the participant can resend “vote commit” to the coordinator and continue the protocol process. If the last record in the log is “commit” or “abort”, it means that the participant has received the coordinator’s “global commit message” (in commit state) or “global abort message” (in abort state). It is unknown whether the coordinator has ever been sent a confirmation message to “global commit” or “global abort”. But even if no confirmation message has been sent, because the coordinator will continue to resend “global commit” or “global abort”, it only needs to send confirmation message when receiving these messages, which does not affect the global consistency of the protocol.

protocol analysis

The two-stage submission protocol is rarely used in engineering practice. The main reasons are as follows

  1. The two-phase commit protocol has poor fault tolerance. From the above analysis, it can be seen that in some cases, the two-phase commit protocol can not be executed, and the process status can not be judged. In engineering, a good distributed protocol can always be executed even in case of exception. For example, recall the lease mechanism (2.3). Once a lease is issued, regardless of any exception, the lease server node can always determine whether the lease is valid by time, or withdraw the lease permission by waiting for the lease timeout. There is no case that any process is blocked and cannot be executed in the whole lease protocol process. Compared with the simple and effective lease mechanism, the two-stage protocol is more complex and has poor fault tolerance.
  2. The performance of two-phase commit protocol is poor. In a successful two-phase commit process, at least four messages “prepare”, “vote commit”, “global commit” and “confirm global commit” need to be exchanged between the coordinator and each participant in two rounds. Too many interactions can degrade performance. On the other hand, the coordinator needs to wait for the voting results of all participants. Once there are slow participants, the execution speed of the global process will be affected.

Although there are some improved two-stage commit protocols that can improve fault tolerance and performance, this kind of protocol is still less used in engineering, and its theoretical value is greater than practical significance.

2.7 MVCC

Mvcc (multi version concurrent control) technology. Mvcc technology was first put forward in database system, but this idea is not limited to single distributed system, and it is also effective in distributed system.

Mvcc is the technology of concurrency control for multiple different versions of data. Its basic idea is to generate a new version of data for each transaction. When reading data, selecting different versions of data can realize the integrity reading of transaction results. When mvcc is used, each transaction is updated based on an effective basic version, and transactions can be carried out in parallel, thus a graph structure can be generated.Interviewer: young man, what do you think of the principle of distributed system

The version of basic data is 1, and two transactions are generated simultaneously: transaction a and transaction B. Each of these two transactions makes some local modifications to the data (these modifications are only visible to the transaction itself and do not affect the real data). Then, transaction a first commits and generates data version 2. Based on data version 2, transaction C is initiated, and transaction C continues to commit and generates data version 3. Finally, transaction B commits, and the result of transaction B needs to be consistent with that of transaction C If there is no data conflict, that is, transaction B does not modify the variables modified by transaction a and transaction C, then transaction B can commit, otherwise transaction B fails to commit. The process of mvcc is very similar to that of SVN and other version control systems, or SVN and other version control systems are mvcc ideas. When a transaction is modified locally based on the basic data version, in order not to affect the real data, there are usually two ways. One is to copy the data in the basic data version completely and then modify it. SVN uses this method, SVN check out That is, the process of copying; second, each transaction only records the update operation, but does not record the complete data. When reading the data, the update operation is applied to the data with the basic version to calculate the result. This process is also similar to the incremental submission of SVN.

2.8 Paxos protocol

Paxos protocol is one of the few decentralized distributed protocols with strong consistency and high availability, which has been proved in engineering practice. The process of Paxos protocol is complex, but its basic idea is not difficult to understand, which is similar to the voting process of human society. In Paxos protocol, there is a group of completely equivalent participating nodes (called accpetor). Each node makes a decision on an event. If a decision is approved by more than half of the nodes, it will take effect. In Paxos protocol, as long as more than half of the nodes are normal, they can work well, and can resist the abnormal situations such as downtime and network differentiation.


Proposer: the proposer. There can be more than one proposer, and the proposer proposes a motion (value). The so-called value can be any operation in the project, such as “modify the value of a variable to a value”, “set the current primary to a node” and so on. These operations are abstracted as value in Paxos protocol. Different proposers can propose different or even contradictory values. For example, one proposer proposes to “set variable x to 1” and another proposer proposes to “set variable x to 2”. However, for the same Paxos process, only one value can be approved at most. Acceptor: approved by. There are n acceptors, and the value proposed by proposer must be approved by more than half of acceptors (n / 2 + 1). The acceptors are completely equal and independent. Learner: learner. Learner learns the approved value. The so-called learning is to read the value selection results of each proposer. If a value is passed by more than half of proposers, the learner learns the value. Recall (2.4) is not difficult to understand. It is similar to the quorum mechanism here. A value needs to be approved by acceptor with w = n / 2 + 1, so learners need to read at least N / 2 + 1 acceptors, and at most N acceptors can learn a passed value. The above three types of roles are only logical division. In practice, a node can act as these three types of roles at the same time.

technological process

Paxos protocol goes round by round, each round has a number. Each round of Paxos protocol may or may not approve a value. If a value is approved in one round of Paxos protocol, then Paxos can only approve this value in the future rounds. The above protocol processes constitute a Paxos protocol instance, that is, only one value can be approved for one Paxos protocol instance, which is also an important embodiment of the strong consistency of Paxos protocol. Each round of Paxos protocol is divided into three phases: preparation phase and approval phase. In these two phases, proposer and acceptor have their own processing flow.

Process: proposer’s process (preparation phase)

  1. Send the message “prepare (b)” to all acceptors, where B is the number of rounds of Paxos, increasing each round
  2. If the message “reject (b)” sent by any acceptor is received, the current round of Paxos fails for the proposer. Set the number of rounds B to B + 1 and repeat step 1; (in the approval stage, different choices are made according to the received acceptor message)
  3. If the “promise (B, V_ i) “N / 2 + 1 (n is the total number of acceptors, rounded by division, the same below); V_ I means that acceptor has approved value V in round I last time. 3.1 if V is empty in the received “promise (B, V)” message, proposer selects a value V to broadcast accept (B, V) to all acceptors; 3.2 otherwise, in all received “promise (B, V)” messages, proposer selects a value V to broadcast accept (B, V) to all acceptors_ i) Select the value V with the largest I to broadcast the message accept (B, V) to all acceptors;
  4. If NACK (b) is received, set the number of rounds B to B + 1 and repeat step 1;

Process: accpetor process (preparation phase)

  1. Accept the message prepare (b) of a propeller. Parameter B is the maximum number of Paxos rounds received by the acceptor; V is the value approved by the acceptor, which can be empty 1.1. If b > b, reply promise (B, V)_ B) , set B = B; to guarantee that proposals with numbers less than B will no longer be accepted. 1.2 otherwise, reply to reject (b) (approval stage)
  2. Accept (B, V), 2.1 if B < B, reply to NACK (b), implying that the proposer has a proposal with a larger number accepted by the acceptor, 2.2 otherwise, set V = v. Indicates that the value of the acceptor approval is v. Broadcast the accepted message.


In the basic example, there are five acceptors and one proposer, and there are no network or downtime exceptions. We focus on the changes of variables B and V on each accupter and B on proposer.

  1. Initial stateInterviewer: young man, what do you think of the principle of distributed system
  2. Proposer sends “prepare (1)” to all accpetors, all acceptors handle it correctly and reply promise (1, null)Interviewer: young man, what do you think of the principle of distributed system
  3. The proposer receives five promises (1, null), and the value of more than half of the promises is empty. At this time, the proposer sends accept (1, V1), where V1 is the value selected by the proposer.Interviewer: young man, what do you think of the principle of distributed system
  4. At this time, V1 is approved by more than half of the acceptors, and V1 is the value of this Paxos protocol instance approval. If learners learn value, they can only learn v1

In the same Paxos instance, the approved value cannot be changed, even if the subsequent proposer initiates the Paxos protocol with a higher serial number. The core of Paxos protocol is that “the approved value cannot be changed”, which is also the basis of the correctness of the whole protocol.

Paxos protocol is designed artificially, and its design process is also the process of protocol derivation. Paxos protocol uses quorum mechanism, w = r = n / 2 + 1. In short, protocol is the process of Proposer updating acceptors. Once an acceptor successfully updates more than half of acceptors, the update is successful. Learner reads acceptor by quorum. Once a value is successfully read on more than half of proposers, it means that it is an approved value. By introducing rounds, the high round proposal preempts the low round proposal to avoid deadlock. The key point of protocol design is how to satisfy the constraint of “only approve one value in one instance of Paxos algorithm”.

2.9 CAP

The definition of cap theory is very simple. The three letters of cap represent three contradictory attributes in distributed system

  • Consistency: copy consistency in cap theory refers to strong consistency (1.3.4);
  • Availability: the system can provide service when there is an exception;
  • Tolerance to the partition of network: the system can deal with the exception of network partition (;

Cap theory points out that: it is impossible to design a distributed protocol, so that it has three attributes of cap at the same time, that is, 1) the replica under this protocol is always strong consistency, 2) the service is always available, 3) the protocol can tolerate any network partition exception; distributed system protocol can only compromise between cap and the three.

The second law of thermodynamics shows that the perpetual motion machine can’t exist. Don’t design it in vain. Similarly, the significance of cap theory is that we should not attempt to design a perfect system which has all three attributes of cap, because this system has been proved not to exist in theory.

  • Lease mechanism: Lease mechanism sacrifices a in some abnormal cases to obtain complete C and good p.
  • Quorum mechanism: quorum mechanism has made a compromise among the three factors of cap, with certain C, better a and better P, which is a relatively balanced distributed protocol.
  • Two phase commit protocol: two phase commit system with complete C, bad a, bad P.
  • Paxos protocol: it is also a strong consistency protocol. Paxos is much better than two-stage commit protocol in cap three aspects. Paxos protocol has complete C, better a and better P. The A and P properties of Paxos are similar to the quorum mechanism, because Paxos protocol itself has the factor of quorum mechanism.