Distributed coordinated synchronization


The essence is to make the programs distributed on different computers have “team spirit”, in other words, let the programs work together to achieve a business goal

Distributed mutex

For the same shared resource, a program does not want to be disturbed by other programs when it is in use. This requires that only one program can access this resource at the same time.

In a distributed system, this exclusive way of accessing resources is called distributed mutual exclusion, and the shared resources that are mutually exclusive are called critical resources.

How to make programs in distributed system access critical resources mutually exclusive? Three algorithms: centralized algorithm, distributed algorithm and token ring algorithm

Centralized algorithm

We introduce a coordinator program and get a distributed mutex algorithm. When each program needs to access critical resources, it first sends a request to the coordinator. If there is no program using this resource at present, the coordinator will directly authorize the requester to access it; otherwise, the requester will be “numbered” in the order of first come first served. If a program has used up resources, it will notify the coordinator, and the coordinator will get the first request from the queue of “queue number” and send an authorization message to it. Programs that get authorization messages can directly access critical resources.

This mutual exclusion algorithm, which we call centralized algorithm, can also be called central server algorithm. It’s so called because coordinator stands for centralized program or central server.

The schematic diagram of the centralized algorithm is as follows:

It can be seen from the above process that a program needs the following processes and message interaction to complete a critical resource access:

  1. Send request authorization information to coordinator, once message interaction;
  2. The coordinator issues authorization information to the program, and has one message interaction;
  3. After using the critical resources, the program sends release authorization to the coordinator, once message interaction

It is not difficult to see that the advantages of centralized algorithm are intuitive, simple, less information interaction, easy to implement, and all programs only need to communicate with the coordinator, no communication between programs. However, the problem with this algorithm also lies with the coordinator

  • On the one hand, the coordinator will become the performance bottleneck of the system. Imagine that if 100 programs want to access critical resources, the coordinator has to process 100 * 3 = 300 messages. In other words, the number of messages processed by the coordinator increases linearly with the number of programs that need to access critical resources.
  • On the other hand, it is easy to cause single point of failure. Coordinator failure will lead to all programs can not access critical resources, resulting in the whole system is not available.

Therefore, when using the centralized algorithm, we must choose the server with good performance and high reliability to run the coordinator.

Distributed algorithm

When a program wants to access the critical resources, it first sends a request message to other programs in the system, and then it can access the critical resources after receiving the consent message from all programs. Among them, the request message needs to include the requested resource, the ID of the requester, and the time to initiate the request.

Similar to democratic negotiation, in the distributed field, we call it distributed algorithm, or the algorithm using multicast and logical clock.

When a program completes a critical resource access, it needs the following information interaction:

  • It takes n-1 message interactions to send requests to other N-1 programs to access critical resources;
  • It needs to receive the consent message from other N-1 programs in order to access the resources. It needs n-1 message interactions in total.

It can be seen that a program needs at least 2 * (n-1) message interactions to successfully access critical resources. Suppose that all n programs in the system need to access critical resources, then 2n (n-1) messages will be generated at the same time. To sum up, in large-scale systems, the number of messages will increase exponentially with the number of programs that need to access critical resources, which easily leads to high “communication cost”.

From the above analysis, it is not difficult to see that the distributed algorithm, based on the “first come first served” and “vote through” mechanism, allows each program to access resources fairly in chronological order, which is simple, crude and easy to implement. However, the usability of this algorithm is very low, mainly for two reasons

  • When more and more programs need to access critical resources in the system, it is easy to produce “signaling storm”, that is, the requests received by the program completely exceed their processing capacity, resulting in their normal business can not be carried out.
  • Once a program fails and can’t send consent message, other programs are waiting for reply, which makes the whole system stagnant and makes the whole system unavailable. Therefore, compared with the coordinator failure of centralized algorithm, the availability of distributed algorithm is lower.

An improved method for low availability is to ignore a program if it detects a program failure without waiting for its consent message. It’s like in a cafeteria, if a person leaves the cafeteria, you don’t need to ask for his permission before using the coffee machine. But in this case, each program needs to detect other programs, which undoubtedly brings greater complexity.

Therefore, the distributed algorithm is suitable for the system with small number of nodes and infrequent changes, and because each program needs communication and interaction, it is suitable for the system with P2P structure. For example, the distributed file system running in the LAN, the system with P2P structure and so on.

So, in our work, what kind of scenario is suitable for distributed algorithm?

Hadoop is a very familiar distributed system. The file modification of HDFS is a typical application of distributed algorithm.

As shown in the figure below, computers 1, 2 and 3 in the same LAN have backup information of the same file, and they can communicate with each other. This shared file is a critical resource. When computer 1 wants to modify the shared file, it needs to perform the following operations:

  1. Computer 1 sends a file modification request to computers 2 and 3;
  2. Computers 2 and 3 find that they do not need to use resources, so they agree to the request of computer 1;
  3. Computer 1 starts to modify the file after receiving the consent message from all other computers;
  4. After the modification of computer 1 is completed, the message of the modification of the file is sent to computers 2 and 3, and the modified file data is sent;
  5. After receiving the new file data from computer 1, computers 2 and 3 update the local backup file

To sum up: distributed algorithm is a fair access mechanism of “first come, first served” and “vote through”, but it has higher communication cost and lower availability than centralized algorithm. It is suitable for scenarios with lower frequency of critical resources and smaller system scale.

Token ring algorithm

All programs form a ring structure, and the token is passed between programs in a clockwise (or counterclockwise) direction. The program receiving the token has the right to access the critical resources, and after the access, the token is sent to the next program; if the program does not need to access the critical resources, the token is directly sent to the next program.

Because before using the critical resources, it is not necessary to solicit the opinions of other programs one by one like the distributed algorithm, so relatively speaking, in the token ring algorithm, a single program has higher communication efficiency. At the same time, every program can access critical resources in a cycle, so the fairness of token ring algorithm is very good.

However, no matter whether the program in the ring wants to access resources or not, it needs to receive and pass the token, so it will also bring some invalid communication. Assuming that there are 100 programs in the system, program 1 can access the resources again after the other 99 programs have passed the token, even if the other 99 programs do not need to access the resources, which reduces the real-time performance of the system.

Summary: the token ring algorithm has high fairness and high stability after the improvement of single point of failure. It is suitable for the scenario that the system scale is small, and each program in the system uses critical resources frequently and for a short time.

Is there a distributed mutex algorithm for large scale systems?

It can be seen that the above-mentioned three mutually exclusive algorithms of centralized, distributed and token ring are not suitable for systems with too large scale and too many nodes. So, what kind of mutual exclusion algorithm is suitable for large-scale systems?

Because of the complexity of large-scale systems, we naturally want to use a relatively complex mutual exclusion algorithm. Nowadays, there is a very popular mutual exclusion algorithm, the two-tier distributed token ring algorithm, which organizes the nodes in the whole Wan system into two-tier structure. It can be used in systems with a large number of nodes, or WAN systems.

As we know, Wan is composed of multiple LANs, so in this algorithm, LAN is the lower level and WAN is the higher level. Each LAN contains several local processes and a coordination process. Local processes logically form a ring structure, and each ring structure has a local token t passing between local processes. Local area network and local area network communicate through their respective coordination processes, which also form a ring structure, which is the global ring in Wan. In this global ring, a global token is passed among multiple coordinating processes. summary

In essence, we can use the combination of centralized algorithm and token ring algorithm to achieve large-scale mutual exclusion

Centralized algorithm: according to the redis cluster communication mode, a large number of requests can be distributed to different masters through hash key to handle a large number of requests. Each master is guaranteed to have a single point of failure by the master and slave nodes of a small cluster
Distributed algorithm: distributed algorithm can be recognized as its consent when more than half of the cluster agrees, reducing the number of communications, such as distributed election scenarios
Token ring algorithm: weight can be listed according to the frequency of participants, and the next participant can be selected combined with smooth weighted polling algorithm

Distributed election

This paper puts forward a question: for a cluster, how do multiple nodes cooperate and manage. For example, in a database cluster, how to ensure that the data written to each node is consistent?

You may say that this is not simple, just choose a “leader” to be responsible for scheduling and managing other nodes. This idea is absolutely true. This “leader” is called the master node in the distributed field, and the process of selecting “leader” is called distributed election in the distributed field. Then, you may ask, how to choose the master?

Why distributed elections

The master node is responsible for the coordination and management of other nodes in a distributed cluster, that is to say, other nodes must obey the arrangement of the master node.

The existence of the master node can ensure the orderly operation of other nodes and the consistency of the written data in the database cluster on each node. Consistency here means that the data in each cluster node is the same, and there is no difference.

Of course, if the main engine fails, the cluster will be in chaos, just as the emperor of a country dies and the country is in chaos. For example, after the failure of the primary node in the database cluster, the data on each node may be inconsistent.

This is corresponding to the saying that “a country can’t have no monarch in one day”, which corresponds to “a cluster can’t have no owner all the time” in a distributed system. To sum up, the role of election is to select a master node, which can coordinate and manage other nodes to ensure the orderly operation of the cluster and the consistency of data among nodes.

Distributed election algorithm

At present, the common methods of selecting the main candidates are based on serial number election algorithm (such as bull algorithm), majority algorithm (such as raft algorithm, Zab algorithm), etc

Bull algorithm

Bull algorithm is a kind of domineering cluster selection algorithm. Why is it domineering? Because its election principle is “elder” is big, that is, in all the living nodes, select the node with the largest ID as the main node.

In the bully algorithm, there are two roles of nodes: ordinary node and master node. During initialization, all nodes are equal and have the right to be the master. However, after the primary node is selected successfully, only one node becomes the primary node, and all other nodes are ordinary nodes. If and only if the primary node fails or loses contact with other nodes, the primary node will be re selected.

During the election process, bull algorithm needs to use the following three kinds of messages:

  • An election message is used to initiate an election;
  • Alive message, the response to the selection message;
  • Victory message, the message of swearing sovereignty sent by the successful primary node to other nodes.

The principle of bull algorithm election is “elder is bigger”, which means that it assumes that each node in the cluster knows the ID of other nodes. On this premise, the specific election process is as follows:

  1. Each node in the cluster judges whether its ID is the largest among the living nodes. If it is, it sends a victory message directly to other nodes to pledge its sovereignty;
  2. If it is not the node with the largest ID, it will send an element message to all nodes with larger ID and wait for the reply from other nodes;
  3. If this node does not receive the alive message from other nodes within a given time range, it will consider itself as the master node and send a victory message to other nodes to swear that it will become the master node; if it receives an alive message from a node larger than its ID, it will wait for other nodes to send a victory message;
  4. If this node receives an election message sent by a node whose ID is smaller than its own, it will reply with an alive message to inform other nodes that I am older than you and re elect.

Actually, victory can be understood as a special kind of election

At present, many open source software have adopted the bull algorithm to select the master, such as mongodb’s replica set fail over function. In mongodb’s distributed election, the ID is represented by the last operation timestamp of the node. The node with the latest timestamp has the largest ID, that is to say, the node with the latest timestamp and alive is the master node.

To sum up. The choice of bull algorithm is particularly domineering and simple. Who is alive and whose ID is the largest is the master node, and other nodes must obey unconditionally. The advantage of this algorithm is that it is fast, low complexity and easy to implement.

However, the disadvantage of this algorithm is that each node needs to have global node information, so more extra information is stored. Secondly, any new node with a larger ID than the current primary node or when the node recovers to join the cluster after failure may trigger re election and become a new primary node. If the node quits and joins the cluster frequently, it will lead to frequent switching.

Raft algorithm

Raft algorithm is a typical majority voting algorithm. Its voting mechanism is similar to the democratic voting mechanism in our daily life. The core idea is “the minority is subordinate to the majority”. In other words, in raft algorithm, the node with the most votes becomes the master.

There are three roles of cluster nodes selected by raft algorithm

  1. When initializing, all nodes are in the follower state.
  2. At the beginning of primary selection, the status of all nodes is changed from follower to candidate, and an election request is sent to other nodes.
  3. According to the order of the received election requests, other nodes reply whether they agree to become the master. It should be noted that in each round of election, a node can only cast one vote.
  4. If the node that initiated the election request gets more than half of the votes, it becomes the master node, and its status changes to leader, while the status of other nodes decreases from candidate to follower. Heartbeat packets are regularly sent between the leader node and the follower node to detect whether the master node is alive or not.
  5. When the leader node’s term is up, that is, when other servers start the next round of primary selection cycle, the status of the leader node will be degraded from leader to follower, and enter a new round of primary selection.

Please note that each node can only vote once in each round. This kind of election is similar to the election of deputies to the National People’s Congress. Under normal circumstances, every deputy to the National People’s Congress has a certain term of office. After the term of office arrives, a re-election will be triggered, and the voters can only vote for one of the candidates with the only vote in their hands. According to raft algorithm, the selection is carried out periodically, including two time periods of selection and any value. The selection phase corresponds to the voting phase, and the any value phase corresponds to the tenure after the node becomes the main node. But there are exceptions. If the master node fails, an election will be immediately launched to select a new master node.

Kubernetes, an open source of Google, is good at container management and scheduling. In order to ensure reliability, three nodes are usually deployed for data backup. One of the three nodes will be selected as the primary node and the other as the secondary node. Kubernetes is an open source etcd component. Etcds, the cluster manager of etcd, is a service discovery repository with high availability and strong consistency, which uses raft algorithm to achieve the selection and consistency.

To sum up. Raft algorithm has the advantages of fast election speed, low complexity and easy implementation; the disadvantage is that it requires that each node in the system can communicate with each other, and it needs to obtain more than half of the number of votes in order to select the leader successfully, so the traffic is large. The election stability of this algorithm is better than that of bull algorithm. This is because when a new node joins or recovers from a fault, it will trigger the election of the master, but it does not necessarily lead to the real cut of the master. Unless the new node or the recovered node gets more than half of the votes, it will lead to the cut of the master.

Zab algorithm

Zab (zookeeper atomic broadcast) election algorithm is designed for zookeeper to realize distributed coordination function. Compared with the voting mechanism of raft algorithm, Zab algorithm adds the node ID and data ID as the reference to select the master. The larger the node ID and data ID, the newer the data, and the priority to become the master. Compared with raft algorithm, Zab algorithm ensures the data to be as up-to-date as possible. Therefore, Zab algorithm can be said to be an improvement of raft algorithm.

When using Zab algorithm, each node in the cluster has three roles

  • Leader, master node;
  • Follower, follower node;
  • Observer, observer, no vote.

During the election process, the nodes in the cluster have four states

  • Looking state is the election state. When the node is in this state, it will think that there is no leader in the current cluster, so it will enter the election state.
  • The leading status, that is, the leader status, indicates that the leader has been selected and the current node is the leader.
  • Following state, that is, follower state. After the primary node has been selected in the cluster, the state of other non primary nodes is updated to following, which indicates that they are following the leader.
  • Observer state, that is, observer state, indicates that the current node is observer, has a wait-and-see attitude, and has no voting rights and voting rights.

In the voting process, each node has a unique triplet (server)_ id, server_ Zxid, epoch), where server_ ID represents the unique ID of the node; server_ Zxid represents the data ID stored in the node. The larger the data ID is, the newer the data is and the greater the election weight is. Epoch represents the number of current selection rounds, which is generally represented by logical clock.

The core of Zab election algorithm is that “the minority is subordinate to the majority, and the node with large ID has priority to become the master”_ id, vote_ Zxid) to indicate which node to vote for, where vote_ ID is the ID of the node to be voted_ Zxid represents the server zxid of the voted node. The principle of Zab algorithm is: server_ The largest zxid becomes the leader; if the server_ If zxid is the same, the server_ The one with the largest ID becomes the leader.

Next, I’ll take the cluster of three servers as an example. Here, each server represents a node, and I’ll introduce the process of Zab selection.

Step 1: when the system is just started, the current voting of the three servers is the first round, that is, epoch = 1, and zxid is 0. At this point, each server selects itself and sends the vote information to the server<epoch, vote_id, vote_zxID>Broadcast it.

Step 2: according to the judgment rules, the epoch and zxid of the three servers are the same, so compare the server_ Therefore, server 1 and server 2 will vote_ Change ID to 3, update your ballot box and rebroadcast your vote.

Step 3: at this time, all servers in the system have selected server 3, so server 3 is selected as the leader and in the leading state, sending heartbeat packets to other servers and maintaining the connection; Server1 and server2 are in the following state.

To sum up. Zab algorithm has high performance and has no special requirements for the system. It uses broadcast mode to send information. If there are n nodes in the cluster and each node broadcasts at the same time, the amount of information in the cluster is n * (n-1) messages, which is prone to broadcast storm. In addition to voting, it also adds comparison node ID and data ID, which means that it also needs to know the ID and data of all nodes ID, so the election time is relatively long. However, this algorithm has good election stability. When a new node joins or recovers from a failure, it will trigger the selection of the primary, but it does not necessarily cut the primary. Unless the data ID and node ID of the new node or recovered from the failure are the largest, and more than half of the votes are obtained, the primary will be cut off.

Comparative analysis of three election algorithms


Knowledge expansion: why do majority algorithms usually use odd nodes instead of even nodes?

The core of the majority algorithm is that the minority is subordinate to the majority and the node with more votes wins. Imagine, if we use even node cluster now, when both nodes get half of the votes, who should be the main choice?

The answer is that, in this case, there is no way to elect the president, and a new vote must be held. But even if the voting is repeated, the probability of two nodes having the same number of votes will be very high. Therefore, the majority selection algorithm usually uses odd nodes.

This is also a key reason why we usually see that zookeeper, etcd, kubernetes and other open source software owners all use odd nodes.

Distributed consensus

In essence, the distributed election problem is the traditional distributed consensus method, which is mainly based on the majority voting strategy. If the distributed election method based on majority voting strategy is used in the distributed online accounting consistency problem, the accounting right is usually completely in the hands of the master node, which makes the master node very easy to fake and has performance bottleneck

Distributed consensus is a process that makes all nodes agree on a certain state when multiple nodes can operate or record independently. Through the consensus mechanism, we can make the data of multiple nodes in the distributed system agree.

Distributed consensus technology is the core of blockchain technology consensus mechanism

Knowledge expansion: what is the difference between consistency and consensus?

Consistency refers to that, given a series of operations between multiple nodes in a distributed system, the data or state presented by the outside world is consistent under the guarantee of the agreed protocol.

Consensus refers to the process in which multiple nodes in a distributed system agree on a certain state.

In other words, consensus emphasizes the result, consensus emphasizes the process of reaching consensus, and consensus algorithm is the core technology to ensure the system to meet different degrees of consistency.

Distributed transaction

Distributed transaction is the transaction running in the distributed system, which is composed of multiple local transactions

E-commerce order processing is a typical distributed transaction

To understand distributed transactions in depth, we first need to understand its characteristics. Distributed transaction is a combination of multiple transactions. The characteristic acid of transaction is also the basic characteristic of distributed transaction. The specific meaning of acid is as follows

  • Atomicity, that is, there are only two final states of a transaction: all executed successfully and all not executed. If any operation of the transaction is not successful, the whole transaction will fail. Once the operation fails, all operations are cancelled (that is, rolled back), making the transaction as if it had not been executed.
  • Consistency refers to the consistency of data integrity before and after transaction operation, or the satisfaction of integrity constraints. For example, user a and user B have 800 yuan and 600 yuan respectively in the bank, with a total of 1400 yuan. User a transfers 200 yuan to user B, which is divided into two steps: deducting 200 yuan from a’s account and increasing 200 yuan to B’s account; Consistency means that after the above steps, the final result is that user a still has 600 yuan, user B has 800 yuan, a total of 1400 yuan, and there is no case that user a deducted 200 yuan, but user B did not increase (in this case, both user a and user B are 600 yuan, a total of 1200 yuan).
  • Isolation means that when multiple transactions are executed concurrently in the system, multiple transactions will not interfere with each other, that is, the operation and data used in a transaction are isolated from other concurrent transactions.
  • Durability, also known as permanence, means that when a transaction is completed, its updates to the database are permanently saved. Even in the event of system crash or downtime, as long as the database can be re accessed, it will be able to recover to the state when the transaction is completed.

How to implement distributed transaction

There are three basic methods to realize distributed transaction

  • Two phase commit protocol method based on XA protocol
  • Three phase commit protocol method
  • Message based final consistency method

Among them, the two-stage submission protocol method based on XA protocol and the three-stage submission protocol method adopt strong consistency and comply with acid, and the final consistency method based on message adopts final consistency and comply with base theory

Two phase commit method based on XA protocol

Xa is a distributed transaction protocol, which specifies the interface between transaction manager and resource manager. Therefore, XA protocol can be divided into two parts: transaction manager and local resource manager.

The principle of XA to implement distributed transactions is similar to the centralized algorithm I introduced to you in the third article: the transaction manager, as the coordinator, is responsible for the submission and rollback of various local resources; and the resource manager is the participant of distributed transactions, which is usually implemented by databases. For example, Oracle, DB2 and other commercial databases all implement the XA interface.

In the two-phase commit method based on XA protocol, the two-phase commit protocol (2pc) is used to ensure the data consistency of transaction commit in distributed system. It is a mechanism for XA to coordinate multiple resources in global transaction.

In order to ensure their consistency, we need to introduce a coordinator to manage all nodes, and ensure that these nodes submit the operation results correctly. If the submission fails, the transaction will be abandoned. Next, let’s take a look at the specific process of the two-phase commit agreement.

The implementation process of two-stage commit protocol is divided into two stages: voting and commit.

  • Voting is the first stage. The coordinator (transaction manager) will initiate cancommit request to the transaction participant (cohort, local resource manager) and wait for the response of the participant. After the participant receives the request, it will perform the transaction operation in the request, record the log information but do not submit it. When the participant successfully executes, it will send a “yes” message to the coordinator, indicating that it agrees with the operation; if not, it will send a “no” message, indicating that it terminates the operation.

When all participants return the operation result (yes or no message), the system enters the submission phase. In the submission phase, the coordinator will send docommit or doabort instructions to the participants according to the information returned by all participants

  • If the coordinator receives all “yes” messages, the “docommit” message is sent to the participant, and the participant will complete the remaining operations and release resources, and then return the “havecommitted” message to the coordinator;
  • If the message received by the coordinator contains “no” message, the “doabort” message will be sent to all participants. At this time, the participant sending “yes” will roll back the operation according to the previous rollback log, and then all participants will send the “havecommitted” message to the coordinator;
  • When the coordinator receives the “have committed” message, it means that the whole transaction is over

It must be guaranteed that the rollback operation for a single machine is fault free

The idea of the two-stage commit algorithm can be summarized as follows: the coordinator sends the request transaction operation, the participants inform the coordinator of the operation result, and the coordinator decides whether the participants want to submit or cancel the operation according to the feedback results of all participants.

Although the two-stage commit algorithm based on XA basically meets the acid characteristics of transactions, it still has some shortcomings

  • Synchronization blocking problem: in the execution of two-stage commit algorithm, all participating nodes are transaction blocking. In other words, when the local resource manager occupies the critical resource, other resource managers will be blocked if they want to access the same critical resource.
  • Single point of failure problem: the two-stage commit algorithm based on XA is similar to the centralized algorithm. Once the transaction manager fails, the whole system will be in a state of stagnation. Especially in the commit phase, once the transaction manager fails, the resource manager will lock the transaction resources all the time because of waiting for the manager’s message, causing the whole system to be blocked.
  • Data inconsistency problem: in the submission phase, when the coordinator sends docommit to the participants After the request, if the local network exception occurs, or the coordinator fails in the process of sending the submit request, only a part of the participants receive the submit request and perform the submit operation, but other participants who do not receive the submit request cannot perform the transaction commit. So the whole distributed system has the problem of inconsistent data.

Three stage submission method

Three phase commit protocol (3pc) is an improvement of 2pc. In order to solve the problem of synchronization blocking and data inconsistency in two-phase submission, the timeout mechanism and preparation phase are introduced in three-phase submission.

  • At the same time, the timeout mechanism is introduced into the coordinator and participants. If the coordinator or participant does not receive the response from other nodes within the specified time, it will choose to commit or terminate the whole transaction according to the current state.
  • In the middle of the first phase and the second phase, a preparation phase is introduced, that is, a pre submission phase is added before the submission phase. In the pre commit phase, some inconsistencies are eliminated to ensure that the states of the participating nodes are consistent before the final commit.

In other words, in addition to introducing the timeout mechanism, 3pc divides the submission phase of 2pc into two phases, so that the three-phase submission protocol has three phases: cancommit, precommit and docommit.

First, cancommit stage.

Cancommit stage is similar to 2pc voting stage: coordinator sends request operation (cancommit request) to participant, asks whether participant can execute transaction commit operation, and then waits for response from participant; participant replies yes after receiving cancommit request, indicating that transaction can be executed smoothly; otherwise, replies No.
The process of successful and failed transaction requests between different nodes in cancommit phase is as follows.

Second, the precommit stage.

According to the response of the participants, the coordinator decides whether the precommit operation can be carried out.

  • If all participants reply “yes”, the coordinator will perform the pre execution of the transaction
    • Send a pre submit request. The coordinator sends a precommit request to the participants to enter the precommit phase.
    • Transaction pre commit. After receiving the precommit request, the participant performs the transaction operation, and records the undo and redo information in the transaction log.
    • Response feedback. If the participant successfully executed the transaction operation, it returns an ACK response and starts to wait for the final instruction.
  • If any participant sends a “no” message to the coordinator, or after waiting for a timeout, the coordinator does not receive a response from the participant, the transaction is interrupted
    • Send interrupt request. The coordinator sends an “abort” message to all participants.
    • Interrupt the transaction. After the participant receives the “abort” message or fails to receive the coordinator’s message after the timeout, the transaction is interrupted.

Third, the docommit stage.

In the docmmit phase, the real transaction is committed, and according to the message sent by the coordinator in the precommit phase, it enters the execution commit phase or the transaction interruption phase.

  • Implementation and submission phase:
    • Send submit request. The coordinator receives the ACK response sent by all participants, goes from the pre submit state to the submit state, and sends the docommit message to all participants.
    • Transaction commit. After the participant receives the docommit message, the transaction is formally committed. After the transaction is committed, release all locked resources.
    • Response feedback. After the participant submits the transaction, it sends an ACK response to the coordinator.
    • Complete the transaction. After the coordinator receives the ACK responses from all participants, the transaction is completed.
  • Transaction interruption phase:
    • Send interrupt request. The coordinator sends abort requests to all participants.
    • Transaction rollback. After the participant receives the abort message, it uses the undo information recorded in the precommit phase to perform the transaction rollback operation and release all locked resources.
    • Feedback results. After the participant completes the transaction rollback, an ACK message is sent to the coordinator.
    • Interrupt the transaction. After receiving the ACK message from the participant, the coordinator interrupts the transaction and ends the transaction.

The process of transaction execution success and failure (transaction interruption) on different nodes in the execution phase is as follows.

In the docommit phase, when a participant sends an ACK message to the coordinator and does not receive a response from the coordinator for a long time, by default, the participant will automatically commit the timeout transaction, which will not be blocked as the two-phase commit.

Final consistency scheme based on distributed message

2pc and 3pc have two common disadvantages: one is that they both need to lock resources and reduce system performance; the other is that they do not solve the problem of data inconsistency. Therefore, there is a solution to ensure the final consistency of transactions through distributed messages.

In the distributed system architecture of eBay, the core idea for architects to solve the consistency problem is to asynchronously execute the transactions that need to be distributed through messages or logs. Messages or logs can be saved to local files, databases or message queues, and then they can be retried through business rules. In this case, the final consistency solution based on distributed message is used to solve the problem of distributed transaction.

Based on the transaction processing of the final consistency scheme of distributed message, a message middleware (MQ) is introduced to deliver messages among multiple applications. The schematic diagram of multi node distributed transaction execution based on message middleware negotiation is as follows.

Rigid transaction and flexible transaction

When discussing transactions, we often refer to rigid transactions and flexible transactions, but it is difficult to distinguish them. So, in today’s knowledge expansion, I’ll talk to you about what is rigid transaction and flexible transaction, and what’s the difference between them?

  • Rigid transaction follows acid principle and has strong consistency. For example, database transactions.
  • Flexible transaction, in fact, is to use different methods to achieve the final consistency according to different business scenarios, that is to say, we can make some trade-offs according to the characteristics of the business, and tolerate data inconsistency in a certain period of time.

In summary, different from the rigid transaction, the flexible transaction allows different data of different nodes to be inconsistent in a certain period of time, but requires the final consistency. The final consistency of flexible transaction follows the base theory.

Base theory

Base theory includes basic available, soft state and eventual consistency.

  • Basic availability: when the distributed system fails, it is allowed to lose the availability of some functions. For example, some e-commerce 618 will degrade the functions of some non core links.
  • Flexible state: in flexible transactions, the system is allowed to have an intermediate state, which does not affect the overall availability of the system. For example, if the database is read-write separated, there will be a delay when the write database is synchronized to the read database (the master database is synchronized to the slave database), which is actually a flexible state.
  • Final consistency: transactions may be inconsistent due to synchronization delay during operation, but in the final state, the data is consistent.

It can be seen that in order to support large-scale distributed systems, base theory can obtain high availability by sacrificing strong consistency and ensuring final consistency, which weakens the acid principle. Today’s three distributed transaction implementations, two-stage commit and three-stage commit, follow the acid principle, while the final message consistency scheme follows the base theory.

Distributed lock

Lock is a kind of mark to realize multithread accessing the same shared resource at the same time and ensure that only one thread can access the shared resource at the same time

Different from ordinary lock, distributed lock refers to a kind of lock in which the system is deployed in multiple machines in a distributed environment to realize the distributed mutual exclusion of multiple processes. In order to ensure that multiple processes can see the lock, the lock is stored in the public storage (such as redis, Memcache, database, etc.), so that multiple processes can access the same critical resource concurrently, and only one process can access the shared resource at the same time, ensuring the consistency of data.

Three implementation methods and comparison of distributed lock

Next, I’ll show you three mainstream methods for implementing distributed locks

  • Distributed lock is realized based on database. The database here refers to relational database;
  • Distributed lock is implemented based on cache;
  • The distributed lock is implemented based on zookeeper.

Implementation of distributed lock based on Database

When we want to lock a resource, we add a record to the table. When we want to release the lock, we delete the record.

Distributed lock based on database is the easiest to understand. However, because the database needs to fall on the hard disk, frequent reading of the database will lead to high IO overhead, so this kind of distributed lock is suitable for the scenario with low concurrency and low performance requirements.

Distributed lock based on database is relatively simple. The best way is to create a lock table and create a record in the lock table for the applicant. If the record is successfully established, the lock will be obtained. If the record is eliminated, the lock will be released. This method relies on database and has two disadvantages

  • Single point of failure problem. Once the database is not available, the whole system will crash.
  • Deadlock problem. There is no expiration time for the data in the database. If the service that obtained the lock crashes and does not modify the information in the database table, it will cause an exception

Implementation of distributed lock based on cache

Redis can usually use setnx (key, value) function to implement distributed lock. Key and value are the two attributes of distributed lock based on cache, where key represents lock ID, value = currenttime + timeout, and current time + timeout. In other words, after a process obtains the key lock, if the lock is not released within the time of value, the system will release the lock actively.

The return values of setnx function are 0 and 1:

  • Return 1, indicating that the server has obtained the lock. Setnx sets the value corresponding to the key to the current time + the effective time of the lock.
  • Return 0, indicating that other servers have obtained the lock, and the process cannot enter the critical area. The server can constantly try setnx operations to get locks.

Compared with the scheme of distributed lock based on database, the advantages of distributed lock based on cache are as follows

  • Better performance. Data is stored in memory instead of disk, avoiding frequent IO operations.
  • Many caches can be deployed across clusters to avoid single point of failure.
  • Multi cache services provide methods that can be used to implement distributed locks, such as redis’s setnx method.
  • You can directly set the timeout to control the release of the lock, because these cache servers generally support automatic deletion of expired data.

The disadvantage of this scheme is that it is not very reliable to control the failure time of the lock through the timeout, because a process may take a long time to execute, or it may be affected by the memory recovery of the system process, resulting in the timeout, thus releasing the lock incorrectly.

Implementation of distributed lock based on zookeeper

Zookeeper implements distributed lock based on tree data storage structure to solve the problem of data consistency when multiple processes access the same critical resource at the same time. Zookeeper’s tree data storage structure is mainly composed of four kinds of nodes

  • Persistent node. This is the default node type and always exists in zookeeper.
  • Persistent order node. In other words, when creating a node, zookeeper numbers the nodes according to the chronological order in which they were created.
  • Temporary node. Unlike persistent nodes, when the client is disconnected from zookeeper, the temporary nodes created by the process are deleted.
  • Temporary order nodes are temporary nodes numbered in chronological order.

According to their characteristics, zookeeper implements distributed locking based on temporary sequential nodes.

Let’s take the scene of e-commerce selling hair dryers as an example. Suppose that users a, B and C submit a request to buy a hair dryer at 0:00 on November 11 at the same time, zookeeper will use the following method to realize distributed lock:

  1. In the persistent node corresponding to this method, shared_ Lock directory, for each process to create a temporary order node. As shown in the figure below, a hair dryer is one with shared_ Lock directory. When someone buys a hair dryer, a temporary order node will be created for him.
  2. Each process gets shared_ List all temporary nodes in the lock directory, register the watcher changed by the child node, and listen to the node.
  3. Each node determines whether its own number is shared_ The smallest of all the child nodes under lock. If it is the smallest, the lock will be obtained. For example, user a’s order comes to the server first, so a temporary sequence node locknode1 with number 1 is created. The number of the node is the smallest in the persistent node directory, so the distributed lock can be obtained, and the critical resources can be accessed, so that the hair dryer can be purchased.
  4. If the corresponding temporary node number of this process is not the smallest, it can be divided into two cases: A. this process is a read request, if there is a write request in the node with a smaller serial number, it will wait; B. this process is a write request, if there is a read request in the node with a smaller serial number, it will wait.

For example, user B also wants to buy a hair dryer, but before him, user C wants to check the stock of hair dryers. Therefore, user B can only purchase the hair dryer after user a has purchased the hair dryer and user C has inquired about the stock in the warehouse.

As you can see, zookeeper can perfectly solve various problems encountered in the design of distributed locks, such as single point of failure, non reentry, deadlock and so on. Although the distributed lock implemented by zookeeper can cover almost all the characteristics of distributed lock and is easy to implement, it needs to add and delete nodes frequently, so its performance is not as good as the distributed lock based on cache.

It is worth noting that the implementation complexity here is for the same distributed lock implementation complexity, which is very simple and different from the database based implementation mentioned before. Distributed lock based on database has the problems of single point of failure and deadlock. It is very complex to solve the problems of single point of failure and deadlock only by using database technology. Zookeeper has defined related functional components, so it can easily solve various problems encountered in the design of distributed lock. Therefore, zookeeper is the simplest choice to implement a complete and defect free distributed lock.

From the above analysis, we can see that in order to ensure the availability of distributed locks, we should consider the following points when designing:

  • Mutual exclusion, that is, in the distributed system environment, distributed lock should ensure that a resource or a method can only be operated by a thread or process of a machine at the same time.
  • With lock failure mechanism to prevent deadlock. Even if a process fails to unlock the lock due to crash during the lock holding period, it can ensure that other subsequent processes can obtain the lock.
  • Reentrant, that is, a process can access critical resources multiple times when the lock is not released.
  • It has the function of getting lock and releasing lock with high availability and good performance.

How to solve the herding problem of distributed lock?

In the distributed lock problem, herding is often encountered. The so-called herding effect is that in the whole process of distributed lock competition, a large number of “watcher notification” and “acquisition of child node list” operations run repeatedly, and the running result of most nodes is to judge that they are not the node with the smallest number at present, and continue to wait for the next notification instead of executing business logic.

This will cause huge performance impact and network impact on zookeeper server. What’s more, the zookeeper server will send a large number of event notifications to other clients in a short time if the corresponding clients of multiple nodes complete the transaction at the same time or the transaction interruption causes the node to disappear.

How to solve this problem? The specific method can be divided into the following three steps.

  1. In the directory of persistent nodes corresponding to this method, a temporary sequential node is created for each process.
  2. Each process gets the list of all temporary nodes and compares whether its own number is the smallest. If it is the smallest, it gets the lock.
  3. If the temporary node number corresponding to this process is not the smallest, continue to judge:
    • If the process is a read request, it registers a watch with the last write request node whose serial number is smaller than its own. When it hears that the node releases the lock, it obtains the lock;
    • If this process is a write request, it registers watch monitoring with the last request node whose serial number is smaller than its own. When it hears that the node releases the lock, it obtains the lock.