Seven solutions for distributed transactions


1、 What is distributed transaction

Distributed transaction means that transaction participants, transaction supporting servers, resource servers and transaction managers are located on different nodes of different distributed systems. A large operation is completed by more than n small operations. These small operations are distributed on different services. For these operations, either all of them are successfully executed or none of them are executed.

2、 Why are there distributed transactions

for instance:

Seven solutions for distributed transactions

Transfer is the most classic distributed transaction scenario. Suppose user a initiates an inter-bank transfer to user B, the banking system first deducts user a’s money, and then increases the balance in user B’s account. If one of the steps fails, two exceptions may occur:

  1. User a’s account was deducted successfully, but user B’s account balance failed to increase.
  2. User a’s account deduction failed, and user B’s account balance increased successfully.

In fact, both situations are not allowed. In this case, transactions are required to ensure the success of the transfer operation.

stayMonomer applicationIn, just@TransactionalThe transaction can be started to ensure the atomicity of the whole operation.

However, in the actual application architecture, it cannot be a single service, such as distributed micro Service Architecture:

Seven solutions for distributed transactions

For example, ordering service, inventory deduction service, etc. must beEnsure the consistency of different service status resultsThus, distributed transactions appear.

IIIDistributed theory

Cap theorem

In a distributed system, the following three characteristics cannot be met at the same time:

All data backup in the distributed system,“Do you have the same value at the same time?”。 (equivalent to all nodes accessing the same latest data copy)

Some nodes in the cluster“Fault”After, the cluster as a whole“Can you still respond?”Read and write requests from clients. (high availability for data updates)

Partition fault tolerance:
Even if it appears“If a single component is unavailable, the operation can still be completed”

Specifically, in a distributed system, in any database design, a web application“At most the above two properties can be supported”。 Obviously, any scale out strategy depends on data partitioning. Therefore, designers must choose between consistency and availability.

Base theory

Distributed systems often pursue availability. Its important programs are higher than consistency. So how to achieve high availability?

That is the base theory, which is used to further expand the cap theorem. Base theory refers to:

  1. Basically available
  2. Soft state
  3. Eventually consistent

Base theory is the result of a trade-off between consistency and availability in cap. The core idea of the theory is:Strong consistency cannot be achieved, but each application can adopt appropriate methods to achieve the final consistency of the system according to its own business characteristics.The following is a distributed transaction solution.

4、 Two phase submission (2pc)

be familiar withMySQLMy classmates are rightTwo stage submissionShould be familiar,MySQL transactions are committed in two phases through the log system

The two-phase protocol can be used in a single machine centralized system, and multiple resource managers are coordinated by the transaction manager; It can also be used in distributed systems. A global transaction manager coordinates the local transaction manager of each subsystem to complete the two-stage commit. The protocol has two roles: node a is the coordinator of the transaction, and B / C is the participant of the transaction.

Seven solutions for distributed transactions

The first stage: voting stage

  1. The coordinator first writes the command to the log.
  2. Send a prepare command to the two participants of the B / C node.
  3. After receiving the message, B / C judges whether its actual situation can be submitted according to its actual situation.
  4. Record the processing results to the log system.
  5. Return the result to the coordinator.
Seven solutions for distributed transactions

The second stage: decision stage

After node a receives all the confirmation messages from B / C participants;

  1. Determine whether all coordinators can submit.
  2. If yes, write the log and launch the commit command; If one cannot be, write to the log and issue the abort command.
  3. Participants receive the command initiated by the coordinator and execute the command.
  4. Write the execution command and results to the log.
  5. Returns the result to the coordinator.

Possible problems

  1. Single point of failure: once the transaction manager fails, the whole system is unavailable.
  2. Inconsistent data: in phase 2, if the transaction manager only sends part of the commit message and the network is abnormal, only some participants receive the commit message, that is, only some participants submit transactions, resulting in inconsistent system data.
  3. Long response time: the whole message link is serial and needs to wait for response results. It is not suitable for high concurrency scenarios.
  4. Uncertainty: after the transaction manager sends a commit and only one participant receives the commit, the re elected transaction manager cannot determine whether the message is submitted successfully after the participant and the transaction manager are down at the same time.

5、 Three phase submission (3pc)

Compared with 2pc, cancommit phase and timeout mechanism are added. If the commit request from the coordinator is not received within a period of time, the commit will be performed automatically, which solves the problem of 2pc single point of failure. However, the performance problems and inconsistencies have not been fundamentally solved.

Phase I: cancommit phase

At this stage, the coordinator simply asks the transaction participants whether they have the ability to complete the transaction. If yes is returned, enter the second stage; If one returns no or waits for a response timeout, the transaction is interrupted and an abort request is sent to all participants

Stage 2: precommit stage

At this time, the coordinator will send a precommit request to all participants. After receiving it, the participants will start to execute the transaction operation, and record the undo and redo information in the transaction log. After the participant completes the transaction operation (it is in the status of uncommitted transaction at this time), it will feed back “ack” to the coordinator, indicating that it is ready to commit, and wait for the coordinator’s next instruction.

Phase III: docommit phase

In phase 2, if all participant nodes can perform precommit submission, the coordinator will change from “pre submission status” to “submission status”. Then send a docommit request to all participant nodes. After receiving the submission request, the participant nodes will respectively perform the transaction submission operation and feed back the ACK message to the coordinator node. After receiving the ACK message from all participants, the coordinator completes the transaction. On the contrary, if one participant node fails to complete the feedback of precommit or the feedback times out, the coordinator will send abort requests to all participant nodes to interrupt the transaction.

6、 Compensation transaction (TCC)

TCC is actually the compensation mechanism adopted, and its core idea is:For each operation, a corresponding confirmation and compensation (cancellation) operation shall be registered.It is divided into three stages: try, confirm and cancel.

  1. The try phase is mainly used for business system detection and resource reservation, which is mainly divided into two phases.

  2. The confirm phase is mainly to confirm and submit the business system. The try phase is successfully executed and started
    During the confirm phase, the default confirm phase will not make an error. That is, as long as try succeeds, confirm will succeed.

  3. The cancel phase is mainly used to cancel the business and release the reserved resources when the business execution is wrong and needs to be rolled back.

Reduce inventory compared with the following orders:

Seven solutions for distributed transactions

Execution process:

  1. Try stage: the order system sets the current order status as being paid, and the inventory system verifies whether the current remaining inventory quantity is greater than 1, and then sets the available inventory quantity as remaining inventory quantity – 1:
  • If the try phase is successfully executed, execute the confirm phase to change the order status to payment succeeded and the remaining inventory quantity to the available inventory quantity.
  • If the try phase fails, execute the cancel phase, change the order status to payment failed, and the available inventory quantity to the remaining inventory quantity.

Compared with 2pc, TCC transaction mechanism solves several disadvantages:

  1. The coordinator single point is solved, and the main business party initiates and completes this business activity. The business activity manager also becomes multipoint and introduces clusters.

  2. Synchronization blocking: introduce timeout, compensate after timeout, and will not lock the whole resource, convert the resource into business logic form, and the granularity becomes smaller.

  3. Data consistency. With the compensation mechanism, the consistency is controlled by the business activity manager

In short, TCC artificially implements two-stage submission through code. The code written in different business scenarios is different, and the complexity of business code is greatly increased. Therefore, this pattern can not be reused well.

7、 Local message table

Seven solutions for distributed transactions

Execution process:

  1. The message producer needs to create an additional message table and record the message sending status. Message tables and business data should be submitted in a transaction, that is, they should be in a database. Then the message will passMQTo the consumer of the message. If the message fails to be sent, it will be sent again.

  2. The message consumer needs to process the message and complete its own business logic. If it is a business failure, you can send a business compensation message to the manufacturer to notify the manufacturer to roll back and other operations.

  3. At this time, if the local transaction is processed successfully, it indicates that it has been processed successfully. If the processing fails, the execution is retried.

  4. The producer and consumer scan the local message table regularly and send the unfinished messages or failed messages again.

8、 Message transaction

The principle of message transaction is to pass two transactions throughMessage Oriented MiddlewareAsynchronous decoupling is somewhat similar to the local message table mentioned above, but it is done through the mechanism of message middleware. Its essence is to “encapsulate the local message table into message middleware”.

Execution process

  1. Send the prepare message to the message middleware.
  2. After sending successfully, execute the local transaction.
  3. If the transaction is executed successfully, the message middleware commits and sends the message to the consumer. If the transaction fails, it will be rolled back and the message middleware will delete the prepare message.
  4. The consumer receives a message to consume. If the consumption fails, it will try again and again.

This scheme is also realized“Final consistency”, compared with the local message table implementation scheme, there is no need to build a message table,“No longer rely on local database transactions”Therefore, this scheme is more suitable for high concurrency scenarios. At present, it is feasible to realize the scheme on the market“Only Alibaba’s rocketmq”

9、 Best effort notification

The best effort notification scheme is relatively simple to implement and is applicable to some businesses with low final consistency requirements.

Execution process:

  1. After the local transaction of system a is executed, a message is sent to MQ.
  2. There will be a service dedicated to consuming MQ, which will consume MQ and call the interface of system B.
  3. If the implementation of system B is successful, it is OK; If the execution of system B fails, the best effort notification service will regularly try to call system B again, repeated N times, and finally give up if it still fails.

10、 Sagas transaction modelLong running transactions

Its core idea is to split the long transaction into multiple local short transactions, which are coordinated by saga transaction coordinator. If it ends normally, it will be completed normally. If a step fails, the compensation operation will be called once according to the reverse order. A distributed transaction in the Seata framework contains three roles:

「Transaction Coordinator (TC)」: Transaction Coordinator, which maintains the running state of global transactions, coordinates and drives the submission or rollback of global transactions.「Transaction Manager (TM)」: controls the boundary of global transactions, is responsible for starting a global transaction, and finally initiates the resolution of global commit or global rollback.「Resource Manager (RM)」: control branch transactions, be responsible for branch registration and status reporting, receive instructions from the transaction coordinator, and drive the submission and rollback of branch (local) transactions.

Seata framework“A undo_log table is maintained for each RM”, which saves the rollback data of each local transaction.

Specific process:

  1. First, TM applies to TC to start a global transaction. The global transaction is successfully created and a globally unique XID is generated.

  2. XID propagates in the context of the microservice invocation link.

  3. RM starts executing the branch transaction. RM first parses the SQL statement and generates the corresponding undo_ Log record. Here is a undo message_ Records in log, undo_ The log table records the branch ID, global transaction ID, and redo and undo data of transaction execution for phase II recovery.

  4. RM executes business SQL and undo in the same local transaction_ Insertion of log data. Before committing this local transaction, RM will apply to TC for a global lock on this record.

  5. If the request is not found, it indicates that other transactions are also operating on this record, so it will retry within a period of time. If the retry fails, the local transaction will be rolled back and the local transaction execution failure will be reported to the TC.

  6. Before the transaction is committed, RM applies for the global lock of relevant records, then directly commits the local transaction and reports the successful execution of the local transaction to TC. At this time, the global lock is not released. The release of the global lock depends on whether the command is submitted or rolled back in the second stage.

  7. TC issues the commit or rollback command to RM according to the execution results of all branch transactions.

  • If RM receives the TC’s submit command, it first releases the global lock of relevant records immediately, then puts the submit request into the queue of an asynchronous task, and immediately returns the successful submission result to TC. When the submission request in the asynchronous queue is actually executed, it only deletes the corresponding undo log record.

  • If RM receives the rollback command from TC, it will start a local transaction and find the corresponding undo log record through XID and branch ID. Compare the rear mirror in undo log with the current data,

    • If it is different, it indicates that the data has been modified by actions other than the current global transaction. This situation needs to be handled according to the configuration policy.

    • If it is the same, generate and execute the rollback statement according to the relevant information of the front image and business SQL in undo log, then commit the local transaction to achieve the purpose of rollback, and finally release the global lock of relevant records.

11、 Summary

Distributed transaction itself is a technical problem, and the specific scheme used in the business still needs to be selected by different business characteristics. Distributed transactions increase the complexity of the process and bring a lot of additional overhead work. The amount of code increases, the business is complex and the performance decreases. Therefore, in the process of real development, distributed transactions can not be used.