abstract: how to overcome the problem of distributed transaction under microservice architecture?
What is micro service? What are the advantages and difficulties of microservices?
What is a microservice architecture?
In short, the microservice architecture system is a distributed system, which is divided into independent service units according to business, which not only solves the shortcomings of single systems, but also meets more and more complex business requirements. Each microservice focuses on completing only one task and completing it well.
Advantages of microservice architecture
- The complex business is divided into several small businesses, and each business is divided into a service to simplify the complex problems. It is conducive to the division of labor and reduce the learning cost of newcomers.
- Microservice system is a distributed system. Business and business are completely decoupled. With the increase of business, it can be subdivided according to business, and has strong horizontal expansion ability.
- HTTP protocol communication is adopted between services, and services are completely independent. Each service can select the appropriate programming language and database according to the business scenario.
- Services are deployed independently. The modification and deployment of each service has no impact on other services.
Although microservices have the above advantages, the practice of microservices is still in the exploratory stage. Many small and medium-sized Internet companies find it difficult to implement microservices in view of their experience, technical strength and other problems. Chris Richardson, a famous architect, pointed out that,At present, the main difficulties of microservices are as follows:
- After a single application is split into a distributed system, the communication mechanism and fault handling measures between processes become more complex.
- After the microservicing of the system, a seemingly simple function may need to call multiple services and operate multiple databases internally, and the distributed transaction problem of service call becomes very prominent.
- With a large number of microservices, its testing, deployment and monitoring become more difficult.
With the maturity of RPC framework, the first problem has been gradually solved. For example, Dubbo can support a variety of communication protocols, and spring cloud can well support restful calls. For the third problem, with the development of docker and Devops technologies and the launch of automated operation and maintenance tools for public cloud PAAS platforms, the testing, deployment and operation and maintenance of microservices will become easier and easier.
For the second problem, there is no general solution to solve the transaction problem caused by microservices. Distributed transaction has become the biggest obstacle to the implementation of microservices and the most challenging technical problem. Next, we will discuss various solutions of distributed transactions under the microservice architecture.
How to overcome the problem of distributed transaction under microservice architecture?
What is a transaction
A transaction is a logical processing unit composed of a group of SQL statements. A transaction has the following four attributes, usually referred to as the acid attribute of a transaction:
Atomicity: a transaction is an atomic operation unit that performs or does not perform all modifications to data.
Consistency: data must be consistent at the beginning and completion of a transaction. This means that all relevant data rules must be applied to the modification of transactions to maintain data integrity.
Isolation: the database system provides a certain isolation mechanism to ensure that transactions are executed in an “independent” environment that is not affected by external concurrent operations. The isolation level of database transactions from low to high is read uncommitted, read committed, repeatable and serializable.
Durability: after the transaction is completed, its data modification is permanent and can be maintained even in case of system failure.
Typical scenario of distributed transaction:
Bank transfer business is a typical distributed transaction scenario, which usually includes the following three situations:
A. Intra branch transfer: intra branch transfer of the same bank
B. Intra bank transfer: transfer between different branches of the same bank
C. Inter bank transfer: transfer between different bank systems
For the traditional centralized architecture, a and B are usually local transactions, and C is distributed transactions. After the transformation of business microservices, the transfer in and transfer out are usually different microservices, and the same microservice usually runs in different instances. A may become a distributed transaction, or it may be avoided by some methods and completed in a local transaction. B and C are difficult to avoid and can only be distributed transactions.
The best practice of microservice suggests avoiding distributed transactions as much as possible, but in many business scenarios (such as the B and C transfer scenarios above), distributed transactions are an inseparable technical problem.
Common solutions for distributed transactions
In order to solve the problem of distributed system consistency, predecessors have summarized many typical protocols and algorithms in the process of anti repeated weight balancing of performance and data consistency. Among them, the most commonly used is the two-stage commitment protocol.
Two stage Submission Scheme
Transaction middleware and database use two-phase commit to complete a global transaction through XA interface specification. The basis of XA specification is two-phase commit protocol.
The first stage is the voting stage, in which all participants give feedback on the success of the transaction to the coordinator; The second stage is the execution stage. The coordinator notifies all participants according to the feedback of all participants, and submits or rolls back on all branches in a consistent manner.
The two-phase submission scheme is widely used. Typical commercial software includes Oracle tuxedo and IBM CICS. It has the advantages of low intrusion into business code, but its disadvantages are also obvious:
Low performance: due to the characteristics of XA protocol, it will cause the transaction resources not to be released for a long time, the locking cycle is long, and there is no intervention at the application layer. The performance of scenarios with high data concurrency conflict is very poor.
Single point problem: the coordinator plays an important role in the whole two-stage submission process. Once the coordinator’s server goes down, it will affect the normal operation of the whole database cluster. For example, in the second phase, if the coordinator cannot normally send transaction commit or rollback notification due to failure, the participants will always be blocked.
Synchronous blocking: in the process of two-stage submission, all participants need to obey the unified scheduling of the coordinator. During this period, they are in a blocking state and can not engage in other operations, which is extremely inefficient.
Therefore, the two-stage submission scheme is rarely used in Internet services and can not meet the high concurrency requirements.
In order to make up for the problem of low performance caused by this scheme, we have come up with many schemes to solve it. By making an article in the application layer, that is, the way of invading business, the more typical are TCC scheme and the final consistency scheme based on reliable messages.
TCC transaction scheme
TCC transaction model has been widely applied in e-commerce and finance. TCC scheme is actually an improvement of two-stage submission. It explicitly divides each branch of the whole business logic into three operations: try, confirm and cancel. The try part completes the business preparation, the confirm part completes the business submission, and the cancel part completes the transaction rollback. The basic principle is shown in the figure below.
When a transaction starts, the business application will register with the transaction coordinator to start the transaction. After that, the business application will call the try interface of all services to complete the first stage of preparation. After that, the transaction coordinator will decide to call the confirm interface or cancel interface according to the return of the try interface. If the interface call fails, it will be retried.
The TCC scheme allows the application to define the granularity of database operations, making it possible to reduce lock conflicts and improve throughput. For example, Huawei’s distributed transaction middleware DTM has very high performance. The common configuration server can support 10000 + TPS for global transactions and 30000 + TPS for branch transactions. Of course, the TCC scheme also has shortcomings, which are mainly reflected in the following two aspects:
Strong business intrusion. Each branch of business logic needs to implement three operations: try, confirm and cancel. The application is highly invasive and the transformation cost is high.
It is difficult to realize. In order to meet the requirements of consistency, we should fully consider idempotent operations, allow repeated execution, prevent resource suspension, and do a good job in concurrent access control and data visibility control.
Due to the above reasons, TCC solutions are mostly adopted by large companies with strong R & D strength and urgent needs. Microservices advocate lightweight services, and many transaction processing logic in TCC scheme need to be implemented by their own coding, which is complex and requires a large amount of development.
Message based final consistency scheme
The message consistency scheme is to ensure the consistency of upstream and downstream application data operations through message middleware. The basic idea is to put the local operation and message sending into a local transaction to ensure that both local operation and message sending succeed or fail. The downstream application subscribes to the message system and performs corresponding operations after receiving the message.
In essence, the message final consistency scheme is to convert distributed transactions into two local transactions, and then rely on the retry mechanism of downstream services to achieve final consistency. The final consistency scheme based on message is also very intrusive to the application. The application needs a lot of business transformation, and the cost is very high.
Intrusion code solutions are solutions based on the existing situation. In fact, they are not elegant to implement. For example, the call of a transaction is usually accompanied by a series of reverse operations on the transaction interface. The submission logic must be accompanied by the rollback logic. Such code will make the project very bloated and the maintenance cost is high.
In view of the pain points of the above-mentioned distributed transaction solution, it is obvious that our ideal distributed transaction solution must have good performance and no intrusion to the business. The business layer does not need to care about the constraints of the distributed transaction mechanism, so as to separate the transaction from the business, which is the non-invasive transaction recommended in this paper.
Non intrusive transaction scheme
a. Typical architecture
The typical architecture of non intrusive transactions is shown in the figure below:
Transaction core components include:
Transaction Coordinator (TC): Transaction Coordinator, distributed transaction brain, generates and maintains global transactions and branch transactions, and promotes the two-stage processing of transaction submission and rollback. TC server provides transaction coordination capability in the form of cluster.
Transaction Manager (TM): defines the boundaries of global transactions and communicates with the transaction coordinator to start, commit, or rollback global transactions.
Resource Manager (RM): the resource manager manages the resources of branch transaction processing, communicates with the transaction coordinator to start and end the transaction branch, and receives the transaction coordinator instruction to complete the two-stage branch transaction commit or rollback.
Lock Server （LS）: distributed lock server, which can query, lock and release the resources of ongoing distributed transaction operations.
A distributed transaction is called a global transaction. Several branch transactions are attached below. A branch transaction is a local transaction that meets acid. The core idea of non intrusive transaction is that the resource manager intercepts the business SQL, parses it and does some additional data processing, generates and saves the undo log. Once the global transaction rollback occurs, all branch transactions are rolled back through the undo log corresponding to each branch transaction.
It is easy to think that two global transactions modify the same data in parallel, which may cause data errors when rollback is completed according to undo log. The solution is to lock the modified data of the transaction through the lock server, release the lock immediately after the global transaction is committed, and wait for the branch transaction rollback to complete the release.
b. Typical process
The main execution steps of a typical distributed transaction are as follows:
1. TM requests TC to start a new global transaction. TC creates a global transaction and returns the global transaction ID (XID).
2. Build the transaction context according to XID and propagate it through the call chain of microservices.
3. RM finds itself in the transaction context, obtains the global transaction ID and parses SQL, generates undo log and distributed transaction lock data, and requests TC to create branch transactions.
4. TC locks through ls. After locking is successful, it creates a branch transaction ID and returns it.
5. RM associates the branch transaction ID with undo log and submits it in a local transaction with the original business SQL.
6. Repeat 3 ~ 5 to create a branch transaction for each local transaction within the global transaction scope.
7. If there is no exception within the global transaction boundary, TM requests TC to submit the global transaction; If there is an exception, TM requests TC to roll back the global transaction.
- TC marks the global transaction status. If it is committed, it will release the lock through LS immediately. Push all branch transactions under the global transaction corresponding to XID for two-stage processing, and send requests to RM.
9. RM completes the commit or rollback of branch transactions and returns the status to TC.
10. TC unlocks the branch that completes rollback through ls. After all branches are completed, the global transaction results are returned to TM.
Phase II transaction processing is key, which is highlighted here.
c. Branch transaction commit
If the global transaction status is commit, branch commit is initiated for each branch, as shown in the following figure:
RM receives the branch transaction submission request, saves the ID of the branch transaction in the queue and returns it. A thread periodically takes out a batch of branch transaction IDs from the queue and constructs the undo log corresponding to SQL batch deletion. The branch transaction submission can be processed asynchronously in batch because the global transaction has been committed, and the undo log as an intermediate state is no longer important. Just clean it regularly.
d. Branch transaction rollback
If the global transaction status is rollback or timeout, branch rollback is initiated for each branch, as shown in the following figure:
RM receives the branch transaction rollback request, starts a local transaction, finds the corresponding undo log through the branch ID, constructs and executes the rollback SQL statement, deletes the undo log, and then commits the local transaction. If it is completed successfully, the TC will clean up the resources occupied by the branch through LS after receiving the response.
e. Performance analysis
An important performance advantage of non intrusive transactions over XA two-phase commit is that it takes less time to lock resources. In the actual business, we know that most transactions are committed and few are rolled back. For Xa, resources are released in phase 2, whether committed or rolled back. For the non-invasive transactions introduced in this paper, there is no need to take the lock in the second stage for the global transactions in the commit state. Only a small proportion of the global transactions in the rollback state need to put the lock in the second stage.
Non intrusive transactions are not limited to the database XA interface and are fully controllable. TC, RM and LS have a great impact on performance. Good design and implementation can achieve very high performance. Non intrusive transaction practice has proved that it can easily meet the performance requirements of most high concurrency business scenarios.
Distributed transaction transformation example of typical core business system
Huawei cloud stack is a distributed transaction transformation of an operator’s core business system. The customer’s business challenges the distributed system in common concurrent scenarios such as recharge at the beginning of the month and fee deduction business peak:
- High concurrency distributed transactions access account tables. XA two-phase commit seriously affects business due to long locking time. The overall performance requirement is up to 1000 + TPS. Traditional or open source distributed transactions are difficult to meet the requirements of high availability and high performance.
- Consistency between XA transactions and other database operations. Xa transaction needs to be regarded as a branch of DTM TCC transaction, and other database operations are another branch.
Through a series of innovative technologies, Huawei cloud stack hybrid cloud solution distributed transaction middleware DTM provides high-performance, high availability, high reliability, high security, low intrusion and easy-to-use distributed transaction services, supports two models of TCC transaction and non-invasive transaction, helps enterprises to transform microservices and gracefully solve the problem of data consistency under distributed systems.