99% of people can understand the “compensation” mechanism of distributed systems


Summary:Let’s talk about how to digest the “internal injury” through the “compensation” mechanism while ensuring high external availability.

1、 What is the significance of the “compensation” mechanism?

Take the shopping scenario of e-commerce as an example:

Client – > shopping cart micro Service – > order micro Service – > payment micro service.

This call chain is very common.

So why should compensation mechanism be considered?

As mentioned in previous articles, a cross machine communication may pass through DNS services, network cards, switches, routers, load balancing and other devices. These devices are not always stable. In the whole process of data transmission, any error in any link will lead to problems.

In the distributed scenario, a complete service is composed of multiple cross machine communications, so the probability of problems increases exponentially.

However, these problems do not completely mean that the real system cannot process requests, so we should automatically digest these exceptions as much as possible.

You may ask, I have seen “compensation” and “transaction compensation” or “retry” before. What is the relationship between them?

In fact, you don’t have to worry about these names. They are all the same in terms of purpose.Once an exception occurs in an operation, how to eliminate the “inconsistent” state caused by the exception through the internal mechanism

Digression:In brother Z’s opinion, no matter what method is used, as long as the problem is solved through additional methods, it can be understood as “compensation”, so “transaction compensation” and “retry” are a subset of “compensation”. The former is a reverse operation, while the latter is a forward operation.

But from the results, the two have different meanings. “Transaction compensation” means “give up”, and the current operation is bound to fail.

99% of people can understand the

▲ transaction compensation

“Retry” has a chance of success. These two methods are applicable to different scenarios.

99% of people can understand the

▲ retry

Because “compensation” is already an additional process, since this additional process can be taken, it shows that timeliness is not the first consideration, soThe core point of compensation is: it’s better to be slow than wrong

Therefore, do not determine the implementation scheme of compensation hastily, and careful evaluation is needed. Although mistakes cannot be 100% avoided, holding such a mentality can more or less reduce the occurrence of some mistakes.

2、 What should compensation do?

The main methods of “compensation” are “transaction compensation” and “retry”, which will be called “rollback” and “retry” below.

Let’s talk about rollback first. It is logically simpler than “retry”.


Brother Z divides rollback into two modes: explicit rollback (calling the reverse interface) and implicit rollback (without calling the reverse interface).

The most common is “explicit rollback”. This plan is nothing more than two things:

First, determine the failed steps and status, so as to determine the scope of rollback. A business process is often formulated at the beginning of design, so it is easier to determine the scope of rollback. But the only thing to note here is:If not all the services involved in a business process provide a rollback interface, the services that provide a rollback interface should be placed first when arranging services, so that there is a chance to “rollback” when subsequent work services are wrong

Secondly, it should be able to provide the business data used in the “rollback” operation.The more data provided during rollback, the more robust the program is。 Because the program can check the business when it receives the “rollback” operation, such as checking whether the accounts are equal, whether the amount is consistent, and so on.

Since the data structure and data size of this intermediate state are not fixed, brother Z suggests that you can serialize the relevant data into a JSON and store it in a NoSQL type storage.

“Implicit rollback” is used in relatively few scenarios. It means that you do not need additional processing for this rollback action. The downstream service has a mechanism similar to “preemption” and “timeout expiration”. For example:

In the e-commerce scenario, the goods in the order will occupy the inventory first and wait for the user to pay within 15 minutes. If no payment is received from the user, the inventory is released.

Let’s talk about a lot of ways to play, and it’s easier to fall into the pit of “retry”.


The biggest advantage of “retry” is that the business system does not need to provide a “reverse interface”, this is a great benefit to the long-term development cost. After all, the business is changing every day. So,Where possible, priority should be given to using retry.

However, compared with rollback, retry has fewer applicable scenarios, so the first step is to judge whether the current scenario is suitable for retry. For example:

  • When the downstream system returns to “request timeout”, “restricted flow” and other temporary States, weTry again
  • If it returns business errors such as “insufficient balance” and “no permission” that cannot be continued, it will be closedNo retry requiredYes
  • Some middleware or RPC frameworks return http503, 404, etc. when there is no expectation of recoveryNo retry required

If you are sure to retry, we also need to select an appropriate retry policy. The main “retry strategies” are the following.

Policy 1. Retry now。 Sometimes the fault is temporary, which may be caused by events such as network packet conflict or hardware component traffic peak. In this case, it is appropriate to retry the operation immediately. However, the number of immediate retries should not exceed one. If the immediate retry fails, you should use another strategy.

Strategy 2. Fixed interval。 The interval between each attempt of the application is the same. This is easy to understand. For example, it is fixed to retry the operation every 3 seconds. (the specific numbers in all the following example codes are for reference only.)

Policy 1 and policy 2 are mostly used in the interactive operation of the front-end system.

Strategy 3. Incremental interval。 The retry interval is incremented each time. For example, 0 seconds for the first time, 3 seconds for the second time, 6 seconds for the third time, 9, 12 and 15.

return (retryCount - 1) * incrementInterval;

The higher the number of failures, the lower the priority of retry requests, making way for new incoming retry requests.

Strategy 4. Exponential interval。 The retry interval increases exponentially. The “same goal” as the incremental interval is to make the retry requests with more failures rank lower in priority, but the growth rate of this scheme is greater.

return 2 ^ retryCount;

Strategy 5. Full jitter。 On the basis of increasing, increase randomness (the exponential growth part can be replaced by incremental growth). It is applicable to the scenario of dispersing the pressure of a large number of retry requests generated at a certain time.

return random(0 , 2 ^ retryCount);

Strategy 6. Equal jitter。 Seek a moderate scheme between “exponential interval” and “total jitter” to reduce the effect of randomness. The applicable scenario is the same as “full jitter”.

var baseNum = 2 ^ retryCount;return baseNum + random(0 , baseNum);

3. The performance of strategies 4, 5 and 6 is roughly like this. (x-axis is the number of retries)

99% of people can understand the

Why is there a pit in “retry”?

As mentioned earlier, for the sake of development cost, when you do “retry”, it may be a reused routine call interface. Then we have to ask the question of idempotency.

If the technical scheme selected to implement “retry” cannot 100% ensure that retry will not be repeated, then the “idempotency” problem must be considered. Even if the technical solution can ensure that 100% of the retries will not be repeated, consider the “idempotency” problem as much as possible for the sake of accidents.

Idempotency:No matter how many repeated calls are made to the program, if the state of the program (all relevant data changes) is consistent with the result of one call, idempotency is guaranteed.

This means that the operation can be repeated or retried as needed without causing unexpected effects. For non idempotent operations, the algorithm may have to track whether the operation has been performed.

So,Once a function supports “retry”, the idempotency problem needs to be considered for the interface on the whole linkThe cumulative increase or decrease of business data cannot be caused by multiple service calls.

To satisfy idempotency is to find a way to identify duplicate requests and filter them out. The idea is:

  1. Define a unique identifier for each request.
  2. During retry, judge whether the request has been executed or is being executed. If so, discard the request.

Point 1,We can use a globally unique ID generator or generate services (which can be extended to read,A necessary medicine in distributed system — generation of globally unique document number)。 Or be simple and rude. You can also use the guid and UUID of the official class library.

Then, the RPC framework assigns a unique identification field to each request in the calling client.

Point 2,We can cooperate to verify before and after the server cuts into the actual processing logic code through AOP.

99% of people can understand the

The general code idea is as follows.

[before method execution] if (isexistlog (requestid)) {// 1. Judge whether the request has been received. The corresponding sequence number is 3
    var lastResult = getLastResult();  // 2. Obtain the information used to judge whether the previous request has been processed. Corresponding serial number 4
    if(lastResult == null){  
        var result = waitResult();  // Pending pending completion of processing
        return result;
        return lastResult;
    log(requestId);  // 3. Record that the request has been received

//Do something.. [after method execution]

logResult(requestId, result);  // 4. Update the results.

If the “compensation” is done through MQ, it can be done directly in the SDK encapsulated by MQ. Assign a globally unique ID at the production end and eliminate the weight at the consumer end through the unique ID.

3、 Best practices for retry

Let’s talk about some best practices accumulated by brother Z (key points:). They are all aimed at “retry”. Indeed, this is also the most commonly used scheme in work.

“Retry” is especially suitable for being “degraded” under high load. Of course, it should also be affected by “current limiting” and “fusing” mechanisms. When the “spear” of “retry” is used with the “shield” of “current limiting” and “fuse”, the effect is the best.

It is necessary to measure the input-output ratio of increasing compensation mechanism. When some problems are not very important, you should “fail quickly” rather than “try again”.

It must be noted that overly aggressive retry strategies (such as too short interval or too many retries) will adversely affect downstream services.

Be sure to make a termination policy for retry.

When the rollback process is difficult or costly, it can accept a long interval and a large number of retries. In fact, the “saga” mode often mentioned in DDD is the same idea. However, the premise is that other operations will not be blocked because scarce resources are reserved or locked (such as serial operations of 1, 2, 3, 4 and 5. 3, 4 and 5 cannot continue because 2 has not been processed).

4、 Summary

In this article, we first talked about the meaning of “compensation” and the implementation ideas of two ways of compensation, “rollback” and “retry”.

Then, I remind you to pay attention to the idempotency problem when “retry”, and Z brother also gives a solution.

Finally, some best practices for “retry” summarized by Z Ge are shared.

I hope it will help you.


Have you ever done “compensation” by yourself before? Make complaints about Tucao

Brother Z himself stayed up until midnight many times to clean up the chaos caused by the “accident”, which is unforgettable

Click focus to learn about Huawei cloud’s new technologies for the first time~