It looks scary, but it’s a simple and practical cap theory


When doing distributed system development, we often hear cap theory more or less, or deal with the problem of data consistency between nodes.
But what is cap theory?

Cap theory is very simple, but it is the macro guidance of many software design, so it is also called one of the theories that architects must master.
Since the theoretical things are relatively abstract and cumbersome, let’s take an example first:
One day you knelt down in the glory of the king, so you found a great God Luban to take you off:

So the following conversation happened

Shooter Luban: help me, thank you

Assistant Cai Wenji: received
Assistant Cai Wenji: I’m here to catch people

 Shooter Luban: copy that

If everything is good, then everything is OK
But if YY voice suddenly hangs up at this time

(1) Shooter Luban: help me, thank you

 Cai Wenji, no response

(2) Assistant Cai Wenji: I’m here to catch people

 Shooter Luban, no response


Then there are two strategies:
1、 For the time being, no matter the problem of YY voice, continue to play the game well.
2. Stop playing games, cut the screen and call each other to see if something happened

Yes, this involves the so-called cap theory,
That is, in the distributed scenario, if there is a problem in the communication between different nodes, what should be the response strategy of different nodes. Should we pause the response and wait for the connection to resume, or should we insist on the response and ignore the data inconsistency.
Let’s first look at the professional explanation of cap in the computer field:

Consistency: in a distributed system, all data backups must be consistent at any time;
Availability: nodes in the distributed system can respond to read and write normally at any time without timeout;
Partition tolerance: when the communication between nodes occurs (anti-theft connection: This article starts from )In case of failure, the whole distributed system can still run without crashing directly;
P is the background, that is, in a distributed system, if there is a problem in the communication between nodes, the whole system is still running. After the network is restored, the whole system can still operate normally.

After theoretical and practical deduction, we find that AC can not coexist in the context of P.
That is, you can’t guarantee both the availability and consistency of nodes.
Professional deduction proves that this is not the case here. The author’s own understanding is as follows:
When the network between nodes collapses, if nodes want to support availability, it is bound to cause data modification, resulting in data inconsistency, and the network collapses, so there is no way to synchronize data between nodes. Now there are two options for the next processing strategy:
(1) If nodes want to support consistency, the only choice is not to write data. This is because nodes cannot synchronize data, so they can only give up the write operation, which leads to the loss of high availability.
(2) If write is supported, there will be data inconsistency between nodes, and due to network problems, it cannot be synchronized in time, which will lose high consistency.

What’s the use of this theory?
He can clearly point out that for distributed systems, it is impossible to make perfect data all the time, and there are high availability scenarios. Architectural design must make trade-offs.
1. Comply with availability and abandon high consistency
2. Support high availability and abandon high availability
The specific choice is determined by the business. If your system supports short service pauses, but the system cannot make mistakes, you should prefer CP schemes, such as real-time call system and financial transaction system. If your system supports temporary data inconsistency, you must ensure high availability, such as live broadcast, likes and comments. Then we should prefer the AP scheme. It doesn’t matter which scheme is better, but the business needs to drive the selection of technology.
Many people say that it’s not cap. Do you choose the second of the three? Why do you only have CP and AP here?
Both of these understandings are OK,however(draw, pay attention)
P represents partition tolerance, which means that it is still available for network communication in distributed scenarios. This is the premise.
CP and AP are two schemes discussed based on this premise. If you cross out P, take AC, that is, the consistency and availability in the single machine scenario (anti-theft connection: This article starts from )Problem, which loses the significance of discussion. It’s like in middle school physics, r = u / I, resistance is equal to voltage divided by current. But you can’t understand that resistance is directly proportional to voltage and inversely proportional to current, because resistance is only related to material and specification. You can calculate the data according to the formula, but it doesn’t mean it’s literal.

In addition, it should be noted that the above cases are theoretical models, in which the network delay required for data consistency is ignored. In other words, in practice, due to the delay of the network, an efficient CP system is very difficult to implement. At the same time, due to the rapid response required in Internet products, the design of AP mode usually accounts for a larger proportion in actual development. So how to avoid the impact of data inconsistency?
Here are two ideas
1、 The final consistency of the data does not matter. It is acceptable as long as it does not affect the core data calculation. As long as the final data can ensure consistency and meet idempotency.
2、 Shorten the time of data consistency recovery, that is, if there is a problem of data inconsistency, the system can find ways to shorten the time-consuming of data recovery. If the data has been recovered when the request comes in again, the outside world will not perceive the inconsistency.