Distributed System: Past and Present Life of CAP Theory


CAP theory is an important theory in distributed system design. Although it provides a very useful basis for system design, it also brings many misunderstandings. This article will start with the background of CAP’s birth, then explain the theory, and finally analyze some new understandings of CAP in the current context, clarify some misunderstandings of CAP.

The Background of the Birth of CAP Theory

The theory of CAP arises from the controversy of “Data Consistency VS Availability”. Brewer, the author of CAP, began to study cluster-based cross-regional systems (essentially early cloud computing) in the 1990s. For such systems, system availability is the primary goal, so they use caching or post-update to optimize system availability. Although these methods improve system availability, they sacrifice system data consistency.

Brewer put forward BASE theory in the 1990s (basic availability, soft state, final consistency), which was not well accepted at that time. Because we still pay more attention to the advantages of ACID, and are unwilling to give up strong consistency.Therefore, Brewer put forward the CAP theory in order to open up the design space of distributed system, emancipate the mind through the formula of “three choices and two choices”, and not just grasp the consistency.

Understanding the background of the birth of CAP, we can have a deeper understanding of CAP theory and its inspiration. Although the idea of “three choices and two choices” helps us to open up our design ideas, it also brings many misunderstandings. Next, we will analyze one by one. First, we will look at the explanation of CAP theory.

Classical Explanation of CAP Theory

CAP theorem is the most basic and key theory in distributed system design. It points out that distributed data storage cannot satisfy the following three conditions at the same time.

  • ConsistencyEach read either gets the most recently written data or gets an error.
  • AvailabilityEach request receives a (non-error) response, but there is no guarantee that the latest written data will be returned.
  • Partition toleranceAlthough any number of messages are lost (or delayed) by the network between nodes, the system continues to run.

CAP theorem shows that in the case of network partition, consistency and availability must be chosen as two choices.When a network partition occurs (network failure or large latency between different nodes), either it loses consistency (allowing data to be written to different partitions), or it loses availability (stopping service when network partition is identified).In the absence of network failure, that is, when the distributed system is running normally, consistency and availability can be satisfied at the same time. It should be noted that the consistency in CAP theorem is quite different from that in ACID database transactions. C of ACID means that transactions cannot break any database rules, such as the uniqueness of keys. In contrast, CAP C only refers to consistency in the sense of a single copy, so it is only a strict subset of ACID consistency constraints.

CAP theory seems difficult to understand, in fact, as long as we grasp a core point can be derived, without memorizing. When a network partition occurs,

  • If the system is not allowed to write, it means that the availability of the system is reduced, but the data of different partitions can be consistent, that is, consistency is chosen.
  • If the system allows writing, it means that data between different partitions is inconsistent and system availability is guaranteed, that is, selective availability.

New Understanding of CAP

CAP is often misunderstood, largely because the scope of usability and consistency is often ambiguous when discussing CAP. Without defining the concepts of usability, consistency and partition tolerance in specific scenarios, CAP will actually constrain the design of systems. First of all, because partition seldom happens, there is no reason to sacrifice C or A when there is no partition in the system. Secondly, trade-offs between C and A can occur repeatedly in the same system at very fine granularity, and each decision may vary depending on the specific operation, or even on the specific data or users involved. Finally, all three properties can be measured to a certain extent, not black or white. Availability obviously varies continuously from 0% to 100%. Consistency can be divided into many levels, and even partitioning can be subdivided into different meanings. For example, different parts of the system can have different perceptions of whether partitioning exists or not.

What is partition tolerance?

In the real world, under normal circumstances, the communication between nodes of distributed system is reliable, and there will be no case of message loss or high delay, but the network is unreliable. There will always be occasional case of message loss or high message delay. At this time, the nodes in different regions will be unable to communicate for a period of time, that is, sending messages. Partitions were generated.

Partition tolerance refers to the ability of distributed systems to continue to operate and provide services to the outside world when network partitions occur.Note that what is said here is still able to provide services to the outside world is different from the requirement of availability. The requirement of availability is that any request can be responded to, which means that all nodes in the network partition can provide services even if there is a network partition. The emphasis of partition tolerance is that the system is still available (including partially available) after the emergence of network partitions.

For example, a system using Paxos for data replication is a typical CP system. Even if there is a network partition, the primary partition can provide services, so it is partition tolerant. Another counterexample: a system that replicates data using 2PC has no partition tolerance. When network partitions occur, the whole system will block.

Scope of availability

Availability is actually intuitive: each request receives a (non-error) response, but there is no guarantee that the latest written data will be returned. To put it another way, it isFor each node in a distributed system, it can respond to external requests, but does not require consistency.

What often puzzles us is the criteria for measuring system availability. In fact, the key point is the scope of usability, which is meaningless without the scope of usability in specific scenarios. Discussing usability requires specific scenarios to demarcate boundaries. Simply assuming that an algorithm meets the usability requirements is not rigorous, because there are many techniques to compensate for corrections in engineering implementation.

For example: document is a very typical AP system. It can also be used when the network is broken. The trick is that it goes into offline mode when it finds that the network is broken, allows users to continue editing, and then merges the modified content after the network is restored. It can be found that for Google documents, the user’s browser is also a node of its system. When the network partition occurs, it can still provide users with services, but the cost is to give up the consistency, because users modify only local knowledge, and the server side is not clear. So in this example, the scope of usability includes user browsers, not the nodes of distributed systems that we normally understand must be servers.

It is worth noting that in the real world, we generally do not pursue perfect availability, so the general saying is high availability, that is to say, to ensure that as many node services as possible are available. This is one of the reasons why Paxos’s consistency algorithm is becoming more and more popular.

Scope of Consistency

When discussing consistency, it is necessary to clarify the scope of consistency, that is, the state within a certain boundary is consistent, and the consistency beyond the boundary is impossible to talk about.For example, Paxos guarantees complete consistency and availability in a primary partition when network partitioning occurs, while services outside the partition are not available. It is worth noting that when the system chooses consistency when partitioning, that is, CP, it does not mean that it loses usability completely, which depends on the implementation of the consistency algorithm. For example, the standard two-phase commit is completely unavailable when partitioning occurs, while Paxos guarantees the consistency and availability of the primary partition.

After the discussion above, we can find that the scope requirement of availability is more stringent than the scope requirement of consistency. The availability requirement in CAP theory is the availability of the whole system, even if some nodes are unavailable, it violates the availability constraint. The requirement of consistency is not so high. When network partition occurs, as long as the data consistency of primary partition is guaranteed, the system is also considered to conform to consistency constraints. Why do I say so? Because when a network partition occurs, the client can get the latest value by visiting the main partition (more than half of the nodes are accessed, if the values are the same, the data accessed is the latest). At this time, the system meets the consistency requirement of CAP theory.

Management zoning

Network partitioning is inevitable in distributed systems. The classical CAP theory ignores network latency, but in the real world, network latency is closely related to partitioning. That is to say, when the system fails to reach a consensus within a limited period of time (the network delay is very high), it means that the partition occurs. At this point, a choice needs to be made between consistency and availability: choosing to continue retrying means choosing consistency and abandoning availability; abandoning data consistency and completing operations means choosing availability. It is worth noting that abandoning data consistency when partitioning does not mean completely ignoring the fact that general engineering implementations will adopt retry to achieve final consistency.

From the above analysis, it can be found that balancing the impact of availability and consistency during partitioning is a key issue in distributed system design. Therefore, managing partitions requires not only active discovery of partitions, but also preparation of recovery process for the impact during partition. In other wordsWe can apply CAP theory from another angle: how to choose between consistency and availability when the system enters the partition mode.

There are three steps to manage partitions:

  • Partition start detected
  • Clearly enter partition mode, restrict certain operations
  • Start the partition recovery process when the communication is restored

When the system enters the partitioning mode, there are two choices:

  • Selection Consistency: For example, Paxos algorithm, only the majority of primary partitions can operate, other partitions are not available, when the network is restored, a few nodes synchronize data with most nodes.
  • Choose availability: for example, the document, when the partition comes into the offline mode, and so on, the network resumed the recovery of the client and server data.


Theory is abstract from reality and serves reality, but it is by no means equal to reality. The misunderstanding of “three choices and two choices” in CAP theory stems from the fact that we often equate theory with reality.The birth of CAP is mainly to broaden the design ideas, not limited to the constraints of strong consistency. Simply applying “three choices and two choices” restricts the design idea. In the real world, different business scenarios have different requirements for availability and consistency, and the scope and range of consistency and availability are dynamic, not either. Therefore, a better system design can be achieved only by accurately understanding CAP theory, starting from the perspective of management partition and combining with specific business scenarios.

Reference material

  • CAP twelve years later: How the “rules” have changed
  • Cluster-Based Scalable Network Services
  • Harvest, Yield and Scalable Tolerant Systems

Author: Xiao Hansong

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.

Recommended Today

Vue、Three. JS implementation panorama

1、 First, we need to create a Vue project This paper mainly records the process of building panorama in detail, so building Vue project is not described too much. 2、 Install three js npm install three –save npm install three-trackballcontrols –save npm install three-orbit-controls –save npm i three-obj-mtl-loader –save npm i three-fbx-loader –save npm i […]