Review the cap theory of distributed system

Time:2021-4-19

1. The history of cap theory

In July 2000, Professor Eric brewer proposed the cap conjecture; two years later, Seth Gilbert and Nancy Lynch proved cap in theory; after that, cap theory has officially become a recognized theorem in the field of distributed computing.

2. The background and definition of cap

The object of cap theory is distributed scene. A distributed system needs to meet the three most basic characteristics, namely consistency, availability and partition tolerance. This Chinese translation is not intuitive and fails to reflect the original meaning of partition, which also artificially increases the cost of understanding. At least for me before, it will be introduced separately later). The simple explanation of cap theory is that there can not be a perfect distributed architecture that can satisfy the three characteristics at the same time. Architects should not try to design a “perfect” architecture to meet these three characteristics, but should make trade-offs between caps according to local conditions and actual requirements.

Here is just a translation of cap from the literal meaning, which is not easy to understand. Here is a specific example to illustrate.

The following figure is a hypothetical minimum (typical) distributed application scenario:

Review the cap theory of distributed system

  • Two servers, node1 and node2, form a service cluster to provide external services
  • The client can randomly access the services on any server
  • The two servers can also access each other internally

What features does this simple distributed system need to meet to be a better system (product)? Consider the following scenarios:

  1. When the customer accesses node1, he writes a data (for example, he deposits 100 yuan into the account). When the customer wants to read the value, he randomly accesses node2. The system needs to ensure that node2 can also return the correct value. This is the so-called consistency requirement.

    The authoritative explanation of consistency is as follows (from the original author who proved cap theory)

    Consistency

    any read operation that begins after a write operation completes must return that value, or the result of a later write operation

  2. When a customer visits a node, if the node works normally, the system needs to ensure that the node must give a response to the customer (it can be an error response, it can also have a certain delay, but it can not be without a response), that is to say, it must ensure that the request can be responded at any time, which is the availability requirement of the system

    Availability

    every request received by a non-failing node in the system must result in a response

  3. In the distributed environment, each node is not reliable, and the communication between each node may also have problems. When some nodes fail (or the node itself fails, or part of the network fails), the whole system produces the so-called “partition”. When a system generates partitions, if it can provide better services (such as better consistency and availability), it can be said that the system has better partition tolerance.

    If you don’t understand partition well, let’s look at the English explanation of partition

    (n.) a wall or screen that separate one part of a room from another

    (v.) to separate one area, one part of a room, etc. from another with a wall or screen

    That is to say, when some nodes are isolated from the cluster for some reasons, the whole system can still work normally and behave as if it is OK.

    Partition Tolerance

    the network will be allowed to lose arbitrarily many messages sent from one node to another

3. Why can’t cap be satisfied at the same time

If you are interested in further understanding, you can see the original proof here

https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf

Easy to understand, you can refer to this article:

https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

A simpler explanation is as follows: consider the following scenario:

The client writes data to node1; the partition of node2 causes the data of node1 not to be synchronized to node2; the client accesses node2 to read data

  • At the same time meet the AP: the system ensures that in the case of node2 partition, it can immediately return the results to the client. But at this time, node2 has not synchronized to the data of node1, so there is no way to ensure the consistency of the data
  • At the same time, CP is satisfied: the system can return consistent results to the client in the case of node 2 partition. This can only be returned after node2 synchronizes the data of node1 correctly (it may never be synchronized), so it cannot be returned immediately (or it may never be returned), and it will lose its availability
  • At the same time, CA: node2 can guarantee to return accurate and consistent data to the client, but considering that this is a distributed system, it is impossible to ensure that each node can work normally without partition (although the probability of all nodes in the cluster failure at the same time is very low, the probability of a single node failure is relatively high)

4. The application of cap theory in reality

Since the theory is like this, we should not waste time to design perfect distributed system, which is the function of methodology.

Considering that in the distributed scenario, the partition of the system can not be avoided, we can only provide a better “partition fault tolerance” product as far as possible. In other words, we need to make a trade-off between consistency and availability when partitioning occurs.

  • CP: give priority to ensuring the consistency of data. When the data is not consistent, you can appropriately reduce the availability of the system, such as giving up the current request and letting the client try again; or reduce the response speed to the client (such as waiting for the 5S prompt interface at the end of the bank transfer). Zookeeper is designed to coordinate services in a distributed system and ensure the data consistency of each service node, which is an example of CP. There are also various distributed database products, such as redis and HBase, which are also examples of CP which is biased towards data consistency.
  • AP: give priority to ensure the availability of the system and reduce the demand for data consistency. For example, the number of orders that can be purchased displayed in the order ordering interface of e-commerce websites changes from time to time. If you want to ensure consistency, you need to refresh the system from time to time to obtain the latest data, which is bound to affect the response speed of the website, reduce the availability and affect the user experience (you can change to prompt whether the inventory meets the order conditions at the moment of real order).

In fact, with more and more perfect infrastructure, the occurrence of P in distributed system can be controlled more and more finely. In the case of not having to worry about P, the system can achieve perfect C and a in most cases. For details, please refer to another article of the original author (strongly recommended): 12 years review of cap Theory: “rules” have changed

Chinese version:https://www.infoq.cn/article/cap-twelve-years-later-how-the-rules-have-changed

English version:https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed