Reading tidb papers has a sense of HTAP database with strong data consistency and resource isolation


Author introduction
Chen Xianlin, the head of the middle platform of fish technology, built the middle platform of fish technology from 0 to 1. He has some experience in the construction of distributed architecture, service governance, stability construction, high concurrency and high QPS system and the organization architecture of China Taiwan. He advocates simple and elegant design and pays attention to cloud native and distributed databases.

Pingcap team’s paper tidb: a raft based HTAP database was selected into VLDB 2020, which is an affirmation of the phased achievements of tidb database and is very happy with the rapid development of domestic database technology. Since the implementation scheme of tidb database in high availability, horizontal expansion and acid transaction has been published for a long time, we are familiar with these topics, so we won’t repeat it. The following mainly discusses some inspiration and feelings on how to realize HTAP database with strong data consistency and resource isolation in the paper.


As we all know, databases are divided into OLTP and OLAP types. Why should databases be divided into these two types?
Firstly, OLTP and OLAP define data processing methods, which are two workloads with obvious differences. OLTP operation involves less data, but has high real-time and transaction requirements and large amount of concurrency; OLAP operation has low real-time and transaction requirements, but it involves a large amount of data, and the query mode is not fixed, so it is difficult to be covered by index. Secondly, there was no distinction between OLTP and OLAP types in the early database. The data related operations of OLTP and OLAP types were carried out in a database (mainly relational database). Later, the amount of data gradually increased, and it was unable to handle OLTP and OLAP requests directly in the relational database at the same time, Worse, OLTP type requests may also be affected. Therefore, an OLAP type database more in line with its workload is designed for the OLAP scenario. The OLTP type data is synchronized to the OLAP type database, and then the OLAP type operation is performed.
In the above way, the conflict between OLTP and OLAP workload is solved, but an additional external data replication (from OLTP to OLAP) is introduced, which also leads to the loss of real-time and consistency of OLAP operation data. This is to solve this problem from the outside of the database system through heterogeneous systems, sacrificing the real-time and consistency of data. In this paper, tidb proposes a new scheme to solve this problem from the inside of the database system, while avoiding the loss of real-time and consistency of data.

Strong data consistency or resource isolation

In the previous article, we mentioned that OLTP and OLAP are two very different workloads. Generally, the schemes that need to have both are divided into:

  1. Design a set of storage engine that is suitable for both OLTP and OLAP workloads, and use this storage engine to process all data requests, so that the problem of real-time and consistency of data can be well solved. However, it is difficult to solve the problem that OLTP and OLAP workloads do not affect each other, and a set of storage engine should be suitable for both OLTP and OLAP workloads, There are many restrictions on the design and optimization of storage engine. I feel that it is designing a colorful black engine.
  2. There are two sets of storage engines in the database, which are responsible for OLTP and OLAP workloads, which can avoid the above problems. However, the data needs to be copied from the two sets of storage engines, which will lead to strong data consistency and resource isolation at the same time. It is a difficult problem to solve.

In this paper, tidb chooses scheme 2, which provides a row storage engine tikv for OLTP workload and a column storage engine tiflash for OLAP workload. How to solve the problem of strong data consistency and resource isolation?

Generally speaking, for a distributed storage system, strong data consistency and resource isolation are often one of two choices. Choosing strong data consistency is generally to copy data to multiple related examples through synchronous replication (such as synchronous double write, etc.), which will lead to the tight coupling of all computing and storage resources in one system, A small local problem may cause other parts to be affected and affect the whole body; The selection of resources for mutual isolation is generally realized by asynchronous replication (such as master-slave synchronization) to copy data to other related instances, which can ensure the mutual isolation of resources, but the strong consistency of data can not be guaranteed.

For this problem, the solution given in tidb paper is:Expand the raft algorithm and add the role of learner.

Follower or Learner

Tikv refers to each continuous section of data as a region (96 m by default). Each region is a raft group. Data is copied from the leader node to the follower node through the raft protocol. This is a synchronous replication process (more than half of the nodes are copied successfully). If tiflash also uses the follower method to synchronize data, Data replication between tikv and tiflash can be simply understood as synchronous replication (in fact, strictly speaking, it is a replication between synchronous and asynchronous, because if the follower of tiflash is slow or hangs up, it is equivalent to adding a node with replication failure, which reduces the probability of successful replication of most nodes and reduces the availability of tikv), In this way, the two storage engines will affect each other and cannot achieve the goal of resource isolation.

Therefore, tidb extends the raft algorithm and adds the role of learner. The learner role only asynchronously receives the raft logs of the raft group. It does not participate in the raft protocol to submit logs or elect leaders. In this way, the performance overhead of tiflash’s learner node on tikv is very small in the process of data replication. After receiving the data through the learner role, tiflash converts the row format tuple into column data storage, so as to achieve the purpose of simultaneous row storage and column storage of data in tidb cluster.

Here, you may find a problem. The role of tiflash’s learner is to asynchronously accept the raft log of the raft group. How to ensure strong data consistency of tiflash?

This problem is solved when tiflash reads data. Similar to raft’s follower read mechanism, the learner node provides a snapshot isolation level, and we can read data from tiflash through a specific timestamp. After receiving the read request, the learner sends a readindex request to its leader. According to the received readindex, the learner waits for the corresponding raft log to be synchronized successfully and played back to the local storage, and then filters out the data that meets the requirements according to the requested timestamp.

In this way, tidb transforms synchronous data replication mechanism into asynchronous data replication mechanism, and ensures strong data consistency. When tiflash reads data, the tiflash learner only needs to do a readindex operation with the tikv leader node, which is a very lightweight operation, so the impact between tikv and tiflash will be very small.The experimental data in this paper also verify this point. In tidb, AP and TP operations are carried out at the same time. The impact of AP operation on TP throughput is less than 10% at most, and TP operation on AP throughput is less than 5% at most.

In addition, the experiments in this paper show that the data delay caused by tidb’s asynchronous data replication mechanism from tikv to tiflash is also very small: under the data volume of 10 warehouses, most of the data replication delay is less than 100 ms and the maximum is no more than 300 ms; Under the data volume of 100 warehouses, the delay of data replication is mostly less than 500 ms, and the maximum is no more than 1500 Ms. Moreover, the data delay will not affect the consistency level of tiflash, but will only make the request on tiflash a little slower, because data synchronization needs to be done when receiving the read request.


Here, tidb has two storage engines: OLTP friendly row storage tikv and OLAP friendly column storage tiflash. In fact, this is not the key. The key is that the data synchronization of the two storage engines is highly consistent and can provide a consistent snapshot isolation level, which is a great advantage for the query optimizer of the computing layer. For a request, The query optimizer can choose three scanning methods: row scanning and index scanning of tikv and column scanning of tiflash. For the same request, different scanning methods can be adopted for different parts, which provides a huge optimization space for the query optimizer.The experimental data from the paper also show that AP requests using both tikv and tiflash are better than using either one alone.

Let’s go back to the beginning of the article. At the beginning, because the database needs to handle the OLTP and OLAP workload well, the database is divided into OLTP and OLAP databases according to the workload, and then the user can classify the requests into OLTP and OLAP types and request the corresponding types of databases. Here is another solution: there is only one database for the user, After analyzing the request, the database decides which storage engine to use or two storage engines at the same time, and eliminates this higher-level abstraction by classifying the database and query according to the workload.

When people encounter a problem and cannot find a fundamental solution at present, they always deconstruct the problem according to the scene and solve it one by one in each small scene to achieve the purpose of solving the problem. This is only an expedient measure. Wait for theoretical or technological progress before fundamentally solving the problem. For example, in the process of the development of communication technology, first solve the problem of communication at fixed places with wired phones, then use pagers to receive mobile information, and then add wired phones to solve the problem of mobile communication. Finally, the emergence of mobile phones directly solves the problem of long-distance communication, Before that, by deconstructing the targeted solutions of communication scenarios, wired phones and pagers slowly withdrew from the stage of history.The same is true for the database. First, it is divided into different workloads to solve them one by one, and finally a unified solution will be formed. Tidb has taken a big step forward, and we will wait and see.

Single point or horizontal expansion

An interesting point is also found in the paper. In the distributed architecture, any single point problem that cannot be expanded horizontally is the original sin, because as long as there is a single point that cannot be expanded horizontally, it may theoretically become the bottleneck of the whole system. Tidb, as a horizontally scalable distributed database, Architecturally, there is a single point of dependency: obtaining time stamps from PD. In this paper, through strict performance testing, it is proved that this place will not become the bottleneck of the whole system, and tidb is full of desire for survival.

Complete decentralization or unified centralized scheduling

For the current distributed storage system, there are many completely decentralized architecture designs abroad, such as Cassandra and cockroachdb, but the architecture design of tidb is not completely decentralized. It has a central brain role PD, which corresponds to the ideology of the East and the West. The scheme of small government and big government is also an interesting place.

Vitalik buterin pointed out the main reasons for choosing fully decentralized design: fault tolerance, attack resistance and collision resistance. Since databases are deployed on internal trusted networks, attack resistance and collision resistance will not be a problem. This is different from the decentralized architecture adopted by bitcoin and other digital currencies to ensure that they cannot be controlled by some people or organizations for social and political reasons, Moreover, fault tolerance can be solved in a centralized architecture.

In addition, more importantly, new SQL or HTAP databases such as tidb are designed for massive data, and the number of nodes and data managed by the database cluster will become larger and larger. Especially when combined with the original elastic ability of the cloud, the intelligent scheduling ability of the database will be the key factor determining the performance and stability of the database, However, the decentralized architecture will make the scheduling decision more difficult, especially when the global perspective and multi node cooperation are needed.

Therefore, as long as the centralized role is not the system bottleneck, centralized scheduling has its natural advantages. After all, for scheduling, the most important thing is the global perspective and multi node coordination ability. Tidb is very lightweight for PD positioning of the mental role, and there is no persistent scheduling related state information, so it will not affect the horizontal expansion ability of the whole system.

Present or future

From the internal perspective of tidb, we can see that tidb is a complete storage and computing separation architecture. At present, the computing layer has two engines: SQL Engine and tispark, and the storage layer also has two engines: tikv and tiflash,In the future, both the computing layer and the storage layer can be easily extended to the new ecosystem, so tidb’s goal is not only a database, but also to create a distributed storage ecosystem.

From the perspective of single cluster tidb, HTAP with strong data consistency but isolated resources is a very efficient capability. It eliminates the process of synchronizing data from OLTP database to OLAP database and the process of synchronizing data to OLTP database when the calculation results of OLAP database need to be provided for online business use. In this way, Engineers are happy to move bricks and write SQL instead of moving data frequently. Compared with moving bricks, moving data is like transporting water, which is more prone to problems such as watering and water seepage.

From the perspective of multi cluster tidb, although tidb provides the capacity of HTAP database horizontal expansion, it does not provide tenant isolation. As a result, it is impossible to put all the data of the whole company into one tidb cluster for business isolation, data level isolation and operation and maintenance guarantee (such as backup and recovery), Although tidb provides OLAP capability, if the data for AP operations are distributed in multiple clusters, it is still necessary to synchronize the data of multiple clusters from the outside to a database providing OLAP capability (which can be tidb), resulting in the problem solved by tidb through HTAP again. I think a feasible idea is to add a Google F1 layer on the tidb cluster. In this way, under a unified F1 layer, there can be many tidb clusters. Each tidb cluster is completely isolated. Each tidb cluster is equivalent to a tenant. The F1 layer provides metadata management, read-write request routing capability and AP capability across tidb clusters; Another idea is to solve it within tidb. Add the concept of tenants in the storage layer. Each tenant corresponds to a group of storage nodes, so that the storage layer between tenants is isolated, and the computing layer can do AP operations across tenants. Generally speaking, this is a problem that the storage layer wants resource isolation, but the computing layer wants a unified perspective. We look forward to the subsequent solutions of tidb.


Finally, let’s talk about the value of this paper. Generally speaking, the value of engineering papers is different from that of academic papers. The contribution of engineering papers is not particularly great innovation in ideas and theories, but to tell you that this direction is feasible, and many uncertain explorations are reduced through deterministic engineering implementation. For example, the papers related to Google’s GFS, MapReduce and BigTable have little academic innovation. Almost all ideas and theories already exist, but it tells you that it is feasible to implement them distributed, which is of great significance and greatly promotes the popularization and development of distributed storage systems.

Therefore, as the first paper on the industrial implementation of real time HTAP distributed database in the industry, tidb hopes to accelerate the popularization and development of real time HTAP distributed database. From this level, this significance is very great.