How is tidb’s columnar storage engine implemented?


Author: Wei Wan

TiDBIt is a distributed HTAP database. At present, it has two storage nodes: tikv and tiflash. Tikv adopts line storage, which is more suitable for TP type services; Tiflash adopts column storage and is good at AP type services. Tiflash synchronizes data from tikv nodes in real time through raft protocol, with millisecond delay and excellent data analysis performance. It supports real-time synchronous data update of tikv and online DDL. We integrate tiflash as a raft learner into tidb’s raft system, integrate the two nodes into a database cluster, and the upper layer uniformly queries through tidb nodes, making tidb a real HTAP database.

How is tidb's columnar storage engine implemented?

In order to support real-time updates on column storage, we have specially developed a new column storage engine delta tree for tiflash. It can support high TPS writes while still maintaining good read performance. This article explains the design idea of delta tree in detail. Welcome to discuss it.

Overall architecture

The architecture design of delta tree fully refers to the design ideas of B + tree and LSM tree. On the whole, delta tree divides the table data into ranges according to the primary key, and the segmented data block is called segment; Then, the segment adopts a hierarchical structure similar to LSM tree. Partitioning is to reduce the amount of data in each area and reduce the complexity. We will see the great advantages brought by this design later.


The segmentation granularity of segment is usually about 1.5 million lines, which is much larger than the leaf node of traditional B + tree. The number of segments on a machine is usually less than 100000, so we can completely put the meta information of segments in memory, which simplifies the complexity of engineering implementation. Like the leaf node of B + tree, segment supports split and merge. In the initial state, there is only one segment with a range of [- ∞, + ∞).

How is tidb's columnar storage engine implemented?

2 Levels LSM Tree

Within segment, we organize data in a hierarchical manner similar to LSM tree. Because the data volume of segment is much smaller than that of other LSM tree implementations, delta tree only needs two fixed layers, namely delta layer and stable layer, corresponding to l0 and L1 of LSM tree respectively. We know that for LSM tree, the fewer layers, the smaller the write magnification. Under the default configuration, the theoretical write amplification (regardless of compression) of delta tree is about 19 times. Because the column storage continuously stores the same type of data, it is naturally more friendly to the compression algorithm. In the production environment, the common actual write amplification of delta tree engine is less than 5 times.

How is tidb's columnar storage engine implemented?

The overall architecture of delta tree is introduced above. Next, we will discuss the internal design details of segment in detail.


The internal data management unit of segment is pack. Usually, a pack contains 8K rows or more of data. The schema of a relational database consists of multiple column definitions. Each column definition includes column name, column ID, column type and default value. Because we support DDL, such as adding columns, deleting columns, changing data types, etc., different pack schemas may be different. The data of pack is also composed of column data. Each column data is actually a one-dimensional array. In addition to the primary key column primary keys (PK) and the columns contained in the schema, the pack also includes the version column and del column_ Mark column. Version is the commit timestamp of the transaction, which is used to implement mvcc. del_ Mark is a boolean type, indicating whether this line has been deleted.

The purpose of dividing segment data into packs is that we can take packs as IO units and index filtering units. In the data analysis scenario, the way to obtain data from the storage engine is scan. In order to improve the performance of scan, our IO blocks are usually large, so we can read one or more packs of data each time we read io. In addition, in the analysis scenario, the traditional row level precise index is usually of little use, but we can still implement some simple rough indexes, such as min max index. The filtering unit of this kind of index is pack.

If you are familiar with the architecture of tidb, you will understand that the data of tikv takes region as the scheduling unit, and region is the virtual data block segmented by range. The internal data of delta tree pack is(PK, version)The combined fields are sorted in ascending order, which is consistent with the data order in tikv. In this way, tiflash can seamlessly access the tidb cluster and reuse the original region scheduling mechanism.

Delta Layer

Delta layer is equivalent to l0 of LSM tree. It can be considered as an incremental update to segment, so it is named Delta. Similar to the memtable of LSM tree, the latest data will be written to a data structure called delta cache first. When it is full, it will be brushed into the delta layer on the disk. When the delta layer is full, it will merge with the stable layer (this action is called delta merge) to get a new sable layer.

Stable Layer

The stable layer is equivalent to the L1 of the LSM tree and is the place where most of the data of the segment is stored. It is stored in a non modifiable file called dtfile. A segment has only one dtfile. The stable layer is also composed of packs, and the data is displayed in(PK, version)The combined fields are sorted in ascending order. The difference is that the data in the stable layer is globally ordered, while the delta layer only ensures the order in the pack. The reason is very simple, because the data of delta layer is written from Delta cache, and there will be overlap between packs; The data of stable layer is sorted by delta merge action to realize global order.

When the total data volume of the segment exceeds the configured capacity limit, the midpoint of the segment range will be used as the split point to split into two segments; If two adjacent segments are small, they may be merged into one segment.

Storage mode

The storage methods of delta layer and stable layer of delta tree engine are different on disk. The former uses pagestorage (PS) for storage and the latter uses dtfile for storage.


How is tidb's columnar storage engine implemented?

PS is similar to object storage. It stores and manages the data block page, which is actually a section of bytes. It is designed to store the pack data of delta layer of delta tree and metadata of segment. PS supports page get, insert, delete and update operations, and supports combining multiple operations into one writebatch to realize atomic writing. Pages are stored in pagefile. A pagefile can store multiple pages. We maintain a page metadata pagemap in memory, which can locate the specific pagefile and file offset stored by the page, so as to support random reading of the page.

PS may have one or more writable pagefiles. When the writable pagefile is full, it will become readonly pagefile. All page update operations are written directly to the end of the pagefile. For example, the update operation of page is to write a new page in the writable pagefile, and then point the pageid to the new page in the metadata table in memory. Over time, many invalid data segments will appear in pagefile, so we designed GC threads to merge pagefiles with low utilization in the background, so as to improve reading efficiency and recover disk space.

The pack in delta layer will be serialized into page and stored in PS. Usually, business queries only involve some columns of a table, so PS also supports some data in read-only pages.


Dtfile is in the form of a folder on the file system. Inside the folder is a standard column storage format, that is, one column is stored in one file. Dtfile is read-only, and its reading mode is sequential reading, which is suitable for storing stable layer data. Dtfiles are generated in three cases: Delta merge, segment split, and segment merge.

  • Stable layer is suitable for dtfile storage, while delta layer is more suitable for pagestorage. Mainly considering their differences in the following aspects:
  • Dtfile is used to store one file for one column, so if you need to read and write consecutive packs, you can do column level IO consolidation, which is very suitable for the reading and generation mode of stable layer.
  • Dtfile is read-only after writing. If it is used to store delta layer data, it can only be used in the way of one-to-one correspondence between pack and dtfile, which will cause a large number of small files and affect the system performance. Pagestorage combines and stores multiple pages without small file problems.
  • Pagestorage can read pages randomly, which is consistent with the reading mode of delta layer.
  • Pagestorage can reduce the number of IO generated during writing. When writing dtfile, you need to open as many files as the number of columns, and each column will generate at least one io. Since the frequency of regenerating the stable layer is not high, and writing multiple packs can be used for column level IO consolidation, the IO size of each time is appropriate, and dtfile storage is more appropriate. For delta layer, only one pack will be written at a time. If dtfile is used for storage, the IO unit can only be one column of pack; If pagestorage is used for storage, we can serialize all column data of the pack into pages and write them to the disk at one time, so as to reduce the number of Io.

Write optimization

At present, tiflash is mainly oriented to real-time OLAP scenarios. In design, it needs to support both high-frequency writing and batch writing. High frequency writing is to synchronize the update operation of tikv in OLTP scene in real time; In the scenario of batch data import and just after starting the tivlash column storage synchronization, it is necessary to write a large number of data in a short time, so delta tree also needs to support batch writing. We deal with these two scenarios differently to achieve the best effect.

Delta Cache

In order to relieve the IOPs pressure of high-frequency writing, we designed a memory cache in the delta layer of delta tree, called delta cache. The update will be written to the delta cache first and will not be flushed to the disk until it is full. Batch write does not need to write delta cache. This type of data will be written directly to disk.

You may notice that the design of delta cache has the risk of losing data. Therefore, the general storage engine needs to write the wal before writing data, and recover from the previous flush point after restarting. However, because delta tree is a column storage engine designed for tidb, it makes full use of the characteristics of tidb, that is, raft protocol. In the raft protocol, any update will be written to the raft log first and will not be applied to the state machine (i.e. database storage engine) until most copies are confirmed. Therefore, we directly use raft log to implement wal, that is, the raft log applied index is updated after flush.

Continuous write capability

Similar to LSM tree, delta tree uses background thread to merge delta layer into stable layer during data writing. The purpose is to control the data volume of delta layer within a certain proportion, so as to maintain good reading performance. Delta merge is the main factor of delta tree write amplification, and delta tree write amplification~=(segment average size / delta limit)。 Therefore, the higher the delta merge frequency, the greater the write amplification, the smaller the delta layer ratio, and the better the read performance.

The design concept of tiflash is to maintain the write capability as much as possible and find a balance between write performance and read performance. For example, when the business write volume is very large, the write speed is too fast. If the background delta merge cannot keep up, write stall may occur, that is, blocking the write of the foreground. In this case, the delta tree will dynamically limit the write speed, reduce the delta merge frequency and reduce the write amplification, so as to rebalance the write and background delta merge. This will sacrifice some read and write performance, but alleviate the problem of completely unable to write in extreme cases, and the business experience is better.

Read optimization

Delta tree is not directly implemented as a column stored LSM tree, but tosses over the above designs, largely for reading acceleration. The segment partition design of delta tree can reduce read amplification, and the double-layer structure in the segment partition is also convenient for read acceleration.

Reading data from the storage engine corresponds to the scan operator in the execution plan. Its main task is to return a range according to a range (such as [x, y])(PK, version)Ordered data. There are three time-consuming parts in the process, and the design of delta tree optimizes them.

A. time consuming for reading IO and decompressing the data itself.

B. the consumption of multi-channel merging algorithm for multiple sorted data streams, such as the common minimum heap algorithm.

C. after multi-channel merging, copy the data of multiple data streams to the output stream.

Reduce read amplification

We know that the data of tidb cluster is managed by region. By scheduling the replicas of the region to be reasonably distributed among the storage nodes of the cluster (i.e. tikv and tiflash), we can achieve load balancing and high data availability. Because regions are only logical blocks, and regions at a node are not continuous. Therefore, when scan involves a large number of regions, there must be read amplification. Obviously, if the region’s data is distributed in more files, its read magnification will be larger. That is, the more layers, the greater the reading magnification. Delta tree reduces the amount of data in the area and the number of layers through segment partition, so as to reduce the read amplification, that is, it is optimizedAPart of the time-consuming.

Read index delta index

Since multi-channel merging is time-consuming, can we avoid doing it again every time we read it? The answer is yes. In fact, some in memory databases have practiced similar ideas. The specific idea is that after the first scan is completed, we try to save the information generated by the multi-channel merging algorithm, so that the next scan can be reused and only need to process the incremental part. This information that can be reused is called delta index, which is implemented by a B + tree. The design of delta index has gone through many design iterations and referred to many existing database schemes.

  1. First, we put the stable layer and delta layer together. At this time, each row of data has tuple ID that can be used for index. Then, using the idea similar to secondary index, build a B + tree in memory, and the entry of the leaf node of B + tree is the same(PK, version)Sort in ascending order. The data content of the entry is((PK, version), is_insert, Tuple Id)。 is_ Insert indicates whether to insert or delete.
  2. At this time, each line in the segment needs to have a corresponding entry. The memory pressure is very high and needs to be compressed. It was observed that,(PK, version)It’s not necessary, because we have tuple ID, which can be obtained from the original data, and we don’t need to put it in the delta index. This is critical for scenarios with large PK.
  3. Tuple ID has many phenomena of continuous increment of ID, because in segment, the data of stable layer usually accounts for the vast majority. Therefore, for N consecutive stable entries, use one(Tuple Id, N)Replace it.
  4. Further observation shows that tuple IDs of stable layer data must be sorted in ascending order because they are globally sorted in stable layer. Therefore, we can only record the data of delta layer and additionally record how many stable tuples are inserted between two delta entries. Therefore, tuple IDs of stable layer and delta layer need to be distinguished, which are called stable id and delta ID respectively.

How is tidb's columnar storage engine implemented?

The format of the final entry record is(is_insert, Delta Id, (Stable Id, N)), the amount of entry data is equal to the amount of delta layer data. Generally, the data volume of delta layer accounts for less than 3% of the total data volume, while an entry only needs 16 bytes. If it is estimated that 10 billion data is stored in a single node, it will require about 4.5 GB of memory, and the memory overhead is acceptable.

With the information recorded by the delta index, the scan operation can easily combine the delta layer and the stable layer and output a sorted stream. There is a small problem here, that is, between two scans, new data may be written into the delta layer, so we still need to process this part of incremental data and update them to the delta index. Since the results of the last time can be reused in the next time, the cost is shared equally, that is, we optimize theBPart of the time-consuming.

aboutCPart of the time-consuming. The main optimization of delta tree is to batch copy the data of continuous stable layers, so as to save CPU cycles. Of course, the multi-channel merging algorithm of LSM tree can also be copied in batch, but it is troublesome to implement. Moreover, because the data volume ratio between the lower layer and the upper layer of LSM tree is 10 times, while that of delta tree is usually more than 37 times, the effect of batch copy of delta tree will be better than that of LSM tree.

Delta Tree vs LSM Tree

At first, tiflash tried to implement it based on LSM tree architecture, but later it was found that its read performance could not meet the requirements, and there were other problems. That’s why the delta tree project was launched. The following is a comparison of the scan time of delta tree and the column storage engine based on LSM tree under different data volumes (tuple number) and different update TPS (transactions per second).

How is tidb's columnar storage engine implemented?

The following is a time-consuming comparison of different SQL when using ontime dataset. It can be seen that delta tree has greater advantages in most cases, not only because delta tree has better scan performance, but also because it supports rough index (such as min max), which can avoid reading irrelevant data.

How is tidb's columnar storage engine implemented?

The following is the comparison between tiflash and other databases using delta tree engine.

How is tidb's columnar storage engine implemented?

How to handle transactions

The tidb cluster with the tiflash replica enabled also supports transactions and ensures the same isolation level as the original tikv, which is rare in an AP database. For details, please refer toTidb document。 Some students may be curious about how tiflash does it. Here is a brief introduction to the more important points. Interested students can go to see the paper we just published in VLDB“TiDB: A Raft-based HTAP Database”

How is tidb's columnar storage engine implemented?

At present, tiflash only synchronizes the modification of tikv in real time and does not provide data during the transaction, so it does not need to provide point query capability, which avoids the weakness of column storage.

Like tikv, each piece of data in delta tree has a version number field, version, which is the commit timestamp of the transaction_ ts。 Therefore, in the query process, according to query_ Ts can filter outversion <= query_tsData snapshot to realize mvcc.

The distributed transaction model used by tidb is called percolator. It needs locks in the transaction process. Generally speaking, these locks need to be persistent. Of course, tikv can persist these locks into its own storage rocksdb, but the column storage delta tree of tiflash can not efficiently support high TPS and low latency read operations in K-V mode. The current solution of tiflash is also very simple. Since these locks are not convenient for persistence, put them in memory. We will only write the committed modifications in the transaction to the delta tree store, and the uncommitted transactions will be stored in memory. It should be noted that there is no concern about the loss of these lock data and uncommitted data. There are two mechanisms:

  1. As mentioned earlier, all data are landed in the raft log in advance. The part of the memory that is not persistent can be recovered from the raft log after restart.
  2. In the state of memory, including locks and uncommitted data, a copy will be saved to disk regularly, and then the raft log applied index will be pushed forward. Restore the copy from disk to memory after restart.


For a long time, there has been a huge technical gap between online business systems and analysis systems, that is, the update of data can not flow smoothly in real time. The core reason is that the storage and update ability of the analysis database is usually very poor. Imagine the pain of synchronizing MySQL binlog to hive in real time. Delta tree columnar storage engine perfectly solves this problem, so that the columnar storage most suitable for analysis scenarios can also be updated in real time.

By introducing tiflash, tidb integrates row storage and column storage in a database system, making real-time business data analysis very simple. You can even use row and column data in one SQL at the same time, and ensure that the data is strictly consistent. In tidb, only one is required to achieve this capabilitySQL: alter table my_table set tiflash replica 1;

Recommended Today

Swift advanced (XV) extension

The extension in swift is somewhat similar to the category in OC Extension can beenumeration、structural morphology、class、agreementAdd new features□ you can add methods, calculation attributes, subscripts, (convenient) initializers, nested types, protocols, etc What extensions can’t do:□ original functions cannot be overwritten□ you cannot add storage attributes or add attribute observers to existing attributes□ cannot add parent […]