Author: Xiao Chen / Da Shu open laboratory
In the last articleComparison between in memory database parsing and mainstream products (I)In this paper, we introduce the knowledge of disk based database management system, and briefly describe the technical development of memory database. This article will introduce the characteristics of memory database from the perspective of data organization and index, and introduce the actual technical implementation of several products.
– data organization in database management system—
Fixed length block vs variable length block
When the memory database manages the data in memory, although it no longer needs to organize the data in the form of slotted page, it can not arbitrarily allocate address space for the data in memory. It still needs to organize the data in the form of block / page. The traditional disk based DBMS organizes data in the form of slotted page for the sake of reading and writing performance, because the disk interface takes block / page as the reading and writing unit. In memory database, data is organized in blocks to facilitate addressing and management. Data blocks are usually divided into fixed length data block and variable length data block.
Assuming that a data set has been loaded into memory, for convenience, the memory database will put all the fixed length attributes of records into the fixed length data block during data organization; All variable length attributes are saved in another variable length data block. For example, generally, all attributes less than 8 bytes in the data table are placed in the fixed length data block, variable length attributes and attributes more than 8 bytes are placed separately in the variable length data block, and a pointer to its address is placed in the fixed length data block. The advantage of using fixed length data block to manage data is fast addressing. The storage position of records in data block can be determined by record length and number; The space required for the record address pointer is small, which makes the memory address for storing this record in the index structure or other structures the most compact, and the prediction of CPU is more accurate when doing pre fetch.
In the traditional disk based DBMS, the record address saved by the index leaf node is page ID + offset, and the page table is responsible for mapping the page ID to the buffer frame; In the memory database, the record address saved by the leaf node of the index is the direct memory address. In the traditional disk based DBMS, when accessing a page in the buffer, you need to lock / unlock / modify the page. Because there may be many types of locks in the real system, if a thread wants to access a page, it often needs to add several types of locks. Now there is no buffer in the memory database, so the overhead of latch is saved and the performance is greatly improved.
Data organization: data partition, multi version, row / column storage
In the system of multi-core or multi CPU shared memory, the concurrent access conflict to data always exists. Current in memory database systems can be divided intoPartition SystemandNon-Partition SystemTwo. Partition system divides all data into disjoint partitions. Each partition is assigned to a core (or a node in a distributed system). All operations are executed serially without concurrent data access. Ideally, the best performance can be obtained. However, the disadvantages of such systems are also obvious, such as how to divide partitions and how to deal with cross partition transactions. For the non partition system, all cores and all threads can access all data, so there must be concurrent access conflicts, and the data structure supporting concurrent access must be adopted. At present, the general database is mostly designed by non partition system. The main reason why the partition design is not adopted is that it is difficult to partition the data effectively in the general scenario, and the partition database cannot be used.
In the non partition system, if two threads access the same data item, there will be a conflict. At this time, the multi version solution can be considered. The advantage of multi version is that it can improve the degree of concurrency. Its basic idea is to make all read operations not block write operations through multi version data, so as to improve the performance of the whole system. For those systems that read more and write less, the performance of multi version will be very good, but for some write heavy systems, the performance is not ideal.
Another consideration for data organization is the organizational form of row and column. Traditional database systems maintain data on disk, which is divided into row storage and column storage. As the name suggests, row storage stores data by row and column storage stores data by column. If you operate on all attributes of a small number of records, row storage is more appropriate. If you read only part of the column data of a large number of records, the performance of column storage is better. For example, a record has 100 attributes. This read operation needs to read one of the attributes of all records. If it is stored by row, the column needs to be filtered after the block is read in; If you store by column, you can only read the block corresponding to this column of data, so the performance will be better and suitable for statistical analysis. But the memory database will not have this problem. All data are stored in memory, whether row or column, and the cost of access is the same. Therefore, in the memory database, row / column storage can be exchanged or arbitrarily selected. Of course, for TP applications, line storage is more used, because all attributes can be read out at one time. But even for column storage, the performance is not as bad as that in disk based database systems. For example, SAP Hana is a mixed row and column storage. The front-end transaction engine is row storage. After consolidation, the back-end is converted to column storage.
– Comparison of in memory database systems—
Next, from the perspective of data organization, briefly introduce four representative systems: hekaton, the memory database engine of SQL server, hyper, the memory database system of Munich University of technology, Hana of sap, and h-store / voltdb of Michael Stonebraker, the winner of Turing Award.
Hekaton is a non partition system. All threads can access arbitrary data. Hekaton’s concurrency control does not use the lock based protocol, but uses the multi version mechanism. Each version of each record has a start timestamp and an end timestamp to determine the visible range of the version.
Each table in hekaton has at most 8 indexes, which can be hash or range indexes. At the same time, all record versions do not require continuous storage in memory, but can be non continuous storage. Different versions of the same record are associated through a pointer link.
As shown in the figure above, there is a table containing name, city and amount fields. There is a hash index on the name field and a b-tree index on the city field. The black arrow represents the pointer corresponding to the name index, and the first record corresponding to the name John points to the next record with the same initial name. Each record contains start and end time stamps. Red indicates that there is a transaction updating record. The end time stamp will be replaced after the transaction is committed. The same is true for the b-tree index. The blue arrow pointer is concatenated according to the city value.
H-store / voltdb is a partition system. Each partition is deployed on a node, and the tasks on each node are executed serially. H-store / voltdb does not have concurrency control, but has simple lock control. A partition corresponds to a lock. If a transaction is to be executed on a partition, it needs to get the lock of the partition before execution. In order to solve the problem of cross partition execution, h-store / voltdb requires that the transaction must get the locks of all relevant partitions at the same time to start execution, which is equivalent to locking all transaction related partitions at the same time.
H-store / voltdb adopts a two-tier architecture: the upper layer isTransaction Coordinator, determine whether the transaction needs to be executed across partitions; The lower layer isExecution engineIt is responsible for data storage, indexing and transaction execution, and adopts a single version row storage structure.
The data blocks in h-store / voltdb are divided into fixed length and variable length: the length of each record in the fixed length data block is the same, and the 8-byte address is used in the index to point to the position of each record in the fixed length data block; The variable length attribute is stored in the variable length data block. In the record of the fixed length data block, it corresponds to a pointer (non inline data) to point to its specific position in the variable length data block. In this data organization mode, a compressed block look up table can be used to address data records.
Hyper is a multi version non partition system, and each transaction can access any data. Meanwhile, hyper is a TP and AP hybrid processing system for HTAP services. Hyper passCopy on WriteThe mechanism realizes the mixed processing of TP and AP. Assuming that the current system is conducting transactions on the dataset, if an AP request occurs, hyper will take a snapshot of the dataset through the fork function of the operating system, and then analyze it on the snapshot. The copy on write mechanism does not copy all data in memory. Only when the data changes due to OLTP service, the snapshot will actually copy the original data, while the unchanged data will be referenced to the same physical memory address through the virtual address.
In addition, hyper adopts multi version control. All updates are based on the original records. Each record maintains an undo buffer to store incremental update data, and indicates the current latest version through the version vector. Therefore, the modified record can be found through transaction, and the modified version can be retrieved by reverse application of incremental data. Of course, the data version can also be fused or restored regularly.
SAP Hana is a non partition hybrid storage system. Physical records go through three stages in the storage medium: 1. Transaction records are stored in L1 Delta (line storage mode); 2. Then the records are converted into columns and stored in L2 Delta (column storage, unsorted dictionary coding); 3. The main memory of SAP Hana is column memory (highly compressed and encoded by sorting Dictionary). Each record experiences the mapping and merging from row storage to column storage, which is equivalent to a multi version design.
– indexing technology in database management system –
When designing indexes in the field of memory database, we mainly consider cache awareness and multi-core and multi socket parallelism.
Since the in memory database is no longer limited by disk I / O, the purpose of indexing is to accelerate the access speed between CPU and memory. Although the memory price is low now, there is still a large difference between the growth rate of memory speed and the growth rate of CPU main frequency. Therefore, for the memory database, the purpose of indexing technology is to provide data to the CPU in time and put the required data into the CPU cache as fast as possible.
For the parallel processing of multi-core and multi CPU, we began to consider how to reasonably construct the index if the data structure and data are placed in memory in the 1980s. Among them, the mm-dbms project of the University of Wisconsin in 1986 proposed a self balanced binary search tree T-tree index. Each binary node stores data records within the value range, and two pointers point to its two child nodes. The memory overhead of T-tree index structure is small. Because memory was expensive in the 1980s, the main measurement is not whether the performance is optimal, but whether it occupies the minimum memory space. The disadvantage of T-tree is its performance. It needs to be load balanced regularly, and scanning and pointers will also affect its performance. Early commercial systems, such as times ten, used the data structure of T-tree.
So why should cache awareness be considered in index design? In 1999, a study found that cache stall or cache miss in memory access is the main performance bottleneck of memory system. A performance test was conducted in this study. Through the evaluation of four A / B / C / D systems, the time proportion of the following processes was tested: computing, memory stalls, branch miseditons and resource stalls. Computing represents the time actually used for calculation; Memory stall is the time to wait for memory access; Branch mispredictions refers to the cost of CPU instruction branch prediction failure; Resource stalls refers to the time waiting for other resources, such as network, disk, etc.
It can be seen that memory stall will occupy a large proportion of overhead in different test scenarios. Therefore, for the memory index structure, the main purpose of developing cache oriented index is to reduce the overhead of memory stall.
Here are some typical examples of memory index structures. The first is CSB + – tree, which is still a B + – tree logically, but some changes have been made. First, the size of each node is a multiple of the length of a cache line; At the same time, CSB + – tree organizes all the child nodes of a node into children group. A parent node points to its children group through a pointer to reduce the number of pointers in the data structure. Because the CSB + – tree node matches the cache line length, good pre fetch performance can be achieved as long as it is read in order. When the tree splits, the CSB + – tree will reallocate the group in memory, because the CSB + – tree nodes do not need to be continuous in memory. You can create new pointer links after they are arranged.
Another example is Pb + – trees (pre fetching B + – trees). It is not a new structure, but implements B + – tree in memory, and the size of each node is equal to the length multiple of cache line. Pb + – trees is special in that pre fetching is introduced in the whole system implementation process to help the system prefetch data by adding some additional information.
Pb + – trees tend to use flat trees to organize data. The performance of search and scan is given in the paper. The performance of search is improved by 1.5 times and that of scan is improved by 6 times. Compared with CSB + – trees, Pb + – trees account for less data cache stalls when processing search.
Another performance comparison is that when prefetching is not used, it takes 900 clock cycles to read the tertiary index with a node size equal to two cache lines, while it only takes 480 cycles after prefetching. Another implementation of Pb + – trees is that it adds a jump pointer array to each node to judge how many cache lines to skip during scanning to prefetch the next value.
BW tree is an index used in hekaton system. The basic idea is to compare memory values through compare and swap instruction level atomic operations. If the old and new values are equal, they will be updated. If they are not equal, they will not be updated. For example, if the original value is 20 (stored on disk) and the memory address corresponds to 30, updating 30 to 40 will not succeed. This is an atomic operation, which can be used to realize uninterrupted data exchange operation in multithreaded programming.
There is a mapping table in BW tree. Each node has a storage location in the mapping table. The mapping table will store the address of the node in memory. For BW tree, the pointer from the parent node to the child node is not a physical pointer, but a logical pointer, that is, the location recorded in the mapping table is not a real memory location.
BW tree adopts the design that the node update does not directly modify the node, but saves the modified content by adding delta record (incremental record), and then points to the delta record in the mapping table. If there is a new update, it will continue to point to the new delta record. When reading the contents of a node, all delta records are actually merged. Because the BW tree is updated through an atomic operation, only one change can succeed in competition. Therefore, it is a latch free structure, which can solve the competition problem only by compare and swap, and does not need to rely on the locking mechanism.
Adaptive Radix Tree
Hyper’s index tree is designed using adaptive radius tree. The traditional radius tree is a prefix tree. Its advantage is that the depth of the tree does not depend on the number of indexed values, but on the length of the search key. Its disadvantage is that each node needs to maintain the information of possible child nodes, resulting in large storage overhead of each node.
In the adaptive radius tree, different types of length formats are provided for each node, and different numbers of child nodes such as 4 / 16 / 48 / 256 can be saved respectively. Node4 is the smallest node type and can store up to 4 child node pointers. The key is used to represent the value stored by the node. The pointer can point to the leaf node or the internal node of the next layer. Node16 and node4 are structurally consistent, but node16 can store 16 unsigned chars and 16 pointers. When storing the 17th key, it needs to be expanded to node48. Node48 is structurally different from node4 / 16. It has 256 index slots and 48 pointers. These 256 index slots correspond to 0-255 of unsigned char. The value of each index slot corresponds to the position of the pointer, 1-48 respectively. If a byte does not exist, the value of its index slot is 0. When storing the 49th key byte, it needs to be expanded to node256. Node256 results are relatively simple. 256 pointers are stored directly, and each pointer corresponds to the 0-255 range of unsigned char.
For example, in this example, we want to index an integer (+ 218237439). The binary representation of the integer is 32 bits, and then convert the 32 bits into 4 bytes. The 4 bytes are represented in decimal as 13, 2, 9 and 255, which is its search key. In the index tree, the first layer is node 4, 13, which meets the storage requirements of this layer, so it is retained on the first layer node, and the subsequent bits enter the next layer for storage, and the next layer is node 48, which is used to store 2; Each subsequent bit is stored in each layer below. Since the integer in this example is represented by 4 bytes, there are 4 layers in total. It can be seen that the structure of each node is different. It is stored one by one according to the number of bytes and order. The number of different values in this layer depends on the type of node selected. If the current type is not enough, you can increase the number, and the keys that each node can accommodate change dynamically, which can not only save space, but also improve the locality of the cache.
In addition, adaptive radius also adopts the path compression mechanism. If the parent node of a path has only one child node, it will be compressed and merged. Adaptive radius adopts such an index structure because the size of each node is equal to one cache line, and all operations can be implemented on the basis of one cache line.
OLFIT on B+-Trees
Olfit on B + – trees (optimal latch free index access protocol) is an index technology adopted by hanap * time, which can ensure the cache coherence of CPU on multi-core database. In the architecture of multiprocessor computers, the caches of multiple CPUs will cache data in the same memory. In the memory database, the stored data will be read to the corresponding cache before processing; If the memory data changes during cache data processing, the cache data will become invalid due to inconsistency with the memory data. Cache coherence is to solve this problem.
Consider this scenario: as shown in the figure below, there is a tree data structure in memory, which is processed by four CPUs, and each CPU has its own cache. Assuming that cpu-p1 reads N1, N2 and N4 first, N1, N2 and N4 will be in the cache. Then, when cpu-p2 reads N1, N2 and N5, it is assumed that the data structure is not latch free. If modification is allowed while reading, a latch is required to lock it during reading and release it after reading. Because latch and data are put together in the memory database, although the data does not change, the state of latch has changed, and the computer hardware structure will think that memory has changed. Therefore, when multiple CPUs read the same data, only the last read state is valid, and the previous read will be considered invalid. This will cause the CPU cache to fail because the latch state changes, even if the read operation is performed. Therefore, olfit has designed a set of mechanisms that require write operations to apply for latch, and read operations are not required. Olfit maintains read-write transactions through the version number. Each CPU copies the version number to the local register before reading, and then judges whether the version number is the same as that before reading after reading the data. If it is the same, it will continue to execute normally. If it is different, it means that the cache is invalid. Therefore, read requests will not cause cache invalidation of other CPUs.
From this example, we can see that the problems considered by memory database are different from disk based database. Without disk I / O, we need to consider the performance limitations of other aspects.
Skilists is a technology used by the data processing engine of memsql. Its bottom layer is an ordered list. The upper layer extracts data according to a certain probability (p-value) and then makes a list. When searching a large list, you can start from the top layer and progress layer by layer, which is similar to binary search. The granularity can be customized according to the actual situation. The reason for this design is that all insertion operations into the list can be completed through the compare and swap atomic operation, thus saving the cost of locking.
– Summary of this paper –
Firstly, this paper introduces the data organization of memory database, including data partition, partition / non partition system differences and storage methods, and compares the actual implementation of four products. Then, this paper introduces the index technology of six memory database systems, and briefly describes the principle of index query through examples. The next article will continue to analyze the memory database, and discuss the optimization design of memory database for query performance and availability from the perspective of concurrency control, persistent storage and compiling query.
Note: refer to the following materials for relevant contents of this paper:
1. Pavlo, Andrew & Curino, Carlo & Zdonik, Stan. (2012). Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. Proceedings of the ACM SIGMOD International Conference on Management of Data. DOI: 10.1145/2213836.2213844.
2. Kemper, Alfons & Neumann, Thomas. (2011). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. Proceedings – International Conference on Data Engineering. 195-206. DOI: 10.1109/ICDE.2011.5767867.
3. Faerber, Frans & Kemper, Alfons & Larson, Per-Åke & Levandoski, Justin & Neumann, Tjomas & Pavlo, Andrew. (2017). Main Memory Database Systems. Foundations and Trends in Databases. 8. 1-130. DOI: 10.1561/1900000058.
- Sikka, Vishal & Färber, Franz & Lehner, Wolfgang & Cha, Sang & Peh, Thomas & Bornhövd, Christof. (2012). Efficient Transaction Processing in SAP HANA Database –The End of a Column Store Myth. DOI: 10.1145/2213836.2213946.
5. Diaconu, Cristian & Freedman, Craig & Ismert, Erik & Larson, Per-Åke & Mittal, Pravin & Stonecipher, Ryan & Verma, Nitin & Zwilling, Mike. (2013). Hekaton: SQL server’s memory-optimized OLTP engine. 1243-1254. DOI: 10.1145/2463676.2463710.