Author: laboratory Chen/ Big data open laboratory
In the last articleComparison between in memory database parsing and mainstream products (II)In, we introduce the characteristics of memory database and the technical implementation of several products from the perspective of data organization and index. This paper will continue to analyze the memory database, introduce several technologies from the perspective of concurrency control, persistence and query processing, and bring more dimensional and detailed discussion on the memory database technology.
– concurrency control in database management system—
1. Two strategies for concurrency control of in memory database
a. Multi version concurrency control
There are two main strategies for concurrency control in memory database: 1. Multi version concurrency control; 2. Sub partition processing. Concurrency control mechanisms can be divided into optimistic and pessimistic types. Pessimistic concurrency control thinks that the process competition resources always exist, so it locks the access first and releases it after access; Optimistic concurrency control believes that in most cases, there is no need to compete for resources. It only checks whether there is a conflict before the final submission, rolls back if there is a conflict, and commits if there is no conflict.
Optimistic concurrency control is mostly not implemented by lock based technology, and it is usually multi version. Multi version means that each update will produce a new version. The read operation selects the appropriate old version according to the visible range. The read operation does not block the write operation, so the degree of concurrency is relatively high. Its disadvantage is that it will incur additional costs, such as creating a new version when updating, and with more and more versions, it also needs additional costs to recover the old version. Memory databases mostly adopt optimistic multi version concurrency control mechanism. Compared with pessimistic concurrency control based on lock, its advantage is that it has less overhead and supports scenarios with high degree of concurrency; The disadvantage is that in the scenario with a large number of write competitions, when the conflict probability between transactions is high, a large number of transactions will fail and roll back.
b. Partition processing
Another strategy of concurrency control of memory database is to divide the database into multiple partitions, and each partition processes transactions in a serial manner. The advantage is that the execution of single partition business has no additional overhead for concurrency control. The disadvantage is that the system throughput will drop sharply when there are cross partition transactions. Therefore, if it cannot be guaranteed that all services are performed in a single partition, the performance will be unpredictable.
2. Hekaton of multi version concurrency control
Hekaton adopts optimistic multi version concurrency control. When a transaction starts, the system assigns a read timestamp to the transaction and marks the transaction as active, and then starts executing the transaction. During the operation, the system records the data read / scanned / written. Then, in the pre commit phase, first obtain an end timestamp, and then verify whether the versions of read and scanned data are still valid. If the verification passes, write a new version to the log, execute commit, and then set all new versions to be visible. After commit, post processing records the version timestamp, and then the transaction really ends.
a. Transaction validation for hekaton
i) Read Stability：Hekaton system can ensure the read stability of data. For example, each record version read at the beginning of the transaction is still visible during commit, so as to realize read stability.
ii) Phantom Avoidances：Phantom means that a transaction executes the same conditional query at the beginning and end, and the results are different. The reason for the phantom is that other transactions add / delete / update the same dataset during the execution of the transaction. How to avoid phantom phenomenon? Repeat scanning to check whether the read data has a new version, so as to ensure that the version recorded at the beginning of the transaction is consistent with that at the end.
The advantage of hekaton concurrency control is that there is no need to verify the read only transaction, because multiple versions can ensure that the record version at the beginning of the transaction still exists at the end. For transactions that execute updates, whether to verify is determined by the isolation level of the transaction. For example, if the snapshot isolation level is, no verification is required; If you want to read repeatedly, do read stability; In case of serialization isolation level, both read stability and phantom avoidance shall be guaranteed.
b. Hekaton’sRecycling strategy
The recycling task in hekaton is not handled by an independent thread, but each transaction recycles itself. As shown in the following figure, the transaction end timestamp with transaction ID 250 is 150 and the status is terminated. At this time, a write set will obtain all old versions and judge whether the start timestamp of all current active transactions is greater than the transaction end timestamp with ID 250, i.e. 150. If it is greater than 150, it means that it is impossible to modify the old version based on the old version whose timestamp is earlier than 150. Therefore, the old version is recycled by the transaction. This part of work is additional work for each thread when processing the transaction.
3. Multi version concurrency control
There are three main differences between hyper concurrency control and hekaton: 1. Update directly at the record location, save the modifications to the data through the undo buffer, and the data and all modifications are linked together; 2. Verification is realized by comparing the latest update with the read record (which will be involved later); 3. Serialize the commit, sort the submitted transactions and process them in turn.
In terms of transaction verification, hyper verification needs to record read predictions in the log, including query or range scan, and records of insertion, deletion and update. In hyper mode, insert / delete / update learns the modified records through the corresponding undo buffer, so record changes are easy for hyper.
For each transaction, you only need to compare whether other transactions have added / deleted / modified the data sets that meet the search criteria from the beginning to the commit, so as to judge whether there is a phantom phenomenon. If so, the transaction will be terminated directly.
4. Multi version concurrency controlHana andHStore/VoltDB
Hana parallel control mode is relatively simple. It adopts pessimistic multi version control. The data structure is protected by row level lock, and the visible range of each version is determined by time stamp for each row. Each transaction needs to apply for a write lock when updating or deleting, and deadlock detection is required.
Hsstore / voltdb is a partition system. The granularity of locks is very coarse. Each partition corresponds to a lock. Therefore, when a transaction is executed on a node, it needs to get all the resources of the node. Once a transaction may involve two partitions, you need to get the locks of both partitions. Therefore, the advantage of the partition system is that the processing speed of a single partition is very fast, while the disadvantage is that the efficiency of multiple partitions is very low. At the same time, the system is very sensitive to the deflection of the load. If there is hot data, the hot data will constitute the system bottleneck.
5. Load prediction of multi version concurrency control
Assuming that in a workload, the data sets that transactions need to read and write can be obtained in advance, the execution order of all transactions can be determined before execution. Calvin is a Vll (very lightweight locking) ultra lightweight lock database prototype system designed based on this assumption. The workload of general scenarios cannot know the read-write set in advance, but in the application of stored procedure business, the read-write set can be determined in advance. For these scenarios, systems similar to Calvin can be considered.
– Persistence technology in database management system –
For in memory databases, the same as disk based databases, logs and checkpoints are also required. The purpose of checkpoint is that recovery can start from the nearest checkpoint without playing back all data. Because checkpoint involves the operation of writing to disk, it affects the performance. Therefore, it is necessary to speed up the related processing as much as possible.
One difference is that the log and checkpoint of the in memory database can not contain the index, and the index can be reconstructed through the basic data during recovery. The index in the memory database is reconstructed during recovery. After the construction is completed, it is also placed in memory without dropping the disk. If the memory index data is lost, it can be reconstructed. Another difference is that the amount of data in the in memory database checkpoint is larger. When a disk oriented database is in checkpoint, it only needs to write all the dirty pages in memory to the disk. However, the memory database checkpoint needs to write all the data to the disk. No matter how much the data is, it needs to be written in full. Therefore, the data written to the disk when the memory database checkpoint is much larger than the disk based database.
For the performance optimization of persistence, the first is to ensure high throughput and low latency when writing logs, and the second is to consider how to quickly reconstruct the whole database during recovery. Hekaton’s records and indexes are stored in memory, and all operations are logged to disk. The log only records the update of data, not the update of index. When performing checkpoint, hekaton will recover from the log and process it in parallel according to the primary key range. As shown in the following figure, there are three primary key ranges: 100 ~ 199, 200 ~ 299 and 300 ~ 399. Green represents data and red represents deleted records (the deleted files are saved separately). During recovery, hekaton uses a parallel algorithm to reconstruct the index and data in memory. In the process, he filters the data file according to the deleted records, removes the deleted data, and then plays back the data according to the log starting from the checkpoint point.
Checkpoints for other systems
- Systems using logic logging, such as h-store / voltdb, do not record specific data changes, but record executed operations and instructions. Its advantage is that less log information is recorded. When writing logs, hsstore / voltdb adopts the cow (copy on write) mode, that is, the normal state is single version, but another version will be “copied” when writing logs, and the versions will be merged after writing.
- The other is to periodically write snapshots to disk (excluding indexes). For example, hyper provides snapshots based on the folk function of the operating system.
– query processing in database management system –
The traditional query processing adopts the volcanic model. Each node in the query tree is a general operator. The advantage is that the operators can be combined arbitrarily. However, the record obtained by the operator is only a byte array, and another method needs to be called to resolve the attribute and attribute type. If this design is put into an in memory database, the attribute and type parsing are carried out at runtime rather than at compile time, which will have an impact on performance.
In addition, for get next, if there are millions of data, it will be called millions of times. At the same time, the implementation of get next is usually a virtual function, which is called through a pointer. Compared with calling directly through a memory address, these will affect the performance. In addition, the distribution of such function code in memory is discontinuous and needs to jump constantly. To sum up, the query processing method of traditional DBMS is not applicable in the memory database, especially in the underlying execution.
Memory database usually adopts the method of compilation and execution. First, the query is parsed, then the parsed statements are optimized, and the execution plan is generated. Then, the execution plan is compiled according to the template to generate executable machine code, and then the machine code is loaded into the database engine and called directly during execution.
The following figure is a time-consuming analysis of different query methods. It can be seen that resource stall accounts for a small proportion of compilation execution methods.
Another figure explains the implementation of the current CPU architecture. There is a hardware pre fetcher between L2 cache and main memory. L2 cache is divided into instruction cache and data cache. The instruction cache will be predicted by branch prediction, and the data cache will be predicted by pre fetcher based on sequential pattern. Therefore, the design of the database system needs to consider how to give full play to the pre fetcher function under this architecture, so that the cache can continuously provide instructions and data to the CPU computing unit to avoid cache stall.
Hekaton compiled query processing
The compilation of hekaton adopts T-SQL stored procedure. The intermediate form of compilation is called mat generator. The final C code is generated and executed in the compiler. The difference between the library it produces and the general-purpose operator is that the general-purpose operator needs to interpret data types at run time; The hekaton compilation method is to compile the table definition and query together. Each library can only process the corresponding table rather than general, so the data type can be obtained naturally. This implementation can obtain 3 ~ 4 times the performance improvement.
Hyper and memsql compile query processing
The compilation method of hyper is to compile the query tree according to each segment of the pipeline. Memsql is compiled with llvm and MPI language into code.
The following figure is a test of memsql performance. When compilation execution is not adopted, the time for memsql to execute the same query twice is 0.05 seconds; If compilation is used for execution, it takes 0.08 seconds for the first time, but only 0.02 seconds for re execution.
– other in memory database systems –
In addition to the several memory databases mentioned earlier, there are other famous memory DBMS.
i) SolidDB：The hybrid database system, which was born in 1992, has an optimization engine based on disk and memory, uses vtrie (variable length trie) tree index and pessimistic lock mechanism for concurrency control, and is recovered through snapshot checkpoints.
ii) Oracle Times Ten：In the early days, it was a research project called smallbase of HP lab, which was acquired by Oracle in 2005. Now it is mostly used as the front-end memory acceleration engine of large database systems. Oracle times ten supports flexible deployment, with independent DBMS engine and RDBMS based transaction cache; Support memory repository when Bi works, and control concurrency through locking; Row level latching is used to handle write conflicts, and write ahead logging and checkpoint mechanisms are used to improve persistence.
iii) Altibase：Founded in Korea in 1999, it is widely used in telecommunications, finance and manufacturing. Altibase stores records on the page, takes the page as the granularity for checkpoint, and is compatible with the traditional DBMS engine; It supports multi version concurrency control, uses pre written log records and checkpoints to achieve persistence and recovery, and checks page data through latching free.
iv) PTime: The 21st century originated in Korea and was sold to sap in 2005. It is now a part of SAP Hana. Ptime has excellent performance processing. It uses differential coding (XOR) for log records, adopts a mixed storage layout, and supports a database management system larger than memory.
– Summary of this paper –
Each database system is designed for a specific hardware environment. The disk based database system faces the scenario of single CPU core, small memory and slow disk. The memory database takes memory as the main memory and does not need to repeatedly read and write to the disk. Disk I / O is no longer the performance bottleneck, but to solve other bottlenecks, such as: 1. The overhead of locking / latching; 2. Cache line miss, that is, if the data structure is not well defined or organized in memory, it cannot match the hierarchical cache of CPU, which will lead to poor computing performance; 3. Pointer chasing, which requires another pointer interpretation or another table to find the record address, is also the main performance overhead. In addition, there are prediction evaluation calculation, a large number of copies during data migration / storage, and network communication between distributed system applications and database systems.
In this column, we introduce the characteristics of traditional disk based DBMS and memory database, and introduce the similarities and differences between memory database and disk based database from the aspects of data organization, index, concurrency control, statement processing, compilation and persistence:
1. Data organization:In the memory database, the records are divided into fixed length and variable length management, without continuous data storage, and the indirect access of page ID + offset is replaced by pointer;
2. Index optimization:Consider latch free technologies such as memory oriented cache line optimization and fast memory access, as well as index updates without logging;
3. Concurrency control:Pessimistic and optimistic concurrency control methods can be adopted, but the difference from the traditional disk based database is that the lock information and data binding of the memory database are not managed by a separate lock table;
4. Query processing:There is little difference between sequential access and random access in the memory database scenario. Query performance can be improved by compiling and executing;
5. Persistence:Still log through wal (write ahead logging) and use lightweight logs. The contents of log records are as few as possible to reduce the delay of log writing to disk.
Memory database began to appear in 1970s, and has experienced the development links of mature theory, putting into production, market verification and so on. With the emergence of high concurrency, large traffic and low delay platforms such as Internet spike, mobile payment and short video platform, there are great demands and challenges for database performance. At the same time, the continuous improvement of memory itself in terms of capacity and unit price friendliness, as well as the recent development of nonvolatile storage (NVM) have promoted the development of memory database. These factors make the memory database have a broad market and landing opportunities in the future.
Note: refer to the following materials for relevant contents of this paper:
1. Pavlo, Andrew & Curino, Carlo & Zdonik, Stan. (2012). Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. Proceedings of the ACM SIGMOD International Conference on Management of Data. DOI: 10.1145/2213836.2213844.
2. Kemper, Alfons & Neumann, Thomas. (2011). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. Proceedings – International Conference on Data Engineering. 195-206. DOI: 10.1109/ICDE.2011.5767867.
3. Faerber, Frans & Kemper, Alfons & Larson, Per-Åke & Levandoski, Justin & Neumann, Tjomas & Pavlo, Andrew. (2017). Main Memory Database Systems. Foundations and Trends in Databases. 8. 1-130. DOI: 10.1561/1900000058.
- Sikka, Vishal & Färber, Franz & Lehner, Wolfgang & Cha, Sang & Peh, Thomas & Bornhövd, Christof. (2012). Efficient Transaction Processing in SAP HANA Database –The End of a Column Store Myth. DOI: 10.1145/2213836.2213946.
5. Diaconu, Cristian & Freedman, Craig & Ismert, Erik & Larson, Per-Åke & Mittal, Pravin & Stonecipher, Ryan & Verma, Nitin & Zwilling, Mike. (2013). Hekaton: SQL server’s memory-optimized OLTP engine. 1243-1254. DOI: 10.1145/2463676.2463710.