Comparison of common technologies and schemes of database recovery subsystem (2)


Author: Xiao Chen / big data open laboratory

Last article《Comparison of common technologies and schemes of database recovery subsystem (I)》In, we basically introduce the logging & recovery recovery subsystem in the database management system, and discuss in detail the concept and technical implementation of Aries, the mainstream recovery algorithm based on physical logging. This article will share with you Professor Gong Xueqing’s introduction to the principle of logical undo logging and the recovery technology of two database systems SQL Server (Azure) and silo.

— Logical Undo Logging —

In the previous article, we briefly introduced the optimization idea of early lock release: release the lock on the index in advance to improve the degree of concurrency, but at the same time, it will generate dependencies between transactions, resulting in cascading rollback. For example, the first transaction has released the lock, but it needs to be rolled back when it fails to brush the log. At this time, the lock has been obtained by the next transaction, and the next transaction needs to be rolled back together with the previous transaction, which greatly affects the system performance.

Based on this situation, logical undo logging is introduced into the recovery system to solve the above problems to a certain extent. Logical   The basic idea of undo logging is to undo the modification operation when the transaction is rolled back, rather than the modification of data. For example, the cancellation of insertion is deletion, and the cancellation of data item + 1 is data item – 1.

In addition, the concept of operation logging is introduced when processing logical undo, that is, a special logging method is adopted for some transaction operations when logging, which is called transaction operation logging. When an operation starts, a special operation logging will be recorded in the form of log < Ti, OJ, operation begin >, where OJ is the unique operation ID. During the operation, the system performs physical redo and undo logging normally; After the operation, special logging < Ti, OJ will be recorded,   Operation end, u >, where u is the logical undo operation for a set of operations already performed.

For example, if the index inserts a new key value pair (K5, rid 7) to the leaf node index i9, where K5 is stored at position x on the index i9 page, replace the original value old1; Rid7 is stored in the X + 8 position to replace the original value old2. For this transaction, an operation will be added at the front when logging. The update operation of < T1, O1, operation begin > and < T1, O1, operation end > will be recorded in the middle physical logging. At the end, the end log < T1, O1, operation end, (delete i9, K5, rid7) >, in which (delete i9, K5, rid7) is the logical undo of T1 operation, that is, K5 and rid7 of i9 node page will be deleted.

Comparison of common technologies and schemes of database recovery subsystem (2)

After the operation is completed, the lock can be released in advance, allowing other transactions to successfully insert < key, record ID > into the page, but causing all key values to be reordered, causing K5 and rid7 to leave the X and X + 8 positions. At this time, if you perform physical undo, you need to withdraw K5 and rid7 from X and X + 8, but in practice, their positions have changed. It is unrealistic to perform physical undo according to the original log record. However, logically, undo only needs to delete K5 and rid7 from the index node i9. Therefore, add the operation log (delete i9, K5, rid7) to the log, indicating that if Undo is performed, it only needs to follow this logical instruction.

During rollback, the system scans the log: if there is no operation end log record, physical Undo is performed because the lock will be released only at the end of the operation. Otherwise, it means that the lock has not been released, and other transactions cannot modify the locked data items. If the operation end log is found, it indicates that the lock has been released. At this time, only logical Undo can be performed. Skip all logs between begin operation and end operation, because the undo of these logs has been replaced by logical undo. If the < T, O, operation abort > log is found during undo, it indicates that the operation has been abort successful. You can directly skip the intermediate log and return to the < T, O, operation begin > of the transaction to continue to undo; When undo to the log of < T, start >, it indicates that all logs of the transaction have been revoked, recording < T,   Abort > log indicates the end of undo.

It should be noted that all redo operations are physical redo. This is because the implementation of logical redo is very complex, such as determining the redo order. Therefore, all redo information in most systems is physical. In addition, there is no conflict between physical redo and lock release in advance, because redo can only be performed in case of failure. At this time, the system hangs up, resulting in all locks are gone. When redoing, it is necessary to re apply for locks.

— SQL Server:Constant Time Recovery —

For most commercial database systems, such as SQL server, Aries recovery system is mostly used. Therefore, the workload of all uncommitted transactions is directly proportional to the work content of each transaction. The more transactions do, the longer undo takes. Because the transaction operation may be a statement, but many records are updated, undo needs to undo the records one by one, resulting in a long undo time. Although the correctness is guaranteed, it will be difficult to accept cloud services or systems with high availability requirements. In order to deal with this situation, CTR optimization technology (constant time recovery) appears, which combines Aries system and multi version concurrency control to realize fixed time recovery. Fixed time recovery refers to using the multi version information in the database system to ensure that the recovery operation is completed within a certain time. The basic idea is to use different data versions in the multi version database system to ensure that the data Undo is in a correct state, rather than recovering with the information in the original wal log.

  • Concurrency control of MS-SQL

SQL Server began to introduce multi version concurrency control in 2005, but the early multi version was only used to realize snapshot isolation level rather than recovery. It supports the system to read the data in the database according to the snapshot timestamp. When updating data, multi version concurrency control can update records locally on the data page, but the old version is not discarded, but placed separately in the version store. Version store is a special table that only allows data to be added continuously (append only). It indicates the old version of the record through a pointer link, and the old version will point to the previous older version, thus forming a data version chain. When accessing data, you can decide which version of data to read according to the timestamp of the transaction. Therefore, there is no need to log the update of the previous version store, because once a failure occurs, all timestamps are the latest after restart. As long as the last version is maintained, there will be no need for snapshot access earlier than this version. For the current active transaction, the old version can be discarded through the garbage collection mechanism according to the current timestamp.

Comparison of common technologies and schemes of database recovery subsystem (2)

CTR is optimized based on the original version store and implements the persistent version store, so that the old version can be stored persistently. Under CTR technology, when the system updates the version store, it will record the preparation of logs for recovery, which increases the volume and overhead of the version store. Therefore, there are two strategies for old version storage: in row versioning and off rowversioning.

In-Row VersioningIt refers to the operation of updating a record. If only a small amount of data is changed, it does not need to be stored in the version store. Instead, a delta value is added after the record to indicate the value change of the attribute. The purpose is to reduce the overhead of versioning, because the disk I / O changed at the same location is relatively small.

Off row versioningA special system table is used to store the old versions of all tables and record the redo records of insert operations through wal. When in row versioning cannot completely save data updates due to large modifications, off row versioning is adopted. As shown in the figure below, col 2 in line A4 is 444. After updating to 555, a delta will be written to record version changes. However, this modification is limited by the amount of data and whether there is free space on the page where the record itself is located. If there is free space, you can write. If not, you need to put the updated records into the off row versioning table.

Comparison of common technologies and schemes of database recovery subsystem (2)

During the recovery process, CTR is implemented in three stages (analytics, redo and undo). The analysis phase is similar to Aries. It is used to determine the current state of each transaction, such as active, commit and transactions requiring undo. During redo, the system will replay the main table and the table in the version store, and restore them to the state of crash. After redo is completed, the database can be online for external services. The third step of CTR is undo. After the end of the analysis phase, you already know which transactions are not committed. In the undo phase, you can directly mark these transactions as abort. Since different versions of each record will record the transaction number related to the version, when subsequent transactions read the version, they first judge the status of the relevant transaction. If it is abort, they ignore the version and read the previous old version. This recovery method makes it necessary to find the previous version according to the link when reading the unavailable version. Although it will bring additional performance overhead, it reduces the offline time of the database. After continuing to provide services, the system can carry out garbage collection in the remaining time to slowly clear the invalid old versions. This mechanism is called logical revert.

  • Logical Revert

There are two ways to logical revert. The first is to scan all data blocks with background cleanup to determine which garbage can be recycled. The judgment condition is: if the last version in the main table comes from an abort transaction, take an old committed version from the version store and put it in the main table. Even if you don’t do this at this time, you will read the data in the version store when you use the data later. Therefore, the version can be moved slowly through the garage collection process in the background. The second is that if the transaction finds that the version in the main table is the version of abort’s transaction when updating data, it will overwrite the version. At this time, the correct version of this transaction should be in the version store.

It can be seen that the recovery of CTR is a fixed time, as long as the first two phases are completed, and the time required for the first two phases is actually only related to the checkpoint of the transaction. If the checkpoint interval is determined according to the fixed log size, the database can resume work when the redo phase ends, and the recovery time will not exceed a fixed value.

— Silo:Force Recovery —

Silo is a high-performance memory database system prototype researched by Harvard University and MIT, which solves the problem of throughput reduction caused by high concurrency. If a CPU core executes transactions corresponding to a thread, in the absence of competition, the throughput increases with the increase of the number of cores, but it will decrease when it is high to a certain extent. It may be due to the bottleneck caused by some resource competition. Although each thread executes separately in the process of transaction execution, all transactions need to get the transaction ID before committing. The transaction ID is global, and the transaction operates atomic_ fetch_ and_ Add (& global_tid) gets the ID at the time of commit. The assignment of transaction ID is realized through the global manager role. When applying for transaction ID, the manager will update the transaction counter by counting + 1 to ensure that the transaction ID is globally unique and increasing. Therefore, the speed of manager write operation will be the upper limit of system performance. When concurrency becomes higher and higher and transactions apply for IDS, there will be competition, resulting in longer waiting time and lower throughput.
Comparison of common technologies and schemes of database recovery subsystem (2)

  • Optimistic concurrency control

For the performance bottleneck problem, silo’s solution is to adopt optimistic concurrency control in the database of multi-core concurrent work + shared memory. Optimistic concurrency control is introduced in the comparison between memory database parsing and mainstream products (III). It means that transactions are considered to have no impact on each other during execution, and only check whether there is a conflict during submission. If there is no conflict, apply for the global transaction ID to complete the commit. By designing force recovery, silo cancels the required global transaction ID and uses the concept of group commit to divide the time into multiple epochs, each of which is 40ms. Epoch contains multiple transactions involved in the current time period. Therefore, this group of transactions can be submitted together by submitting epoch without applying for global transaction ID for each transaction one by one. However, the drawback of this design is that if the transaction execution time exceeds 40 milliseconds, the cross level will have an impact on commit and recovery.

In silo, each transaction is distinguished by sequence number + epoch number. Sequence number is used to determine the sequence of transactions during execution, and the recovery strategy is jointly determined by sequence number and epoch number. Each transaction will have a transaction ID (TID). The transaction will be group committed according to the epoch, and the overall submission will be serialized according to the epoch number. The transaction ID is 64 bits and consists of status bits, sequence number and epoch number. The high bit is epoch number, the middle is sequence number, and the first three bits are status bits. Each record will store the corresponding transaction ID, in which the status bit is used to store the latch lock and other information when accessing the record. One important difference between the memory database and the traditional disk based DBMS database is that the lock management is placed together with the record, and the data buffer and locking table are not managed separately.

Comparison of common technologies and schemes of database recovery subsystem (2)

  • Three phases of transaction commit

Since silo uses standard optimistic concurrency control, conflicts are only checked at commit time. In the pre commit phase, when reading a data item, the transaction ID in the data item will be stored in the local read set, and then the value will be read; When modifying a data record, you also need to put the modified data record into the local write set.

The silo transaction submission is divided into three stages. The first step is to get the lock of all records to be written in the local wise set. The lock information is saved in the status bit in the transaction ID and obtained through the atomic operation compare and set; All transactions are then read from the current epoch. There are special threads in the system responsible for updating the epoch (every 40ms + 1). All transactions will not compete to write the epoch number, but just read the value. The second step of transaction submission is to check the read set. Each data item in the silo contains the ID of the last transaction to update it. If the TID of the record changes or the record is locked by other transactions, it indicates that the record has changed during the process from reading to submission and needs to be rolled back. Finally, the transaction TID is generated. When a new transaction commits, the sequence number in the TID should be a minimum value greater than the transaction ID of all read records to ensure that the transaction is incremented. After submitting, the result will be returned only after all transactions are closed.

  • Recovery——SiloR

Silor is the recovery subsystem of silo. It uses physical logging and checkpoints to ensure the persistence of transactions, and adopts parallel recovery strategies in these two aspects. As mentioned earlier, the log writing operation in the memory database is the slowest, because the log involves writing to the disk, and disk I / O is the performance bottleneck of the whole system architecture. Therefore, silor writes logs in a concurrent manner, and there are the following assumptions: each disk in the system has a log thread to serve a group of working threads, and the log thread and a group of working threads share a CPU socket.

Based on this assumption, the log thread needs to maintain the log buffer pool (there are multiple log buffers in the pool). If a worker thread wants to execute, it first needs to ask the log thread for a log buffer to write a log, return it to the log thread after it is full, and then obtain a new log buffer; If no log buffer is available, the worker thread will block. The log thread will periodically brush the buffer to the disk. After the buffer is empty, it can continue to be handed over to the worker thread for processing. For log files, a new log file is generated for every 100 epochs. The old log file generates a file name according to fixed rules. The last part of the file name is used to identify the largest epoch number in the log; The log content records the TID of the transaction and the collection of record updates (table, key, old value – > new value).

Comparison of common technologies and schemes of database recovery subsystem (2)

The above describes the processing of one core in a multi-core CPU. In fact, each core in the CPU works in the same way. There will be a special thread to track which logs have been brushed to the disk, and write the latest dropped epoch to a fixed location on the disk. All transactions compare their current epoch with the persistent epoch that has been dropped (hereinafter referred to as pepoch). If it is less than or equal to pepoch, it indicates that the log has been dropped and can be returned to the client.

  • Recovery process

Like Aries recovery system, slior also needs checkpoints. The first step of slior is to read out the data of the last checkpoint and then restore it; Since the in memory database does not log the index, the index needs to be reconstructed in memory. The second step is log playback. The memory database only performs redo instead of undo. Redo is not performed in the same order as Aries, but is directly updated to the latest version from the back to the front. During log playback, first check the pepoch log file to find the latest pepoch number. For log records exceeding the pepoch number, it is considered that the corresponding transaction has not been returned to the client, so it can be ignored. The recovery of log files adopts value logging. For each log, check whether the data record already exists. If it does not exist, generate data record according to the log record; If present, compare the record with the TID in the log. If the TID in the log is greater than the TID in the record, redo is required. Replace the old value with the new data value in the log. It can be seen that the redo of silor does not recover to the fault site step by step like Aries, because the purpose of Aries is to restore the data on the disk to the final correct state, while silor is to restore the data in the memory to the latest correct state.

— In-Memory Checkpoint —

For the memory database, the recovery system is simpler than the disk DBMS system. As long as the data is logged, the index is not required, and only the redo log is required instead of the undo log. All data is directly overwritten and modified. There is no need to manage the dirty page, and there will be no buffer pool and buffer drop. However, the memory database is still limited by the time cost of log synchronization, because the logs still need to be flushed to nonvolatile storage (disk). In the early 1980s, the research of memory database assumed that memory data would not be lost, such as memory with battery (it can work with battery for a period of time after power failure); There are also non-volatile memory NVM (non volatile memory) and other technologies, but these technologies are far from being widely used, and persistent storage should still be considered.

Persistent systems require minimal impact on performance and cannot affect throughput and latency. In order to reduce the impact on transaction execution, the first goal of memory database is recovery speed, that is, the recovery can be completed as soon as possible after failure. Therefore, serial recovery can not meet the speed requirements. At present, many research work focuses on the parallel log of memory database, and parallelism makes the implementation of locks and checkpoints complex.

For in memory checkpoints, the implementation mechanism is usually closely integrated with concurrency control, and the concurrency control design determines how to implement checkpoints. The first requirement of checkpoints in the ideal in memory is that it can not affect the normal transaction execution, and can not introduce additional latency and occupy too much memory.

Types of checkpoints:Checkpoints are divided into consistency checkpoints and fuzzy checkpoints. Consistency checkpoints means that the data in the generated checkpoint does not contain any uncommitted transactions. If it does, the uncommitted transactions will be removed during checkpoints; Fuzzy checkpoints contain committed and uncommitted transactions. Uncommitted transactions are removed only when recovering.

Checkpoints mechanism:There are two kinds of checkpoints mechanisms. The first one can realize checkpoints by developing the functions of the database system, such as using multi version storage for snapshots; The second method can copy the sub processes of the process through the folk function at the operating system level; Then copy all the data in memory, but additional operations are required to roll back the changes being executed and not yet committed.

Checkpoints content:In terms of content, there are two types of checkpoints. One is that each checkpoint copies the full amount of data; The other is incremental checkpoint, which only does the incremental content generated between this time and the last time. The difference between the two lies in the amount of data during checkpoint and the amount of data required during recovery.

Checkpoints frequency:There are three checkpoint frequencies. One is time-based periodic checkpoint; Second, checkpoint based on fixed log size; The last one is mandatory checkpoint, such as mandatory checkpoint after the database goes offline.

– Summary of this paper –

In this two columns, we introduce the recovery subsystem of the database system, and introduce the related concepts and technical implementation of Aries and logical undo logging, the mainstream recovery systems of physical logging. In addition, we introduce the recovery strategies of two database systems – CTR fixed time recovery of SQL server and force recovery recovery recovery principle of silo, an in memory database system. The next lecture will discuss the concurrency control technology of database system.


  1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94–162.
  2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., … & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, 12(12), 2143-2154.
  3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).
  4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).
  5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.