Comparison of common technologies and schemes of database recovery subsystem (I)


Author: Xiao Chen / big data open laboratory

For transactional database, the key function is to ensure the acid attribute of things, in which atomicity and persistence depend on the recovery subsystem. If it is found that the transaction cannot continue during the transaction, it needs to be rolled back with the recovery subsystem; Or in case of system crash, it is also necessary to rely on the recovery subsystem to restore the database to the state before the crash. In this column, we mainly introduce logging protocols / recovery algorithms, which are two key parts of the transactional database recovery subsystem.

— Logging Schemes—

The key of the recovery subsystem is the recovery algorithm, which aims to realize two processes. The first is the preparation for system recovery during transaction execution. At present, most systems usually use logging. Although recording data update logs at the same time during transaction execution will have additional overhead, if there is no log, system recovery and rollback of unfinished transactions cannot be realized once the system crashes. In addition, there is a shadow paging scheme, that is, each modification of data is carried out through copy on write. When updating data, copy a copy of the original data and update it on the copy. Finally, the operation is completed by replacing the original data with a copy. Shadow paging scheme costs a lot and is generally used in scenes with infrequent updates, such as text editors and other similar scenes. Therefore, log based schemes are mostly used in transactional database systems. The second process is how to use the log information recorded by the system and adopt appropriate strategies to ensure that the database can be restored to the correct state in case of system failure or transaction rollback.

  • Physical Logging & Logical Logging

Logging is divided into physical logging and logical logging. Physical logging refers to recording the modification of data item in the log record. For example, the value of data item a before modification is 90 and after modification is 100. Physical logging will record the change process of data item a. In a database system, physical logging may be value logging, that is, record data item, data item ID, attribute value before / after modification and other information; It may also be a real physical logging, that is, recording the disk page pageid, offset and length, as well as the values before and after modification.

The other is logical logging, which does not record execution results, but only records data modification operations, such as delete / update. Compared with physical logging, which restores or replays according to the values before and after modification, logical logging needs to re perform the operations in the log during replay, and reverse the operations recorded in the log during rollback, such as inserting the corresponding deletion, etc.

  • Physical Logging VS Logical Logging

The two kinds of logging have their own advantages and disadvantages. The log content recorded by logical logging is less, such as update operation. Logical logging only needs to record an update statement, with less logging overhead. The disadvantage is that it is difficult to implement in the concurrent scenario. When multiple transactions generate update operations at the same time, these operations will be scheduled as serial sequence execution in the database. A mechanism is needed to ensure that the execution sequence of each playback operation is consistent with the sequence generated by the scheduling. Therefore, most database systems use physical logging to ensure the consistency of data recovery. The execution sequence of transaction operations generated by the transaction manager (concurrency control subsystem) will be recorded in the form of log. According to the log sequence, the recovery subsystem can ensure that the rollback and replay of each data item modification are strictly executed in sequence. However, at present, some database systems still use logical logging, such as the memory database engine voltdb. This is because the voltdb engine is designed without concurrency control and each CPU core performs all operations in sequence. Therefore, it can be played back in sequence through logical logging.

For the database management system, to ensure the data persistence and correctness in case of failure, the recovery subsystem is indispensable. During the execution of a transaction, it can be rolled back when it needs to be revoked to ensure atomicity; However, restoring the subsystem at the same time will have a performance impact, because all log records can only be really dropped when they are brushed to the disk. Even if all transactions are completed, they must wait for the log to be dropped before responding to the client. Therefore, the performance of logging often becomes the performance bottleneck of the whole system.

— Recovery System Optimization

For the optimization of log or recovery subsystem, there are two types of technologies: Group commit and early lock release.

  • Group Commit

Group commit is to brush a group of transaction logs executed in parallel to disk, rather than once for each log by transaction. The log has a separate log buffer. All transactions write the log into the log buffer first. Set a separate thread to periodically brush the contents of the log buffer into the disk, or when the log buffer is full, brush it to the disk.

The operating system provides different ways to write to the disk, such as sync, fsync and fdatasync. When sync brushes the data to the operating system file buffer, it is regarded as the end, and then the operating system background process brushes the contents of the buffer to the disk. Therefore, brushing the disk through sync may cause data loss. The database system usually uses fsync to drop the log disk, and only returns when the record is actually written to the disk. Fsync writes file data and file metadata such as modification time, file length and other information to disk; The difference between fdatasync and fsync is that it only brushes data instead of metadata.

In some DBMS, fsync and fdatasync are mixed: when metadata modification does not affect logging, for example, only file modification time changes, only fdatasync can be used; However, if the operation modifies the file length, fdatasync cannot be used at this time, because fdatasync does not save metadata modification information, which will cause partial loss of content during recovery. When many DBMS write logs, they do not incrementally increase the contents of the log file, but allocate enough space for the log file at one time. The length of the log file remains unchanged in the subsequent log writing process. Therefore, fdatasync can be used to write the log to disk. It can be seen that group commit writes a group of transactions to disk using one system call at a time, merging many transaction I / O, thereby reducing the I / O of the whole system.

  • Early Lock Release

When implementing concurrency control based on lock mechanism, if the lock of the previous transaction is not released, the subsequent transaction can only be in the state of waiting for the lock. The black part in the figure indicates the ongoing transaction operation, and the gray part indicates the time waiting for the log to drop. Although the data is not modified at this time, the lock can be released only after the log is brushed to the disk. Early lock release is an optimization method for improving performance in this scenario. The strategy is to release the lock when the processing work in the transaction is completed, and then drop the log to the disk, so as to shorten the waiting time for the lock in the next transaction and improve the degree of concurrent execution.
Comparison of common technologies and schemes of database recovery subsystem (I)

However, this method also has defects. For example, the first transaction has released the lock, but there is a failure when the log disk falls and needs to be rolled back. However, since the lock has been obtained by the next transaction and the next transaction needs to be rolled back together with the previous transaction, the system needs to maintain the dependencies between transactions. In reality, lock early release technology is widely used in database. For the index structure, locking a node in the index will have a large impact range, because an index leaf node often involves a series of many data records. If the leaf node is locked, the relevant records will be locked. Therefore, in the use of indexes, early lock release is usually used instead of two-stage blocking protocol to shorten the time when data records are locked.

– Aries algorithm

In disk based database systems, recovery subsystems are mostly implemented based on Aries (algorithms for recovery and isolation expanding semantics) algorithm. Aries adopts data buffer and log buffer managementSteal + No ForceThe management strategy of steel + no force is described inComparison between in memory database parsing and mainstream products (I)As detailed in). In Aries, each log will have a sequence number LSN (log sequence number). As shown in the figure below, the log of LSN 10 is the update operation of transaction T1 writing page 5; LSN 20 is the update operation for transaction T2 to write page 3. It should be noted that the transaction end record will be kept in the log to identify that the transaction has been committed and returned to the client, indicating that the transaction has completed operations. If there is only commit but no end in the log, it may mean that the transaction has completed, but the client may not receive a response.

Comparison of common technologies and schemes of database recovery subsystem (I)

  • Aries three-phase recovery

The recovery algorithm of Aries is divided into three stages: analysis, redo and undo. The specific details of each stage will be described in detail later.

  1. Analysis:After a crash restart occurs, the system will first read the log file from the disk, analyze the contents of the log file, and judge which transactions are in the active state when the system crashes and which pages have been modified when the system fails.
  2. Redo:In the redo phase, the system reproduces the fault scene according to the log, including restoring the dirty page in the memory to the state at the time of crash, which is equivalent to replaying the log history, and executing each log record once, including the transaction log without commit.
  3. Undo:In the undo phase, the system begins to undo unfinished transactions. The above figure is a simple example of log recording. The system crashes after LSN 60. There is a mark of transaction T2 end in the log, so T2 has been committed, but transactions T1 and T3 have not been completed. If the modifications of transactions T1 and T3 to P1, P3 and P5 have been dropped, they need to be revoked from the disk.
  • Data structure of logging

For Aries recovery subsystem, the recovery process needs to be based on the information stored in logging. The log in Aries consists of multiple log records. One log record contains the transaction ID, page ID + offset, length, values before and after modification and additional control information of the modified data item.

The log types of Aries include update, commit, abort, end, and compensation log record (CLR). CLR is used to prevent the impact of failures during transaction rollback. When a transaction is rolled back, a CLR is recorded for each rollback log. The system can judge which operations have been rolled back through CLR. If the CLR is not recorded, the operation may be rolled back twice.

Comparison of common technologies and schemes of database recovery subsystem (I)

During normal log recording, Aries will record redo and undo information. The recorded log includes values before and after modification. Generally speaking, the log drop disk is written sequentially, so the database will arrange separate disks for the log service in configuration, and will not be mixed with the disks storing data records, so as to improve the performance of log writing.

The following is the schematic diagram of log disk dropping in Aries. The sequence on the right side of the diagram represents all logs, the blue part represents the logs that have been disk dropped, and the orange part represents the logs that are still in the log buffer. Aries will record the flushed LSN, indicating which buffers have been flushed to disk. In addition, a page LSN will be recorded in each disk block where the data is saved, which is used to represent the maximum log number corresponding to all operations modifying the data page (that is, the log number corresponding to the last operation modifying the data page). When brushing the data in the data buffer to the disk, determine whether the data can be brushed to the disk by judging the size of page LSN and flushed LSN. If the page LSN is less than or equal to the flushed LSN, it means that all log records for modifying the data page have been dropped, so the data can also be dropped. This is the so-called wal (write ahead log). The log is always written to the disk before the data.

In addition, prev LSN is saved in the log record to correspond to the previous log number of the transaction to which the log belongs. Since all transactions in the system share the log buffer, the generated logs are interspersed. All LSNS belonging to the same transaction can be connected in series through prev LSN to find all logs corresponding to the transaction.

Comparison of common technologies and schemes of database recovery subsystem (I)

Xact table and dirty page table also need to be maintained in the recovery subsystem. Xact table is used to record the status of all active transactions, such as active, commit, abort, end, etc; At the same time, the log number last LSN generated by the transaction is also recorded. The dirty page table is used to record which data pages have been modified after being loaded into the buffer from the disk, and the log number rec LSN at the earliest modification of each page (that is, the log number corresponding to the first modification operation after the data page is loaded into the buffer).

In addition to recording information in the log, in order to ensure that the recovery can be completed successfully, the database system also needs to record the LSN of the checkpoint with master record to ensure that each recovery only needs to start from the nearest checkpoint. Because the database system needs to shut down when doing checkpoint (no transaction execution is allowed), which is difficult for users to accept, the checkpoint in Aries adopts fuzzy checkpoint mode, that is, transactions can be continuously executed when doing checkpoint. Fuzzy checkpoint generates two log records: begin_ Checkpoint and end_ Checkpoint。 Begin_ Checkpoint is responsible for recording the start time of checkpoint, end_ Checkpoint records Xact table and dirty page table, and the LSN of checkpoint will be written to the master record on the disk for persistent storage. The above is all the data structures required for recovery. The sorting of various LSNS is summarized in the table below.

Comparison of common technologies and schemes of database recovery subsystem (I)

Comparison of common technologies and schemes of database recovery subsystem (I)

– transaction recovery of database system –
  • Simple transaction recovery

For simple transaction recovery (the system does not fail, but the transaction does not continue during execution), rollback is required at this time. When rolling back, the system first finds the latest LSN from the Xact table for undo, then finds the preamble log record through the prev LSN of the log record, and then continues undo until the whole transaction is completely replayed to the starting state. Similar to normal transaction operations, the data during undo actually needs to be locked, and the compensation log CLR will be recorded before rollback. The CLR will record the LSN number of undo next to point to the next LSN requiring undo. When undo goes to the LSN of transaction start, transaction abort and transaction end will be recorded to indicate the end of rollback.

  • Failed transaction recovery

As mentioned earlier, the fault recovery of Aries is divided into three stages. The implementation details of the three stages will be introduced in detail below.

  1. Analysis phase

    In the analysis phase, the system obtains the last checkpoint log record from the master record on the disk, reconstructs the Xact table and dirty page table, and retrieves the data from begin_ Checkpoint logging starts to process subsequent logging in sequence. When an end log of a transaction is encountered, it is removed from the Xact table; If the commit log of a transaction is encountered, update the status of the corresponding transaction in the Xact table; If other log records are encountered, judge whether the transaction is in the Xact table. If it is not, add it to the Xact table, and update the last LSN of the transaction in the Xact table to the LSN of the current log record. In addition, the system will judge whether the updated data page in the log record is in the dirty page table. If not, add the data page to the dirty page table and set its rec LSN as the current log number.

  2. Redo phase

    In the redo phase, the system first finds the smallest of all pagerec LSNS in the dirty page table as the starting position of redo, because the data modifications corresponding to the previous log records have been dropped and will not appear in the dirty page table. Then, the system starts from the starting position of redo and performs redo operations (replay) on subsequent update log records (including CLR) in sequence. If the page updated by the operation is not in the dirty page table, or the page is in the dirty page table, but the rec LSN is greater than the current LSN, or the page LSN on the disk is greater than the current LSN, it means that the record corresponding to the LSN has been dropped and can be skipped directly without redo. During redo, the system does not need to record logs, because redo only reconstructs the whole memory state. If a system failure occurs during redo, it will be repeated according to the original operation.

  3. Undo phase

    The purpose of undo phase is to undo the unfinished transactions in case of system failure. At the beginning, a log set requiring undo will be established, the last log number of each transaction to be rolled back will be put into the set, and then cyclic processing will be started. First, the system selects the largest LSN from the set, that is, the last one for undo. If this log is a CLR compensation log and its undo next is empty, it indicates that the transaction has completed undo, and an end log can be recorded to indicate the end of the transaction; If the undo next of the compensation log is not equal to empty, it means that there is another log that needs undo, then put the LSN of the next log into the set; If the log is updated, roll back the log and record a CLR log, and then add the prev LSN of the log to the collection. The system will continue to cycle according to the above process until the entire undo set is empty.

    Next, sort out the whole process through examples. Firstly, the system made a fuzzy checkpoint, and there were two updates: T1 modified P5 and T2 modified P3. Then, T1 abort is cancelled, and lsn40 records the compensation log – rollback lsn10, and then T1 end. Next, other transactions continue: T3 modifies P1, T2 modifies P5. Crash appears at this time. How to recover? First, in the analysis process, scan backward from checkpoint. It is found that T1 has no end and redo is not required for T2 and T3. Therefore, there are only T2 and T3 in Xact table, and the dirty page table includes P1, P3 and P5. After the analysis is completed, redo redo is performed to recover the fault site, and then record the CLR when undoing each log in T2 and T3 until undoing to the first log of each transaction.

    Comparison of common technologies and schemes of database recovery subsystem (I)

    If crash occurs again during the recovery process (as shown in the figure below), the two operations that have been undone will redo the two CLRs because the CLRs are recorded, and the new undo process will not be rolled back again. The recovery system will continue on the original basis until all transactions are undo completed.

    Comparison of common technologies and schemes of database recovery subsystem (I)

  • Aries summary

Aries is a recovery system with mature design and can ensure transaction atomicity and persistence. It uses wal and steel + no force buffer management strategies without affecting the correctness of the system. In Aries, LSN is the unique identification of monotonically increasing log records, which connects all logs of a transaction in series through links. Page LSN is used to record the log number corresponding to the last modification operation of each page. The system reduces the cost of recovery through checkpoint. The whole recovery is divided into three steps: analysis, redo and undo. The purpose of analysis is to find out which transactions need redo, which pages have been modified, and whether the modification has been dropped; Then recover the fault site through redo, and roll back the transactions that need to be revoked using undo.

– Summary of this paper –

This paper introduces the basic concepts of logging and recovery, and discusses the technical principle of Aries, the recovery subsystem in the traditional disk based database management system. The next article will continue to explore the database recovery subsystem, discuss early lock release and logical undo in DBMS recovery, and introduce two database recovery technologies and memory database recovery methods.


  1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94–162.
  2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., … & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, 12(12), 2143-2154.
  3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).
  4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).
  5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.