[vernacular MySQL] MySQL transaction and log principle


In the back-end interview, MySQL is an indispensable link, and the investigation of transactions and logs is a “disaster area”. Most students may know that MySQL ensures the transactional nature of SQL through redolog, binlog and undo, and can also be used for database data recovery, but further, how to ensure the transactional nature? How is the data written to disk during update? What if the contents of these two logs are inconsistent? Why is it more efficient to write logs to disk than to write data directly to disk, If you ask and don’t know, the interviewer (especially the large factory interview) will almost let you go back and wait for the news.

Redo log and binlog

Although most articles may have been introduced, for the sake of the integrity of the article, let’s talk about the difference between redo log and binlog.

Different location

First, the location of the two logs is different. The overall architecture of MySQL can be divided into server layer and storage engine layer. MySQL adopts plug-in storage engine. Common storage engines include MyISAM, InnoDB, memory, etc. specify the storage engine to be used when creating a table (create table… Engine = InnoDB).

Binlog is a log in the server layer, that is, no matter which storage engine is used, binlog can be used to record and execute statements. Redolog is unique to InnoDB storage engine.

Different sizes

Binlog is recorded in multiple log files. The size of a single file is determined bymax_binlog_sizeSet to write by appending. When the binlog size exceedsmax_binlog_sizeSetting the size will create a new log file, and then switch to the new file to continue writing. In addition, throughexpire_logs_daysSet the number of days the binlog log is retained.

The size of the redolog is fixed. You can modify the configuration parameters in MySQLinnodb_log_files_in_groupandinnodb_log_file_sizeConfigure the number of log files and the size of each log file, and use circular writing to record. When writing to the end, it will return to the beginning to write the log circularly.

Different records

The method of binlog recording operation is a logical statement. There are two record formats: statement and row. Statement format records SQL statements; The row format records the contents of the lines before and after the update

Redolog records the modification of each page in the database. For example, “what changes have been made on a data page”

Two stage update process

After understanding the differences between the two logs, let’s see how the two logs are written through an update statement execution process. The statement content isupdate t set a = a + 1 where id = 1

  1. The actuator obtains the data with id = 1 through the InnoDB engine. If the data itself is in memory, it will be directly returned to the actuator; Otherwise, first read the memory from the disk and then return.

  2. The actuator obtains the row data given by the engine, adds 1 to this value to obtain a new row of data, and then calls the engine interface to write this new row of data.

  3. The engine updates this new row of data into memory. Then, the updated contents of the memory data page are recorded in the redolog buffer. At this time, the statement status in the buffer is prepare. The executor is then informed that execution is complete and the transaction can be committed at any time.

  4. When the server layer submits a transaction, it will first write the log of this operation to the binlog buffer, and then call the transaction submission interface of the engine. The engine will change the status of the newly written redolog record to commit. Update complete.

It can be found that after an update, not only the data is stored in memory, but also the redolog and binlog are written to memory first, and then the log is dropped according to the set disk dropping mechanism.

Log drop disk

Binlog disk dropping strategy

MySQL passedsync_binlogParameter to control the log disk dropping strategy of binlog buffer.

sync_binlog = 0, which means that MySQL does not control the refresh of binlog and uses the cache refresh strategy of the file system. At this time, the performance is the best and the risk is the greatest. Once the system crashes, the log data in binlog buffer will be lost.

sync_binlog = 1It means that the log data in the buffer will be synchronized to the disk every time the transaction is committed, which is the safest. However, due to the high frequency of disk brushing, the performance is also the worst.

sync_binlog > 1Indicates every write to binlog buffersync_binlogAfter a transaction, brush the log data to the disk.

Redolog drop strategy

Before talking about redolog persistence, let’s first understand the write and fsync system calls. In the operating system, the memory is divided into user space and kernel space. The user space stores the cached data of the application, and the redolog buffer exists in the user space. To persist the data in the user space to the disk, we need to call the write system call first, Write the data into the kernel space first, and then call the fsync system call to write the data in the kernel space to the disk.

MySQL passedinnodb_flush_log_at_trx_commitThis parameter controls when the redo log buffer is written to the disk.

innodb_flush_log_at_trx_commit = 0Indicates that the log will continue to be saved in the redolog buffer when the transaction is committedinnodb_flush_log_at_timeoutCall write and fsync to persist the log to disk at the set interval,innodb_flush_log_at_timeoutThe default is 1, that is, the log is written to disk every second. Batch write, IO performance is good, but the risk of data loss is high.

innodb_flush_log_at_trx_commit = 1Indicates that when the transaction is committed, both write and fsync will be called to write the log to disk. This method does not lose any data, but IO performance is poor.

innodb_flush_log_at_trx_commit = 2It means that when a transaction is committed, it will call write to write the log to the kernel cache, and then call fsync to write the log to disk every second. This is also relatively safe. Even if the MySQL program crashes, the logs in the OS buffer will not be lost. Of course, if the operating system also crashes, this part of the log will disappear.


Q: Will the redolog in the prepare state be flushed to the disk?
A: Yes, for example, there are two transactions a and B at the same time. A is in prepare and B commits to trigger log disk flushing. At this time, a’s redo log will also be flushed to disk.

Q: Is binlog redundant? Can I use redolog instead of binlog?
A: Firstly, in terms of supporting transactions, binlog is really of little use. During crash recovery, it is necessary to determine whether the transaction should be committed through binlog, which only prevents binlog from being applied to the standby database. If the primary database rolls back directly, the primary and standby data will be inconsistent.
However, binlog’s “archive” function is not available in redolog. Redolog has a fixed size and uses circular writing. Older logs will be overwritten and cannot be saved permanently, while binlog does not limit the size and the log will be written additionally. As long as the binlog log log is retained, the state of the database at any time can be restored.

Q: Several disk dropping strategies of binlog and redolog are also frequently written to the disk. Is there any difference between them and direct data writing to the disk?
A: The log file is stored in several consecutive data pages, so it only needs to be addressed once when writing the log to the disk, which belongs to sequential reading and writing; When writing data, the data that may need to be changed in a transaction may involve several discrete data pages. When writing disk, it needs to carry out multiple “seek – > rotation” addressing processes. It belongs to random reading and writing, and the speed is several orders of magnitude worse than sequential reading and writing.

Data dropping disk

In order to avoid the performance bottleneck caused by frequent writing to the disk, data pages are modified in memory first. The modified pages in memory are called dirty pages (because the data in the page is inconsistent with the disk at this time, it is “dirty”. The changed data pages need to be synchronized to the disk at some time. This process is called “brushing dirty pages”.


In InnoDB, every modification of a data page will generate an 8-byte serial number LSN to mark the version. The value of LSN increases monotonically globally and gradually increases with the writing of the log. LSN exists in the data page and redo log.
During the whole update process, several LSNS deserve attention:

  1. When modifying the data in the memory data page, the LSN in the memory data page will be updated, temporarily called data_ in_ buffer_ lsn。

  2. When a log is written to the redolog buffer, the corresponding LSN will be recorded, which is temporarily called redo_ log_ in_ buffer_ lsn。

  3. When several disk flushing strategies of redolog are triggered, the logs in the redolog buffer will be flushed to the disk, and the corresponding LSN will be recorded in the file, which is temporarily called redo_ log_ on_ disk_ lsn。

  4. When data is brushed from memory to disk, the current LSN, temporarily called data, will be recorded on the corresponding data page on the disk_ on_ disk_ lsn。

  5. InnoDB will synchronize the changes of the corresponding data page recorded on the redolog to the disk at an appropriate time. The synchronization progress is also marked by LSN, which is called checkpoint_ lsn。 (detailed later)

Can passshow engine innodb statusView the value of each LSN.

LSN can be understood as the amount of redo logs generated by the database since its creation. The larger this value is, the more the database is updated. It can also be understood as the update time. In addition, there is also an LSN on each data page, indicating the LSN when it was last modified. The larger the value, the later it was modified. For example, the LSN of data page a is 100, the LSN of data page B is 200, the checkpoint LSN is 150, and the system LSN is 300, indicating that the current system has been updated to 300, and the data pages less than 150 have been brushed to the disk. Therefore, the latest data of data page a must be on the disk, while data page B may not be, and may still be in memory.

Let’s discuss several opportunities for dirty pages in InnoDB.

Data drop timing

Timed refresh

InnoDB’s main thread will periodically refresh a certain proportion of dirty pages to disk. This process is asynchronous and will not affect other operations such as query / update.

Insufficient system memory

InnoDB will maintain an LRU list of memory data pages and ensure certain free data pages through a separate page clear thread. When there are insufficient free pages, the memory pages at the end of LRU will be eliminated. If there are dirty pages in the eliminated pages, the dirty page data will be refreshed to disk first.

The proportion of dirty pages is too high

In InnoDB, there is ainnodb_max_dirty_pages_pctParameter to control the proportion of dirty pages in memory. When the proportion of dirty pages exceeds the set proportion, some dirty pages will be refreshed to disk.

mysql> show variables like 'innodb_max_dirty_pages_pct';
| Variable_name              | Value     |
| innodb_max_dirty_pages_pct | 90.000000 |

The database shuts down normally

parameterinnodb_fast_shutdownIt controls the disk dropping strategy when the database is closed. When it is set to 1, all log dirty pages and data dirty pages will be refreshed to the disk; When set to 2, only log disk dropping is guaranteed.

Redo log checkpoint

Then review the update process. The update operation is recorded in redolog and the data is updated to memory. The whole update operation is over. If the database is shut down abnormally, we need to recover the data changes of the corresponding data page according to redolog during the next startup.

However, the size of the redolog is fixed, and the circular write mode is adopted. When writing to the end, it will return to the beginning to write the log circularly. Therefore, with the accumulation of update operations, the records on the redolog will be overwritten and some changes will be lost.

Can we not limit the size of the redolog?
Imagine that the redolog reaches 1tg and the database data volume is 10tg. In case of abnormal restart, in order to recover the changes of the data page. We need to read 1t logs for recovery. If all data pages are modified, we also need to load all 10tg data into memory. Therefore, when the size of the redolog is not limited, two other problems will occur:

  1. Slow recovery speed;
  2. Memory cannot cache all data in the database.

Redolog adopts the checkpoint strategy, which will periodically refresh the data modification on the redolog to the disk gradually. The synchronization progress is marked with LSN, which is called checkpoint_ lsn。 Redolog according to checkpoint_ LSN can be divided into two parts, smaller than checkpoint_ The data page changes corresponding to the log of LSN have been refreshed to the disk, and this part of the log can be overwritten and rewritten; Greater than checkpoint_ The corresponding changes of some logs of LSN have not been synchronized to disk.

Redolog checkpoint disk brushing is divided into asynchronous disk brushing and synchronous disk brushing.

checkpoint_age = redo_lsn - checkpoint_lsn
async_water_mark = 75% * total_redo_log_file_size
sync_water_mark = 90% * total_redo_log_file_size

checkpoint_ age < async_ water_ Mark indicates that there is less dirty page data at present, and redolog checkpoint will not be triggered.

async_ water_ mark < checkpoint_ age < sync_ water_ Mark will asynchronously refresh a certain amount of dirty pages to the disk to meet the checkpoint_ age < async_ water_ mark。 Asynchronous refresh does not affect other update operations.

checkpoint_ age > sync_ water_ Mark, when the redolog capacity is set to be small and a large number of update operations are carried out at the same time, resulting in less available logs, synchronous refresh will be triggered to refresh the dirty pages to the disk until the checkpoint is met_ age < async_ water_ Mark, synchronous refresh will block the user’s update operation.


Q: In addition to redolog checkpoint, there are several other situations where brushing dirty pages will promote checkpoint_ LSN?
A: No. The buffer pool maintains a flush that manages dirty pages_ List. After a data page becomes a dirty page due to modified data, it will be added to flush_ In the list, the dirty pages will be flushed from the flush after being flushed to the disk_ Remove from the list.
flush_ List is sorted from small to large according to the earliest modification LSN (oldest_modification) of the data page. For example, after a clean page becomes a dirty page, data_ in_ buffer_ LSN = 100, in flush_ The position of list is 1. When the data page changes again, data_ in_ buffer_ LSN becomes 120, but in flush_ The position of the list remains unchanged.
When redo checkpoint is performed, the selected log only needs to be consistent with flush_ Compare the oldest page on the list (with the smallest LSN on the flsuh_list):

  1. page_noflush_list != page_noredoIndicates that the dirty page data has been synchronized to the disk, and the checkpoint is pushed_ lsn。
  2. page_noflush_list == page_noredo, refresh the dirty page to disk and push checkpoint_ lsn。

Q: Where does the checkpoint information exist? How to store?
A: The checkpoint information is stored in the header of the first redo log file. The storage adopts double copy storage and reading and writing in turn.
In the header of the first redo log file, there are two places to store checkpoint information. These two checkpoint fields are read back and forth during recording. Suppose there is only one checkpoint domain. When half of the checkpoints are updated, the server also hangs, resulting in the unavailability of the entire checkpoint domain. In this way, the database will not be able to crash recovery, so it cannot be started. If there are two checkpoint domains, even if one is broken, you can use the other to try to recover. Although it is possible that the log has been overwritten at this time, it at least improves the probability of successful recovery. The two checkpoint domains are written in turn, which can also reduce the impact of disk sector failure.

Collapse recovery

The user modifies the data and successfully commits the transaction. At this time, the data changes have not been dropped in the memory. If the database hangs at this time, after restarting, it is necessary to recover the successfully committed transaction data changes from the log and write them to the disk again to ensure that the data is not lost. At the same time, it is also necessary to roll back the uncommitted transactions. In crash recovery, in addition to redolog and binlog logs, it is also inseparable from the support of undo logs.

undo log

Undo logs will be generated during update operations: when a record is deleted, a corresponding insert log will be recorded. When updating a record, an opposite update log will be recorded. When inserting a record, a delete log will be recorded.

When a transaction needs to be rolled back, you only need to perform the corresponding undo operation to recover the data. In addition, transaction isolation can be guaranteed through undo logs. Suppose the isolation level is set to read commit. When uncommitted transaction a modifies the row data corresponding to id = 1, transaction B wants to read the data with id = 1. First, take the latest version of the data and find the records that meet its visibility along the undo log.

Undo logs are the same as ordinary data pages. Redo logs need to be written first to modify undo pages. It may also be eliminated from memory due to LRU rules, and then read from disk.

Crash recovery process

The whole rout recovery process can be divided intoRedo roll forwardandUndo rollbackTwo parts.

Redo roll forward

For checkpoint_ The log before LSN has been changed and needs no attention. First initialize a hash_ Table, scan checkpoint_ For logs after LSN, the logs of the same data page are distributed to the hash_ The same location as the table and sorted by the LSN of the log from small to large. After scanning, traverse the entire hash table and apply the log of each data page in turn. After the application, the state of the data page in memory is restored to that before the crash.

Undo rollback

Next, initialize the undo log, which is divided into undo logs by operation type_ insert_ List and undo_ update_ List, traverse two linked lists, and rebuild the transaction state according to the transaction state recorded in the log, TRX_ Active indicates that rollback is required, TRX_ STATE_ Prepared indicates that a rollback may be required. Then add the transaction to TRX_ List, and then traverse TRX_ List, rollback or commit according to different transaction states. For TRX_ Transactions in active status are rolled back directly using undo logs; For TRX_ STATE_ For transactions in the prepared state, whether to roll back is determined according to the binlog of the server layer. If the binlog has been written and the log is complete, the transaction is committed, otherwise it is rolled back.


Q: When will undo logs be deleted?
A: Undo can be divided into update / delete / insert by operation type. The insert operation is only visible to the current transaction before the transaction is committed, and the generated undo log can be deleted directly after the transaction is committed. In the update / delete operation, other transactions may need old version data, which can be deleted only after the transaction ID corresponding to the undo operation is smaller than all the current transaction snapshots of the database (at this time, all transactions of the database are visible to this change).

Write at the end

Friends who love this article welcome to pay attention to the official account of “play code” and share practical skills in the vernacular.

Official account benefits

Reply to [MySQL] to get the free test database!!

Reply to [PDF] to obtain massive learning materials that are continuously updated!!

Recommended Today

Notes on basic learning of ruby metaprogramming

Note 1:The code contains variables, classes and methods, which are collectively referred to as language construct. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # test.rb class Greeting  def initialize(text)   @text = text  end    def welcome   @text  end end my_obj = Greeting.new(“hello”) […]