Cache management is the core system of DBMS, which is used to manage the access, cleaning and expulsion of data pages. Although the operating system itself has page cache, it is not specially designed for the database, so most database systems manage the cache by themselves. Almost all data page accesses involve buffer pool, so the concurrent access control of buffer pool is particularly important, which may affect throughput and response time. This paper mainly reviews the development of recent versions of MySQL buffer pool (if there is any omission, please comment and add), and feels the progress of this block in recent years
Before MySQL 5.5
Only one buffer pool can be set, which is controlled by InnoDB buffer pool size. The master thread is responsible for the flushing, and the scalability is poor.
The InnoDB? Buffer? Pool? Instances parameter is introduced to split the buffer pool into multiple instances, so as to reduce the access control to the buffer pool. At this time, the master thread is responsible for the dirty brush.
The features of buffer pool page ID dump and import are introduced, that is to say, the page no in memory can be saved to a file at any time, and these pages will be loaded into memory automatically when restarting, so that the memory remains warm. In addition, the page cleaner is introduced for the first time in this version, which transfers the work of flushing out on flush list / LRU to a separate thread, reducing the burden of the master thread
This version has released an important feature: Online buffer pool resize. Of course, whether it is online requires a question mark, because in the process of resizing, a lot of global large locks need to be taken, which can easily lead to instance hang (81615) in high load scenarios.
Unlike before, buffer pool is divided into multiple instances. Each instance is composed of multiple chunks. The size of each chunk is controlled by the parameter InnoDB? Buffer? Pool? Chunk? Size. The default size is 128MB. Buffer pool resize is increased or decreased by chunk.
Another point to be noted is that the buffer pool size you configure may be larger than the memory you actually use, especially for large BPs, because of the internal alignment processing. The buffer pool size must be aligned upward with InnoDB? Buffer? Pool? Chunk? Size * InnoDB? Buffer? Pool? Instances (80350)
We know that usually the IO of data files is set to o’direct, but after each modification, we still need to do fsync to persist metadata information. For some file systems, it is unnecessary to do fsync, so we add the new option o’direct’no’fsync, which comes from Facebook. They also do special processing for this: do not do fsync unless the file size changes. (it’s also interesting to discuss the security of this parameter on the buglist recently. The official document has made a new explanation. If you are interested, you can see [94912: o’direct’no’fsync possible write hole]
Another important function is to finally introduce the multiple page cleaner, which enables multiple background threads to simultaneously brush dirty pages, providing better performance of cleaning dirty pages, and effectively preventing user threads from entering the single page flush. Of course, this is not perfect. There are four main points:
- User threads will still enter single page flush, and once a large number of threads enter, it will cause serious performance degradation: Super frequent fsync, fierce dblwr competition, thread switching, etc
- When the redo space is insufficient, the user thread will also enter the page flush, which is very common in high load scenarios. You will find that the performance of the system drops sharply after running for a period of time. This is because redo is generated too fast, and page flush can’t keep up with it, so checkpoint can’t be pushed forward. Then the user thread may come to do fuzzy checkpoint. At that time, the performance was almost impossible to see.
- Dblwr becomes an important single point bottleneck If your server does not support atomic writing, you must open the double write buffer. Write a fixed area of ibdata, which is contained by a lock. It is divided into two parts: single page flush and batch flush. However, even if multiple page cleaners are split, the final scalability is limited by dblwr
- There is no dedicated LRU evict thread, all of which are page cleaner key values. For example, when the buffer pool is full and there are many dirty pages at the same time, the page cleaner may be busy scrubbing the dirty pages, while the user thread will not get the free page, thus falling into a single page flush
If you are very dissatisfied with the above problems, you can try percona server. They are always good at optimizing the performance of IO bound scenarios, and all the above problems have been solved, especially dblwr. They have made multi partition improvements.
A function has been added to keep the buffer pool in the core file when the instance goes down, which greatly reduces the size of the core file. You should know that in many cases, the instance is hung because the file is damaged. The continuous core restart will soon fill up the disk. You can control it by setting the InnoDB? Buffer? Pool? In? Core? File parameter.
In addition, the most important improvement of 8.0 is that the global big lock buffer pool mutex is finally split, and each linked list is protected by its special mutex, which greatly improves the access scalability. In fact, it was contributed to the upstream by percona, which implemented this feature in version 5.5 (WL × 8423: InnoDB: remove the buffer pool mutex and bug × 75534).
The original large mutex is divided into several parts: free list, LRU list, zip free, and zip hash
- LRU_list_mutex for the LRU_list; - zip_free mutex for the zip_free arrays; - zip_hash mutex for the zip_hash hash and in_zip_hash flag; - free_list_mutex for the free_list and withdraw list. - flush_state_mutex for init_flush, n_flush, no_flush arrays.
Because the log system is re implemented in the way of lock free, the flush order mutex has also been removed. As a result, some pages on the flush list may not be in order, leading to the difference between checkpoint LSN and the previous one, which is no longer the boundary of a log record, but may be in the middle of a log, bringing a certain degree of complexity to crash recovery (need to backtrack the log)
Log free check has also changed. When the synchronization point is exceeded, the user thread no longer does the preflush itself, but informs the background thread to do it and waits there（
log_request_checkpoint）, the log UU checkpoint thread will consider
log_consider_sync_flush, at this time, if you turn on the parameter
innodb_flush_syncIn other words, the flush operation will be completed by the page cleaner thread. At this time, the page cleaner will ignore the IO capacity limit and enter the fierce dirty operation
8.0 also added a new parameter called InnoDB ﹣ fsync ﹣ threshold. For example, when creating a file, the file size will be set. If the server has multiple running instances, it may have a significant impact on other running instances. In order to solve this problem, starting from 8.0.13, the threshold value is introduced. In the code, the function
os_file_set_sizeInjection. This function is usually called when creating or truncate files. It means that fsync is required once for each write of so many bytes to avoid impact on the system. This patch was contributed by Facebook to the upstream.
Of course, there are some auxiliary structures to query the buffer pool quickly:
- Adaptive hash index: directly index the records on the leaf node. When certain conditions are met, the records can be directly located on the leaf node without scanning from the root node, reducing the number of pages read
- Page hash: on each buffer pool instance, the auxiliary page hash is used to quickly access the stored page, read and add s lock, write new page and add x lock. Page hash adopts the partition structure. The default value is 16. There is a parameter InnoDB? Page? Hash? Locks. Unfortunately, the current code is debug only. If you want to configure this parameter, you need to modify the code slightly and move the parameter definition out of the debug macro
- Change buffer: when the secondary index page is not available, you can cache the operation to a BTREE (ibuf) in the ibdata. The next time you need to read this page, do the merge again. In addition, the background master thread will also try the merge ibuf.
Finally, I heard that the official is trying to solve the bottleneck problem of double write buffer. Look forward to it
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.