In depth analysis of disk caching mechanism and SSD write amplification under Linux


Some time ago in the development of a system using SSD as cache, there will be a lot of disk cache when writing data at high speed. If too much disk cache is not written to the disk in time, it is very dangerous when there is a problem with the machine, which will cause a lot of data loss. However, if the data is flushed into the disk in real time, the write efficiency is too low. In order to understand this kind of disk writing feature of Linux system, I studied it deeply recently.
The existence of VFS (virtual file system) makes Linux compatible with different file systems, such as ext3, ext4, XFS, NTFS and so on. It not only has the function of implementing a common external interface for all file systems, but also has another important role related to system performance cache. VFS introduces the mechanism of high-speed disk cache, which is a software mechanism. It allows the kernel to save some information on the disk in RAM, so that further access to the data can be carried out quickly without slow access to the disk itself. High speed disk cache can be roughly divided into the following three types:
Directory entry cache — it mainly stores the directory entry object that describes the file system pathname
Inode cache — it mainly stores the inode objects that describe the disk inodes
Page cache — it mainly stores the complete data page object. The data contained in each page must belong to a file. At the same time, all file read and write operations depend on page cache. It is the main disk cache used by the Linux kernel.
Because of the introduction of cache, VFS file system adopts the technology of file data delay writing. Therefore, if the synchronous write mode is not used when calling the system interface to write data, most of the data will be stored in the cache first, and then the data will be flushed to the disk when some conditions are met.

How does the kernel flush data to disk? You can get the answer after reading the following two points.

1. Write dirty pages to disk
As we know, the kernel constantly fills the page cache with pages containing block device data. As long as the process modifies the data, the corresponding page is marked as dirty, that is, its PG_ Dirty flag location.
UNIX system allows the operation of writing dirty buffer to block device to be delayed, because this strategy can significantly improve the performance of the system. Several writes to the pages in the cache may be satisfied with only one slow physical update to the corresponding disk block. In addition, write operations are less urgent than read operations, because processes are usually not suspended because of delayed writes, and most of the time they are suspended because of delayed reads. Because of the delay in writing, any physical block device will provide more services for read requests than write requests.
A dirty page may stay in main memory until the last minute (that is, until the system is shut down). However, considering the limitations of the delay write strategy, it has two main disadvantages
First, if there is a hardware error or power failure, the ram content can no longer be obtained. Therefore, many modifications to the file since the system startup are lost.
Second, the size of the page cache (hence the size of the ram required to store it) can be large – at least different from the size of the block device being accessed.
Therefore, the dirty pages are flushed (written) to disk under the following conditions:
The page cache becomes too full, but more pages are needed, or the number of dirty pages is already too high.
Too long has passed since the page became dirty.
The process requests that any pending changes to the block device or specific file be refreshed. It is implemented by calling the system call of sync(), fsync() or fdatasync().
The introduction of buffer pages makes the problem more complicated. The buffer header associated with each buffer page enables the kernel to know the state of each individual block buffer. If there is at least one PG at the head of the buffer_ If the dirty flag is set, PG of the corresponding buffer page should be set_ Dirty flag. When the kernel selects the buffer to be flushed, it scans the corresponding buffer header and writes only the contents of the dirty block to disk. Once the kernel flushes all the dirty pages of the buffer to disk, the PG of the page_ The dirty flag is cleared to 0.

2. Pdflush kernel thread
Earlier versions of Linux used the bdflush kernel thread to systematically scan the page cache for dirty pages to be refreshed, and another kernel thread kupdate was used to ensure that all pages were not “dirty” for too long. Linux 2.6 replaces these two threads with a set of general kernel threads pdflush.
These kernel threads are flexible in structure and act on two parameters: a pointer to the function to be executed by the thread and a parameter to be used by the function. The number of pdflush kernel threads in the system should be adjusted dynamically: if there are too few pdflush threads, they will be created; if they are too many, they will be killed. Because the functions executed by these kernel threads can be blocked, creating multiple pdflush kernel threads instead of one can improve system performance.
Control the generation and death of pdflush threads according to the following principles:
There must be at least two and a maximum of eight pdflush kernel threads
If there is no idle pdflush in the last 1s, a new pdflush thread should be created
If the last time pdflush becomes idle for more than 1s, a pdflush thread should be deleted
All pdflush kernel threads have pdflush_ The data structure of work descriptor is as follows:

2015121121424931.png (824×272)

When the system has no dirty pages to refresh, the pdflush thread will automatically be in sleep state, and finally the pdflush_ Operation() function. What work has the pdflush kernel thread completed in this process? Some of the work is related to the refresh of dirty data. In particular, pdflush usually performs one of the following callback functions:
    1. background_ Writeout(): systematically scans the page cache for dirty pages to refresh.
In order to get the dirty page that needs to be refreshed, it is necessary to search thoroughly all the addresses corresponding to the index nodes with images on the disk_ Space object (which is a search tree). Since there may be a large number of pages in the page cache, if a single execution stream is used to scan the whole cache, the CPU and disk will be busy for a long time. Therefore, Linux uses a complex mechanism to divide the scanning of page cache into several execution streams. Wakeup is performed when there is insufficient memory or when the user explicitly (user mode process issues sync() system call, etc.) calls to request a refresh operation_ Bdflush() function. wakeup_ The bdflush() function calls pdflush_ Operation() wakes up the pdflush kernel thread and delegates it to execute the callback function background_ writeout()。 background_ The writeout() function effectively gets the specified number of dirty pages from the page cache and writes them back to disk. In addition, execute background_ The pdflush kernel thread of writeout() function can only be awakened if the following two conditions are met: one is to modify the page content in the page cache; the other is to cause the dirty page part to increase to exceed a certain dirty background threshold. The background threshold is usually set to 10% of all pages in the system, but you can modify the file / proc / sys / VM / dirty_ background_ Ratio to adjust the value.
    2. wb_ Kupdate(): check whether there are “dirty” pages in the page cache for a long time to avoid starvation when some pages have not been refreshed for a long time.
The kernel establishes WB during initialization_ Timer dynamic timer, whose timing interval is dirty_ writeback_ The hundredth of a second (usually 500th of a second) specified in the centisecs file, but you can modify / proc / sys / VM / dirty_ writeback_ The centrisecs file adjusts this value). The timer function calls pdflush_ The operation() function, and then the WB_ The address of the kupdate() function is passed in. Wb_ Kupdate() function traverses the page cache to search for old dirty inodes, writes all pages that have been dirty for more than 30 seconds to disk, and then resets the timer.

PS: About SSD write amplification

Nowadays, SSD is increasingly used as server disk. There are some problems when designing and implementing the cache system on SSD (solid state drive) to store data blocks. For example, after the disk is full, if the oldest unused data blocks are aged, and continue to write new data, as time goes on, the write in speed becomes much slower than at the beginning. In order to find out why this happens, we searched the Internet for some information about SSD. It turns out that this situation is decided by the SSD hardware design itself, and finally mapped to the application program. This phenomenon is called write amplification (WA: write) Wa is a very important attribute related to flash memory and SSD. This term was first proposed by Intel and silicon systems (acquired by western data in 2009) in 2008 and used in public contributions. Here is a brief explanation of why this happens and how it works.
SSD design is completely different from the traditional mechanical disk, it is a complete electronic equipment, there is no read-write head of traditional mechanical disk. As a result, SSD can provide high IOPs performance due to the lack of the track seeking process between the tracks when reading and writing data. It is also because of its less head scheduling, so SSD can also reduce the use of power, in the data center is very beneficial to enterprises.
    Compared with the traditional disk, SSD has great performance advantages and more advantages, but things always have two sides, and it also has some problems. The data written in SSD can not be updated directly. It can only be rewritten by sector coverage. Before rewriting, it needs to be erased first. Moreover, erasing operation can not be done on the sector, only on the disk Before erasing the block, the original and valid data should be read out first, and then written together with the new data. These repeated operations will not only increase the amount of data written, but also reduce the life of flash memory, and eat up the available bandwidth of flash memory, which will indirectly affect the random write performance.
2015121121707249.png (580×318)

Solution of writing amplification
In practice, it is difficult for us to completely solve the problem of SSD write amplification. We can only reduce the magnification by some methods. A very simple way is to use only a part of the capacity of a large SSD hard disk, such as 128GB, and you only use 64GB. In the worst case, the write amplification can be reduced by about three times. Of course, this method is a bit of a waste of resources. In addition, sequential writing can be used when writing data. When SSD is written sequentially, the write amplification is generally 1, but some factors will affect the value.
In addition to the above methods, trim is recognized as a better method at this stage. Trim is located in the operating system layer. The operating system uses the trim command to inform SSD that the data of a page is not needed and can be recycled. The main difference between the operating system supporting trim and the past is that the operation of deleting a page is different. In the disk period, after deleting a page, the flag bit of the page is set to available in the record information of the file system, but the data is not deleted. For operating systems that use SSD and support trim, when deleting a page, the SSD will be informed that the data of this page is no longer needed. There is a garbage collection process in SSD at idle time. At idle time, SSD will gather some idle data together and erase them together. In this way, each write operation will write new data on the page that has been eased.

Although it has the problem of write amplification, it does not make us refuse to use it. It has been used for cache acceleration in many projects, especially in database cache projects, in which the efficient read performance of SSD has been fully utilized. With the release of Facebook’s open source project flash cache, as well as the extensive use of Facebook’s internal flash cache, flash cache has become a relatively mature technical solution, which makes more companies choose SSD for storage or caching.