Things you don’t know

Time:2022-3-24

Since Samsung was the first giant to enter the SSD market in 2005, SSD has become a very common storage medium in just 15 years. Compared with mechanical hard disk HDD, SSD has increased hundreds of times in IOPs and bandwidth. Now nvme hard disk has further improved the performance of ordinary SATA SSD by nearly ten times. Whether ordinary SATA SSD or nvme SSD, for most people, it is just a change in media and performance. Ordinary people and even it engineers will simply think that as long as SSD is used, the performance of storage system accessing data will be improved hundreds of times. Is this really the case? In fact, this problem is very similar. As long as the Ferrari engine is installed, will the car be fast? I think only Ferrari engineers know how much wind resistance will be increased by any change in the body, which will affect the speed of a few percent of a second.

This article tries to describe as simply as possible, and make it clear for you what you want to know but don’t know.

SSD background you should know

SSD cell, page and block)。 There are two important components in SSD, one is the cell, and the other is the controller. We will talk about the controller later. Let’s talk about the storage unit first. Today’sMainstream SSDs use NAND particles (of course, newer SSDs use 3D NAND) to store data, each particle can store 1 bit (SLC), 2 bit (MLC), 3 bit (TLC) or even 4 bit (QLC) data. The more bits the particles store, the higher the density and the lower the manufacturing cost, but the lower the durability (or service life and erasure times) of the particles. andThe smallest unit read or written by SSD is not particles, but a page composed of a group of particles, the typical page size is 4KB. Once the disk is covered by magnetic particles, it has an important characteristic that is different from that of SSD. In order to be able to write repeatedly, SSD needs to modify the written particlesErase operation, andThe smallest unit of erasure is neither a particle nor a page, but a block composed of several pages。 The typical size of SSD blocks is 512KB or 1MB, that is, 128 pages or 256 pages. Other behaviors of SSD and the optimization means of storage system for SSD are closely related to these basic characteristics.

Data manipulation and garbage collection (GC)。 Data operations include reading and writing. The read delay is relatively stable, and the write delay will change, depending on the use of the disk. Under normal circumstances, it is tens of microseconds. Compared with mechanical hard disk, SSD has one more erasure operation. Erasure is based on block, which has been mentioned earlier. Garbage collection in SSD is used to recycle those blocks that have been used but whose data is no longer valid. A threshold for the number of available blocks will be set in the SSD controller. When the number of available blocks is lower than this threshold, garbage collection will be started.

Wear leveling and write amplification。 SSD blocks can perform a limited number of erase operations, also known as program / erase (P / E) cycles. When writes are very frequent, erase operations occur more frequently. Once the maximum number of P / es is reached, the block can no longer be written. For SLC, the number of erasures is usually 100000, for MLC, it is usually 10000, and for TLC block, it is thousands. In order to ensure capacity availability and write delay performance, the SSD controller needs to balance the erasure times of each block, which is one of the core tasks of the SSD controller, also known as the “loss equalization” mechanism. During loss equalization, data moves between blocks and is then erased. Since the erased data is no longer valid and the moved data is valid, the valid data in SSD is usually larger than the actually written data, which is called write amplification wa (write amplification).

SSD controllerAfter talking so much, you should feel that SSD is definitely not as simple as finding a particle to write down the data and taking it out when you need to read it. Read / write addressing, page movement within the SSD, block erasure, write amplification control, and how to balance the loss are all completed by the SSD controller. In particular, you may be confused. If the data is moved from the original page to a new place, the old page may be erased. How can the upper layer program find the new address? This is the processing logic of the controller, and many of these logic are even solidified into the circuit. For example, the conversion from physical address to virtual address (the upper application is addressed through virtual address, so the change of the lower address does not affect the upper application at all). They are all circuit level operations, and the delay is microsecond or even nanosecond.

Storage system optimization for SSDs

For the typical differences between SSD and HDD, the storage system should also be optimized for SSD. The effects of these optimization are reflected in many aspects, including the improvement of performance, the improvement of SSD efficiency, the extension of SSD life and so on. Here are some common optimization methods for SSD storage systems.

Maximize local SSD performance

In the HDD era, the delay of HDD is at the millisecond level, which can almost erase the impact of network delay. Therefore, as long as the optimization of network protocol and network interaction is ensured, applications can access the remote HDD. However, the delay of SSD has reached the microsecond level. Unless a high-performance network with very low delay is used, the delay of accessing remote SSD data will be significantly affected. For distributed storage, we must, on the one hand, make full use of the ability of local SSD while placing data in a decentralized manner, that is, make a trade-off in the data placement strategy.

In this regard, combined with the metadata placement algorithm and strategy, we adopt a part of localized SSD access to the metadata of yrcloudfile distributed file system. At the same time, intelligent caching technology will be introduced to cache a large amount of hot data in the designated SSD local device to further reduce the access delay.

Control SSD capacity usage

The use of SSD capacity (that is, how full the disk is) will affect the write amplification factor and the write performance loss caused by GC.

During GC, blocks need to be erased to create free blocks. To erase a block, you need to keep the page with valid data in the block to obtain a free block. To create a free block, you may need to compress and move the pages in multiple blocks. The specific number depends on the “full” degree of the block.

Assuming that the SSD capacity has been used by a%, according to experience, in order to erase a block, 1 / 1-A blocks need to be moved and compressed. Obviously, the higher the utilization of SSD, more blocks will need to be moved to release a block, which will occupy more resources and lead to longer IO waiting time. For example, if a = 50%, only 2 blocks are compressed to release one block. If a = 80%, move about 5 block data to release one block. If you consider the pages of the previous block, the data that needs to be operated is even more amazing. Assuming that each block has p pages and all pages contain valid data, PA / 1-A pages need to be copied every time garbage collected. If each block contains 128 pages, each blcok needs to copy 128 pages when a = 50%, 512 pages when a = 80%, and 2432 pages when a = 95%. As shown in the figure below:

Things you don't know

Relationship between SD disk capacity and the number of GC pages moved

Controlling the capacity of SSD has practical significance for GC efficiency, disk life and write delay of upper application.

Use multithreading for small IO access

SSD has multiple levels of internal parallel processing mechanisms, including channel, package, chip and plane. A single IO thread cannot make full use of all these parallel features. Using only a single thread for small IO access will lead to a longer overall access delay. Using multiple threads for concurrent access, you can take advantage of these concurrency features within SSD. The support of native command queuing on SSD can effectively allocate read and write operations among multiple channels, so as to improve internal IO concurrency. Therefore, the upper application or storage system can access small IO concurrently as much as possible, which is very beneficial to improve the read-write performance. If multithreading concurrency is difficult for a single application, multiple applications can be considered to access data concurrently, so as to make full use of the concurrency characteristics of SSD.

For example, we use an application to execute 10KB write IO, and the result is shown in the figure below. Using an IO thread, it can reach 115mb / s. Two threads basically double the throughput; And 4 threads double again. About 500MB / s can be achieved using 8 threads.

Things you don't know

So the question is how small the “small” IO is. It is generally believed that the IO upper limit that makes full use of the internal parallelism of SSD will be regarded as a “small” boundary. For example, if the page size of SSD is 4KB and the parallelism that SSD can support is 16, the threshold is 64KB.

summary

SSD has been widely used by storage systems. Generally, the storage system using SSD will have better performance than the storage system using HDD. However, without targeted optimization, simply treating SSD as an ordinary storage device can not give full play to the extreme performance of SSD, especially nvme. This is because the working principle of SSD is quite different from that of ordinary HDD, and the access characteristics are also different. In order to make full use of the performance advantages brought by SSD, modern storage systems, especially distributed storage systems, need to optimize SSD.

Through this article, I hope you can understand the working principle of SSD and the basic optimization methods. We will also share more methods and practices for SSD programming and improving performance in future articles.