Linux Storage Stack Diagram

Time:2020-5-7

There is nothing better than charts in work and study (although many people laugh at ppt), especially the accurate charts. When you want to draw a picture of a structure or a process, you must fully understand it. When explaining a picture, we must also have a basic understanding of it. It’s not easy. It’s like this for me anyway. As for the Linux storage architecture, there is a very precise diagram, “Linux storage stack diagram”. The summary of this figure is really good. The modules involved in storage are described, so that learners can clearly understand the complex system. This article attempts to make a brief introduction to each part of the diagram, but it will not involve specific implementation.

https://www.thomas-krenn.com/…

Linux Storage Stack Diagram

Colors are used to distinguish different components.

  • Sky blue: hardware storage device.
  • Orange: transmission protocol.
  • Blue: device files in a Linux system.
  • Yellow: I / O scheduling policy.
  • Green: file system of Linux system.
  • Blue green: basic data structure bio of Linux storage operation.

file system

VFS

VFS is a virtual file system layer provided by Linux kernel. VFS provides users with some standard system calls to operate the file system, such as open(), read(), write(), etc., so that users do not need to care about the underlying file system and storage media. At the same time, VFS should restrict the underlying file system and provide a unified abstract interface and operation mode.

Underlying file system

Linux supports many file systems, which can be roughly divided into the following categories.

  • Disk file system: File System Based on physical storage device, used to manage storage space of device, such as ext2, ext4, XFS, etc.
  • Network file system: used to access files on other devices in the network, such as NFS, smbfs, etc. The target of a network file system is a network device, so it does not call the block layer of the system.
  • Stack file system: a file system superimposed on other file systems. It does not store data itself, but expands the underlying files, such as ecryptfs, wrapfs, etc.
  • Pseudo file system: because it does not manage real storage space, it is called pseudo file system. It organizes some virtual directories and files, through which you can access the system or hardware data. It is not used to store data, but to access the data wrapper file, so the pseudo file system cannot be operated as storage space. Such as proc, sysfs, etc.
  • Special file system: special file system is also a pseudo file system, which is more like a disk file system, but read-write is memory rather than disk device. Such as TMPFS, ramfs, etc.
  • User file system: also known as fuse. It provides a way for developers to implement file systems in user space without modifying the kernel. This is more flexible, but less efficient. Fuse directly targets the user file system and does not call the block layer.

Block Layer

Block layer is the middle layer of Linux storage system, which connects the file system and block devices. It abstracts the read and write requests of the upper file system as BIOS, and transfers the BIOS to the device through the scheduling strategy. The block layer contains the blue-green, yellow and intermediate BIOS transfer processes in the figure.

Page cache

When opening a file, Linux system can use the o ﹣ direct flag to decide whether to use page cache or not. When there is O “direct, I / O reading and writing will bypass the cache and directly access the block device. Otherwise, reading and writing need to be done through page cache. The main behaviors of page cache are as follows.

  • When reading data, if the accessed page is in the page cache (HIT), the page will be returned directly.
  • When reading data, if the accessed page is not in the page cache (missing), a page missing exception is generated. The system will create a cache page to cache the accessed address into this page. The upper layer will read again and a cache hit will occur.
  • When writing data, if the cache hits, the data is written to the cache page.
  • When writing data, if the cache is missing, a page missing exception is generated, and the system creates a cache page. The upper layer writes again and a cache hit occurs.
  • When a cached page in the page cache is modified, it is marked as dirty. The upper layer calls the sync or pdflush process to write the dirty pages back to disk.

BIO

Bio represents the read and write request to block device, which is described by a structure in the kernel.

struct bvec_iter {
    Sector? Bi? Sector; // device address, in sectors (512 bytes)
    Unsigned int bi_size; // size of transferred data, byte
    Unsigned int bi_idx; // current index in bvl_vec
    Unsigned int Bi ABCD done; // the completed data size in the current bvec, byte
};

struct bio {
    Struct bio * bi_next; // request queue
    Struct block? Device * Bi? Bdev; // points to the block device
    int            bi_error;
    Unsigned int bi_opf; // request tag
    Unsigned short bi_flags; // status, command
    unsigned short        bi_ioprio;
    struct bvec_iter    bi_iter;
    Unsigned int bi_physics_segments; // number of bio segments after physical address merging

    /*
     * To keep track of the max segment size, we account for the
     * sizes of the first and last mergeable segments in this bio.
     */
    Unsigned int Bi \ \ SEG \ \ front \ \ size; // size of the first mergeable segment
    Unsigned int Bi \ \ SEG \ \ back \ \ size; // size of the last mergeable segment

    atomic_t        __bi_remaining;
    The callback function at the end of bio is generally used to inform the caller of the completion of the bio 
    ......
    Unsigned short Bi? VCNT; // count of bio? VEC
    Unsigned short bi_max_vecs; // maximum number of bvl_vecs
    Atomic t Bi CNT; // use count
    Struct bio VEC * Bi IO VEC; // pointer to VEC list
    struct bio_set        *bi_pool;
    ......
};

After a bio is built, you can create a transfer request through generic make request() and add the request to the request queue. The request queue is described by the structure request queue in the kernel, which contains a two-way request list and related control information. Each item in the request chain is a request. The request consists of BIOS, which may contain different segments. Since a bio can only have consecutive disk blocks, but a request may not have consecutive disk blocks, a request may contain one or more BIOS. Although the disk blocks in the bio are continuous, they may be discontinuous in memory, so the bio may contain several segments.

Scheduler

After the read-write data is organized into a request queue, it is the process of accessing the disk, which is completed by IO scheduling. The specified disk sector accessed by BIOS must be addressed first. Addressing is to locate the disk head to a certain location on a specific block, which is relatively slow. In order to optimize the addressing operation, the kernel will not simply receive the request in order, nor submit it to the disk immediately, but execute the pre operation named merge and sort before submitting, which can greatly improve the overall performance of the system. This is what IO scheduling needs to do.

In the current kernel, two modes of IO scheduler are supported: single queue and multi queue. Single queue is identified as “I / O scheduler” in the figure, and multi queue is identified as blkmq. Both should be schedulers, but the requests are organized differently.

Single queue reduces disk addressing time by merging and sorting. Merging refers to combining multiple consecutive requests into a larger IO request to maximize hardware performance. The whole request queue will be arranged orderly according to the growing direction of sectors. The purpose of permutation is not only to shorten the addressing time of a single request, but also to shorten the addressing time of all requests by keeping the disk head moving in a straight line. At present, the scheduling strategies used by single queue include NOOP, deadline, CFQ, etc.

  • NOOP: the simplest algorithm of IO scheduler, which puts IO requests into the queue and executes them in sequence, and merges consecutive IO requests accordingly.
  • Deadline: ensure that the IO request can be served within a certain period of time, and avoid the hunger of a request.
  • CFQ: that is, absolute fairness algorithm, which attempts to allocate a request queue and a time slice to all processes competing for block device usage rights. Within the time slice allocated by the scheduler to the process, the process can send its read and write requests to the underlying block device. When the time slice of the process is consumed, the request queue of the process will be suspended and waiting for scheduling.

The previous kernel only had single queue. At that time, when the storage device was mainly HDD, the random addressing performance of HDD was very poor, and single queue could meet the transmission requirements. When SSD develops, its random addressing performance is very good, and the bottleneck of transmission is transferred to the request queue. Combined with multi-core CPU, multi-queue is designed. Multi queue configures a software queue for each CPU core or socket, which also solves the problem of multi-core lock competition in single queue. If the storage device supports parallel multiple hardware dispatch queues, the transmission performance will be greatly improved. At present, the scheduling strategies supported by multi queue include: MQ deadline, BFQ, kyber, etc.

Block device

Equipment documents

Device file is the interface for Linux system to access hardware device. Driver abstracts hardware device as device file for application access. When the device driver is loaded, it creates a device file descriptor under / dev /. If it is a block device, a soft link will be created to / dev / block /. It will be named according to the device number. In the figure, block devices are divided into the following categories.

  • Logical device: in the figure, “devices on top of” normal “block devices”, use device mapper to map the physical block devices. Through this mapping mechanism, storage resources can be managed as needed. Including LVM, DM, bcache, etc.
  • SCSI device: use SCSI standard device files, including SDA (hard disk), Sr (optical drive), etc.
  • Other block devices: each block device has its own transport protocol. One class represents real hardware devices, such as MMC, nvme, etc. The other class represents virtual block devices, such as loop, zram, etc.

transport protocol

The orange part in the figure shows the technical implementation of the block device, which may be the software implementation of the hardware specification or a software architecture. In the figure, SCSI and Lio are circled separately, because these two parts are relatively complex. SCSI contains many hardware specifications, the most commonly used is to access HDDs and SSDs through libata.

Leo (linux-io) is based on SCSI engine, which implements the SCSI target described in the SCSI architecture model (SAM). Lio introduced the kernel after Linux 2.6.38. The San technologies supported by Lio include fibre channel, FCoE, iSCSI, Iser, SRP, USB, etc. at the same time, it can also generate analog SCSI devices for the local machine and provide virtual machine with SCSI devices based on virtio. Lio enables users to use a relatively cheap Linux system to realize various functions of SCSI and San without buying expensive professional devices. You can see that the front end of Lio is fabric module (fibre channel, FCoE, iSCSI, etc.), which is used to access the emulated SCSI devices. Fabric module is the transmission protocol to realize SCSI commands. For example, iSCSI technology is to transfer SCSI commands in TCP / IP, and Vhost technology is to transfer SCSI commands in virtio queue. The back end of Lio implements the method of accessing disk data. Fileio accesses data through Linux VFS, iblock accesses Linux block devices, PSCSI directly accesses SCSI devices, and memory copy ramdisk is used to place ramdisk accessing analog SCSI.

Hardware equipment

The sky blue part in the picture is the actual hardware storage device. Among them, virtio PCI, para virtualized SCSI and VMware’s para virtualized SCSI are the virtualized hardware devices.