Clickhouse series – fanwai – zero copy


This article will explain to the reader in detail the underlying details of 4K for each read when unordered storage is mentioned in Chapter 3. The appendix in Chapter 3 has explained to the readers that “this reason is that when the operating system reads the disk, according to the principle of data locality, it will read in pages, and the size of each page is 4K by default.” this paper will go deep into a common optimization in the computer field – zero copy technology.

In Linux system, three sets of API supplied executable file operations are provided:

  1. system call
  2. Standard I / O
  3. mmap

The first system call is the file API directly provided by the operating system to provide byte read and write operations to files. The operating system implements the page caching mechanism inside it, which is transparent to the application side. The operating system adjusts the buffer size according to the page access.

The second standard I / O is the famous < stdio. H >. It realizes the operation of files through streaming. The reason for developing stdio is that the page cache of system call is too large (16K ~ 128K), and some simple applications do not need such a large cache. At the same time, calling system call involves the switching of CPU from user model to kernel mode, which consumes a lot of time. Therefore, standard IO is developed, which can be regarded as a buffer for the kernel.

The third type of MMAP is called zero copy. Chapter 2 standard IO is suitable for simple programs, but it will have a well-known disadvantage for programs with high performance requirements such as database. The essence of standard IO is to buffer the first method by copying a copy of data in user space. Standard IO will copy the data generated by the first system call read() to user space. Subsequent operations will be carried out in user space. When appropriate, the kernel will be written. At this point, the data is copied twice: the kernel system calls read() to copy the data to the memory in kernel space, and the standard IO copies the data to user space. This is bound to lead to performance loss. Fortunately, the kernel provides MMAP, which supports applications to directly map file addresses to the memory space of the current process, so that applications can directly operate the memory, and the kernel is responsible for synchronizing the contents to the disk.

Most databases use MMAP to achieve zero copy, avoiding the two copies of standard IO. However, MMAP maps the contents of the file to memory, and the minimum unit of memory managed by the memory management unit (MMU) of the operating system is the page, so MMAP must organize the mapping size according to an integer multiple of the page. This is why 4K appears in the calculation in Chapter 3.

Another advantage of using MMAP is that except for a few page missing exceptions, the reading and writing of MMAP are carried out in user space. No system calls are generated. In addition, the operation MMAP can also control whether the kernel uses the read ahead mechanism on demand through the madwise () system call, so as to control the size of the page cache.

MMAP is a technology widely used in modern applications. Kafka’s commitlog and PostgreSQL’s storage engine… These well-known databases are widely using zero copy technology. However, as I stressed before, using MMAP is not without disadvantages. The main disadvantage is that the size must be organized according to the integer multiple of the page, which is prone to waste of space. Therefore, when dealing with large files or the file size is exactly an integer multiple of PageSize, using MMAP will get a great performance improvement.

Recommended Today

Swift advanced (XV) extension

The extension in swift is somewhat similar to the category in OC Extension can beenumeration、structural morphology、class、agreementAdd new features□ you can add methods, calculation attributes, subscripts, (convenient) initializers, nested types, protocols, etc What extensions can’t do:□ original functions cannot be overwritten□ you cannot add storage attributes or add attribute observers to existing attributes□ cannot add parent […]