Analysis of zero copy technology in Linux


This paper discusses several main zero copy technologies in Linux and the applicable scenarios of zero copy technology. In order to quickly establish the concept of zero copy, we introduce a common scenario

a citation

When writing a server program (web server or file server), file download is a basic function. At this time, the task of the server is to send the files in the server host disk from the connected socket without modification. We usually use the following code:

  1. while((n = read(diskfd, buf, BUF_SIZE)) > 0)
  2. write(sockfd, buf , n);

The basic operation is to read the contents of the file from the disk to the buffer, and then send the contents of the buffer to the socket. However, Linux I / O operations are buffered by default. There are mainly two system calls, read and write, in which we do not know what the operating system has done. In fact, during the above I / O operations, there are multiple data copies.

When an application program accesses a piece of data, the operating system first checks whether it has recently accessed the file and whether the file content is cached in the kernel buffer. If so, the operating system copies the contents of the kernel buffer to the user space buffer specified by the buf address provided by the read system call. If not, the operating system first copies the data on the disk to the kernel buffer, which is currently mainly transmitted by DMA, and then copies the contents of the kernel buffer to the user buffer.

Next, the write system call copies the contents of the user buffer to the kernel buffer related to the network stack, and finally the socket sends the contents of the kernel buffer to the network card. Having said so much, it is better to see the picture clearly

Analysis of zero copy technology in Linux

Data copy

At the same time, it can be seen from the above figure that the CPU needs to process data with the CPU for four times even if the data is copied twice with the CPU.

In this process, we did not make any changes to the file content, so copying data back and forth in kernel space and user space is undoubtedly a waste, and zero copy is mainly to solve this inefficiency.

What is zero copy?

The main task of zero copy is to prevent the CPU from copying data from one storage to another. The main task is to use various zero copy technologies to avoid making the CPU do a lot of data copy tasks, reduce unnecessary copies, or let other components do such simple data transmission tasks, so that the CPU is free to focus on other tasks. In this way, the utilization of system resources can be more effective.

Let’s go back to the example in the citation. How can we reduce the number of copies of data? An obvious focus is to reduce data copy back and forth between kernel space and user space. This also introduces a type of zero copy

Data transmission does not need to go through user space.

Using MMAP

One way to reduce the number of copies is to call mmap() instead of the read call:

  1. buf = mmap(diskfd, len);
  2. write(sockfd, buf, len);

The application program calls MMAP (), the data on the disk will be copied through the kernel buffer of DMA, and then the operating system will share the kernel buffer with the application program, so that the contents of the kernel buffer do not need to be copied to user space. The application program calls write(), and the operating system directly copies the contents of the kernel buffer to the socket buffer. All this happens in the kernel state. Finally, the socket buffer sends the data to the network card. Again, it’s easy to look at the picture:


Using MMAP instead of read obviously reduces one copy. When the amount of copied data is large, the efficiency is undoubtedly improved. But there is a cost to using MMAP. When you use MMAP, you may encounter some hidden pitfalls. For example, when your program maps a file, but when the file is truncated by another process, the write system call is terminated by the sigbus signal for accessing an illegal address. Sigbus signal will kill your process by default and generate a coredump. If your server is suspended in this way, it will cause a loss.

Usually we use the following solutions to avoid this problem:

1. Establish signal processing program for sigbus signal

When a sigbus signal is encountered, the signal handler simply returns. The write system call returns the number of bytes that have been written before it is interrupted, and errno is set to success, but this is a bad way to deal with it, because you don’t have the core to solve the problem.

2. Use file rental lock

Usually we use this method to use the lease lock on the file descriptor. We apply for a lease lock to the kernel for the file. When other processes want to truncate the file, the kernel will send us a real-time rtsignallease signal to tell us that the kernel is breaking the read-write lock you imposed on the file. In this way, your write system call will be interrupted before the program accesses illegal memory and is killed by sigbus. Write will return the number of bytes that have been written, and set errno to success.

We should lock before the MMAP file and unlock it after the operation of the file:

  1. if(fcntl(diskfd, F_SETSIG, RT_SIGNAL_LEASE) == -1) {
  2. perror("kernel lease set signal");
  3. return -1;
  4. }
  5. /* l_ type can be F_ RDLCK F_ Wrlck lock*/
  6. /* l_ type can be F_ Unlock*/
  7. if(fcntl(diskfd, F_SETLEASE, l_type)){
  8. perror("kernel lease set type");
  9. return -1;
  10. }

Using sendfile

Starting from the 2.1 kernel, Linux introduces sendfile to simplify the operation

  1. #include<sys/sendfile.h>
  2. ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

The system calls sendfile() to transfer the file contents (bytes) between the descriptor infd representing the input file and the descriptor outfd representing the output file. The descriptor outfd must point to a socket, and the file pointed to by infd must be MMAP enabled. These limitations limit the use of sendfile, so that sendfile can only transfer data from the file to the socket, otherwise it can not.

The use of sendfile not only reduces the number of data copies, but also reduces the context switching. Data transfer always occurs in kernel space.

Sendfile system call procedure

When we call sendfile, what happens if another process truncates the file? Assuming we don’t have any signal handler set, the sendfile call only returns the number of bytes it has transferred before it was interrupted, and errno is set to success. If we lock the file before calling sendfile, the behavior of sendfile is still the same as before, and we will receive rtsignalease signal.

So far, we’ve reduced the number of data copies, but there’s still one copy, which is a copy from the page cache to the socket cache. Can we omit this copy as well?

With the help of hardware, we can do it. Previously, we copied the page cache data to the socket cache. In fact, we only need to transfer the buffer descriptor to the socket buffer, and then pass the data length. In this way, the DMA controller can package and send the data in the page cache to the network.

In summary, the sendfile system call uses the DMA engine to copy the file content to the kernel buffer, and then adds the buffer descriptor with the file location and length information to the socket buffer. In this step, the data in the kernel will not be copied to the socket buffer. The DMA engine will copy the data of the kernel buffer to the protocol engine to avoid the last copy.

Analysis of zero copy technology in Linux

Sendfile with DMA

However, this kind of collection copy function needs hardware and driver support.

Use split

Sendfile is only suitable for copying data from a file to a socket, which limits its use. Linux introduces the split system call in version 2.6.17 to move data in two file descriptors:

  1. #define _GNU_SOURCE /* See feature_test_macros(7) */
  2. #include<fcntl.h>
  3. ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsignedint flags);

The split call moves data between two file descriptors without copying the data back and forth between kernel space and user space. It copies len length data from fdin to fdout, but one of them must be pipeline equipment, which is also some limitations of current split. The flags parameter has the following values:

  • Split move: try to move data instead of copying it. This is just a small hint to the kernel: if the kernel cannot move data from the pipe or the pipe cache is not a full page, you still need to copy the data. There are some problems with the initial implementation of Linux, so starting from 2.6.21, this option will not work. Later versions of Linux should implement it.
  • Split fnonblock: the split operation is not blocked. However, if the file descriptor is not set to I / O in non blocking mode, then calling split may still be blocked.
  • Splitfmore: the subsequent split call will have more data.

The split call takes advantage of the pipeline buffer mechanism proposed by Linux, so at least one descriptor must be a pipeline.

The above zero copy technologies are to reduce the data copy in user space and kernel space, but sometimes, data must be copied between user space and kernel space. At this time, we can only focus on the time when the data is copied in user space and kernel space. Linux usually uses copy on write to reduce system overhead, which is also called cow.

Due to the space, this paper does not introduce copy while writing in detail. If multiple programs access the same piece of data at the same time, each program has a pointer to the data. In the view of each program, it owns the data independently. Only when the program needs to modify the data content, can the data content be copied to the application space of the program. At this time, the data becomes the program Private data of the order. If the program does not need to modify the data, it will never need to copy the data to its own application space. This reduces the copy of the data. Copy the content while writing, you can write another article…

In addition, there are some zero copy technologies, such as adding o to traditional Linux I / O_ Direct marking can directly I / O, avoiding automatic caching. There are also immature fbufs technologies. This paper does not cover all zero copy technologies, but introduces some common ones. If you are interested, you can study them by yourself. Generally, mature server-side projects will transform the I / O part of the kernel to improve their data transmission rate.

Author: the trees of Kabala_…

Analysis of zero copy technology in Linux