Performance Optimization: how to receive data faster

Time:2022-1-11

From the network card to the application, the packet will pass through a series of components. What does the driver do? What does the kernel do? What can we do to optimize? The whole process involves many fine adjustable software and hardware parameters, which affect each other, and there is no “silver bullet” once and for all. In this paper, Yang Peng, senior engineer of cloud system development, will introduce how to make the optimal configuration of “Scene” on the basis of in-depth understanding of the underlying mechanism.

The article is compiled according to Yang Peng’s keynote speech “performance optimization: receiving data faster” at youpai cloud open talk Technology Salon Beijing station. The live video and PPT can be downloadedClick to readSee the original text.

Hello, I’m Yang Peng, the development engineer of youpai cloud. I’ve been working in youpai cloud for four years. During this period, I’ve been engaged in the development of CDN underlying system, responsible for scheduling, caching, load balancing and other core components of CDN. I’m glad to share my experience and feelings in network data processing with you. Today’s topic is “how to receive data faster”, which mainly introduces the methods and practices of accelerating network data processing. I hope it can help you better understand how to achieve the ultimate optimization at the system level without feeling the application as much as possible. Let’s get to the point.

First of all, you need to know what should be the first thing you think of when trying to do any optimization? Personally, I think it’s a measure. Before making any changes or optimizations, we should clearly know what indicators reflect the current problems. Then, only after making corresponding adjustments or changes can we verify the actual effect and function through indicators.

For the theme to be shared, there is a basic principle around the core of the above indicators. In the final analysis, we only need to look at the optimization at the network level. If we can do each level of the network stack and add the packet loss rate that can monitor the corresponding level, we can clearly know which layer the problem is at. With clear indicators that can be monitored, it is very simple to make corresponding adjustments and verify the actual effect. Of course, the above two points are relatively empty, and the next is the dry part.

Performance Optimization: how to receive data faster

As shown in the figure above, when a packet is received, there are many data flows from entering the network card to reaching the application layer. At this stage, you don’t need to pay attention to each process, just pay attention to the key paths of several cores:

  • First, the packet arrives at the network card;
  • Second, when the network card receives a packet, it needs to generate an interrupt to tell the CPU that the data has arrived;
  • The third step, the kernel takes over from this time, takes the data out of the network card and hands it over to the protocol stack of the later kernel for processing.

These are three key paths. The hand drawn drawing on the right in the above figure refers to these three steps and deliberately distinguishes two colors. The reason for this distinction is becauseNext, it will be shared according to these two parts: the upper driver and the lower part involving the kernel.Of course, there are many kernels. The whole article only involves the kernel network subsystem, more specifically the interaction between the kernel and the driver.

Network card driver

The part of network card driver. The network card is hardware and the driver is software, including most of the network card driver. This part can be simply divided into four points: initialization, startup, monitoring and tuning, which drive its initialization process.

Network card driver – initialization

The process of driver initialization is related to hardware, so there is no need to pay too much attention. However, it should be noted that a series of operations of registering ethool. This tool can perform various operations on the network card. It can not only read the configuration of the network card, but also change the configuration parameters of the network card. It is a very powerful tool.

How does it control the network card? During initialization, the driver of each network card registers through the interface to support a series of operations of the ethool tool. Ethool is a set of very general interfaces. For example, it supports 100 functions, but each model of network card can only support a subset. Therefore, the specific functions supported will be declared in this step.

Performance Optimization: how to receive data faster

The part intercepted in the above figure is the assignment of the structure during initialization. You can take a brief look at the first two. When initializing, the driver will tell the kernel that if you want to operate the callback functions corresponding to this network card, the most important ones are startup and shutdown. Those who use the ifconfig tool to operate the network card should be familiar with it. When using ifconfig up / down a network card, they call the functions specified during initialization.

Network card driver – start

After the driver initialization process, the process of starting (opening) is divided into four steps: allocating Rx / TX queue memory

Enable NaPi, register interrupt handling function and enable interrupt. It is natural to register interrupt handling functions and enable interrupts. This operation is required for any hardware access to the machine. When some events are received later, it needs to notify the system through the interrupt, and then turn on the interrupt.

NaPi in the second step will be described in detail later. Here, we first focus on the memory allocation during startup. When the network card receives data, it must copy the data from the link layer to the machine’s memory, which is applied to the kernel and the operating system through the interface when the network card is started. Once the memory is applied for and the address is determined, when the subsequent network card receives the data, it can directly transfer the data packet to the fixed address of the memory through the DMA mechanism, even without the participation of the CPU.

Performance Optimization: how to receive data faster

The allocation of queue memory can be seen in the figure above. Long ago, network cards were single queue mechanism, but most modern network cards are multi queue. The advantage is that the data reception of the machine network card can be load balanced to multiple CPUs, so multiple queues will be provided. Here is a concept, which will be described in detail later.

Performance Optimization: how to receive data faster

Next, we will introduce in detail the second step of NaPi in the startup process, which is a very important extension of the modern network packet processing framework. NaPi mechanism plays a very important role in supporting 10g, 20g, 25g and other very high-speed network cards. Of course, NaPi is not complex. Its core is two points: interrupt and cycle. Generally speaking, the network card must receive a packet when receiving data, generate an interrupt, and then process the packet when the interrupt processing function. It is in the cycle of receiving and processing interrupts, and the next receiving and processing interrupts. The advantage of NaPi mechanism is that it only needs one interrupt. After receiving it, it can take all the data in the queue memory in a round robin way to achieve a very efficient state.

Network card driver – monitoring

The next step is the monitoring that can be done in the driver layer. We need to pay attention to the sources of some of the data.


$ sudo ethtool -S eth0
NIC statistics:
     rx_packets: 597028087
     tx_packets: 5924278060
     rx_bytes: 112643393747
     tx_bytes: 990080156714
     rx_broadcast: 96
     tx_broadcast: 116
     rx_multicast:20294528
     .... 

First of all, a very important tool is the ethool tool, which can get the statistical data in the network card, the number of packets received, the traffic processed and other conventional information, and we need to pay more attention to the exception information.


$ cat /sys/class/net/eth0/statistics/rx_dropped
2

Through the interface of sysfs, you can see the number of packet losses of the network card, which is a sign of an exception in the system.

Performance Optimization: how to receive data faster

The information obtained in the three ways is similar to that in the front, but the format is a little messy. Just understand it.

Performance Optimization: how to receive data faster

The above figure is an online case to share. At that time, there were business exceptions. After investigation, it was finally suspected that the network card layer was involved. Therefore, further analysis is needed. Through ifconfig tool, you can intuitively view some statistical data of the network card. In the figure, you can see that the errors data index of the network card is very high, and there are obvious problems. But what’s more interesting is that the value of the last frame indicator on the right of errors is exactly the same as it. Because the errors indicator is the indicator after the accumulation of many errors in the network card, the dropped and overruns adjacent to it are zero, that is, in the state at that time, most of the errors of the network card came from the frame.

Of course, this is only an instantaneous state. The lower part of the figure above is the monitoring data. You can obviously see the fluctuation change. It is indeed an abnormality of a machine. The frame error is generally caused by the failure of RCR verification when the network card receives a packet. When a packet is received, the contents of the packet will be verified. When it is found that it does not match the saved verification, it indicates that the packet is damaged, so it will be directly discarded.

This reason is easy to analyze. There are two points and one line. The network card of the machine is connected to the uplink switch through the network cable. When a problem occurs here, it is either the network cable or the network card of the machine itself, or the port of the opposite end switch, that is, the port of the uplink switch. Of course, it is analyzed according to the first priority, coordinated operation and maintenance to replace the network cable corresponding to the machine, and the later indicators also reflect the effect. The indicators directly drop suddenly until they completely disappear, the error will no longer exist, and the business corresponding to the upper layer will soon return to normal.

Network card driver – tuning

After monitoring, let’s take a look at the final tuning. There are few things that can be adjusted at this level, mainly for the adjustment of network card multi queue, which is more intuitive. You can adjust the number and size of queues, the weight between queues, and even the hash field.

$ sudo ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:   0
TX:   0
Other:    0
Combined: 8
Current hardware settings:
RX:   0
TX:   0
Other:    0
Combined: 4

The above figure shows the adjustment for multiple queues. To illustrate the concept just now, for example, a web server is bound to CPU2, and the machine has multiple CPUs. The network card of the machine is also multi queue, in which a queue will be processed by CPU2. At this time, there will be a problem. Because the network card has multiple queues, the traffic on port 80 will only be allocated to one of the queues. If this queue is not processed by CPU2, some data will be involved. When the bottom layer receives the data and then gives it to the application layer, it needs to move the data. If it is processed in CPU1 and needs to be moved to CPU2, CPU cache failure will be involved, which is a costly operation for high-speed CPUs.

So what should we do? Through the tools mentioned above, we can specifically direct the 80 port TCP data traffic to the network card queue processed by the corresponding CPU2. The effect of this is that the data packets are the same CPU from the arrival of the network card to the completion of kernel processing and then to the application layer. The biggest advantage of this is the cache. The CPU cache is always hot. In this way, the overall delay and effect will be very good. Of course, this example is not practical, mainly to illustrate an effect that can be achieved.

Kernel network subsystem

After finishing the whole network card driver, the next step is to explain the kernel subsystem, which will be divided into two parts: soft interrupt and network subsystem initialization.

Soft interrupt

Performance Optimization: how to receive data faster

Netdev in the figure above is a branch of the Linux network subsystem held every year. The interesting point is that the number of sessions held at the annual conference is represented by a special character. In the picture, 0x15 has been held. It must be found that this is a hexadecimal number. 0x15 is just 21 years, which is also a geek. Those interested in the network subsystem can pay attention to it.

Performance Optimization: how to receive data faster

To get back to business, there are many mechanisms for kernel delay tasks, and soft interrupt is only one of them. The figure above shows the basic structure of Linux. The upper layer is user mode, the middle is kernel, and the lower layer is hardware. It is a very abstract layer. There are two ways of interaction between user state and kernel state: through system call or exception, you can fall into kernel state. How does the underlying hardware interact with the kernel? The answer is interrupt. When the hardware interacts with the kernel, it must be interrupted. To deal with any event, an interrupt signal needs to be generated to inform the CPU and the kernel.

However, such a mechanism may not be a problem in general, but for network data, one datagram and one interrupt will have two obvious problems.

Problem 1: during interrupt processing, the previous interrupt signal will be masked. When an interrupt is processed for a long time, all interrupt signals received during processing will be lost.If it takes ten seconds to process a packet, five more packets are received during this ten second period, but because the interrupt signal is lost, even if the previous processing is completed, the subsequent packets will not be processed again. On the TCP side, if the client sends a packet to the server, and the processing is completed after a few seconds, but during the processing, the client sends three subsequent packets, but the server does not know that it has received only one packet. At this time, the client is waiting for the server to return the packet, which will cause both sides to get stuck, It also shows that signal loss is an extremely serious problem.

Problem 2: if a packet triggers an interrupt processing, a very large number of interrupts will be generated when a large number of packets arrive.If it reaches 100000, 500000, or even millions of PPS, the CPU needs to deal with a large number of network interrupts, and there is no need to do other things.

The solution to the above two problems is to make the interrupt processing as short as possible.Specifically, you can’t use the interrupt processing function. You can only find it and hand it over to the soft interrupt mechanism. In this way, the actual result is that the hardware interrupt processing does very little. Some necessary things such as receiving data are handed over to the soft interrupt, which is also the significance of the existence of soft interrupt.

static struct smp_hotplug_thread softirq_threads = {
  .store              = &ksoftirqd,
  .thread_should_run  = ksoftirqd_should_run,
  .thread_fn          = run_ksoftirqd,
  .thread-comm        = “ksoftirqd/%u”,
};

static _init int spawn_ksoftirqd(void)
{
  regiter_cpu_notifier(&cpu_nfb);
  
  BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

  return 0;
}
early_initcall(spawn_ksoftirqd);

The soft interrupt mechanism is implemented through the thread of the kernel. The figure shows a corresponding kernel thread. The server CPU will have a ksoftirqd such kernel thread, and multi CPU machines will have multiple threads. The last member ksoftirqd / of the structure in the figure, if there are three CPUs corresponding to / 0 / 1 / 2, there will be three kernel threads.

Performance Optimization: how to receive data faster

The information of soft interrupt mechanism can be seen under softirqs. There are not many soft interrupts, but only a few. Among them, net-tx and net-rx, which are related to the network, are the two scenarios of network data sending and receiving.

Kernel initialization

After laying the groundwork for soft interrupts, let’s look at the process of kernel initialization. There are two main steps:

  • For each CPU, a data structure is created, on which many members are hung, which is closely related to the subsequent processing;
  • Register a soft interrupt processing function corresponding to the above two soft interrupt processing functions net-tx and net-rx.

Performance Optimization: how to receive data faster

The figure above shows the processing flow of a data packet drawn by hand:

  • Step 1: the network card receives the data packet;
  • The second step is to copy the data packet to the memory through DMA;
  • The third step generates an interrupt, tells the CPU and starts processing the interrupt. The key interrupt processing can be divided into two steps: one is to shield the interrupt signal, and the other is to wake up the NaPi mechanism.

static irqreturn_t igb_msix_ring(int irq, void *data)
{
  struct igb_q_vector *q_vector = data;
  
  /* Write the ITR value calculated from the previous interrupt. */
  igb_write_itr(q_vector);
  
  napi_schedule(&q_vector->napi);
  
  return IRO_HANDLED;
}

The above code is what the IGB network card driver interrupt handling function does. If you omit the initial variable declaration and the subsequent return, the interrupt handler function has only two lines of code, which is very short. Second, in the hardware interrupt processing function, only the external NIPA soft interrupt processing mechanism is activated without doing anything else. So this interrupt handler will return very fast.

Nipi activation


/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi)
{
  list_add_tail(&napi->poll_list, &sd->poll_list);
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

The activation of nipi is also very simple, mainly in two steps. When the kernel network system is initialized, each CPU will have a structure, which will insert the information corresponding to the queue into the linked list of the structure. In other words, when each network card queue receives data, it needs to tell its own queue information to the corresponding CPU, and bind the two information to ensure that a CPU processes a queue.

In addition, like triggering hard interrupts, soft interrupts need to be triggered. The following figure puts many steps together, and the previous ones will not be repeated. In the figure, we should pay attention to how the soft interrupt is triggered. Similar to hard interrupts, soft interrupts also have a vector table of interrupts. As like as two peas, each interrupt number corresponds to a processing function. When dealing with an interrupt, it is only necessary to find it in the corresponding interrupt vector table, which is exactly the same as the hard interrupt handling.

Performance Optimization: how to receive data faster

Data receiving – monitoring

After finishing the operation mechanism, let’s see where we can monitor. There are many things under proc. You can see the interrupt processing. The first column is the interrupt number. Each device has an independent interrupt number, which is written dead. For the network, you only need to pay attention to the interrupt number corresponding to the network card. In the figure, it is 65, 66, 67, 68, etc. Of course, it doesn’t make sense to look at the actual number, but to look at its distribution. Whether interrupts are processed by different CPUs. If all interrupts are processed by one CPU, you need to make some adjustments and disperse them.

Performance Optimization: how to receive data faster

Data receiving – tuning

There are two adjustments that interrupts can make: one is interrupt merging, and the other is interrupt affinity.

Adaptive interrupt merging

  • rx-usecs:After the data frame arrives, how long is the delay to generate an interrupt signal, in microseconds
  • rx-frames:Maximum number of accumulated data frames before triggering interrupt
  • rx-usecs-irq:If there is an interrupt processing in progress, how long is the current interrupt delayedCPU
  • rx-frames-irq:If an interrupt processing is being executed, how many data frames are accumulated at most

The functions supported by the hardware network card are listed above. NaPi is also an interrupt merging mechanism in essence. If many packets arrive, NaPi can generate only one interrupt. Therefore, hardware is not required to help with interrupt merging. The actual effect is the same as that of NaPi, which reduces the total number of interrupts.

Interrupt affinity

$ sudo bash -c ‘echo 1 > /proc/irq/8/smp_affinity’

This is closely related to the network card multi queue. If the network card has multiple queues, you can manually specify which CPU will process it, and evenly distribute the data processing load to the available CPU of the machine. The configuration is also relatively simple. You only need to write the number to the file corresponding to / proc. This is a single digit group, which will be processed by the corresponding CPU after being converted to binary. If you write a 1, cpu0 may handle it; If you write a 4 and convert it to binary 100, it will be handed over to CPU2 for processing.

There is also a small problem to note that many distributions may come with an irqbalance daemon(http://irqbalance.github.io/i…), the setting of manual interrupt equalization will be overwritten. The core of this program is to put the operation of manually setting the file above into the program. If you are interested, you can see its code(https://github.com/Irqbalance…), you can open this file and write the corresponding number in it.

Kernel – data processing

Finally, the data processing part. When the data reaches the network card and enters the queue memory, the kernel needs to pull the data out of the queue memory. If the PPS of the machine reaches 100000 or even millions, and the CPU only processes network data, other basic business logic does not need to be done. Therefore, the processing of data packets cannot monopolize the whole CPU, and the core point is how to limit it.

There are two main limitations to solve the above problems: overall limitation and single limitation

while (!list_empty(&sd->poll_list)){
  struct napi_struct *n;
  int work,weight;
  
  /* If softirq window is exhausted then punt.
   * Allow this to run for 2 jiffies since which will allow
   * an average latency of 1.5/HZ.
   */
   if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
   goto softnet_break;

The overall limitation is well understood, that is, one CPU corresponds to one queue. If the number of CPUs is less than the number of queues, one CPU may need to process multiple queues.

weight = n->weight;

work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
        work = n->poll(n,weight);
        trace_napi_poll(n);
}

WARN_ON_ONCE(work > weight);

budget -= work;

Single limit is to limit the number of packets processed by a queue in a round. When the limit is reached, stop and wait for the next round of processing.

softnet_break:
  sd->time_squeeze++;
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
  goto out;

Stopping is a key node. Fortunately, there are corresponding indicator records and interrupt counts such as time-squeeze. With this information, you can judge whether the network processing of the machine has a bottleneck and the frequency of forced interruption.

Performance Optimization: how to receive data faster

The above figure is the data for monitoring CPU indicators. The format is very simple. Each line corresponds to a CPU. The values are separated by spaces, and the output format is hexadecimal. So what does each column of values represent? Unfortunately, there is no documentation. You can only check the kernel version used and then look at the corresponding code.

seq_printf(seq,
     "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
     sd->processed, sd->dropped, sd->time_squeeze, 0,
     0, 0, 0, 0, /* was fastroute */
     sd->cpu_collision, sd->received_rps, flow_limit_count);

The following describes how each field in the file comes from. The actual situation may be different, because the number and order of fields may change with the iteration of the kernel version. Among them, the squeeze field is related to the number of interruptions of network data processing:

  • SD – > number of packets processed (multi network card bond mode may be more than the actual number of packets received)
  • SD – > dropped packet loss, because the queue is full
  • sd->time_ Spueze soft interrupt processing net_ rx_ The number of times the action was interrupted
  • sd->cpu_ Collisionwhen sending data, the acquisition of device locks conflicts, such as multiple CPUs sending data at the same time
  • sd->received_ RPS the number of times the current CPU has been awakened (through interprocessor interrupts)
  • sd->flow_ limit_ Count the number of times flow limit is triggered

The following figure shows the cases of related problems encountered in the business, and finally the troubleshooting is to the CPU level. Figure 1 is the output of the top command, showing the usage of each CPU. The utilization rate of cpu4 marked in the red box is abnormal, especially the Si occupation of the penultimate column reaches 89%. Si is the abbreviation of softirq, which indicates the time proportion of CPU spent on soft interrupt processing, while cpu4 in the figure is obviously too high. Figure 2 is the output result corresponding to figure 1. Cpu4 corresponds to the fifth row, and the value of the third column is significantly higher than that of other CPUs, indicating that it is frequently interrupted when processing network data.

Performance Optimization: how to receive data faster

Performance Optimization: how to receive data faster

In view of the above problems, it is inferred that cpu4 has a certain performance degradation, perhaps due to poor quality or other reasons. In order to verify whether it is performance degradation, I wrote a simple Python script, an endless loop that keeps accumulating. Each time you run, bind this script to a CPU, and then observe the time-consuming comparison of different CPUs. Finally, the comparison results also show that the time consumption of cpu4 is several times higher than that of other CPUs, which also verifies the previous inference. After that, the CPU was replaced through coordinated operation and maintenance, and the intention index returned to normal.

summary

All the above operations are only after the data packet has gone from the network card to the kernel layer, and has not reached the common protocol. It has only completed the first step of the Long March, followed by a series of steps, such as packet compression (GRO), network card multi queue software (RPS) and RFs. On the basis of load balancing, consider the characteristics of the flow, that is, the characteristics of the IP port quadruple, Finally, the data is delivered to the IP layer and to the familiar TCP layer.

In general, today’s sharing is done around the driver. I want to emphasize that the core point of performance optimization lies in indicators. If it can’t be measured, it’s difficult to improve. There should be indicators so that all optimization can be meaningful.

Recommended reading

MySQL design specifications for those common errors

Is the whole station HTTPS safe?