Interviewer: how to write code to make CPU run faster?

Time:2020-11-27

Interviewer: how to write code to make CPU run faster?


preface

The code is run by the CPU. The execution efficiency of CPU is determined by whether the code is written well or not. Especially when writing computationally intensive programs, we should pay more attention to the CPU’s execution efficiency, otherwise the system performance will be greatly affected.

The CPU cache (CACHE) is embedded in the CPU. Its storage capacity is very small, but it is very close to the CPU core. Therefore, the read and write speed of the cache is extremely fast. If the CPU reads data directly from the CPU cache instead of from the memory, the operation speed will be very fast.

However, most people don’t know the operation mechanism of CPU cache, so they don’t know how to write code that can cooperate with the working mechanism of CPU cache. Once you master it, you will have new optimization ideas when you write code.

So, let’s take a look at what the CPU cache looks like, how it works, and how to write code that makes the CPU execute faster?

Interviewer: how to write code to make CPU run faster?


text

How fast is the CPU cache?

You may wonder why you need CPU cache when you have memory? According to Moore’s law, the access speed of CPU will double every 18 months, which is equivalent to an annual growth of about 60%. Of course, the speed of memory will continue to grow, but the growth rate is far less than that of CPU, with an average annual growth of about 7%. As a result, the access performance gap between CPU and memory is widening.

Up to now, the time required for a memory access is200~300Multiple clock cycles, which means that the access speed of CPU and memory has been different200~300Many times.

In order to make up for the performance difference between CPU and memory, CPU cache is introduced into CPU, also known as cache.

The CPU cache is usually divided into three levels of cache with different sizesL1 cache, L2 cache and L3 cache

Because the material used in CPU cache is SRAM, the price is much higher than DRAM used in memory. Nowadays, it costs US $7 to produce a CPU cache of 1 MB size, while the cost of memory is only 0.015 US dollars, which is 466 times different in cost. Therefore, CPU cache is not always calculated in GB as memory, and its size is calculated in KB or MB.

In Linux system, we can use the following method to check the size of CPU cache at all levels. For example, in my server, the L1 cache nearest to the CPU core is 32KB, the L2 cache is 256Kb, and the largest L3 cache is 3MB.

Interviewer: how to write code to make CPU run faster?

Among them,L1 cache is usually divided into “data cache” and “instruction cache”This means that data and instructions are cached separately in the L1 cache layer, as shown in the figure aboveindex0And data caching, that is, data cachingindex1The size of the instruction cache is usually the same.

In addition, you will notice that L3 cache is much larger than L1 cache and L2 cache becauseL1 cache and L2 cache are unique to each CPU core, while L3 cache is shared by multiple CPU cores.

When the program is executed, the data in memory will be loaded into the shared L3 cache first, and then loaded into the unique L2 cache of each core. Finally, it will enter the fastest L1 cache, and then it will be read by the CPU. The hierarchical relationship between them is as follows:

Interviewer: how to write code to make CPU run faster?

The closer the cache is to the CPU core, the faster the access speed is2~4Clock cycles, accessing L2 cache approximately10~20Clock cycles, accessing L3 cache approximately20~60Clock cycles, and the memory access speed is about200~300Between clock cycles. The following table:

Interviewer: how to write code to make CPU run faster?

Therefore, the CPU reads data from L1 cache faster than from memory100Many times.


What is the data structure and reading process of CPU cache?

The data in the CPU cache is read from the memory. It reads the data in small pieces rather than according to a single array element. In the CPU cache, such small pieces of data are calledCache line

You can view the CPU cache line in the following way on your Linux system. You can see that the L1 cache line size of my server is 64 bytes, which means thatThe size of L1 cache loading data at a time is 64 bytes

Interviewer: how to write code to make CPU run faster?

For example, there is oneint array[100]When loadingarray[0]Because the size of the array element only occupies 4 bytes in memory, less than 64 bytes, the CPU willSequential loadingArray elements toarray[15]That meansarray[0]~array[15]Array elements will be cached in the CPU cache. Therefore, when accessing these array elements the next time, they will be read directly from the CPU cache instead of from memory, which greatly improves the performance of CPU reading data.

In fact, when the CPU reads data, whether the data is stored in the cache or not, the CPU accesses the cache first. Only when the data is not found in the cache, it will access the memory, read the data in the memory into the cache, and then read the data from the CPU cache.

Interviewer: how to write code to make CPU run faster?

This access mechanism is the same as the logic of using “memory as the cache of the hard disk”. If there is cached data in the memory, it will be returned directly. Otherwise, it is necessary to access the ordinary hard disk.

How does the CPU know whether the memory data to be accessed is in the cache? If so, how to find the corresponding data of cache? We start from the simplest, the most basicDirect mapping cache(Direct Mapped CacheLet’s talk about the data structure and access logic of the entire CPU cache.

As mentioned earlier, when the CPU accesses the memory data, it is read by small pieces of data. The specific size of this small piece of data depends oncoherency_line_size64 bytes. In memory, this piece of data is calledMemory block(BlockWhen reading, we need to get the address of the memory block where the data is located.

For the direct mapping cache, the strategy is to “map” the address of the memory block to the address of a CPU line. As for the implementation of the mapping relationship, the “modular operation” is used. The result of the modular operation is the address of the CPU line corresponding to the memory block address.

For example, memory is divided into 32 memory blocks, and there are 8 CPU lines in CPU cache. If the CPU wants to access memory block 15, if the data in memory block 15 has been cached in CPU line, it must be mapped in CPU line 7, because15 % 8The value of is 7.

Tactfully, you must have found that if you use the modular mapping method, multiple memory blocks will correspond to the same CPU line. For example, in the above example, except that memory block 15 is mapped in CPU line 7, memory blocks 7, 23 and 31 are mapped to CPU line 7.

Interviewer: how to write code to make CPU run faster?

Therefore, in order to distinguish different memory blocks, we also store one in the corresponding CPU lineGroup tag。 This group tag will record the memory block corresponding to the data stored in the current CPU line. We can use this group tag to distinguish different memory blocks.

In addition to the group tag information, CPU line also has two information:

  • One is the actual storage loaded from memoryData(Data
  • The other is,Significant bit(Valid bitIt is used to mark whether the data in the corresponding CPU line is valid. If the effective bit is 0, the CPU will directly access the memory and reload the data regardless of whether there is data in the CPU line.

When the CPU reads data from the CPU cache, it does not read the entire data block in the CPU line, but reads a data fragment required by the CPU. Such data is collectively referred to as aCharacters(Word。 How to find the required word in the data block of the corresponding CPU line? The answer is, you need oneOffset (offset)

Therefore, a memory access address, includingGroup tag, CPU line index, offsetThese three kinds of information, so the CPU can find the cached data in the CPU cache. For data structures in the CPU cache, theIndex + significant bit + group tag + data blockform.

Interviewer: how to write code to make CPU run faster?

If the data in memory is already in the CPU cache, the CPU will go through these four steps when accessing a memory address:

  1. According to the index information in the memory address, the index in CPU cache is calculated, that is to find the address of the corresponding CPU line;
  2. After finding the corresponding CPU line, judge the significant bit in the CPU line and confirm whether the data in the CPU line is valid. If it is invalid, the CPU will directly access the memory and reload the data. If the data is valid, it will proceed to the next step;
  3. Compare the group tag in the memory address with the group tag in the CPU line to confirm that the data in the CPU line is the memory data we want to access. If not, the CPU will directly access the memory and reload the data. If so, it will proceed to the next step;
  4. According to the offset information in the memory address, read the corresponding word from the data block of CPU line.

Here, I believe you have a certain understanding of the direct mapping cache, but in fact, in addition to the direct mapping cache, there are other strategies to find the data in the CPU cache through the memory address, such as fully connected cache(Fully Associative Cache), group connected cache(Set Associative Cache)The data structures of these strategies are all similar. We understand how the direct mapping cache works. If you are interested in looking at other strategies, I believe you will soon understand them.


How to write code to make CPU run faster?

We know that the speed of CPU accessing memory is more than 100 times slower than that of accessing CPU cache. Therefore, if the data to be operated by CPU is in CPU cache, it will bring great performance improvement. If the accessed data is in the CPU cache, it means thatCache Hit The higher the cache hit rate, the better the code performance and the faster the CPU runs.

So, “how do you write code that makes the CPU run faster? This question can be changed to “how to write code with high CPU cache hit rate?”? 」。

As I mentioned earlier, L1 cache is usually divided into “data cache” and “instruction cache”. This is because CPU processes data and instructions separately, such as1+1=2This operation,+That is, the instruction will be placed in the instruction cache and the number will be entered1Will be placed in the data cache.

So,Let’s look at the cache hit rates of data cache and instruction cache separately

How to improve the hit rate of data cache?

If you want to traverse a two-dimensional array, there are two forms. Although the code execution results are the same, which form do you think is the most efficient? Why is it high?

Interviewer: how to write code to make CPU run faster?

After testing, form 1array[i][j]Execution time ratio form 2array[j][i]Several times faster.

The reason why there is such a big gap is because of the two-dimensional arrayarrayThe memory occupied is continuous, such as lengthNWhat do you mean2Then, the layout order of array elements in memory is as follows:

Interviewer: how to write code to make CPU run faster?

Form one usearray[i][j]The order of accessing array elements is exactly the same as the storage order of array elements in memory. When the CPU accessesarray[0][0]Because the data is not in the cache, the following three elements will be loaded into the CPU cache “sequentially”. When the CPU accesses the next three array elements, the data can be successfully found in the CPU cache. This means that the cache hit rate is very high, and the cache hit data does not need to access memory, which greatly improves the code performance.

And if we use form twoarray[j][i]To access, the order of access is as follows:

Interviewer: how to write code to make CPU run faster?

As you can see, the access method is skip, not sequential, so if the value of n is large, then the operationarray[j][i]There is no wayarray[j+1][i]Also read into the CPU cache, sincearray[j+1][i]If the CPU cache is not read, the data element needs to be read from memory. Obviously, this discontinuous and skipping way of accessing data elements may not make full use of the characteristics of CPU cache, so the code performance is not high.

The visitarray[0][0]Element, how many elements will the CPU load from memory to the CPU cache at a time? This problem, as we mentioned earlier, is related to CPU cache line, which indicates thatThe size of data that can be loaded by CPU cache at one time, which can be accessed in Linuxcoherency_line_sizeConfigure to see its size, usually 64 bytes.

Interviewer: how to write code to make CPU run faster?

In other words, when the CPU accesses the memory data, if the data is not in the CPU cache, it will continuously load 64 byte data into the CPU cache at one timearray[0][0]Since the element is less than 64 bytes, theorderreadarray[0][0]~array[0][15]To the CPU cache. Sequential accessarray[i][j]Because it takes advantage of this feature, it will be better than skip accessarray[j][i]Be quick.

Therefore, in the case of traversing the array, accessing according to the memory layout order can effectively take advantage of the CPU cache, and the performance of our code will be greatly improved,

How to improve the hit rate of instruction cache?

The way to improve the cache hit rate of data is to access according to the memory layout order. How to improve the cache for instructions?

Let’s take an example. There is a one-dimensional array of random numbers between 0 and 100

Interviewer: how to write code to make CPU run faster?

Next, do two operations on the array:

Interviewer: how to write code to make CPU run faster?

  • In the first operation, loop through the array and set the array elements less than 50 to 0;
  • The second operation is to sort the array;

So the question comes. Do you think it is faster to traverse and then sort, or to sort and traverse first?

Before answering this question, let’s first understand the CPU’sBranch predictor。 For the if conditional statement, it means that at least two different instructions can be executed, that is, if or else. So,If the branch prediction can predict whether the instruction in if or else instruction will be executed, the instruction can be put into the instruction cache “in advance”, so that the CPU can read the instruction directly from the cache, and the execution speed will be very fast

When the elements in the array are random, branch prediction cannot work effectively. When the array elements are sequential, the branch predictor will dynamically predict the future according to the historical hit data, so the hit rate will be very high.

Therefore, sorting first and then traversing will be faster, because after sorting, the number is from small to large, so the previous loop hitsif < 50The number of branch predictions is higher, so the branch predictions are cachedifInsidearray[i] = 0Instruction into the cache, subsequent CPU execution of the instruction only need to read from the cache.

If you are sure of theifThe expression in thetrueFor example, in C / C + + language, the compiler provideslikelyandunlikelyThese two kinds of macro, ififOn condition thattureYou can uselikelyHongbaifWrap up the expression in and vice versaunlikelyMacro.

Interviewer: how to write code to make CPU run faster?

In fact, the CPU’s own dynamic branch prediction is more accurate, so it is recommended to use these two macros only when you are very sure that the CPU prediction is not accurate and you can know the actual probability.

How to improve the cache hit rate of multi-core CPU?

In a single core CPU, although only one process can be executed, the operating system allocates a time slice to each process. When the time slice is used up, the next process is scheduled. Therefore, each process occupies the CPU alternately according to the time slice. From a macro perspective, it seems that each process is executing at the same time.

Modern CPUs are multi-core, and processes may switch back and forth between different CPU cores, which is not conducive to CPU cache. Although L3 cache is shared among multiple cores, L1 and L2 caches are unique to each core,If a process switches back and forth in different cores, the cache hit rate of each core will be affectedOn the contrary, if the processes are executed on the same core, the cache hit rate of L1 and L2 caches of data can be effectively improved. A high cache hit ratio means that the CPU can reduce the frequency of accessing memory.

When there are multiple “compute intensive” threads running at the same time, in order to prevent the cache hit rate from decreasing due to switching to different cores, we can set theThe thread is bound to a CPU coreIn this way, the performance can be greatly improved.

Thesched_setaffinityMethod to implement the function of binding threads to a CPU core.

Interviewer: how to write code to make CPU run faster?


summary

With the development of computer technology, the access speed difference between CPU and memory is more and more. Nowadays, the gap has reached hundreds of times. Therefore, the CPU cache component is embedded in the CPU. As the cache layer between memory and CPU, CPU cache is separated from CPU The core is very close, so the access speed is also very fast. However, due to the high cost of materials required, it is not like that the memory is often several GB in size, but only tens of KB to MB in size.

When the CPU accesses the data, it first accesses the CPU cache. If the cache hits, it will directly return the data, so it is not necessary to read the data from the memory every time. Therefore, the higher the cache hit ratio, the better the code performance.

However, it should be noted that when the CPU accesses the data, if the CPU cache does not cache the data, it will read the data from the memory, but not read only one data, but read one piece of data at a time and store it in the CPU cache before being read by the CPU.

There are many strategies for mapping memory addresses to CPU cache addresses. Among them, the relatively simple one is to map cache directly. It splits the memory address into “index + group tag + offset”, which enables us to map large memory addresses to very small CPU cache addresses.

If you want to write code that makes CPU run faster, you need to write code with high cache hit rate. CPU L1 cache is divided into data cache and instruction cache, so we need to improve their cache hit ratio respectively

  • For data cache, when traversing data, we should operate according to the order of memory layout. This is because CPU cache operates data in batches according to CPU cache line, so the performance can be effectively improved when operating continuous memory data in sequence;
  • For instruction cache, regular conditional branch statements can make the branch predictor of CPU work and further improve the efficiency of execution;

In addition, for multi-core CPU systems, threads may switch back and forth between different CPU cores, so the cache hit rate of each core will be affected. Therefore, to improve the cache hit rate of a process, we can consider binding the thread to a CPU core.


Garrulous

Interviewer: how to write code to make CPU run faster?

Hello, I’m Xiaolin. I love to illustrate the basics of computer. If you feel that the article is helpful to you, please share it with your friends and give Xiao Lin a “reading”. This is very important for Xiaolin. Thank you for giving your little sisters and brothers a fist. See you next time!


Recommended reading

This week unknowingly output 3 articles, the first 2 have not seen the students, hurry to have a look!

Oh, my God! I know that the hard disk is very slow, but I didn’t expect to be 10 million times slower than the CPU cache

The secret of CPU executing program is hidden in these 15 pictures