Operating system thinking the seventh chapter cache


Chapter 7 cache

By Allen B. Downey

Original: Chapter 7 caching

Translator: Flying Dragon

Protocol: CC by-nc-sa 4.0

7.1 how to run the program

To understand caching, you need to understand how computers run programs. You should study computer architecture to understand this topic in depth. My goal in this chapter is to give a simple model of program execution.

When a program starts, the code (or program text) is usually on the hard disk. The operating system creates a new process to run the program, and then the loader copies the code from memory to main memory, and calls themainTo start the program.

In the process of program running, most of its data are stored in main memory, but some data are in registers, which are small storage units on the CPU. These registers include:

  • Program counter (PC), which contains the address of the program’s next instruction (in memory).

  • Instruction register (IR), which contains the machine code of the currently executed instruction.

  • Stack pointer (SP), which contains the pointer of the current function stack frame, including function parameters and local variables.

  • A general register used by a program to store data.

  • The status register, or bit register, contains information about the current calculation. For example, a bit register usually contains a bit to store the result of whether the last operation was zero.

During program operation, the CPU performs the following steps, which are called “instruction cycle”:

  • Fetch: gets the next instruction from memory and stores it in the instruction register.

  • Decode: a part of CPU is called “control unit”, which decodes instructions and sends signals to other parts of CPU.

  • Execute: the appropriate calculation will be performed after receiving the signal from the control unit.

Most computers can execute hundreds of different instructions, called instruction sets. But most instructions fall into several general categories:

  • Load: sends the value in memory to the register.

  • Arithmetic / logic: loads operands from registers, performs arithmetic operations, and stores results in registers.

  • Store: send the value in the register to memory.

  • Jump / branch: modify the program counter so that the control flow jumps to another location in the program. Branches are usually conditional, that is, they check flags in bit registers and jump only when set.

Some instruction sets, including general x86, provide mixed instructions for loading and arithmetic operations.

In each instruction cycle, instructions are read from the program text. In addition, almost half of the instructions in ordinary programs are used to store or read data. One of the basic problems of computer architecture is “memory bottleneck”.

On the current desktop, the CPU is usually 2GHz, which means that a new statement will be initialized every 0.5ns. But it takes about 10 ns to transfer data from memory. If the CPU needs to wait for 10ns to grab the next instruction, and then wait for 10ns to load data, it may need 40 clock cycles to complete an instruction.

7.2 cache performance

The solution to this problem, or at least part of it, is caching. “Cache” is a small, fast storage space on the CPU. On the current computer, the storage is usually 1-2mib, and the access speed is 1-2ns.

When the CPU reads data from memory, it stores a copy in the cache. If you read the same data again, the CPU will read the cache directly, without waiting for memory.

When the cache is full at last, we need to throw away some data in order to let new data come in. So if the CPU loads the data and reads it after a while, the data may not be in the cache.

The performance of many programs is limited by the efficiency of the cache. If the data needed by the CPU is usually in the cache, the program can run at the full speed of the CPU. If the CPU often needs data that is not in the cache, the program will be limited by the speed of memory.

Cache hit rateh, is the proportion of data found in the cache during memory access. “Missing rate”m, is the proportion of memory that needs to be accessed for memory access. IfThIs the time to process cache hits,TmIs the cache miss time. The average time per memory access is:

h * Th + m * Tm

Again, we can define the “missing penalty,” which is the extra time required to process cache misses,Tp = Tm - Th, the average access time is:

Th + m * Tp

When the missing rate is very low, the average visit time tends to be close toThIn other words, a program can behave as if the memory has a cache speed.

7.3 locality

When a program reads a byte for the first time, the cache usually loads a “block” or a “row” of data, containing the required bytes and some adjacent data. If the program continues to read these adjacent data, they are already in the cache.

For example, if the block size is 64b, you read a string of length 64, and the first byte of the string happens to be at the beginning of the block. When you load the first byte, you trigger the missing penalty, but then the rest of the string is in the cache. After reading the entire string, the hit rate is 63 / 64. If the string is divided into two blocks, you should trigger two missing penalties. But the hit rate is 62 / 64, about 97%.

On the other hand, if the program jumps around unpredictably, reads data from scattered locations in memory, and rarely accesses the same location twice, the performance of cache will be very low.

The tendency of programs to use the same data more than once is called time locality. The tendency to use data from adjacent locations is called “spatial locality.”. Fortunately, many programs are born with these two localities:

  • Many programs contain blocks of code that are not jumps or branches. In these code blocks, the instructions are executed in order, and the access mode has spatial locality.

  • In the loop, the program executes the same instruction many times, so the access mode has time locality.

  • The result of one instruction is usually used for the operands of the next instruction, so the data access pattern has time locality.

  • When a program executes a function, its parameters and local variables are stored on the stack. Access to these values is spatial locality.

  • One of the most common processing models is to read and write array elements in sequence. The first mock exam is also a spatial locality.

In the next section, we will explore the relationship between the access mode of the program and the cache performance.

7.4 measurement of cache performance

When I was a UC Berkeley graduate, I was a teaching assistant in Brian Harvey’s computer architecture class. One of my favorite exercises involves an iterative array, a program that reads and writes elements and measures the average time. By changing the size of the array, it is possible to guess the size of the cache, the size of the block, and some other attributes.

The revised version of my program is in the warehouse of this bookcacheUnder the directory.

The core part of the program is a cycle:

iters = 0;
do {
    sec0 = get_seconds();

    for (index = 0; index < limit; index += stride) 
        array[index] = array[index] + 1;
    iters = iters + 1; 
    sec = sec + (get_seconds() - sec0);
} while (sec < 0.1);

InternalforThe loop traverses the array.limitDetermines the range of array traversal.strideDecide how many elements to skip. For example, iflimitIt’s 16.strideYes, the loop will access 0, 4, 8, and 12.

secIt tracks the CPU’s total time for the inner loop. External loop untilsecIt takes more than 0.1 seconds to stop, which is long enough for us to calculate the average time.

get_secondsUsing system callsclock_gettime, convert the result to seconds, anddoubleReturns the result.

double get_seconds(){
    struct timespec ts;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;

Operating system thinking the seventh chapter cache

Figure 7.1: average missing penalty function for data size and step size

In order to separate the time to access the data, the program runs a second loop, which is exactly the same except that the inner loop does not access the data. It always adds the same variable:

iters2 = 0;
do {
    sec0 = get_seconds();
    for (index = 0; index < limit; index += stride) 
        temp = temp + index;
    iters2 = iters2 + 1;
    sec = sec - (get_seconds() - sec0);

} while (iters2 < iters);

The second loop runs the same number of iterations as the first. After each iteration, it starts fromsecReduce the time consumed. When the cycle is complete,secContains the total time for all array accesses, minus the time used to increasetempThe time. The difference is the total missing penalty for all access triggers. Finally, we divide it by the total number of visits to get the average missing penalty for each visit, in NS:

sec * 1e9 / iters / limit * stride

If you compile and runcache.c, you should see this output:

Size:    4096 Stride:       8 read+write: 0.8633 ns
Size:    4096 Stride:      16 read+write: 0.7023 ns
Size:    4096 Stride:      32 read+write: 0.7105 ns
Size:    4096 Stride:      64 read+write: 0.7058 ns

If you have Python and Matplotlib installed, you can use thegraph_data.pyMake the result graph. Figure 7.1 shows my results running on the Dell OptiPlex 7010. Note that array size and step size are expressed in bytes, not the number of array elements.

Take a minute to think about this image and see if you can infer cached information. Here are some things to think about:

  • The program traverses and reads the array many times, so it has a lot of time locality. If the entire array can be put into the cache, the average missing penalty should be almost zero.

  • When the step size is 4, we read every element of the array, so the program has a lot of space locality. For example, if the block size is large enough to contain 64 elements, the hit rate should be 63 / 64, even if the array cannot be fully cached.

  • If the step size is equal to the block size (or greater), spatial locality should be 0, because each time we read a block, we access only one element. In this case, we will see the biggest penalty for missing.

In short, if the size of the array is smaller than that of the cache, or the step size is smaller than that of the block, we think it will have good cache performance. If the array is larger than the cache size and the step size is larger, the performance will only degrade.

In Figure 7.1, as long as the array is less than2 ** 22Byte, cache performance is good for all steps. We can speculate that the cache size is approximately 4mib. In fact, it should be 3mib according to the specification.

When the step size is 8, 16 or 32B, the cache performance is good. It began to decline at 64b, and for a larger step, the average missing penalty was about 9ns. We can infer that the block size is 128B.

Many processors use multi-level cache, which includes a small fast cache and a large slow cache. In this example, when the array size is greater than2 ** 14B, the missing penalty seems to have increased a little. So this processor may also have a 16kb cache with access time less than 1ns.

7.5 cache friendly programming

Memory caching is implemented by hardware, so in most cases programmers don’t need to know much about it. But if you know how caching works, you can write programs that use them more efficiently.

For example, if you’re working with a large array, traversing the array only once, and performing multiple operations on each element, it might be faster than traversing the array multiple times.

If you deal with a two-dimensional array, it’s stored as a row array. If you need to traverse elements, traversing by row in steps of element size is faster than traversing by column in steps of row size.

Linked list data structure does not always have spatial locality, because nodes are not necessarily continuous in memory. But if you allocate many nodes at the same time, they are usually allocated together in the heap. Or, if you assign an array of nodes at a time, you should know that they are contiguous, which is better.

Recursive strategies like merge and sort usually have good caching behavior because they divide large arrays into small pieces and then process them. Sometimes these algorithms can be tuned to take advantage of caching behavior.

For those applications whose performance is very important, algorithms can be designed to adapt to the cache size, block size and other hardware features. Algorithms like this are called cache aware. The obvious disadvantage of cache aware algorithms is that they are hardware specific.

7.6 memory hierarchy

In several places in this chapter, you may have a question: “if the cache is much faster than main memory, why not use a large cache and throw away main memory?”

Before going into the computer architecture, two reasons can be given: electronic and economic. Caching is fast because they are small and close to the CPU, which can reduce latency and signal propagation due to capacitance. If you make the cache very large, it becomes very slow.

In addition, cache occupies the space of processor chip, and larger processor will be more expensive. Main memory usually uses dynamic random access memory (DRAM), each with only one transistor and one capacitor, so it can pack more memory in the same space. But this way of implementing memory is slower than caching.

At the same time, the main memory is usually packed in the dual in-line memory module (DIMM), which contains at least 16 chips. A few small chips are cheaper than a large one.

The trade-off between speed, size, and cost is the root cause of caching. If there’s memory technology that’s fast, big and cheap, we don’t need anything else.

The same principles apply to memory. Flash memory is very fast, but they are more expensive than hard disks, so they are smaller. Tapes are slower than hard disks, but they can store more and are relatively cheap.

The table below shows the typical access time, size, and cost for each technology.

equipment Access time Normal size cost
register 0.5 ns 256 B ?
cache 1 ns 2 MiB ?
DRAM 10 ns 4 GiB $10 / GiB
SSD 10 µs 100 GiB $1 / GiB
HDD 5 ms 500 GiB $0.25 / GiB
Magnetic tape minutes 1–2 TiB $0.02 / GiB

The number and size of registers depends on the details of the architecture. The current computer has 32 general registers, each of which can store a word. On a 32-bit computer, a word is 32 bits, 4 bytes. On a 64 bit computer, a word is 64 bits, 8 bytes. So the total capacity of register file is 100-300 bytes.

The cost of registers and caching is difficult to measure. They are included in the cost of the chip. But customers don’t know the cost directly.

For the other data in the table, I looked at the computer hardware specifications that are usually for sale in the computer online store. By the time you read this, the data should be out of date, but they can give you some idea of the performance and cost gap at some point in the past.

These technologies form the “memory architecture”. Each level in the structure is larger and slower than its upper level. In a sense, each level serves as the cache of its next level. You can think of main memory as a cache that persistently stores programs and data on SSDs or HDDs. And if you need to deal with very large data sets on tape, you can cache some of the data on your hard disk.

7.7 cache strategy

The storage hierarchy shows a framework that takes caching into account. At each level of the structure, we need to highlight four basic issues of caching:

  • Who moves data up or down in the hierarchy? At the top of the structure, registers are usually allocated by the compiler. The hardware on the CPU manages the cache of memory. In the process of executing a program or opening a file, the user can implicitly move the file on the memory to the memory. But the operating system also moves data from memory back to storage. At the bottom of the hierarchy, administrators explicitly move data between tape and disk.

  • What moved? Generally, the block size at the top of the structure is smaller than that at the bottom. In the memory cache, the block size is usually 128B. Pages in memory may be 4kib, but when the operating system reads files from disk, it may read 10 or 100 blocks at a time.

  • When will the data move? In most basic caches, data is moved to the cache on first use. But many caches use some “prefetch” mechanism, which means that data is loaded before an explicit request. We’ve seen some forms of prefetching: loading an entire block when a part of it is requested.

  • Where is the data in the cache? When the cache is full, we can’t put something in it without throwing something away. Ideally, we’re going to keep the data we’re going to use and replace the data we’re not going to use.

The answers to these questions constitute a “caching strategy.”. Near the top, caching strategies tend to be simpler because they are very fast and implemented by hardware. Near the bottom, there will be more decisions, and well-designed strategies will be very different.

Most caching strategies are based on the principle of historical replay. If we have information about the recent period, we can use it to predict the near future. For example, if a piece of data has been used recently, we think it will be used again soon. This principle shows a strategy called “least recently used”, LRU. It removes the oldest unused block from the cache. See Wikipedia for caching algorithms for more topics.

7.8 page scheduling

In systems with virtual memory, the operating system can move pages between storage and memory. As I mentioned in 6.2, this mechanism is called “page scheduling”, or simply “page wrapping”.

Here is the workflow:

  1. Process a callmallocTo assign pages. If there is no free space of the requested size in the heap,mallocWould callsbrkRequest more memory from the operating system.

  2. If there are free pages in physical memory, the operating system will load them into the page table of process a to create a new virtual memory valid range.

  3. If there is no free page, the scheduling system will select a “sacrifice page” belonging to process B. It copies the page contents from memory to disk, and then modifies the page table of process B to indicate that the page is “swapped out”.

  4. Once the data of process B is written, the page will be reassigned to process a. To prevent process a from reading process B’s data, the page should be cleared.

  5. heresbrkThe call to themallocProvides additional space in the heap area. aftermallocAllocate the requested memory and return. Process a can continue.

  6. When process a finishes executing or breaks, the scheduler may let process B continue executing. When it accesses the page being swapped out, the memory manager unit notices that the page is “invalid” and triggers an interrupt.

  7. When the operating system processes an interrupt, it sees that the page has been swapped out, so it transfers the page from disk to memory.

  8. Once the page is swapped in, process B can continue.

When page scheduling works well, it can greatly improve the utilization level of physical memory and allow more processes to execute in less space. Here’s why:

  • Most processes do not run out of allocated memory.textMany parts of a segment will never be executed, or once they are executed they will never be used again. These pages can be swapped out without causing any problems.

  • If a program leaks memory, it may lose its allocated space and never use it. By swapping out these pages, the operating system can effectively fill in the leak.

  • In most systems, some processes, like daemons, are idle for most of the time and are only “woken up” to respond to events in specific situations. When they are idle, these processes can be swapped out.

  • In addition, there may be many processes running the same program. These processes can share the sametextSegment to avoid keeping multiple copies in physical memory.

If you increase the total memory allocated to all processes, it can exceed the physical memory size, and the system is still running well.

To some extent.

When the process accesses the paged out, it needs to get data from the disk, which takes several milliseconds. This delay is usually obvious. If you leave a window idle for a while and then switch back to it, it may be slow to execute, and you may hear the sound of the disk working when the page changes in.

Occasional delays like this may be acceptable, but if you have a lot of processes that take up a lot of space, they affect each other. When process a runs, it reclaims the pages needed by process B, and then when process B runs, it reclaims the pages needed by process a. When this happens, both processes execute slowly and the system becomes unresponsive. This kind of scene we don’t want to see is called “turbulence”.

In theory, the operating system should avoid bumps by detecting scheduling and block growth, or kill processes until the system can respond again. But in my opinion, most systems don’t do this, or they don’t do well. They usually let users limit the use of physical memory, or try to recover in case of turbulence.