We all know that memory access is very fast at ordinary times. Today, let’s give full play to the spirit of thorough inquiry and think about two problems
Question 1:What is the memory access latency? Will you make a rough estimate?
For example, the author’s memory module
SpeedThe display is 1066MHz. Can you calculate that the memory IO delay is 1s / 1066MHz = 0.93ns?
This algorithm is very wrong.
Question 2:Is random IO slower than sequential IO in memory?
We all know that random io of disk is much slower than sequential IO (elevator scheduling algorithm is specially implemented at the bottom of the operating system to alleviate this problem). Will random io of memory be slower than sequential IO?
To fully understand the above two questions, I think we have to find the answer from the physical process of memory io.
Let me tell you a story about the librarian first
Before we begin to introduce the boring working principle of memory. I want to tell you a story first and take you to meet someone, the librarian.
In our story, you are the protagonist of the story. You have a house. There is a servant in the house. He helps you deal with all kinds of book data every day. But the house price in Beijing is too expensive, so your house is very small and can only hold 64 books. Across the street from your house is the Beijing Library (although your house is small, the location is good). All the books you need can be found there. There is a librarian who is responsible for helping you find the books you want.
OK, next, the director shouted action and the scene began!
You found that you need the calculation result of the book numbered 0. Your servant crossed the road and told the librarian to take out books 0-63 for me. The librarian helps you find the book on the second floor in front of the computer. So he took some time to take the elevator to the second floor. When he got to the second floor, he took some time to help you find it. Then your servant took 64 books and put them in the living room. He picked up the 0th book and helped you deal with it.
If you find that you need the calculation result of the book numbered 1, tell your servant. Your servant can take it out directly from the living room. This time you have to wait the shortest.
You find that you need a Book numbered 65, and you tell your servant. Your servant crossed the road and went to the librarian. The librarian is still on the second floor. I heard that he needs 65-127 this time. He doesn’t have to spend time looking for the floor this time. Just spend time looking for books. Your servant put 65-127 books in the living room (the previous 0-63 were thrown away) and helped you start dealing with book 65.
You found that you needed a Book numbered 10000, and you told your servant. Your servant crossed the road to the library and found the administrator. The administrator found that the book you need is on the 10th floor. He has to take some time to take the elevator. After you go, he has to take some time to help you find it.
In these four scenes, I think you must have found the time-consuming difference in different situations.
- Scenarios 1 and 4 take the most time. Because the librarian needs to spend time looking for the floor by elevator, and need to spend time looking for books in the building.
- Scene 3 is the second, because the librarian is directly on the floor and only needs to spend time looking for books in the building
- Scene 2 is the fastest, because you only need the servant to take it from the living room. You don’t even need to cross the road.
The reason for making up such an example is that memory works so much like it. Next, let’s go to the actual analysis of memory.
Physical structure of memory
In “take you to understand the bottom principle of memory alignment!” In, we learned about the physical structure of memory particles and IO process. Today we’ll review it again.
Memory is composed of chips. Inside each chip, there are eight banks. Its structure is shown in the figure below:
Each bank is a matrix on a two-dimensional plane, as we mentioned in the previous article. Each element in the matrix holds one byte, that is, 8 bits.
Whenever the CPU requests data from the memory, the memory chip always works with eight banks in parallel. After locating the row address, each bank copies the corresponding row to the row buffer. Then take out the data in the corresponding element according to the column address, splice the data with eight banks, and a 64 bit wide data can be returned to the CPU.
According to the above figures, we can roughly understand the IO process of memory. There are some delays between each operation in this process. Let’s continue to understand these delays.
Memory IO delay
In “from DDR to DDR4, the memory core frequency index has basically not made much progress.” At the end of the, you should remember that we mentioned that there are four parameters Cl tRCD Trp tra as memory. Let’s understand the meaning of these four parameters in detail:
- Cl (column address latency): the number of cycles between sending a column address to memory and the beginning of data response
- TRCD (row address to column address delay): the minimum number of clock cycles required to open a row of memory and access its columns
- Minimum number of precharge cycles between precharge and the next row of rocharge commands.
- TRAS (row active time): the minimum number of clock cycles required between the row active command and the precharge command. That is to limit the next precharge time.
Note that except that CL is a fixed number of cycles, the other three are the minimum cycles. In addition, the above parameters are all in clock cycles. Because modern memory is a clock cycle, the upper and lower edges transmit data once respectively, so it can be obtained by using speed / 2. For example, if the speed of the author’s machine is 1066MHz, the clock cycle is 533MHz. Your own machine can pass through
# dmidecode | grep -P -A16 "Memory Device" Memory Device ...... Speed: 1067 MHz ......
Similar to the “librarian”, the memory chip also has a similar working scenario:
Your process needs a byte of data with memory address 0x0000. At this time, the CPU sends a request to the memory controller. The memory controller precharges the row address and needs to wait Trp clock cycles. When issuing the command to open one line of memory, you need to wait for tRCD clock cycles. Then send the column address and wait another CL cycle. Finally, all the data of 0x0000-0x0007 is returned to the CPU. The CPU puts these data into its own cache and helps you start to calculate the data of 0x0000.
Your process needs a byte of data with memory address 0x0003. If the CPU finds that it exists in its own cache, just use it directly. In this scenario, there is no memory IO at all.
Your process needs a byte of data with memory address 0x0008, and the CPU cache does not hit, so you request it from the memory controller. The memory controller finds that the row address is consistent with the row address of the last work. This time, it only needs to wait CL cycles after sending the column address, and then it can get the data of 0x0008-0x0015 and return it to the CPU.
Your process needs a byte of data with memory address 0xF000. Similarly, the CPU cache does not hit and requests from the memory controller. As soon as the memory controller looks (a little depressed inside), the line address changes again, which is the same as Scene 1. Continue to wait for Trp + tRCD + Cl cycles before you can get the data and return.
The actual computer memory IO process also requires the conversion of logical address and physical address. The table is not ignored here.
Scenario 1 and scenario 4 are random IO, scenario 2 has no memory IO, and scenario 3 is sequential io,. Through the above process description, we can draw a conclusion. Memory also has the problem that random IO is slower than sequential IO, just like disk. If the row address is inconsistent with the last access, the row buffer needs to be copied again, and the delay cycle needs Trp + tRCD + CL. If it is sequential IO (the row address remains unchanged), it only needs CL cycles to complete.
We then estimate the memory delay and the memory parameters on the author’s machine
dmidecodeThe value divided by 2 is the frequency of the clock cycle = 1066 / 2 = 533MHz. The delay period is 7-7-7-24.
- Random IO
In this case, Trp + tRCD + Cl clock cycles are required, and 7 + 7 + 7 = 21 cycles. However, there is also a limitation of TRAS. The two line address precharges shall not be less than 24. So we have to calculate by 24, 24 * (1s / 533MHz) = 45ns
- Sequential IO
In this case, only CL clock cycles are required, 7 * (1s / 533MHz) = 13ns
Extension: review the cache line of CPU
Because for memory, the one-time overhead of random IO is several times higher than that of sequential io. Therefore, when the operating system works, it will try to make the memory through sequential io. The key isCache Line。 When the CPU finds that the cache does not hit, it will never actually request 1 byte or 8 bytes from the memory. Instead, you need 64 bytes at a time, and then put them in your own cache.
Using the above example,
- If the request takes 458 bytes
- If 64 bytes are requested randomly: the time consumption is 45 + 7 * 13 = 136ns
The overhead is not much expensive, because only the first byte is random IO, and the next seven bytes are sequential io. The data is 8 times, but the IO time is only 3 times, and the later probability of the extracted data needs to be used, so it’s done inside the computer. This way can help you avoid some random IO!
In addition, the memory also supports burst mode. In this mode, you can only pass in the row and column address once and command the memory to return the continuous byte data at the beginning of the memory, such as 64 bytes. In this mode, only the first 8 bytes need real row and column access delay, and the next 7 bytes can be spit out directly according to the data frequency of memory, which takes less time.
Develop an album of internal skill cultivation:
- 1. Take you to deeply understand the bottom principle of memory alignment
- 2. Random memory access is also slower than sequential access, which will give you an in-depth understanding of the memory IO process
- 3. From DDR to DDR4, the memory core frequency has not made much progress
- 4. There is a difference in access delay between sequential IO and random IO in the actual test
- 5. Expose the “lies” of memory manufacturers and measure the real performance of memory bandwidth
- 6. Memory access delay difference under NUMA architecture!
- 7. The essence of php7 memory performance optimization
- 8. Project practice of primary memory performance improvement
- 9. Challenge the maximum memory limit of redis single instance and “encounter” NUMA trap!
My official account is “developing internal strength training”. I am not merely introducing technical theories here, nor do I only introduce practical experience. Combine theory with practice to improve your ability of theory and practice. Welcome to my official account, please share with your friend ~ ~