Memory barrier and its application in – JVM

Time:2020-6-17

Author: Guo Rui, senior engineer of leancloud backend

Content sharing video version: memory barrier and its application in JVM

MESI

Here is the entry of MESI: MESI protocol – Wikipedia, which is a cache consistency maintenance protocol. MESI represents the four states of cache line,modified, exclusive, shared, invalid

  • modified: the CPU owns and modifies the cache line. Before reusing the cache line to store other data, the CPU needs to write the modified data to the main memory, or transfer the cache line to other CPUs;
  • exclusive: heelmodifiedSimilarly, it means that the CPU owns a cache line but has not yet made changes to it. The CPU can directly discard or transfer the data in it to other CPUs
  • shared: the data of cache line is up-to-date and can be discarded or transferred to other CPUs, but the current CPU cannot modify it. You need to change it toexclusiveConduct after the status;
  • invalid: the data in cache line is invalid, which is equivalent to no data stored. When CPU finds empty cache line cache data, it is to findinvalidCache line of status;

There is a super visualization tool that can see how cache flows between these four states: viviojs MESI animation help. Both address bus and data bus enable all CPUs to monitor changes. For example, if CPU 0 wants to read data, it will send the request to address bus first. Memory and other CPUs will receive the request. When memory sends data through data bus, all CPUs will receive data. I understand that’s why bus snooping can be implemented. In addition, this tool can use the mouse wheel to scroll up and down to see the data flow process under each clock.

Memory barrier and its application in - JVM

Most of the following content and pictures are from Appendix C of is parallel programming hard, and, if so, what can you do about it. Because the MESI protocol itself is very complex and various state flows are very troublesome, this book makes some simplification of the protocol, and introduces the protocol in a more intuitive way. The advantage is that it is easier to understand. If you want to know the real appearance of the agreement, you need to look at the wiki mentioned above.

agreement

  • Read: read data on a physical memory address;
  • Read response: containsReadThe data result of the command request can be sent from main memory or other CPUs. For example, the read data is on the cache line of other CPUsmodifiedStatus, the CPU will respondReadcommand
  • Invalidate: contains a physical memory address, which is used to tell all other CPUs to clear the cache line corresponding to this address in their own cache;
  • Invalid knowledge: receivedInvalidateAfter cleaning up its cache, the CPU needs to respondInvalidate Acknowledge
  • Read invalidate: equivalent toReadandInvalidateOn the one hand, the CPU receiving the request should consider the constructionRead ResponseAlso clean up your cache and reply after completionInvalidate Acknowledge, i.e. reply twice;
  • Writeback: contains the address of the data to be written and the data to be written. It is used to write the corresponding data back to main memory or to some other place.

launchWritebackGenerally, the cache of a CPU is not enough. For example, new load data is needed, but the cache is full. You need to find a cache line to discard. If the discarded cache line is in themodifiedState, it needs to be triggered onceWriteBackEither write the data to the main memory, or write to the higher level cache of the same CPU, or write directly to other CPUs. For example, if the CPU has read the data before, it may be interested in the data. In order to ensure that the data is still in the cache, it may trigger onceWritebackSend the data to the CPU that has read the data.

give an example:

Memory barrier and its application in - JVM

On the left is the operation execution sequence, and the CPU is the number of the CPU executing the operation. Operation is the operation performed. RMW means read, modify and write. Memory where V indicates that the memory data is valid.

  1. At the beginning, all caches are invalid;
  2. CPU 0 passedReadThe message reads 0 address data, and the cache line of 0 address becomesSharedStatus;
  3. CPU 3 execute againReadRead the 0 address, and the cache line where its 0 address is also changed toShared;
  4. CPU 0 reads the 8 address from memory, replacing the cache line of the previous 0 address. The cache line of the 8 address is also marked asShared
  5. CPU 2 sends because it wants to read and modify 0 address dataRead InvalidateRequest, first load 0 address data to cache line, and then make the cache of CPU 3 with 0 address data cached becomeInvalidate;
  6. CPU 2 modifies 0 address data, 0 address data enters on cache lineModifiedStatus, and the data in memory becomes invalid;
  7. CPU 1 sendRead InvalidateRequest, get the latest modification of 0 address from CPU 2, and set cache line on CPU 2 toInvalidate。 CPU 1 modifies the latest data of address 0 after reading it, and cache line entersModifiedStatus.be carefulNo CPU 2 hereWriteback0 address data to memory
  8. CPU 1 reads 8 address data, because its cache line is full, soWritebackAfter modification, 0 address data is read to memory and 8 address data is read to cache lineSharedStatus. At this time, 0 address data on memory enters the valid state

The real MESI protocol is very complex. Because it is a protocol to maintain data consistency between caches, all requests of MESI are divided into two ends. The request comes from CPU or bus. Different request sources have different results in different states. The following picture is from wiki MESI protocol – Wikipedia, just post it and have a look.

Memory barrier and its application in - JVM

Memory Barrier

Store Buffer

Suppose CPU 0 wants to write data to an address. There are two situations:

  1. CPU 0 has read the cache line where the target data is located, which is in theSharedStatus;
  2. There is no cache line of the target data in the cache of CPU 0;

In the first case, CPU 0 just sendsInvalidateJust give other CPUs. Received from all CPUsInvalidate AckAfter that, the cache line can be converted toExclusiveStatus. In the second case, CPU 0 needs to sendRead InvalidateTo all CPUs, the CPU with the latest target data will send the latest data to CPU 0, and mark its own cache line as invalid.

Whether it isInvalidatestillRead Invalidate, CPU 0 has to wait for all other CPUs to returnInvalidate AckData can only be operated safely after that. This waiting time may be very long. Because CPU 0 just wants to write data to the target memory address, it doesn’t care about the current value of the target data on other CPUs at all, so this wait can be optimized by using store buffer:

Memory barrier and its application in - JVM

Send every time you write dataInvalidateTo other CPUs, on the other hand, put the newly written data content into the store buffer. Wait until all CPUs replyInvalidate AckAfter that, the corresponding cache line data is removed from the store buffer and written to the CPU’s actual cache line.

Except to avoid waitingInvalidate AckIn addition, store buffer can also be optimizedWrite MissThe situation. For example, even if only one CPU is used, if the target memory to be written is not in the cache, normally it needs to wait for the data to be loaded from memory to cache before the CPU can start writing. With the existence of store buffer, if the memory to be written is not in the cache now, you can put the newly written data into the store instead of waiting for the data to be loaded from memory Buffer, and then perform other operations. After the data is loaded into the cache, write the newly written data in the store buffer to the cache.

In addition, forInvalidateIs it possible for two CPUs to operate concurrentlyInvalidateSome same cache line?

This conflict is mainly solved by bus. As can be seen from the previous visualization tool of MESI, all operations need to access address bus first, and address bus will be locked during access, so only one CPU will operate bus and a certain cache line in a period of time. However, the two CPUs can constantly modify the same memory data, resulting in the same cache line switching back and forth between the two CPUs.

Store Forwarding

There are still problems with the structure of the store buffer shown in the figure above. The main problem is that when reading data, you need to read the data in the store buffer, instead of finishing writing the store buffer.

For example, now there is this code. A is not in CPU 0 at the beginning, but in CPU 1. The value is 0. B in CPU 0:

a = 1;
b = a + 1;
assert(b == 2); 

Because CPU 0 does not cache a, the operation written as 1 should be put into store buffer, and then it needs to be sentRead InvalidateTo CPU 1. After CPU 1 sends data from a, the value of a is 0. If CPU 0 is executinga + 1If the store buffer is not read, the value of B after execution will be 1, not 2, resulting in an assert error.

So the more normal structure is as follows:

Memory barrier and its application in - JVM

In the beginning, CPU 0 wanted to write the value of a as 1. It didn ‘t care what the value was now, but it couldn’ t be sent directlyInvalidateTo other CPUs. Because a may not only be on the cache line, but also have other data. If you send it directlyInvalidateData not belonging to a on cache line will be lost. thereforeInvalidateOnly when cache line is inSharedStatus, ready toExclusiveIt will only be used when changing.

Write Barrier

//CPU 0 executes foo() and has cache line of B
void foo(void) 
{ 
    a = 1; 
    b = 1; 
} 
//CPU 1 executes bar (), with cache line of a
void bar(void)
{
    while (b == 0) continue; 
    assert(a == 1);
}

For CPU 0, there is no a in the cache at first, so it sendsRead InvalidateTo obtain the modification right of the cache line where a is located. Store buffer exists for the new value written by A. After that, CPU 0 can write immediatelyb = 1Because the cache line of B is on CPU 0, it is in theExclusiveStatus.

For CPU 1, it has no cache line of B, so it needs to be sent firstReadRead the value of B, if CPU 0 is just finished at this timeb = 1, the value of B read by CPU 1 is 1, which can jump out of the loop. At this time, if CPU 0 has not yet received the dataRead Invalidate, or received CPU 0’sRead InvalidateBut it’s only doneReadThe value of a sent back to CPU 0 isRead ResponseBut it’s not finishedInvalidateIn other words, CPU 1 also has a cache line of a, and CPU 0 still cannot write a from store buffer to cache line of CPU 0. In this way, the value read by a on CPU 1 is 0, thus triggering the assert failure.

The reason for the above problem is the existence of store buffer. If there is no write barrier, the write operation may be out of order, causing the latter write to be seen by other CPUs in advance.

A possible question here is that the above problems can appear, which means that CPU 1 is receivingRead InvalidateYou can send it before it’s finishedReadIt seems unreasonable to request CPU 0 to read the cache line of B variable, because it seems that cache should only process a request after receiving a request. There may be a blind area of understanding here. I guess it’s becauseRead InvalidateThere are actually two operations, oneReadOneInvalidateReadCan return quickly, butInvalidateOperations may be heavy, such as write back to main memory. What optimization can cache allow to wait for execution to completeInvalidatereturnInvalidate AckI received the lightweight one from the CPUReadDuring operationReadThrow it out first. After all, the CPU read operation only needs to be forwarded to the cache,InvalidateIt really needs cache to operate its own flag and so on. It does more.

The solution to the above problem is write barrier. Its function is to mark the cache lines of all operations before write barrier. The write operations after barrier can not directly operate cache line, but also write to store buffer first. Only this kind of cache line is owned, but the cache line written to store buffer does not need to be marked because of the barrier relationship. Wait for tagged writes in store buffer because receivedInvalidate AckWhen cache line can be written, these unmarked writes can be written to cache line.

Same code with write barrier:

//CPU 0 executes foo() and has cache line of B
void foo(void) 
{ 
    a = 1; 
    smp_wmb();
    b = 1; 
} 
//CPU 1 executes bar (), with cache line of a
void bar(void)
{
    while (b == 0) continue; 
    assert(a == 1);
}

At this time, for CPU 0, a writes to the store buffer with a special tag, and B writes to the store buffer. So if CPU 1 has not yet returnedInvalidate Ack, the write of CPU 0 to B is not visible on CPU 1. From CPU 1ReadReading B always gets 0. Wait for CPU 1 to replyInvalidate AckAfter that, the cache line of a is in the ACK, so CPU 0 will store in the buffera = 1The write of is written to its own cache line. Find all the writes without special mark after a in the store buffer, that isb = 1Write your own cache line. In this way, if CPU 1 reads B again, it will get a new value of 1. At this time, a is on CPU 1 because it has repliedInvalidate Ack, so a would beInvalidateState, read a again and get a value of 1. Assert succeeded.

Invalidate Queue

The store buffer is limited on each CPU. When the store buffer is full, subsequent writes can only be written after the store buffer has a location. This leads to performance problems. In particular, once there is a write barrier, all subsequent writes must be put into the store buffer, which will greatly increase the number of store buffer queued writes. Therefore, it is necessary to shorten the queuing time of write requests in store buffer.

As mentioned earlier, the reason store buffer exists is waitingInvalidate AckIt may be longer, so the way to shorten the store buffer queue time is to reply as soon as possibleInvalidate AckInvalidateThe long operation time comes from two aspects:

  1. If the cache is very busy, for example, the CPU has a large number of read and write operations on the cache, which may cause the cache to missInvalidateMessages, resulting inInvalidateDelay (I think it can be lost if the signal on the bus is not processed, wait for the signal sender to retry later)
  2. It’s possible that a large number ofInvalidate, causing cache to not be able to handle so muchInvalidateRequest. Everyone has to replyInvalidate Ack, also takes up bus communication time

So the solution is to add an invalid queue for each CPU. receivedInvalidatePut the request into the queue and reply immediately after the requestAck

Memory barrier and its application in - JVM

The problems are also obvious. A cache line that is invalidated should be in theInvalidateThe CPU should not read or write the data, but becauseInvalidateThe request is put into the queue, and the CPU thinks that it can read and write the cache line and operate on the old data. It can be seen from the above figure that CPU and invalidate queue are at both ends of cache, so unlike store buffer, CPU cannot go to invalidate queue to check whether a cache line is invalidated, which is the reason why CPU will read invalid data.

On the other hand, if you want to invalidate a cache line, you must first clean up the CPU’s own invalidate queue, or at least have a way for cache to confirm that a cache line is not invalidated.

Read Barrier

Due to the existence of invalid queue, the CPU may read the old value. The scenario is as follows:

//CPU 0 executes foo(), a is in shared, B is in exclusive
void foo(void) 
{ 
    a = 1; 
    smp_wmb();
    b = 1; 
} 
//CPU 1 executes bar (), a is in shared state
void bar(void)
{
    while (b == 0) continue; 
    assert(a == 1);
}

CPU 0 willa = 1Write to store buffer, sendInvalidate(noRead InvalidateBecause a isSharedStatus) to CPU 1. CPU 1 willInvalidateAfter the request is put into the queue, it returns immediately, so CPU 0 can write 1 to the cache line where a and B are located. When CPU 1 reads B again, it gets the new value 0 of B. when reading a, it thinks a is in theSharedThe state then reads a directly and gets the old value of a, such as 0, which causes the assert to fail. Finally, even if the program fails to run, CPU 1 still needs to continue to process the invalidate queue and set the cache line of a to be invalid.

The solution is to add read barrier. The function of read barrier is not to say that when the CPU sees the read barrier, it immediately processes the invalidate queue, processes it and then executes the rest, but only marks the cache line on the invalidate queue, and then continues to execute other instructions, until it sees that the next load operation will read data from the cache line, the CPU will wait for the invalidate queue All the cache lines that have just been marked in are processed before proceeding to the next load. For example, after marking cache line, there are newInvalidateRequests come in, because these requests are not marked, so the next load operation will not wait for them.

//CPU 0 executes foo(), a is in shared, B is in exclusive
void foo(void) 
{ 
    a = 1; 
    smp_wmb();
    b = 1; 
} 
//CPU 1 executes bar (), a is in shared state
void bar(void)
{
    while (b == 0) continue; 
    smp_rmb();
    assert(a == 1);
}

With the read barrier, when CPU 1 reads that B is 0, mark all the cache lines on the invalidate queue to continue running. The next operation is to read the current value of a, and then start to wait until all the marked cache lines are actually invalidated. Then read a again and find that a isInvalidateStatus, so sendReadGo to CPU 0, get the latest value of cache line where a is, and assert succeeds.

In addition to read barrier and write barrier, there are also two integrated barriers. The function is to make all subsequent write operations go to the store buffer first and queue up, so that subsequent read operations have to wait for the invalidate queue to finish processing.

Other references

  • http://www.rdrop.com/users/pa…
  • Intel® 64 and IA-32 Architectures Software Developer Manuals | Intel® Software
  • Memory Barriers Are Like Source Control Operations