Memory barrier and its application in – JVM (2)

Time:2020-6-13

Author: Guo Rui, senior engineer of leancloud backend

Content sharing video version: memory barrier and its application in JVM

Java Memory Model (JMM)

In order to run on CPUs of different architectures, Java refines a set of its own memory model, defines how Java programs should interact with this abstract memory model, defines the running process of the program, what kind of instructions can be rearranged, what kind of instructions can’t, and how visibility between instructions. So the basic specification of Java program running has come out. The definition of this model will be very difficult. It needs to be flexible enough to adapt to different hardware architectures, so that these hardware can meet the operation specifications when supporting JVM. It needs to be rigorous enough so that application layer code writers can rely on this set of specifications to know how to write programs to run on various systems without ambiguity and concurrency.

Figure 12-1 in the famous book “deep understanding of Java virtual machine” points out the relationship among thread, main memory and working memory in JMM. The picture comes from the Kindle version of the book:

Memory barrier and its application in - JVM (2)

As can be seen from the term memory model, this is a simulation of the real world. In the figure, the java thread corresponds to the CPU, and the working memory corresponds to the CPU cache. A set of save and load instructions extracted from Java correspond to the cache consistency protocol, which is MESI and other protocols. Finally, the main memory corresponds to memory. Real world hardware needs to be built into this model according to its own situation.

JMM is perfect for jsr-133. Now, we will generally put the detailed description on the spec of Java language. For example, for Java 11, Chapter 17. Threads and locks. In addition to these instructions, there is a famous cookbook called the jsr-133 Cookbook for compiler writers.

Memory barrier on JVM

The JVM provides four kinds of Barrier in full arrangement according to the two operations of reading and writing before and after. The name is the name combination of the left and right operations. such asLoadLoadThe barrier is the one placed in the middle of two load operations,LoadStoreIt’s the barrier between the load and the store. Barrier type and its meaning are as follows:

  • LoadLoadThe operation sequence load1, loadload, load2 is used to ensure that the read operation accessing load2 must not be rescheduled before load1. Similar to what I said beforeRead Barrier, you need to process the invalid queue before reading load2;
  • StoreStore, operation sequence store1, storestore, store2, used to ensure that the data written out by store1 and later must be written out before store2, that is, other CPUs must see the data of store1 first, and then the data of store2. There may be a swipe of store buffer, or all write operations may be put into store buffer sorting to ensure;
  • LoadStoreThe operation sequence load1, loadstore, store2 is used to ensure that the data read by load1 must be read into the cache before the data written out by store2 and later are seen by other CPUs. It is even possible that the operation of store2 depends on the current value of load1. The usage scenario of this barrier may be difficult to correspond with the cache architecture model mentioned in the previous section. After all, it is a minimalist structure and only a specific cache architecture. The barrier of the JVM should be abstract enough to deal with various cache architectures. If I jump out of the cache architecture in the previous section, I understand that the scenario using this barrier may be that when a CPU writes store2, it thinks that the speed of flushing store2 to memory and setting the cache line of store2 on other CPUs to be invalid is faster than reading load1 from memory, so this rearrangement is done.
  • StoreLoad, operation sequence store1, storeload, load2, which is used to ensure that the data written out by store1 can only be read to the cache after being seen by other CPUs. If store1 and load2 operate on the same address, the storeload barrier needs to ensure that load2 cannot read the data in the store buffer. It must be the modified value of a specific CPU pulled from the memory.StoreLoadGenerally speaking, the heaviest barrier is also a barrier that can realize all other barrier functions.

The best explanation for the above four kinds of barrier comes from here: JDK/ MemoryBarriers.java At 6bab0f539fba8fb441697846347597b4a0ade428 · openjdk / JDK · GitHub, it feels a little more detailed than that in jsr-133 cookbook.

Why are there so manyStoreLoadHeaviest?

The so-called heavy reality is the number of times to interact with the memory. The more interactions, the greater the delay, that is, the heavier.StoreStoreLoadLoadNeither of them is mentioned, because they only limit reading or writing, that is, only one memory interaction. onlyLoadStoreandStoreLoadIt seems that there may be restrictions on reading and writing. butLoadStoreThe actual limit in is more to read, that is, load data comes in. It does not require the visibility of the last store stored data. It just says that the store cannot be rescheduled before load. On the contraryStoreLoad, which means that the load cannot be rescheduled to the store. This requires that the store buffer be written to the memory before the load operation. If the Store Buffer is not flushed, the read operation may be performed first, and then the Store Buffer will be flushed, causing the write operation to be rescheduled after the read operation. Once the data is written out, other CPUs can see it. After seeing it, the memory of the next load operation may be modified, causing the cache line of the load operation memory to be invalid. If the load operation is allowed to read data from a cache line that may be invalidated, it means that the load is rescheduled to before the store in practical sense, because the data may be in the cache before the store, which is equivalent to the advance of the read operation. In order to avoid this, after the store is completed, it is necessary to process the invalidate queue to determine whether the cache line where the memory of its load operation is set to be invalid. So to satisfyStoreLoadOn the one hand, you need to brush the store buffer, and on the other hand, you need to process the invalid queue. In the worst case, there will be two memory operations, one for reading and one for writing, so it is the heaviest.

StoreLoadWhy can other barrier functions be implemented?

This can be seen from the result of the previous problem.StoreLoadBecause there are requirements for reading and writing operations, it can implement other barrier functions. Other barriers only require one aspect of reading and writing.

However, these four barriers are only designed by Java for cross platform. In fact, according to different CPUs, the JVM on the corresponding CPU platform may be able to optimize some barriers. For example, when many CPUs read and write the same variable, they can ensure the sequence of its continuous operations, so there is no need to add barrier. such asLoad x; Load x.fieldRead X and reread a field below X. if the access to the same memory CPU can guarantee the order, the barrier between the two reads is no longer needed. In the assembly instructions compiled according to the bytecode, the place where the barrier should have been inserted will be replaced withnop, i.e. empty operation. On X86, there is onlyStoreLoadThis barrier is valid. There is no invalid queue on X86. Every time the store data is queued in the store buffer, soStoreStoreLoadLoadNone. X86 can ensure that all store operations will go through store buffer asynchronous write, and store will not be rescheduled before load,LoadStoreIt’s not necessary. There’s only one leftStoreLoadBarrier is used on a JVM on an X86 platform.

How to use barrier on X86 can be seen in the code of openjdk, where Src / hotspot / CPU / x86 / assembler_ x86.hpp。 You can see that under x86lockTo achieveStoreLoad, and onlyStoreLoadIt works. In this code annotation, the use oflockWhy.

volatile

One of the main applications of barrier on JVM isvolatileThe implementation of key words. The implementation of this keyword is described in Oracle as follows:

Using volatile variables reduces the risk of memory consistency errors, because any write to a volatile variable establishes a happens-before relationship with subsequent reads of that same variable. This means that changes to a volatile variable are always visible to other threads. What’s more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.

From Oracle’s description of atomic access: atomic access. Basically, it means beingvolatileThe tagged variables need to maintain two characteristics:

  • Visibility, every readvolatileVariable always reads its latest value, that is, the last write to it, whether or not the current thread has completed the write.
  • Prohibit command rearrangement, i.e. maintenancehappens-beforeRelationship, rightvolatileThe writing of variables cannot be rescheduled before the operation before writing, so as to ensure that other threads can know that all the operations before writing have occurred after seeing the written valuevolatileFor example, I need to readvolatileThen do something according to the value you read. If you rearrange it to readvolatileBefore, it was equivalent to not satisfying readingvolatileNeed to read the requirements of the latest value, because the following things are based on an oldvolatileIt’s worth it.

There are two things we need to see. One is to prohibit instruction rearrangement, not all rearrangements, butvolatileWrite cannot be arranged forward, read cannot be arranged backward. Other rearrangements will still be allowed. The other is that the prohibition of instruction rearrangement is actually generated in order to meet the visibility. thereforevolatileThe maintenance of these two features can be realized by barrier.

Suppose normal load is the Convention, and normal store corresponds to the modification of normal references. It’s like havingint a = 1;thata = 2;It’s the normal store,int b = a;There is a normal load for a. If the variable carriesvolatileDecorated, the corresponding read and write operations are volatile load or volatile store.volatileIt has no effect on the bytecode generated by the code itself, that is, the bytecode generated by the Java method regardless of the variables operated in itvolatileThe bytecode generated is the same.volatileAt the bytecode level, it affects the field in the classaccess_flags(refer to section 4.5 of Java 11 the Java virtual machine specification). It can be understood that when a member variable is declared asvolatile, the java compiler marks this member variable and records that it isvolatileOf. When the JVM compiles bytecode into an assembly, if it encounters an examplegetfield, putfieldThese bytecodes, and found that thevolatileThe marked member variable will insert the corresponding barrier in the assembly instruction according to JMM requirements.

according tovolatileSemantics, we can see in turn which barrier should be used in the following operation order. It should be noted that the two operations need to operate different variables:

  • Normal Store, Volatile Store。 That is, first write a common variable, and then write a bandvolatileVariable for. This is obviously a storestore barrier.
  • Volatile Store, Volatile Store。 It is also obviously storestore, because when the second modification is seen by other CPUs, you need to ensure that all the writes before this write can be seen.
  • Nolmal Load, Volatile Store。 Loadstore must be used to avoid the store operation being rescheduled before load.
  • Volatile Load, Volatile Store。 Loadstore must be used for the same reason.

The reason for using barrier in the above four cases is that Oracle’s description of atomic access in the front is to write avolatileWhen the variable of is seen by other CPUs, it is necessary to ensure that the operation before writing this variable can be completed, regardless of whether the previous operation is read or writevolatileVariable or not. If the store operation is rescheduled before the previous operation, this Convention will be violated. thereforevolatileThe variable operation is to add a barrier before the store operation. If it is a normal variable after the store operation, the barrier will not be used. It doesn’t matter whether it is reset or not

  • Volatile Store, Normal Load
  • Volatile Store, Normal Store

aboutvolatileVariable read operation, in order to meet the above mentionedvolatileTo avoid rearranging the latter operation to readvolatileBefore the operation, sovolatileThe read operations of are all followed by a barrier:

  • Volatile Load, Volatile Load。 You have to use loadload.
  • Volatile Load, Normal Load。 You have to use loadload.
  • Volatile Load, Normal Store。 Loadstore is required.

If the latter operation is load, you can rearrange at will without using barrier

  • Normal Store, Volatile Load。
  • Normal Load, Volatile Load。

Finally, there is a special operation. The first operation is the volatile store, and the second operation is the volatile load:

  • Volatile Store, Volatile Load。 Storeload is required. Because the operation before the previous store may cause the variable of the later load to change, the latter load operation needs to be able to see the change.

There are four normal operations left, all of which are random rearrangements, without any impact:

  • Normal Store, Normal Load
  • Normal Load, Normal Load
  • Normal Store, Normal Store
  • Normal Load, Normal Store

The corresponding table of these usage methods and specific operations in Java is as follows:

Memory barrier and its application in - JVM (2)

In the figure, monitor enter and monitor exit correspond to in and out respectivelysynchronizedBlock. Monitor Ender corresponds to volatile load. The same way is used for barrier. Monitor exit corresponds to volatile store, and the same way is used for barrier.

To sum up the figure, it’s very easy to remember how to use barrier, as long as it’s writtenvolatileVariable. In order to ensure that when the write operation to this variable is seen by other CPUs, all the things that happened before the write operation can also be seen by other CPUs, you need to writevolatileBefore joining barrier. Avoid forward rearrangement of writesvolatileVariables have been written and seen by other CPUs, but they have been written before, but the read variables have not been perceived by other CPUs. Write variables are sensed by other CPUs. How can read variables be sensed by other CPUs? Mainly

In terms of reading, as long as it’s readingvolatileVariable. In order to ensure that subsequent operations based on this read operation can really do the next thing according to the latest read value, you need to add a barrier after the read operation.

In addition, a special volatile store, volatile load, should be added to ensure that the next read can see the changes caused by the previous write, so it is necessary to addStoreLoad Barrier。

JMM description, except for the above tablevolatileIn addition to the variable related use of barrier, there is a special place where barrier is also usedfinalModifier. ModifyingfinalThere should be aStoreStoreBarrier。 for examplex.finalField = v; StoreStore; sharedRef = x;Here is a set of operation examples to see what variables need to be read and written with barrier.

Finally, you can take a look at the example given in jsr-133 cookbook, and roughly feel how the barrier is added when operating various types of variables:

Memory barrier and its application in - JVM (2)

Visibility maintenance of volatile

In summary,volatileVisibility includes two aspects:

  1. WrittenvolatileVariables can be read by other CPUs in the next reading after writing;
  2. write involatileOperations before variables are seen in other CPUsvolatileThe latest value of can also be seen;

For the first aspect, mainly through:

  1. readvolatileVariable can’t use register. Every time you read it, you have to go to memory to get it
  2. No readingvolatileVariable subsequent operations are rescheduled to readvolatilebefore

For the second aspect, mainly through writingvolatileBarrier guaranteed write when variablevolatilePrevious operations precede writingvolatileOccurs before the variable.

The last one is special, if you can use itStoreLoadBarrier, writevolatileAfter that, the store buffer will be triggered to write, so the write operation can be “immediately” seen by other CPUs.

General referencevolatileThe most common explanation for how visibility can be achieved is “add a write barrier after writing data to cache to main memory, and add a barrier before reading data to force reading from main memory.”.

As we can see from the introduction of JMM, at least from the perspective of JMM, this statement is not accurate enough. On the one hand, barrier is writingvolatileBefore variable, barrier should not be added after variable. While readingvolatileIt’s after, not before. On the other hand, the description of “brush cache” is not accurate, evenStoreLoadIt is also store buffer that barrier brushes to cache, not cache to main memory. If there is a current CPU cache in the target to be written, even if store buffer is triggered, data will be written to the cache, and the writeback of the cache will not be triggered, that is, synchronization to memory will not be meaningful, because other CPUs do not necessarily care about this value; similarly, even if readvolatileThere is a barrier after the variable. If there is a current CPU cache in the target and it is in the valid state, the read operation will immediately read from the cache, and it will not really go to the memory to pull the data again.

What needs to be added is whether it isvolatileIt is the same in the read and write operations of common variables. That is to say, the read and write operations are all handed over to cache. Cache maintains cache consistency through MESI and its variant protocols. The only difference between the two variables is the use of barrier.

Is volatile read free

Under x86 becauseStoreLoadOther barriers are empty operations, but readvolatileVariables are not completely free of overhead. On the one hand, Java compilers will fill in the assembly instructions that should be added to barrier according to JMM requirementsnop, which will hinder some optimization measures of java compiler. For example, those who can rearrange their instructions dare not rearrange them. In addition, because the accessed variable is declared asvolatileEvery time it is read, it must be from memory (or cache) instead ofvolatileThe variable is put into the register and used repeatedly. This also reduces the performance of accessing variables.

Ideally yesvolatileThe use of fields should be read more and write less, and only one thread should write as much as possible. But readvolatileCompared with reading ordinary variables, there is also overhead, but it is not particularly heavy in general.

Review the examples in false sharing

[[CPU cache basics]] this article introduces the concept of false sharing and how to observe it. One of the key points is that in order to better observe false sharing, variables operated by threads must be declared asvolatileIn this way, when false sharing occurs, the performance will degrade a lot, but if you remove itvolatileThe performance degradation rate will decrease. Why?

In short, if notvolatileDeclare that there is no barrier. If the cache line in the memory where the current variable is located is not in the current CPU, put the modified operation in the store buffer, wait for the target cache line to load, and then actually perform the write operation. This is equivalent to the accumulation of write operation in the store buffer. For examplea++The operation is not to add one to the cache every time it is executed, but to directly execute such operations as adding 10 and adding 100 after the cache is loaded, so as to merge a batch of adding one operations into a cache line write operation. And with itvolatileDeclaration: with barrier, in order to ensure the visibility of write data, the overhead of waiting for store buffer to write cache line will be introduced. When the target cache line has not been loaded into the cache of the current CPU, write the data to the store buffer first, but see for exampleStoreLoadAfter the barrier, you need to wait for the store buffer to write to continue to execute the next instruction. Take ita++For example, each additional operation can no longer accumulate. Instead, the cache line must wait for the cache line to load, and the next write can only be continued after the store buffer is written. This enlarges the impact of cache miss, so when false sharing occurs, the cache line jumps back and forth between multiple CPUs, and the modified variables havevolatileThe execution will be slower after declaration.

Further more, I do tests on my own machine. My machine is x86 architecture, and on my machine there is onlyStoreLoadBarrier will really work. Let’s go to the code of open JDKStoreLoadHow did barrier add it.

First of all, jsr-133 cookbook defines a bunch of barriers, but there will be more on the JVM virtual machine in Src / hotspot / share / runtime/ orderAccess.hpp 。

Each different system or CPU architecture will use differentorderAccessFor example, in Src / hotspot / OS of Linux x86_ cpu/linux_ x86/orderAccess_ linux_ X86.hpp, BSD x86 and Linux x86 are similar in Src / hotspot / OS_ cpu/bsd_ x86/orderAccess_ bsd_ X86.hpp is defined as follows:

inline void OrderAccess::loadload()   { compiler_barrier(); }
inline void OrderAccess::storestore() { compiler_barrier(); }
inline void OrderAccess::loadstore()  { compiler_barrier(); }
inline void OrderAccess::storeload()  { fence();            }

inline void OrderAccess::acquire()    { compiler_barrier(); }
inline void OrderAccess::release()    { compiler_barrier(); }

inline void OrderAccess::fence() {
   // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
  __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
  __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  compiler_barrier();
}

compiler_barrier()It’s just to avoid instruction rearrangement, but the corresponding operation is empty. See that there’s onlyStoreLoadIt is practical and effective, corresponding tofence(), seefence()The implementation oflock。 WhylockPosted on the frontassembler_x86It’s explained in the notes.

aftervolatileVariables need to be used after each modificationStoreLoadBarrier, as you can see in the code that interprets the execution bytecode. src/hotspot/share/interpreter/ bytecodeInterpreter.cpp , see is executionputfieldIf the operation isvolatileVariable, just add one after writingStoreLoadBarrier。 We can still find itMonitorExitEquivalent tovolatileAs mentioned in jsr-133 cookbook, evidence can also be found in openjdk code in Src / hotspot / share / runtime/ objectMonitor.cpp 。

Jsr-133 cookbook also mentionsfinalFields need to haveStoreStoreBarrier, in Src / hotspot / share / interpreter/ bytecodeInterpreter.cpp You can find it, too.

Here comes the problem again. Press the figure given in jsr-133 cookbook twice in a rowvolatileWhat should not be used in variable writing isStoreStoreDo you know how to use the above codeStoreLoad。 From jsr-133 CookbookStoreLoadIt meansStore1; StoreLoad; Load2The meaning is that all read operations after the barrier cannot be rearranged in front of store1. It does not mean the read immediately following store1, but it cannot be rearranged no matter how far away there is a read. So I understandvolatileFor decorated variables, finish writingvolatileAfter that, there’s always a place for the program to read thisvolatileVariable, so finish writingvolatileVariables always correspond toStoreLoadBarrier, it’s just in theory, like just writingvolatileVariable but never read it, then it can be generatedStoreStoreBarrier。 Of course, this is just my conclusion from JDK code and actual test.

How to observe whether the above content is true? We need to print out the result of JDK coding. You can refer to this article. In short, there are two key points:

  • These parameters are used when starting a java program:-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
  • You need to find a way to download or compile hsdis and put it in theJAVA_HOMEOfjre/libbelow

If missinghsdisWhen you start the program, you will see:

Could not load hsdis-amd64.dylib; library not loadable; PrintAssembly is disabled

After that, we will print the compiled results of the code in the test false sharing example. You can see that in the assembly instructions, each time the code is completedvolatileOfvalueAperhapsvalueBFollow melockInstruction, even after JIT intervention, assembly instruction is similar to:

0x0000000110f9b180: lock addl $0x0,(%rsp)     ;*putfield valueA
                                                ; - cn.leancloud.filter.service.SomeClassBench::[email protected] (line 22)

Other applications of memory barrier in JVM

Atomic’s lazyset

Another interesting thing about barrier is the lazyset operation under atomic. Take the most commonAtomicIntegerFor example, the state insidevalueIt’s avolatileOfint, normalsetThis is to change the status to the target value. After modification, other CPUs will be visible because of the barrier relationship. andlazySetAndsetThe comparison is as follows:

public final void set(int newValue) {
    value = newValue;
}
public final void lazySet(int newValue) {
    unsafe.putOrderedInt(this, valueOffset, newValue);
}

aboutunsafe.putOrderedInt()Java does not explain at all, but addslazySet()The function: bug ID: jdk-6275329 add lazyset methods to atomic classesvolatileAdd before statusStoreStoreBarrier。 It only guarantees that this write will not be rescheduled before the previous write, but when this write can be flushed to memory is not required, from a lightweight write operation, which can optimize performance in specific scenarios.

Black technology under concurrent linked queue

Briefly introduce the black technology. For example, now there are four a B C DvolatileVariables, if performed without brain:

a = 1;
b = 2;
c = 3;
d = 4;

A barrier is added to the middle of each statement. It might be OK to write it like this directly. It’s all rightStoreStoreBut if you writevolatileAnd then there’s some readingvolatileOperation, maybe the barrier will be lifted to the heaviestStoreLoadBarrier, it’s going to cost a lot. However, if the start a B C is written in the way of writing ordinary variables, only the last d is written in the way of writing ordinary variablesvolatileMode update, i.e. only ind = 4It’s on the front tape, barrier. Promised = 4When it is seen by other CPUs, the values of a, B and C can also be seen by other CPUs. This reduces the number of barriers and improves performance.

Described in the previous section of the JVMunsafeThere’s another oneputObjectThe method of using avolatileVariables are updated as normal variables, i.e. no barrier is used. Use thisputObjectThe optimization mentioned above can be achieved.

ConcurrentLinkedQueueIt is a lockless queue provided by Java standard library, which uses this black technology. Because it’s a linked list, there’s aNodeClass to store data,NodeThey are linked together to form a linked list.NodeThere is a quilt insidevolatileDecorated variable points toNodeStored data.NodeSome codes of are as follows:

private static class Node<E> {
    volatile E item;
    volatile Node<E> next;
    Node(E item) {
        UNSAFE.putObject(this, itemOffset, item);
    }
    ....
}

Because when a node is constructed, it has to passcasOperation team tailNodeOfnextIt refers to the access list, which needs to be seen by other CPUs only after the access is successfulNodeWhen it was first constructed,NodeInternalitemIt will not be accessed by any other thread, so you can seeNodeThe constructor of can be used directlyputObjectto updateitem, etccasOperation queue endNodeOfnextTime to start againvolatileMethod updatenext, so as to bring the barrier. After the update is completednextUpdates for includeNodewithinitemThe updates of are all seen by other CPUs. To reduce operationsvolatileThe cost of the variable.