As mentioned in the previous article,JAVA memory model regulates how Java virtual machine and computer memory work together. Java virtual machine is a complete computer model, so this model naturally includes a memory model – also known as JAVA memory model.
If you want to design concurrent programs that perform well, it’s important to understand the JAVA memory model.The JAVA memory model specifies how and when to see the value of the shared variable modified by other threads, and how to access the shared variable synchronously when necessary。
1 Java Memory Model
Let’s take a look at the diagram of java thread running memory, as shown in the figure below:
Java thread running memory diagram
This picture shows us that when a thread is running, there is a small memory dedicated to memory,When the Java program synchronizes the variables to the memory of the thread, it will operate the variables in the working memory, and when the values of the variables in the thread are synchronized back to the main memory is unpredictable。
Therefore, according to the thread running memory diagram in the figure above,JAVA memory model is abstractly divided into thread stack and heap in JVM. As shown in the figure below
JMM is divided into thread stack and heap
1.1 thread stack and heap
Each thread running in Java virtual machine has its own thread stack. This thread stack containsInformation about the current execution point of a method called by a threadAt the same time, the thread stack has the following characteristics:
Even if two threads execute the same code, they still create local variables in their own thread stack.Therefore, each thread has a unique version of each local variable。
All local variables of the original type are stored on the thread stack, so they are not visible to other threads. A thread may pass a copy of an original type variable to another thread, but it cannot share the original type variable itself.
The heap contains all the objects created in the Java program, no matter which object created them. This includes the object version of the original type.If an object is created and assigned to a local variable, or used as a member variable of another object, the object is still stored on the heap。
So,The call stack and local variables are stored on the thread stack, and the objects are stored on the heapAs shown in the figure below:
Thread stack and heap & variable, object, call stack
Objects stored on the heap can be accessed by all threads that hold references to the object. When a thread can access an object, it can also access its member variables. If two threads call the same method on the same object at the same time, they will both access the member variable of the object, but each thread has a private copy of the local variable.
The above points are shown in the figure below:
Stack, heap & local variable, static variable
1.2 CPU and memory
as everyone knows,CPU is the brain of the computer, which is responsible for executing the instructions of the program. Memory is responsible for storing data, including the program’s own data. As we all know, memory is much slower than CPU,Now it takes more than 200 CPU cycles to get a piece of data in memory, and generally one CPU cycle is enough for CPU register. The following is a simple diagram of CPU cache:
Schematic diagram of CPU cache
With the development of multi-core technology,CPU cache is divided into three levels: L1, L2 and L3. The smaller the level, the closer it is to the CPU, so the faster it is, and the smaller the capacity.
Use “cat / proc / cpuinfo” under Linux or “lscpu” under Ubuntu to check the cache of your machine. For more details, you can use the following command:
Just like the database cache, when getting data, the first step is to find the data in the fastest cache. If the cache miss occurs, the next step is to find the data until the three-tier cache can not be found. All you have to do is ask the memory for the data. One miss after another means that the longer it takes to get the data.
At the same time, in order to access the cache efficiently, it is not easy to write a single piece of data to the cache at will.A cache is made up of cache rows, typically 64 bytes. You can use the following shell command to view cherency_ line_ Size is the size of the machine’s cache line
CPU access to the cache is “row” as the smallest unit of operation. For example, a Java long takes up 8 bytes, so you can get 8 long variables from a cache line. So if you access a long array, when one long is loaded into the cache, you will load the other seven without consumption. So you can traverse arrays very quickly.
2 cache consistency
Because there is a certain difference in processing speed between CPU and main memory, in order to match this gap and improve computer ability, people add multi-layer cache between CPU and main memory. Each CPU will have L1, L2 or even L3 cache, and there will be multiple CPUs in a multi-core computer,Then there will be multiple caches, and the data between these caches may be inconsistent. To solve this problem, there is a memory model. The memory model defines the reading and writing behavior of multithreaded programs in shared memory system. These rules are used to standardize the read and write operation of memory, so as to ensure the correctness of instruction execution.
In fact, the Java Memory Model tells us that Java can guarantee certain constraints by using the keywords “synchronized” or “Volatile”:
Through the above description, we can write thread safe Java programs, JDK also helps us shield a lot of underlying things.
Therefore, on various compiler optimizations and various types of micro architecture platforms, Java language specification makers try to create a virtual concept and pass it on to Java programmers, so that they can write thread safe programs on this virtual concept, while compiler implementers will achieve the requirements of Java programmers on different platforms according to various constraints in Java language specification Thread safety is the goal。
So, how to solve the cache inconsistency problem on various types of microarchitecture platforms? This is a problem that many CPU manufacturers must solve. In order to solve the problem of inconsistent cache data mentioned above, many schemes have been proposed, generally speaking, there are the following two schemes:
2.1 concept of bus
First of all, the above two solutions actually involve the concept of bus. What is bus? Bus is the medium of communication between processor and main memory and between processor and processorSMP (symmetric multiprocessing) and NUMA (non uniform memory access)。
SMP and NUMA
SMP system structure is very common, because they are the easiest to build, many small servers use this structure. The processor and memory are interconnected by bus. Both processor and memory have bus control unit which is responsible for sending and monitoring bus broadcast information.But at the same time, only one processor (or storage controller) can broadcast on the bus, and all processors can listen. It is easy to see that the use of bus is the bottleneck of SMP structure.
In nump system architecture, a series of nodes are interconnected through point-to-point network, like a small Internet, each node contains one or more processors and a local memory. The local storage of one node is visible to other nodes, and the local storage of all nodes together forms a global memory that can be shared by all processors.It can be seen that nump’s local storage is shared rather than private, which is different from SMP. The problem of nump is that the network needs more complex protocol than bus replication, and the processor can access the memory of its own node faster than that of other nodes.Nump has good scalability, so many large and medium-sized servers are using nump structure。
For upper level programmers,The most important thing to understand is that the bus line is an important resource, which will directly affect the performance of the program。
2.2 Bus Plus lock
In the early CPU, the problem of cache inconsistency can be solved by adding lock # lock on the bus. Because the communication between CPU and other components is carried out through the bus, if the lock # lock is applied to the bus, that is to say, the access of other CPUs to other components (such as memory) is blocked, so that only one CPU can use the memory of this variable. If the lcok # lock signal is sent out on the bus, then only after the code is completely executed, can other CPUs read variables from their memory and then perform corresponding operations. This solves the problem of cache inconsistency.
However, when the bus is locked, other CPUs cannot access the memory, resulting in low efficiency. Therefore, there is a second solution, cache consistency protocol to solve the cache consistency problem.
2.3 cache consistency protocol
Consistency requirement means that if a field in the cache is modified, the copy of the field in the main memory (and higher level) must be modified immediately or finally, and the correctness of its reference to the content of the word in the main memory must be ensured.
In modern multiprocessor systems, each processor has its own cache. Copies of the same main memory block can be stored in different caches at the same time. If processors are allowed to modify their own caches independently, inconsistency will occur. There are software and hardware methods to solve this problem.The hardware method can dynamically identify the inconsistent conditions and deal with them in time, so that the cache can be used with high efficiency. And this method is transparent to programmers and system software developers, and reduces the burden of software development, so it is widely used.
The most famous software method is Intel’s MESI protocol, which ensures that the copies of shared variables used in each cache are consistent. MESI protocol is a kind of listening protocol which adopts write invalid mode.It requires each cache line to have two status bits, which are used to describe the current state of the line in the modified state (m), exclusive state (E), shared state (s) or invalid state (I), so as to determine its read / write operation behavior. The four states are defined as:
MESI protocol is suitable for multiprocessor system with bus as interconnection mechanism。Each cache controller is not only responsible for responding to its own CPU’s memory read / write operations (including read / write hit and miss), but also responsible for listening to other CPU’s memory read / write activities (including read monitor hit and write monitor hit) on the bus and processing its own cache accordingly。To maintain cache consistency, all these processes must comply with the MESI protocol state transition rules。
Bus monitoring and state transition of MESI
Starting from the four vertices of the graph, this paper introduces the rules of bus monitoring and state transition
It can be seen from the above analysis that although each cache controller is monitoring the system bus at any time,But only read miss, write miss and shared line write hit can be monitored. Bus monitoring logic is not complex, and the additional system bus transmission overhead is not large. MESI protocol effectively ensures the uniqueness of the dirty copy of main memory block in multiple caches, and can write back in time to ensure the correctness of cache main memory access.
However, it is worth noting that there are two behaviors in traditional MESI protocol, which cost a lot of execution.One is to mark a cache line as invalid, the other is to write new data when the current state of a cache line is invalid. Therefore, CPU reduces the delay of such operations through store buffer and invalid queue components. As shown in the figure below:
CPU uses store buffer and invalidate queue components to reduce the delay of such operations
So,MESI protocol can guarantee the consistency of cache, but it can’t guarantee the real-time performance, which may cause the problem of dirty read in a very short time。
Actually,Cache consistency is not used in all casesFor example, if the data to be operated cannot be cached inside the CPU or the operation data spans multiple cache lines (the status cannot be identified), the processor will call bus locking. In addition, when the CPU does not support cache locking, it can only use bus locking, such as Pentium 486 and older CPUs.