Principle of atomic operation realized by CPU

Time:2021-9-9

The CPU before 586 will passLOCKAtomic operation is realized in the form of lock bus. 686 provides cache coherence at the beginning,   This is the basis of multiprocessing and atomic operation

 

1. Granularity of storage

The organizational form (granularity) of storage is based onCacheLineThe unit is usually 64 bytes or higher (32 bytes in the early days). Then several groups of cachelines form a small LRU (or other replacement rules)

 

2. Agreement

Storage consistency (CC) is generally realized through MESI protocol and subsequent variants, such as Intel’s mesif protocol and AMD’s MOESI protocol

Take MESI protocol as an example:

Modified: exclusive cacheline. It has been modified, but it has not been synchronized to main memory

Exclusive: exclusive and consistent with main memory

Shared: shared. Other cores also own the cacheline, which is consistent with the main memory

Invalid: indicates that the cacheline is unavailable

 

3. Communication

With a protocol, communication is needed to implement the protocol (stored state). There are two kinds of communication, one is broadcast / listen, and the other is directory

Broadcast / listen, as the name suggests, is the change of storage state, which will be broadcast to other cores to maintain the state of cacheline. Obviously, this method will waste a lot of traffic and is difficult to expand. There are many CPU cores, and the bus is an obvious bottleneck

Directory type is to notify specific cores of changes to avoid broadcasting. However, consider an extreme case. If many cores are accessing the same cacheline, broadcasting (in fact) cannot be avoided. Therefore,In multi-threaded programming, sharing the same cacheline is not a good choice.

 

With the above, let’s consider the implementation of atomic operation:

 

1. Atomic load / store

Since the CPU manages the cache in the unit of cacheline, the load / store in a cacheline is actually atomic. Load and store are an 8-byte object. It is impossible for the upper 4 bits and the lower 4 bits to operate separately (resulting in two values)

However, this alone is not enough. The CPU does not immediately write the changes to the cacheline to the main memory, so the values seen by other cores may be old values, so the fence needs to read the latest values at this time; As for writing, you must have write permission, i.e. m or E status, and both permissions have the latest value (but what you just read is not necessarily the latest, so it is possible to overwrite the new value with the old value)

2. FetchAndAdd

This is a slightly more complex operation than load and store. It is actually a composite operation. However, with m and e States, it is easy to understand:

lock(CacheLine)
v := load(obj)
v += add
store(obj, v)
release(CacheLine)

In x86, there are xadd instructions

3. CompareAndSwap

Then you can guess:

lock(CacheLine)
v := load(obj)
if v != expected {
  store(obj, new_value)
}
release(CacheLine)

Inside x86 is xchg

Lock and release here mean to monopolize and release the cacheline

 

As for the principle of atomic operation, there are few data tables showing how to do it, which may be too biased towards hardware. However, considering the protocols such as MESI, we can actually guess the implementation inside the CPU (at least seven or eight)   Fortunately, we found two materials, one is < > and < >. The chapter of Kunpeng 920 memory model reads as follows:

The logic of atomic instructions is not complex in software, but the cost is very high in microarchitecture. If we regard CPU and memory as independent entities on the bus, a CPU needs to make CAS instructions. This CPU needs to first read a value from memory and set a flag on the memory controller to ensure that other CPUs can’t write in. After it is compared, we decide to write a value back before other CPUs can write in.

Different microarchitectures have different methods to optimize behavior. On Kunpeng 920, atomic instruction requests need to be queued on l3cache to ensure that the semantics required by atomic instructions can be maintained among multiple actions of atomic operation. The queue itself has costs. So don’t use atomic variables easily without atomic needs, which actually has a cost.

The parallel multi-core architecture reads as follows:

Fortunately, the cache consistency protocol provides the basis for Atomicity to be guaranteed. For example, when an atomic instruction is encountered, the protocol knows that atomicity needs to be guaranteed. It first obtains the “exclusive ownership” of the storage unit m (by invalidating the copies in other cache blocks containing m). After obtaining the exclusive ownership, This protocol will ensure that only one processor can access the block, and if other processors want to access it at this time, they will experience cache loss, and then atomic instructions can be executed. During the duration of atomic instructions, other processors are not allowed to “steal” the block. In terms of distance, if another processor requests to read or write the block, the block will be “stolen” (if the block is cleaned up, the state of the block is degraded to invalid). Exposing the block before the atomic instruction is completed will destroy the atomicity of the instruction

reference resources:

1) Fundamentals of parallel multi-core architecture

2)   Understand the implementation and application of modern server from Kunpeng 920

Recommended Today

Notes on basic learning of ruby metaprogramming

Note 1:The code contains variables, classes and methods, which are collectively referred to as language construct. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # test.rb class Greeting  def initialize(text)   @text = text  end    def welcome   @text  end end my_obj = Greeting.new(“hello”) […]