Introduction to Concurrent Programming


The introduction of concurrent programming is a summary of the series of distributed computing-concurrent programming 8. Welcome to the Public Number: The Technological Path of a Bear.

Introduction to Concurrent Programming

Introduction to Concurrent Programming

With the rapid development of hardware performance and the coming of big data era, concurrent programming has increasingly become an important part of programming. Simply defined, if the logical control flows of execution units overlap in time, they are concurrent. The main driver of the Renaissance of concurrent programming comes from the so-called “multi-core crisis”. As predicted by Moore’s Law, chip performance is still improving, but compared with speeding up CPU, computer is developing towards multi-core. As Herb Sutter said, “The era of free lunch is over.” In order to make the code run faster, relying solely on faster hardware can no longer meet the requirements. Parallel and distributed computing is the main content of modern applications. We need to use multiple cores or machines to speed up applications or run them on a large scale.

Concurrent programming is a very broad concept, which depends downward on operating system, storage, and distributed systems, micro services, etc. It will be specifically located in Java concurrent programming, Go concurrent programming, JavaScript asynchronous programming and other fields. Cloud computing promises unlimited scalability in all dimensions (memory, computing, storage, etc.). Concurrent programming and related theories are also the basis for building large-scale distributed applications.

Introduction to Concurrent Programming

This section mainly discusses the related content of concurrent programming theory. You can refer to [Java concurrent programming, Go concurrent programming and so on to understand concurrent programming practice in specific programming languages. You can refer to the actual battle of micro-services or relational database theory to understand and DJNQn. The application of programming in practical system.

Concurrent and Parallel

Concurrent is a program that can initiate and execute at the same time, referring to the logical structure of the program; parallel is a concurrent program that can be executed on hardware supporting parallel, referring to the running state of the program. In other words, concurrent programs represent all programs that can implement concurrent behavior. This is a relatively broad concept, and concurrent programs are only a subset of them. Concurrent is a necessary condition for union, but it is not a sufficient condition for union. Concurrency is only a more realistic expression of the nature of the problem, the purpose is to simplify the code logic, not to make the program run faster. If a program runs faster, it must be a concurrent program with multiple cores.

In short, concurrency is the ability to deal with many things at the same time; concurrency is the ability to do many things at the same time.

Introduction to Concurrent Programming

Concurrency is a concept in the problem domain – a program needs to be designed to handle multiple simultaneous (or almost simultaneous) events; a concurrent program contains multiple logically independent execution blocks that can be executed independently in parallel or serially. Parallelism is a concept in the method domain, which speeds up problem solving by executing multiple parts of the problem in parallel. A parallel program often solves problems much faster than a serial program because it can perform multiple parts of the whole task at the same time. Parallel programs may have multiple independent execution blocks or only one.

Specifically, Redis is a good example of distinguishing concurrency from parallelism. Redis itself is a single-threaded database, but it can provide concurrent IO services through multiplexing and event looping. This is because multicore parallelism essentially costs a lot of synchronization, especially in the case of locks or semaphores. Therefore, Redis uses single-threaded event loops to ensure a series of atomic operations, thus ensuring almost zero-consumption synchronization even in high concurrency. Re-quote Rob Pike’s description:

A single-threaded program can definitely provides concurrency at the IO level by using an IO (de)multiplexing mechanism and an event loop (which is what Redis does).

Thread-level concurrency

Since the emergence of time sharing in the early 1960s, concurrent execution has been supported in computer systems. Traditionally, this concurrent execution is only simulated and implemented by switching a computer between the processes it is executing. This configuration is called a single processor system. Since the 1980s, multiprocessor systems, which consist of multiprocessors controlled by a single operating system core, have adopted technologies such as multicore processors and hyperThreading to allow us to achieve real parallelism. Multi-core processors integrate multiple CPUs into an integrated circuit chip:

Introduction to Concurrent Programming

Hyperthreading, sometimes called simultaneous multi-threading, is a technology that allows a CPU to execute multiple control flows. It involves multiple backups of some CPU hardware, such as program counters and register files, while only one of the other hardware components, such as units that perform floating-point arithmetic operations. Conventional processors need about 20,000 clock cycles to convert between different threads, while hyperthreaded processors can decide which thread to execute on a single cycle basis. This enables the CPU to make better use of its processing resources. For example, if a thread has to wait until some data is loaded into the cache, the CPU can continue to execute another thread.

Instruction level concurrency

At a lower abstraction level, the attribute that modern processors can execute multiple instructions simultaneously is called instruction level parallelism. In fact, each instruction takes much longer from start to finish, about 20 or more cycles, but the processor uses a lot of clever techniques to process up to 100 instructions at the same time. In the pipeline, the activities needed to execute an instruction are divided into different steps, and the hardware of the processor is organized into a series of stages, one step in each stage. These stages can be operated in parallel to handle different parts of different instructions. We’ll see a fairly simple hardware design that achieves an instruction execution rate close to a clock cycle. If a processor can achieve a faster execution rate than an instruction in a cycle, it is called a Super Scalar processor.

Single instruction, multiple data

At the lowest level, many modern processors have special hardware that allows a single instruction to produce multiple operations that can be executed in parallel. This is called single instruction, multi-data, or SIMD parallelism. For example, newer Intel and AMD processors have instructions to add four pairs of single-precision floating-point numbers (C data type float) in parallel.

Memory model

As mentioned earlier, modern computers usually have two or more CPUs, some of which have multiple cores; they allow multiple threads to run simultaneously, and each CPU runs one of them within a certain time period. In the section of Storage Management, we introduce different types of storage in computer systems:

Introduction to Concurrent Programming

Each CPU contains multiple registers, which are essentially CPU memory; the CPU performs operations in registers much faster than in main memory. Each CPU may also have a CPU cache layer, which is much faster than accessing main memory blocks, but slower than accessing registers. The computer also includes main memory (RAM), which can be accessed by all CPUs. The main memory is generally much larger than the CPU cache, but slower than the CPU cache. When a CPU needs to access the main memory, it reads some data in the main memory into the CPU cache, and even reads some data in the cache into the internal register, and then operates on it. When the CPU needs to write data to the main memory, it writes the data in the register to the cache, and sometimes it brushes the data from the cache into the main memory. Whether read or write data from the cache, there is no need to read or write all at once, but only to operate on some data.

Introduction to Concurrent Programming

Problems in concurrent programming often arise from visibility problems caused by cache, atomicity problems caused by thread switching, and ordering problems caused by compilation optimization. Take the Java virtual machine as an example, each thread has its own thread stack (call stack). With the execution of thread code, the call stack will change accordingly. The thread stack contains local variables for each method being executed. Each thread can only access its own stack. Call local variables in the stack. Only threads that create the stack can access them. No other threads can access them. Even if two threads execute the same piece of code, the two threads create local variables in their respective thread stacks. Therefore, each thread has its own local variables. All basic types of local variables are stored in the thread stack and are not visible to other threads. A thread can copy the basic type to other threads, but it cannot share it with other threads, and objects created by any thread are stored in the heap.


So-called visibility is a thread’s modification of shared variables, and another thread can see it immediately. In the single-core era, all threads operate directly on the data of a single CPU, and one thread must be visible to another thread to write caches; for example, if thread B accesses variable values after thread A updates them, it must be the latest value of variable V. In the multi-core era, each CPU has its own cache, and shared variables are stored in main memory. Threads running in a CPU read shared variables into their own CPU cache. In the CPU cache, the value of the shared object is modified. Because the CPU does not flush the data in the cache back to main memory, the modification of the shared variable is not visible to the thread running in another CPU. In this way, each thread will have a copy of its own shared variables, stored in their respective CPU caches.

Introduction to Concurrent Programming

The most classic case of visibility problem is concurrent addition. The following two threads update the value of the count attribute field of the variable test at the same time. For the first time, count = 0 is read into their respective CPU cache and executed.count+=1After that, the values in the respective CPU caches are all 1, and when we write to memory, we will find that the memory is 1, not 2 as we expected. Later, because each CPU cache has count value, both threads are calculated based on count value in the CPU cache, so the final count value is less than 2000.

Thread th1 = new Thread(()->{

Thread th2 = new Thread(()->{

// Additional operations are performed on the same object in each thread
count += 1;

In Java, if multiple threads share an object without properly using volatile declarations and thread synchronization, after one thread updates the shared object, another thread may not be able to get the latest value of the object. When a shared variable is modified by volatile, it guarantees that the modified value will be updated to main memory immediately. When other threads need to read, it will read the new value in memory. Visibility is also guaranteed by synchronized and Lock, which ensures that only one thread at a time acquires the lock and executes the synchronized code, and that changes to variables are refreshed into main memory before releasing the lock. So visibility can be guaranteed.


The so-called atomicity is that one or more operations are not interrupted in the process of CPU execution. The CPU can guarantee that the atomic operations are CPU instruction level, not high-level language operators. Some of our seemingly atomic instructions in programming languages tend to become multiple operations when compiled into assembly:


# When compiled into an assembly, it is:
# Read the current variable I and assign it to a temporary register.
movl i(%rip), %eax
# Give temporary register + 1;
addl $1, %eax
# Write back the new value of eax to memory
movl %eax, i(%rip)

We can clearly see that C code only needs one sentence, but compiling into assembly takes three steps (without considering compiler optimization, in fact, compiler optimization can combine these three assembly instructions into one). That is to say, only simple reading and assignment (and the number must be assigned to a variable, the mutual assignment between variables is not atomic operation) is atomic operation. According to the way of atomic operation to solve the synchronization problem: relying on the primitive support of processor, the three instructions are combined into one, which is executed as one instruction to ensure that the execution process is not interrupted and multi-threaded concurrency is not disturbed. In this way, the synchronization problem can be solved easily, which is also called atomic operation. However, the processor is not obliged to provide atomic operations for any code fragments. Especially, our critical region resources are huge and even uncertain in size. It is unnecessary or difficult for the processor to provide atomic support. At this time, it is often necessary to rely on locks to ensure atomicity.

In Java, the read and assignment of variables of the basic data type are atomic operations, i.e. these operations are not interruptible, either executed or not executed. Java memory model only guarantees that basic reading and assignment are atomic operations. If we want to achieve atomicity of a wider range of operations, we can achieve it through synchronization and Lock. Since synchronized and Lock can guarantee that only one thread can execute the code block at any time, then there is no atomicity problem, thus ensuring atomicity.


Orderliness, as the name implies, refers to the sequential execution of a program according to its code. Code rearrangement refers to the compiler’s optimization of user code to improve the efficiency of code execution. The optimization premise is that the results of code execution must be the same before and after optimization.

For example:

int a = 1, b = 2, c = 3;
void test() {
    a = b + 1;
    b = c + 1;
    c = a + b;

The assembly code test function body code under GCC is as follows, in which the compilation parameter: – O0

movl b(%rip), %eax
addl $1, %eax
movl %eax, a(%rip)
movl c(%rip), %eax
addl $1, %eax
movl %eax, b(%rip)
movl a(%rip), %edx
movl b(%rip), %eax
addl %edx, %eax
movl %eax, c(%rip)

Compilation parameters: -O3

Movl B (% rip),% eax; read B into eax register
Leal 1 (% rax),% edx; write B + 1 to the EDX register
Movl C (% rip),% eax; read C into eax
Movl% edx, a (% rip); write EDX to a
ADDL $1,% eax; add eax + 1
Movl% eax, B (% rip); write eax to B
Addl% edx,% eax; add eax + EDX
Movl% eax, C (% rip); write eax to C

The classic problem associated with orderliness in Java is the singleton pattern. For example, we use static functions to get instances of an object, and synchronized locking to ensure that only one thread can trigger the creation, while other threads can get instance objects directly.

if (instance == null) {
    synchronized(Singleton.class) {
    if (instance == null)
        instance = new Singleton();

However, although the process of creating objects we expect is memory allocation, initialization of objects, and assignment of object references to member variables, in practice, the optimized code usually assigns variables first, and then initializes objects. Assuming thread A executes the getInstance () method first, when instruction 2 is executed, thread switching happens to occur and switch to thread B. If thread B also executes the getInstance () method at this time, thread B will find out when executing the first judgement.instance != nullSo go back to instance directly, and instance is not initialized at this time. If we access member variables of instance at this time, we may trigger null pointer exceptions.

Memory barrier

Multiprocessors access shared main memory at the same time. Each processor reads and writes again. Once the data is updated, it needs to be updated to the main memory synchronously (this does not require the processor to update the main memory immediately after the cache update). In this case, code and instruction rearrangement, together with the output of cached delayed instruction results in changes in the order in which shared variables are modified, making the behavior of the program unpredictable. To address this unpredictable behavior, the processor provides a set of machine instructions to ensure the order of instructions. It tells the processor to submit all unprocessed load and storage instructions before continuing to execute. Similarly, the compiler can be asked not to rearrange the given point and the sequence of instructions around it. These instructions to ensure sequence are called memory barriers. Specific assurance measures at the programming language level are the definition of memory model.

POSIX, C++, and Java all have their own shared memory models. There is no difference in their implementations, but they are slightly different in some details. The memory model mentioned here does not refer to the memory layout. It refers specifically to the protection of read and write instruction operations during optimized interactions between memory, Cache, CPU, write buffer, registers and other hardware and compilers to ensure read and write order. These complex factors can be generalized into two aspects: rearrangement and caching, i.e. code rearrangement, instruction rearrangement and CPU Cache. Simply put, memory barriers do two things:Refuse rearrangement, update cache

C++ 11 provides a set of user APIs std:: memory_order to guide the processor’s reading and writing order. Java uses happens-before rules to shield specific details and guide JVM to insert barrier instructions in the process of instruction generation. Memory barrier can also indicate that instructions or sequences of instructions around them are not optimized during compilation. It is called compiler barrier, which is equivalent to lightweight memory barrier. Its work is equally important because it guides compiler optimization at compilation time. The implementation of barriers is slightly more complicated. We use a set of abstract imaginary instructions to describe the working principle of memory barriers. Use MB_R, MB_W, MB to abstract processor instructions into macros:

  • MB_R represents the read memory barrier, which ensures that the read operation is not rearranged after the instruction is invoked.
  • MB_W stands for the write memory barrier, which ensures that the write operation is not rearranged after the instruction is invoked.
  • MB represents the read-write memory barrier, which ensures that previous instructions are not rearranged after the instruction is invoked.

These barrier instructions are equally effective on single-core processors, because although single-processor does not involve data synchronization between multiprocessors, instruction rearrangement and caching still affect the correct synchronization of data. Instruction rearrangement is very low-level and the implementation effect is very different, especially the support of different architectures for memory barrier, even in the architecture that does not support instruction rearrangement, there is no need to use barrier instructions at all. Specifically, how to use these barrier instructions is achieved by supporting platforms, compilers or virtual machines. We only need to use the APIs of these implementations (referring to various concurrent keywords, locks, and reentry, etc.), which are described in detail in the next section. The purpose here is to help better understand how memory barriers work.

Memory barrier is of great significance and is the key to ensure correct concurrency. By setting the memory barrier correctly, we can ensure that instructions are executed in the desired order. It is important to note that memory masking should only work on instructions that need synchronization or can also contain fragments of surrounding instructions. If all instructions are synchronized, the design of most processor architectures at present will be meaningless.

Java Memory Model (JMM)

Java memory model focuses on describing how threads in Java interact with memory and the sequence of code execution in a single thread, and provides a series of basic concurrent semantic principles. The earliest Java memory model was proposed in 1995 to solve the problem of thread interaction/synchronization in different processors/operating systems, specifying and guiding Java programs in different memory. There is deterministic behavior between architecture, CPU and operating system. Before the Java 5 version, JMM was not perfect. At that time, multithreads often read and read a lot of strange data in shared memory; for example, one thread could not see the values written by other threads to shared variables, or because of the reordering of instructions, one thread might see strange operation steps of other threads.

Java memory model has some inherent “orderliness”, that is, it can be guaranteed orderliness without any means, which is commonly referred to as the happens-before principle. If the execution order of the two operations can not be deduced from the occurrences-before principle, then they can not guarantee their orderliness, and the virtual machine can reorder them at will.

Java memory model guarantees that changes made to one thread can be visible to other threads, and that they are related first.

  • Code within a thread can be executed sequentially, which is calledProgram Order Rule
  • For the same lock, an unlock operation must occur before another lock operation occurs after time, also known asPipeline Locking Rules
  • The former write operation on volatile is also called before the latter read operation on volatile.Volatile variable rule
  • Any operation within a thread must be called after the thread’s start () call.Thread Start Rules
  • All operations of a thread will occur before the thread terminates.Thread termination rule
  • The termination of an object must be done after the construction of the object, also known asObject termination rule

For program order rules, the execution of a piece of program code appears to be orderly in a single thread. Note that although this rule refers to “Writing first occurs after writing”, this should be the order in which the program appears to be executed, because the virtual machine may reorder the program code in order of instructions. Although reordering is performed, the result of the final execution is consistent with that of the sequential execution of the program, which only reorders the instructions that do not have data dependencies. Therefore, in a single thread, program execution seems to be executed in an orderly manner, which should be understood. In fact, this rule is used to ensure the correctness of program execution results in single thread, but it can not guarantee the correctness of program execution in multi-thread.

Processes, threads and coroutines

In systems without OS configuration, program execution is sequential, that is, one program must be executed before another program is allowed to execute; in a multi-program environment, multiple programs are allowed to execute concurrently. There are significant differences between the two ways of execution of a program. It is this characteristic of concurrent execution of programs that leads to the introduction of the concept of process into the operating system.Process is the basic unit of resource allocation and thread is the basic unit of resource scheduling.

Early operating systems scheduled CPUs based on processes. There was no shared memory space between different processes, so the process had to switch the memory mapping address in order to switch tasks. All threads created by a process shared one memory space, so the cost of task switching by threads was very low. Modern operating systems are based on lighter threads to schedule, and now we refer to “task switching” refers to “thread switching”.

Process | Process

A process is an abstraction of an operating system from a running program. It can run multiple processes simultaneously on a system, and each process seems to be using hardware exclusively. The so-called concurrent operation means that the instructions of one process and the instructions of another process are interleaved. In both single-core and multi-core systems, a single CPU may appear to be executing multiple processes concurrently by switching processors between processes. The mechanism by which the operating system implements this staggered execution is called context switching.

The operating system keeps track of all the state information required for the process to run. This state, or context, includes many information, such as the current values of PC and register files, and the contents of main memory. At any given time, a single processor system can only execute code for one process. When the operating system decides to transfer control from the current process to a new process, context switching occurs, that is, to save the context of the current process, restore the context of the new process, and then transfer control to the new process. The new process will start where it stopped last time.

Introduction to Concurrent Programming

In the section on virtual storage management, we introduced that it provides an illusion for each process that each process is using primary memory exclusively. Each process sees a consistent memory called virtual address space. The top area of the virtual address space is reserved for code and data in the operating system, which is the same for all processes; the bottom area of the address space stores user process-defined code and data.

Introduction to Concurrent Programming

  • Program code and data, for all processes, the code starts at the same fixed address and is initialized directly according to the content of the executable object file.
  • The heap, code and data area are closely followed by the runtime heap. Unlike code and data areas, which are sized at the beginning of a process, heaps can be dynamically expanded and contracted at runtime when C standard library functions such as malloc and free are called.
  • Shared libraries: About the middle of the address space is an area for storing code and data from shared libraries such as C standard libraries and mathematical libraries.
  • Stack, which is located at the top of the user’s virtual address space, is used by the compiler to implement function calls. Like heaps, user stacks can expand and contract dynamically during program execution.
  • Kernel Virtual Memory: Kernel always resides in memory and is part of the operating system. The area at the top of the address space is reserved for the kernel, and applications are not allowed to read or write the contents of the area or directly call functions defined by the kernel code.

Thread | Thread

In modern systems, a process can actually be composed of multiple execution units called threads, each of which runs in the context of the process and shares the same code and global data. The individual processes are completely independent, and the threads are interdependent. In a multi-process environment, the termination of any process will not affect other processes. In a multithreaded environment, the parent thread terminates and all the child threads are forced to terminate (without resources). The termination of any sub-thread will not affect other threads unless the sub-thread executes.exit()System call. Any subthread executionexit()All threads die at the same time. There is at least one main thread in a multithreaded program, which is actually a process with a main function. It is the process of the whole program, and all threads are its sub-threads. We usually call the main process with multithreading as the main thread.

Thread sharing environment includes: process code segment, process public data, process open file descriptor, signal processor, process current directory, process user ID and process group ID, etc. Using these shared data, threads can easily communicate with each other. Processes have many commonalities, but also have their own personalities, and thus achieve concurrency:

  • Thread ID: Each thread has its own thread ID, which is unique in this process. Processes use this to identify threads.
  • Register Group Value: Because threads run concurrently, each thread has its own running clues. When switching from one thread to another, the state of the original thread’s register set must be saved so that the thread can be restored when it is switched back.
  • Thread Stack: Stack is necessary to ensure that threads run independently. Thread functions can call functions, and the called functions can be nested layer by layer, so threads must have their own function stack, so that function calls can be executed normally without the influence of other threads.
  • Error Return Code: Because many threads in the same process are running at the same time, it is possible that a thread has set the error value after system call, while the thread has not handled the error, another thread is put into operation by the scheduler at this time, so the error value may be modified. Therefore, different threads should have their own error return code variables.
  • Thread’s signal masking code: Because each thread is interested in different signals, the thread’s signal masking code should be managed by the thread itself. But all threads share the same signal processor.
  • Thread Priority: Since threads need to be able to be scheduled like processes, they must have parameters that can be used for scheduling, which is the priority of threads.

Introduction to Concurrent Programming

Threads in Linux

Prior to Linux version 2.4, thread implementation and management were implemented in a process-based manner; prior to Linux version 2.6, the kernel did not support the concept of threads, only simulated threads through lightweight processes, and a user thread corresponded to a kernel thread (kernel lightweight processes). The biggest feature of this model was that thread scheduling was completed by the kernel. Yes, and other thread operations (synchronization, cancellation) are performed by the Linux Thread function outside the core. In order to be fully compatible with Posix standard, Linux 2.6 first improves the kernel by introducing the concept of thread group (still representing threads with lightweight processes). With this concept, a group of threads can be organized as a process, but the kernel does not prepare special scheduling algorithms or define special data structures to represent threads; on the contrary, threads are only regarded as threads. A process that shares certain resources with other processes (conceptually threads). The main change in implementation is to add the TGID field to task_struct, which is the field used to represent the thread group ID. In terms of user thread library, NPTL is also used to replace Linux Thread. Still adopted on different scheduling models1 to 1Model.

The process is implemented by calling the fork system call:pid_t fork(void);Threads are implemented by calling clone system calls:int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, ...)This is the case. Standardsfork()Compared with threads, the overhead of threads is very small, and the kernel does not need to copy the memory space of the process or the file description descriptor separately, and so on. This saves a lot of CPU time, making thread creation ten to 100 times faster than new process creation, and can use threads in large quantities without too much worry about CPU or memory shortage. Whether it is fork, vfork, kthread_create, do_fork is finally called, and do_fork allocates the resources needed by a process according to different function parameters.

Thread pool

The size of the thread pool depends on the characteristics of the tasks performed and the environment in which the program runs. The size of the thread pool should be configurable (written to the configuration file) or based on the number of CPUs available.Runtime.availableProcessors()To set up, where Ncpu denotes the number of CPUs available, Nthreads denotes the number of thread pool worker threads, and Ucpu denotes CPU utilization0≤ Ucpu ≤1W denotes resource waiting time, C denotes task computing time, Rtotal denotes the total amount of limited resources, and Rper denotes the amount of resources required by each task.

  • The size of the compute-intensive task thread pool for pure CPU computing tasks that do not depend on blocking resources (external interface calls) and limited resources (thread pools) can be set to:Nthreads = Ncpu+1
  • If the tasks performed include some external interface calls or other blocking calculations in addition to CPU calculations, the size of the thread pool can be set toNthreads = Ncpu - Ucpu -(1 + W / C)This is the case. It can be seen that the IO waiting time is longer than the task computing time.W/CIf the CPU utilization is 100%, thenW/CThe larger the result, the more worker threads are needed, because if there are not enough threads, the task queue will expand rapidly.
  • If tasks depend on limited resources such as memory, file handles, database connections, etc., the maximum thread pool can be set toNthreads ≤ Rtotal/Rper

Coroutine | Coprocess

Collaboration is a lightweight thread in user mode. The most accurate name should be User Space Thread. There are also different names in different fields, such as Fiber, Green Thread and so on. The operating system kernel knows nothing about the co-process, and the scheduling of the co-process is entirely controlled by the application program, regardless of the scheduling of this part of the operating system; a thread may contain one or more co-processes, which have their own register context and stack. When the co-process scheduling is switched, the fine lines and stacks on the registers are saved, and the previously guaranteed register context and stack are restored when the co-process scheduling is switched back. Stack.

For example, the go keyword in Golang is actually responsible for opening a Fiber and letting func logic run on it. And all this happens in the user mode, not in the kernel mode, that is to say, there is no overhead on ContextSwitch. In the implementation library of the protocol, the author often uses Go Routine, node-fibers, Java-Quasar and so on.

The Go stack is dynamically allocated in size, increasing and shrinking with the amount of data stored. Each new Goroutine has only about 4KB stacks. With only 4KB per stack, we can have 2.56 million Goroutines on a 1GB RAM, which is a huge improvement over 1MB per thread in Java. Golang implements its own scheduler, allowing numerous Goroutines to run on the same OS thread. Even if Go runs the same context switching as the kernel, it can avoid switching to ring-0 to run the kernel and then switch back, which saves a lot of time. However, this is only a paper analysis. To support millions of Goroutines, Go needs to do more complex things.

Another optimization is needed to support true large concurrency: scheduling threads when you know they can do useful work. If you run a lot of threads, only a few threads actually do useful work. Go achieves this by integrating channels and schedulers. If a Goroutine waits on an empty channel, the scheduler sees this and does not run the Goroutine. Go goes a step further, putting most of the idle threads on its operating system threads. In this way, active Goroutines (much less expected) are scheduled to execute on the same thread, while millions of dormant Goroutines are handled separately. This helps reduce latency.

Unless Java adds language features that allow the scheduler to observe, intelligent scheduling cannot be supported. However, you can build a runtime scheduler in User Space, which can sense when a thread can perform its work. This forms the basis for frameworks like Akka, which can support millions of Actors.

concurrency control

When dealing with multi-threaded programs, there are often some amazing things. Assigning a variable with stack and stack may produce unexpected results in future execution. The result is that the memory is illegally accessed, resulting in the content of memory being changed. Threads in a process share the heap area, while threads in the process maintain their own stacks. On platforms such as Windows, different threads use the same heap by default, so synchronization protection is used when allocating memory with C malloc (or GlobalAlloc of Windows). If there is no synchronization protection, there will be competition conditions when two threads perform memory operations at the same time, which may lead to memory management confusion in the heap. For example, two threads allocated a unified block memory address, pointer errors in the free list, and so on.

The most common synchronization methods for processes/threads are mutex, rdlock, cond, Semophore, etc. Critical Section and Event are also commonly used synchronization methods in Windows systems. In summary, synchronization is basically to solve the two problems of atomicity and visibility/consistency. Its basic means are lock-based. Therefore, it can be divided into three aspects: instruction serialization/critical resource management/lock, data consistency/data visibility, transaction/atomic operation. In concurrency control, we will consider thread cooperation, mutex and lock, concurrent container and so on.

Thread communication

In concurrent control, the communication between threads (which mechanism to exchange information between threads) and synchronization (read-write wait, race condition, etc.) are considered. In imperative programming, there are two communication mechanisms between threads: shared memory and message passing. Java is a typical communication mechanism of shared memory mode, while Go advocates memory sharing through message transmission rather than through sharing.

In the concurrent model of shared memory, the common state of programs is shared among threads, and the common state in memory is implicitly communicated between threads. In the concurrent model of messaging, there is no common state between threads, and threads must explicitly communicate by sending messages. Synchronization refers to the mechanism used by programs to control the relative sequence of operations between different threads. In the shared memory concurrency model, synchronization is explicit. Programmers must explicitly specify that a method or piece of code needs to be executed mutually exclusively between threads. In the concurrent model of message delivery, synchronization is implicit because the message must be sent before it is received.

Common thread communication methods are as follows:

  • Pipe: Pipe is a semi-duplex communication mode. Data can only flow unidirectionally and can only be used between processes with affinity, which usually refers to the father-son process relationship.
  • Message Queue: A message queue is a linked list of messages stored in the kernel and identified by the message queue identifier. Message queue overcomes the shortcomings of less signal transmission information, pipeline can only carry unformatted byte stream and buffer size limitation.
  • Semophore: A semaphore is a counter that can be used to control access to shared resources by multiple processes. It is often used as a lock mechanism to prevent a process from accessing a shared resource while other processes are accessing the resource. Therefore, it is mainly used as a means of synchronization between processes and between different threads in the same process.
  • Shared Memory: Shared Memory is the mapping of a piece of memory that can be accessed by other processes. This shared memory is created by one process, but can be accessed by multiple processes. Shared memory is the fastest IPC mode. It is specially designed for the low efficiency of other inter-process communication modes. It is often used in conjunction with other communication mechanisms, such as semaphores, to achieve synchronization and communication between processes.
  • Socket: Socket is also an inter-process communication mechanism. Unlike other communication mechanisms, it can be used for process communication between different hosts.

Lock and Mutual Exclusion

Mutual exclusion means that a resource can only be accessed by one visitor at the same time, which has uniqueness and exclusiveness. However, mutually exclusion can not restrict the order of accessing resources, that is, access is disorderly. Synchronization: It refers to the orderly access to resources by visitors through other mechanisms on the basis of mutual exclusion (in most cases). In most cases, synchronization has been mutually exclusive, especially when all write resources are mutually exclusive; in a few cases, multiple visitors can be allowed to access resources simultaneously.

Critical resources

The so-called critical resource is the resource that only one process can access at a time, and multiple processes can only access mutually exclusive resources. Access to critical resources requires synchronization, such as semaphores, which are a convenient and effective mechanism for process synchronization. But semaphores require that each process accessing critical resources have wait and signal operations. In this way, a large number of synchronization operations are dispersed in various processes, which not only brings troubles to system management, but also leads to deadlock due to improper use of synchronization operations. Management is to solve such problems.

All kinds of software and hardware resources managed in the operating system can abstractly describe their resource characteristics by data structure, i.e. using a small amount of information and the operations performed on the resource to represent the resource, while ignoring their internal structure and implementation details. Shared data structure is used to abstractly represent the shared resources in the system. Operations implemented on the shared data structure are defined as a set of processes, such as requests and releases of resources. The application, release and other operations of shared resources are realized through the operation of shared data structure in this group of processes. This group of processes can also accept or block the access of processes according to the situation of resources, ensuring that only one process uses the shared resources at a time, so that all access to shared resources can be managed uniformly and critical resource exclusive access can be realized.

Manager is a resource management module of an operating system which represents the data structure of shared resources and a resource management program composed of a group of processes that operate on the shared data structure. Pipelines are called by processes that request and release critical resources. The process defines a data structure and a set of operations that can be performed by a concurrent process (on that data structure), which can synchronize the process and change the data in the process.

Pessimistic Locking

Pessimistic Concurrency Control (PCC) is a concurrency control method. It can prevent a transaction from modifying data in a way that affects other users. If a transaction executes operations in which a lock is applied to a row of data, then only when the lock is released by the transaction can other transactions execute operations that conflict with the lock. Pessimistic concurrency control is mainly used in an environment where data contention is intense, and where the cost of using locks to protect data is lower than that of rolling back transactions when concurrent conflicts occur.

In programming languages, pessimistic locks may have the following drawbacks:

  • In multi-threaded competition, locking and unlocking will lead to more context switching and scheduling delays, causing performance problems.
  • A thread holding a lock causes all other threads that need it to hang.
  • If a high priority thread waits for a low priority thread to release the lock, it will cause priority inversion, causing performance risk.

Pessimistic locks in databases are mainly realized by the following problems: pessimistic locks in most cases rely on the database lock mechanism to ensure the maximum degree of exclusivity of operations. If the lock time is too long, other users can not access it for a long time, which affects the concurrent accessibility of the program. At the same time, this has a great impact on the performance overhead of the database, especially for long transactions, such overhead is often unbearable, especially for long transactions. For example, in a financial system, when an operator reads the user’s data and modifies it on the basis of the user’s data read out (such as changing the balance of the user’s account), if a pessimistic lock mechanism is adopted, it means the whole operation process (from the operator reads the data, starts to modify to submit the result of the modification, and even includes the time for the operator to make coffee halfway). The database records are always locked, and you can imagine what the consequences would be if you faced hundreds or thousands of concurrencies.

Mutex/Exclusive Locks

Mutual exclusion lock is to lock the mutex separately. It is similar to spin lock. The only difference is that threads that can not compete with each other will go back to sleep until the lock is available. After the first entry thread is locked, the other competitors will go back to sleep until they are notified and competed again.

Mutual exclusion lock is one of the most commonly used locks in concurrent systems, which is supported by POSIX, C++ 11, Java and so on. Processing POSIX locking is relatively common, C++ and Java locking methods are very interesting. In C++ an AutoLock (common in open source projects such as chromium) can be used to work in a way similar to the auto_ptr smart pointer. In C++ 11, it is officially standardized as std:: lock_guard and std:: unique_lock. Java uses synchronized synchronization to synchronize code blocks (or modifiers) in a very flexible way. Both implementations use their respective language features skillfully to achieve very elegant locking methods. In addition, they also support the traditional POSIX-like locking mode.

Re-entrainable lock

Also known as lock recursion, is to acquire a lock that has been acquired. The way that threads are not supported to acquire what they have acquired and have not yet been unlocked is called non-recursive or non-reentrant. Locks with reentry characteristics determine whether the same thread is in reentry, and if so, make the lock-holding counter + 1 (0 means that the lock is not acquired by the thread, or the lock is released). Two kinds of locks are supported in C++ 11, recursive lock std:: recursive_mutex and non-recursive lock std:: mutex. Java’s two mutex implementations and read-write lock implementations support reentry. POSIX uses a method called reentrant function to ensure thread safety of functions, and the lock granularity is call rather than thread.

Read-write lock

Locks that support both modes are exclusive when they are locked in write mode, the same as mutexes. But read mode locking can be read by multiple read threads. That is to say, mutex is used when writing and shared lock is used when reading, so it is also called shared-exclusive lock. A common mistake is that data needs to be locked only when it is written. The fact is that even read operations need lock protection. If not, the read mode of read-write locks is meaningless.

Optimistic Locking

Compared with pessimistic lock, Optimistic Locking mechanism adopts a more relaxed locking mechanism. In contrast to pessimistic locks, optimistic locks assume that data will not conflict in general, so when data is submitted and updated, data conflict will be formally detected. If conflict is found, the user will return the wrong information and let the user decide how to do it. In fact, the concept of optimistic lock mentioned above has elaborated on its specific implementation details: there are two main steps: conflict detection and data update. Compare and Swap are two typical ways to implement it.


CAS is an optimistic lock technology. When multiple threads try to update the same variable with CAS at the same time, only one thread can update the value of the variable, while the other threads fail. The failed threads will not be suspended, but will be told that they failed in the competition and can try again. The CAS operation consists of three operands — memory location (V), expected original value (A) and new value (B). If the value of the memory location matches the expected original value, the processor automatically updates the location value to a new value. Otherwise, the processor does nothing. In either case, it returns the value of that location before the CAS instruction. CAS effectively shows that I think location V should contain value A; if it contains value, place B in this position; otherwise, don’t change the location, just tell me the current value of this location. This is actually the same principle as optimistic lock conflict check + data update.

Optimistic locking is not omnipotent. Optimistic concurrency control believes that the probability of data Race between transactions is relatively small, so do it as directly as possible until the time of submission to lock, so there will be no lock and deadlock. But if we do this directly and simply, we may encounter unexpected results. For example, when two transactions read a row of the database and write back to the database after modification, we will encounter problems.

  • Optimistic locks can only guarantee the atomic operation of a shared variable. For example, the spin process can only guarantee the atomicity of the value variable. If there are more than one or several variables, the optimistic lock will become inadequate, but the mutually exclusive lock can be easily solved, regardless of the number of objects and the size of the object granularity.
  • Long spins can lead to high overhead. If CAS is unsuccessful for a long time and spins all the time, it will bring a lot of overhead to CPU.
  • ABA problem.

The core idea of CAS is to judge whether the memory value has been changed by comparing whether the memory value is the same as the expected value. However, the logic of this judgment is not rigorous. If the memory value was originally A, then changed by a thread to B, and finally changed to A, CAS believes that the memory value has not changed, but in fact, it has been changed by other threads, which depends on the process value. The calculation results of the scenario have a great influence. The solution is to introduce version number, which is added to every variable update. Some optimistic locks are implemented by version to solve ABA problems. Every time optimistic locks perform data modification operations, they will bring a version number. Once the version number and data version number are consistent, they can perform modification operations and + 1 operations on version number, otherwise they will fail. Because the version number of each operation will increase, there will be no ABA problem, because the version number will only increase and will not decrease.

Spin lock

The most common lock in the Linux kernel is to synchronize data between multi-core processors. The spin here means to be busy waiting. If a thread (in this case, the kernel thread) already holds a spin lock and another thread wants to acquire the lock, it waits repeatedly, or spin waits until the lock is available. It is conceivable that such a lock cannot be held by one thread for a long time, which will cause other threads to spin all the time and consume processors. Therefore, the spin lock has a very narrow scope of use and only allows short-term locking.

In fact, another way is to let the waiting thread sleep until the lock is available, so that busy waiting can be eliminated. Obviously, the latter is better than the former, but it is not suitable for this. If we use the second method, we need to do a few steps: swap out the waiting thread and wait until the lock is available for swap in, at the cost of two context switching. This cost is better suited to the actual situation than the short-term spin (which is also easy to implement). It’s also important to note that trying to get a thread that already holds a spinlock to get it or cause a deadlock is not the case with other operating systems.

Spinlocks are similar to mutexes, except that spinlocks do not cause the caller to sleep. If the spinlocks have been kept by other executing units, the caller will keep looping there to see if the holder of the spinlocks has released the locks. The term “spin” is hence named. Its function is to solve the mutually exclusive use of a resource. Because spin locks do not cause the caller to sleep, the efficiency of spin locks is much higher than that of mutually exclusive locks. Although it is more efficient than mutex, it also has some shortcomings:

  • Spin lock occupies CPU all the time. It always runs – spin without lock, so it occupies CPU. If it can’t get lock in a very short time, it will undoubtedly reduce CPU efficiency.
  • It is possible to cause deadlocks when spin locks are used, when recursive calls are made, and when some other functions are called, deadlocks may also be caused, such as copy_to_user(), copy_from_user(), kmalloc().

The spin lock is more suitable for the case where the lock user keeps the lock for a short time. It is precisely because spin lock users usually keep the lock very short, so it is necessary to choose spin instead of sleep. The efficiency of spin lock is much higher than mutually exclusive lock. Semaphores and read-write semaphores are suitable for situations with long retention times, which can cause the caller to sleep, so they can only be used in the process context, while spinlock is suitable for situations with very short retention times, which can be used in any context. If the protected shared resource is accessed only in the process context, it is very appropriate to protect the shared resource with signal quantity. If the access time to the shared resource is very short, spin lock can also be used. But if the protected shared resources need to be accessed in interrupt context (including the bottom half, which is the interrupt handle, and the top half, which is the soft interrupt), spinlock must be used. The preemption fails during the spin lock holding period, while the semaphores and read-write semaphores can be preempted during the holding period. Spinlock is really needed only when the kernel is preemptive or SMP (multi-processor). In a single CPU and non-preemptive kernel, all operations of spinlock are empty. Another special note is that spin locks cannot be used recursively.


In order to realize serialization and avoid various problems of lock mechanism, we can adopt lock-free transaction mechanism based on the idea of multi-version concurrency control (MVCC). People generally call lock-based concurrency control mechanism as pessimistic mechanism, and MVCC mechanism as optimistic mechanism. This is because the lock mechanism is a preventive, read can block writing, write can block reading, when the lock granularity is large, the time is longer, concurrent performance will not be too good; and MVCC is a posteriori, read does not block writing, write does not block reading, until submission to check whether there is conflict, because there is no lock, so read and write will not block each other, thus greatly improved and Sexual function. We can use source code version control to understand MVCC. Everyone can read and modify local code freely without blocking each other. The version controller checks conflicts and prompts merge only when submitting. At present, Oracle, PostgreSQL and MySQL have all supported concurrency mechanism based on MVCC, but the specific implementation is different.

A simple implementation of MVCC is conditional update based on the idea of CAS (Compare-and-swap). The normal update parameter contains only onekeyValueSet’,Conditional UpdateOn this basis, a set of renewal conditions are added.conditionSet { … data[keyx]=valuex, … }That is, the data is updated to keyValueSet’only if D satisfies the update condition; otherwise, the error message is returned. In this way, L forms what is shown in the following figureTry/Conditional Update/(Try again)Processing mode:

For common examples of modifying user account information, assume that there is one in the account information table in the databaseversionFields, the current value is 1; and the current account balance field is 100.

  • Operator A reads it out at this time (version = 1) and deducts 50 (100-50) from its account balance.
  • During operator A’s operation, operator B also reads in the user information (version = 1) and deducts 20 (100-20) from its account balance.
  • Operator A completes the modification work and submits the data version number plus 1 (version = 2) and balance = 50 after account deduction to the database update. At this time, because the submitted data version is larger than the current version of the database record, the data is updated, and the database record version is updated to 2.
  • Operator B completed the operation and tried to submit data to the database (balance = 80) by adding version number (version = 2), but when comparing with the record version of the database, it was found that the version number submitted by operator B was 2, and the current version of the database record was 2. The optimistic lock strategy for updating can only be implemented if the submitted version is larger than the current version of the record. The submission of member B was rejected. In this way, it avoids the possibility that operator B overrides operator A’s operation result with the result of modification of old data based on version=1.

As can be seen from the above examples, optimistic locking mechanism avoids the overhead of database locking in long transactions (neither operator A nor operator B locks database data in the process of operation), and greatly improves the overall performance of the system under large concurrency. It should be noted that the optimistic locking mechanism is often based on the data storage logic in the system, so it also has some limitations. For example, in the above example, because the optimistic locking mechanism is implemented in our system, the user balance update operation from the external system is not controlled by our system, so it may cause dirty data to be updated to the database.

Concurrent IO

The concept of IO, understood from the word meaning, is input and output. Operating system from the top to the bottom, there is IO between all levels. For example, CPU has IO, memory has IO, VMM has IO, the underlying disk also has IO, which is a broad sense of IO. Generally speaking, an upper IO may generate multiple IOs for disk, that is to say, the upper IO is sparse and the lower IO is dense. The IO of the disk, as its name implies, is the input and output of the disk. Input refers to writing data to disk and output refers to reading data from disk.

The so-called concurrent IO is that in a time slice, if a process performs an IO operation, such as reading a file, the process can mark itself as a “dormant state” and grant CPU usage rights. When the file is read into memory, the operating system will wake up the dormant process and the awakened process will have the opportunity to regain CPU usage rights. The reason why the process here releases the CPU usage right while waiting for IO is to let the CPU do something else during the waiting time, so that the CPU usage rate will come up; moreover, if another process reads the file, the operation of reading the file will queue, and the disk driver finds the queuing task after completing the reading operation of a process, it will do so. Start the next read operation immediately, so IO usage is also up.

IO type

Unix has built-in five IO models, blocking IO, non-blocking IO, IO multiplexing model, signal-driven IO and asynchronous IO. From the application point of view, the types of IO can be divided into:

  • Large/small IO: This value refers to the number of consecutive read-out sectors given in the controller instructions. If the number is large, such as 64, 128 and so on, we can think of it as large IO; if it is small, such as 4, 8, we will think of it as small IO. In fact, there is no clear boundary between large IO and small IO.
  • Continuous/Random IO: Continuous IO means that the initial sector address given by this IO and the end sector address of the last IO are completely continuous or not much apart. Conversely, if the difference is large, it is considered as a random IO. The reason why continuous IO is more efficient than random IO is that when continuous IO is made, the head hardly needs to change lanes, or the time of changing lanes is very short. For random IO, if there are many IOs, the head will change lanes continuously, resulting in a great reduction in efficiency.
  • Sequential/concurrent IO: Conceptually, concurrent IO is to issue an IO instruction to one disk without waiting for it to respond, and then to issue an IO instruction to another disk. For striped RAID (LUN), its IO operations are concurrent, such as RAID 0 + 1 (1 + 0), raid 5, etc. On the contrary, it is sequential IO.

In traditional network server construction, IO mode is classified according to Blocking/Non-Blocking and Synchronous/Asynchronous standards. Blocking and Synchronous are similar, while NIO and Async are different in that NIO emphasizes polling while Async emphasizes Notification. For example, in a typical single-process, single-threaded Socket interface, blocking interfaces must be closed before they can access the next Socket connection. For socket of NIO, the server application will get a special “Will Block” error message from the kernel, but it will not block until the socket client waiting to initiate the request stops.

Introduction to Concurrent Programming

Generally speaking, in a Linux system, you can call independentselectperhapsepollMethod traverses all read data and writes. For asynchronous Sockets (such as Sockets in Windows or Sockets model implemented in. Net), the server application will tell the IO Framework to read a Socket data, and the IO Framework will automatically call your callback after reading the data (that is, to inform the application that the data itself is ready). Taking the Reactor and Proactor model in IO multiplexing as an example, the non-blocking model requires the application itself to process IO, while the asynchronous model is prepared by Kernel or Framework to read data into the buffer, and the application reads data directly from the buffer.

  • Synchronization blocking: In this way, after an IO operation is initiated, the user process must wait for the completion of the IO operation. Only when the IO operation is actually completed, the user process can run.
  • Synchronized non-blocking: In this way, the user process can return to do other things after initiating an IO operation, but the user process needs to ask whether the IO operation is ready from time to time, which requires the user process to keep asking, thus introducing unnecessary waste of CPU resources.
  • Asynchronous non-blocking: In this mode, the user process only needs to initiate an IO operation and return immediately. When the IO operation is really completed, the application will be notified of the completion of the IO operation. At this time, the user process only needs to process the data, and no actual IO read-write operation is needed, because the real IO read-write operation is already inside. The nuclear is complete.

In concurrent IO, the common problem is the so-called C10K problem, that is, there are 10,000 clients need to connect to a server and maintain TCP connection, the client will send requests to the server from time to time, and the server needs to process and return the results in time after receiving the requests.

IO multiplexing

IO multiplexing can monitor multiple descriptors through a mechanism. Once a descriptor is ready (usually read or write), it can notify the program to read and write accordingly. Select, poll and epoll are all mechanisms of IO multiplexing. It is worth mentioning that epoll only works for read-write blocking IO such as Pipe or Socket. Normal file descriptors return the contents of files immediately, so functions such as epoll do not work for ordinary file read-write.

First, let’s look at readable and writable events: when any of the following occurs, socket-readable events occur:

  • The number of data bytes in the receiving buffer of the socket is greater than or equal to the size of the low water mark in the receiving buffer of the socket.
  • The read half of the socket is closed (i.e. FIN is received), and the read operation of such socket will return 0 (i.e. EOF).
  • The socket is a listening socket and the number of connections completed is not zero.
  • The socket has errors to be handled, and the read operation for such sockets will return – 1.

Writable events of sockets occur when either of the following occurs:

  • The number of available space bytes in the sending buffer of the socket is greater than or equal to the size of the low water mark in the sending buffer of the socket.
  • The write half of the socket is closed, and the SIGPIPE signal will be generated if the socket continues to write.
  • In non-blocking mode, the socket connection succeeds or fails after the connection returns.
  • The socket has errors to be handled, and the write operation for such sockets will return – 1.

Select, poll and epoll are all synchronous IO in essence, because they all need to be responsible for reading and writing after the read and write events are ready. That is to say, the process of reading and writing is blocked, while asynchronous IO does not need to be responsible for reading and writing. The implementation of asynchronous IO is responsible for copying data from the kernel to user space. Select itself is polling and stateless. Every call needs to copy the FD set from user state to kernel state. This overhead will be very large in many fds. Epoll handles connections by triggering, and the number of descriptors maintained is unrestricted, and performance does not decrease with the number of descriptors.

Method Quantitative limitation Connection processing Memory operation
select The number of descriptors is limited to 1024 by FD_SETSIZE in the kernel; recompiling the kernel changes the value of FD_SETSIZE, but cannot optimize performance. Each call to select linearly scans the state of all descriptors. After the selection is completed, the user also needs to scan the fd_set array linearly to know which descriptors are ready (O(n)). Every call to select requires memory replication of FD descriptors and other information in user space and kernel space.
poll Using pollfd structure to store FD breaks through the restriction of descriptor number in select Similar to select scanning mode The pollfd array needs to be copied into the kernel space, and then the status of the FD is scanned in turn. The overall complexity is still O (n). In the case of large concurrency, the performance of the server will decline rapidly.
epoll The FD list corresponding to socket in this mode is saved by an array with no limitation on size (default 4k) Based on the reflection mode provided by the kernel, when an active Socket is available, the callback of the Socket is accessed by the kernel without traversing polling. Epoll uses memory sharing instead of memory copy when delivering messages from the kernel and user space, which also makes epoll more efficient than poll and select.