System knowledge behind goroutine (Reprint)


Go languageIt has been three years since its birth to popularity. Most of the pioneers are from the background of web development and have some popular books. However, people with system development background always feel vague when learning these books. There are also some widely circulated articles on the Internet, but there are more or less technical descriptions inconsistent with the facts. I hope this article can introduce web developers who lack system programming backgroundgoroutineSystem knowledge behind it.

1. Operating system and runtime
2. Concurrency and parallelism
3. Thread scheduling
4. Concurrent programming framework
5. goroutine

1. Operating system and runtime

For ordinary computer users, it is enough to understand that the application runs on the operating system, but for developers, we also need to understand how the program we write runs on the operating system and how the operating system provides services for the application, so that we can distinguish which services are provided by the operating system, Which services are provided by the runtime of the language we use.

In addition to internal modules such as memory management, file management, process management and peripheral management, the operating system also provides many external interfaces for application programs. These interfaces are the so-called “system call”. Since the DOS era, system calls have been provided in the form of soft interrupts, that is, the famousINT 21The program puts the function numbers that need to be called into the AH register, places the parameters in other specified registers, and calls INT 21. After interruption returns, the program obtains the return value from the specified register (usually AL). This practice has not changed until Pentium 2, that is, P6 came out. For example, windows provides system calls through int 2E, and Linux is int 80. However, the later registers are larger than before, and there may be another layer of jump table query. Later, Intel and AMD provided more efficient services respectivelySysenter / sysexit and syscall / sysretInstruction instead of the previous interrupt mode, bypassing the time-consuming privilege level check and the operation of pressing the register stack, and directly completing the conversion from ring 3 code segment to ring 0 code segment.

What functions do system calls provide? Use the name of the operating system and the corresponding interrupt number to Google to get a complete list(Windows, Linux), this list is the communication protocol between the operating system and applications. If we need to exceed the functions of this protocol, we can only implement it in our own code. For example, for memory management, the operating system only provides process level memory segment management, such as windowsvirtualmemorySeries, or Linuxbrk, the operating system will not care about how the application allocates memory for new objects or how to do garbage collection, which need to be implemented by the application itself. If the functions beyond this protocol cannot be realized by ourselves, we will say that the operating system does not support this function. For example, Linux does not support multithreading before 2.6. No matter how we simulate in the program, we cannot make multiple scheduling units that can run simultaneously and comply with the POSIX 1003.1c semantic standard.

However, we do not need to call interrupts or syscall instructions to write programs. This is because the operating system provides a layer of encapsulation. On windows, it is ntdll.dll, which is often called native API. We do not need to call int 2E or syscall directly. To be exact, we cannot call int 2E or syscall directly, because Windows does not disclose its call specification, Direct use of int 2E or syscall cannot guarantee future compatibility. On Linux, there is no such problem. The list of system calls is public, and Linus attaches great importance to compatibility and will not make any changes. Glibc even provides it specificallysyscall(2)To facilitate users to call directly with numbers, however, in order to solve the trouble caused by the compatibility of different versions between glibc and the kernel, and to improve the efficiency of some calls (such as_NRGet time of day), Linux still encapsulates some system calls, that isVDSO(early call)。

However, our programs rarely call ntdll or vdso directly, but through a higher layer of encapsulation. This layer handles parameter preparation, return value format conversion, error handling and error code conversion. This is the runtime of the language we use. For C language, glibc is on Linux and Kernel32 is on windows (or MSVCRT is called). For other languages, For example, Java is JRE, and the runtime of these “other languages” usually ends up calling glibc or Kernel32.

The term “runtime” actually includes not only the library files used to link with the compiled target executor, but also the runtime environment of scripting language or bytecode interpreted language, such as python, c# CLR, and Java JRE.

Encapsulation of system calls is only a small part of the functions of the runtime. The runtime usually provides functions that do not need to be supported by the operating system, such as string processing, mathematical calculation, common data structure containers, etc. at the same time, the runtime will also provide easier and more advanced encapsulation for the functions supported by the operating system, such as IO with cache and format and thread pool.

Therefore, when we say “XXX language adds XXX function”, there are usually several possibilities:
1. Support new semantics or syntax to facilitate us to describe and solve problems. For example, Java generics, annotation and lambda expressions.
2. New tools or class libraries are provided to reduce the amount of code we develop. For example, argparse in Python 2.7
3. Better and more comprehensive encapsulation of system calls enables us to do things that we could not or were difficult to do in this language environment before. For example, Java NiO

However, no language, including its runtime and running environment, can create functions that the operating system does not support. So can go language. No matter how dazzling its feature description looks, it must be possible for other languages to do so, but go provides more convenient and clear semantics and support, which improves the efficiency of development.

2. Concurrency and parallelism

Concurrency refers to the logical structure of a program. A non concurrent program is a bamboo pole to the end. There is only one logical control flow, that is, a sequential program. At any time, the program will only be in a certain position of the logical control flow. If a program has multiple independent logical control flows, that is, it can deal with multiple things at the same time, we say that the program is concurrent. The “simultaneous” here does not necessarily mean that it is really at a certain time of the clock (that is, the running state rather than the logical structure), but means that if these logical control flows are drawn into a timing flow chart, they can overlap on the time line.

Parallelism refers to the running state of a program. If a program is processed by multiple CPU pipelines at one time, we say that the program is running in parallel( Strictly speaking, we cannot say that a program is “parallel”, because “parallel” does not describe the program itself, but describes the running state of the program. However, this article is not so literal. When we talk about “parallel” below, it means “running in parallel”). Obviously, parallel must be supported by hardware.

And it’s not hard to understand:

1. Concurrency is a necessary condition for parallelism. If a program itself is not concurrent, that is, there is only one logical control flow, it is impossible for us to let it be processed in parallel.

2. Concurrency is not a sufficient condition for parallelism. If a concurrent program is processed by only one CPU pipeline (through time-sharing), it is not parallel.

3. Concurrency is only an expression that is more in line with the essence of practical problems. The initial purpose of concurrency is to simplify code logic rather than make the program run faster;

These paragraphs are slightly abstract. We can instantiate these concepts with the simplest example: write the simplest HelloWorld in C language, which is non concurrent. If we establish multiple threads and print one HelloWorld in each thread, the program is concurrent. If the program runs on an old-fashioned single core CPU, Then the concurrent program is not parallel. If we run it with a multi-core and multi CPU operating system that supports multitasking, the concurrent program is parallel.

There is also a slightly complex example, which can better illustrate that concurrency is not necessarily parallel, and concurrency is not for efficiency, or for calculating primes in the go language examplesieve.go。 We start a code segment for each factor from small to large. If the currently verified number can be divided by the current factor, the number is not a prime number. If not, we send the number to the code segment of the next factor until the last factor cannot be divided, the number is a prime number. We start another code segment to verify a larger number. This is in line with our logic of calculating prime numbers, and the code processing fragments of each factor are the same, so the program is very concise, but it cannot be parallelized, because each fragment depends on the processing results and output of the previous fragment.

Concurrency can be achieved in the following ways:

1. Explicitly define and trigger multiple code fragments, that is, logical control flows, which are scheduled by the application or operating system. They can be independent or interdependent and need to interact. For example, the prime calculation mentioned above is also a classic producer and consumer problem: two logic control flows a and B, a generates output, and when there is output, B obtains the output of a for processing. Thread is only one of the means to achieve concurrency. In addition, the runtime or the application itself also has a variety of means to achieve concurrency, which is the main content of the next section.

2. Implicitly place multiple code fragments to trigger the execution of corresponding code fragments when system events occur, that is, the event driven method. For example, a port or pipeline receives data (in the case of multi-channel IO), and then a process receives a signal.

Parallelism can be achieved at four levels:

1. Multiple machines. Naturally, we have multiple CPU pipelines, such as MapReduce task in Hadoop cluster.

2. Multiple CPUs. Whether it’s real multiple CPUs, multi-core or hyper threading, in short, we have multiple CPU pipelines.

3. ILP (instruction level parallelism) in a single CPU core, instruction level parallelism. Through complex manufacturing processes, instruction parsing, branch prediction and out of order execution, today’s CPU can execute multiple instructions in a single clock cycle. Therefore, even non concurrent programs may be executed in parallel.

4. Single instruction, multiple data. SIMD. In order to process multimedia data, the current CPU instruction set supports single instruction to operate multiple data.

Among them, 1 involves distributed processing, including data distribution and task synchronization, and is based on the network. 3 and 4 are usually considered by compiler and CPU developers. The parallelism we mentioned here mainly aims at the second kind: multi-core CPU parallelism in a single machine.

On the issue of concurrency and parallelism, rob pike, the author of go language, wrote a

The picture in CMU’s famous computer systems: a programmer’s perspective is also very intuitive and clear:
System knowledge behind goroutine (Reprint)

3. Thread scheduling

In the previous section, we mainly talked about the concepts of concurrency and parallelism, and threads are the most intuitive implementation of concurrency. In this section, we mainly talked about how the operating system makes multiple threads execute concurrently. Of course, when there are multiple CPUs, that is, parallel execution. We will not discuss the process. The meaning of the process is “isolated execution environment”, not “separate execution sequence”.

We first need to understand the instruction control mode of IA-32 CPU, so as to understand how to switch between multiple instruction sequences (i.e. logic control flow). The CPU determines the position of the next instruction through the value of CS: EIP register, but the CPU does not allow the MOV instruction to change the value of EIP directly. The code jump must be realized through JMP series instruction, call / ret instruction, or int interrupt instruction; When switching between instruction sequences, in addition to changing the EIP, we also need to ensure that the values of various registers that may be used by the code, especially the stack pointer SS: ESP and EFLAGS flag bit, can be restored to the state when the target instruction sequence was last executed to this position.

Threads are external services provided by the operating system. Applications can make the operating system start threads through system calls and be responsible for subsequent thread scheduling and switching. Let’s first consider a single single core CPU. The operating system kernel and the application actually share the same CPU. When the EIP is in the application code segment, the kernel does not have control. The kernel is not a process or thread. The kernel only runs in real mode and the code segment permission is ring 0, Only when an interrupt is generated or an application calls a system call, the control is transferred to the kernel. In the kernel, all codes are in the same address space in order to provide services to different threads,The kernel builds a kernel stack for each thread, this is the key to thread switching. Usually, the kernel will schedule the threads of the whole system during the clock interrupt or before the system call returns (usually before the infrequent system call returns in consideration of performance), calculate the remaining time slice of the current thread, and calculate the priority in the “runnable” thread queue if switching is required. After selecting the target thread, Save the running environment of the current thread and restore the running environment of the target thread. The most important thing is to switch the stack pointer ESP, and then point the EIP to the instruction when the target thread was removed from the CPU last time. When implementing thread switching, the Linux kernel plays a trick. It does not directly JMP, but first switches the ESP to the kernel stack of the target thread, presses the code address of the target thread on the stack, and then JMP to__switch_to(), which is equivalent to forging a callswitch_ The to() instruction, and thenswitch_ At the end of to (), the RET instruction is used to return, so that the code address of the target thread in the stack is put into the EIP, and then the CPU starts to execute the code of the target thread, which is actually the last time it stopped atswitch_toWhere this macro expands.

Here are some additional points: (1) although IA-32 provides TSS(Task State Segment), trying to simplify the thread scheduling process of the operating system, but because it is inefficient and not a general standard, it is not conducive to transplantation, so the mainstream operating systems do not make use of TSS. More strictly speaking, TSS is actually used, because only through TSS can the stack be switched to the kernel stack pointer SS0: esp0, but other functions of TSS are not used at all( 2) When a thread enters the kernel from the user state, the EIP of the relevant registers and user state code segments have been saved once. Therefore, there is not much to save and restore during the kernel state thread switching mentioned above( 3) The above described scheduling methods are preemptive. The kernel and its hardware drivers will also actively call when waiting for external resources to be availableschedule(), codes in user status can also besched_yield()The system calls to actively initiate scheduling and give up the CPU.

Now, an ordinary PC or service usually has multiple CPUs (physical packages), each CPU has multiple processor cores, and each core can support two logical processors for each core, that is, logical processors. Each logic processor has its own set of complete registers, including CS: EIP and SS: esp. therefore, from the perspective of operating system and application, each logic processor is a separate pipeline. In the case of multiprocessor, the principle and process of thread switching are basically the same as that of single processor. There is only one copy of kernel code. When a clock interrupt or system call occurs on a CPU, the CS: EIP and control right of the CPU return to the kernel. The kernel performs thread switching according to the results of scheduling strategy. But at this time, if our program realizes concurrency with threads, the operating system can make our program parallel on multiple CPUs.

Two points need to be added here: (1) in the multi-core scenario, the cores are not completely equivalent. For example, two hyper threads on the same core share L1 / L2 cache; In the scenario supported by NUMA, the latency of each core accessing different areas of memory is different; Therefore, thread scheduling in multi-core scenario introduces “scheduling domain”(scheduling domains)But this does not affect our understanding of thread switching mechanism( 2) In a multi-core scenario, which CPU is the interrupt sent to? Soft interrupts (including divide by 0, page missing exception, int instruction) are naturally generated on the CPU that triggers the interrupt, while hard interrupts are divided into two cases. One is the interrupt generated by each CPU, such as the clock, which is processed by each CPU, and the other is the external interrupt, such as IO, which can be specified to which CPU through APIC; Because the scheduler can only control the current CPU, if IO interrupts are not evenly distributed, IO related threads can only run on some CPUs, resulting in uneven CPU load, which affects the efficiency of the whole system.

4. Concurrent programming framework

The above roughly introduces how a program that uses multithreading to realize concurrency is scheduled and executed in parallel by the operating system (when there are multiple logical processors). At the same time, you can also see that the scheduling and switching of code fragments or logical control flow is not mysterious. Theoretically, we can also not rely on the operating system and its threads, Define multiple fragments in the code segment of our own program, and then schedule and switch them in our own program.

For ease of description, we next refer to “code snippets” as “tasks”.

Similar to the implementation of the kernel, but we don’t need to consider interrupts and system calls. Then, our program is essentially a cycle. This cycle itself is the scheduler schedule (). We need to maintain a task list. According to our defined policies, first in first out or priority, we select a task from the list each time, Then restore the values of each register, and JMP to the place where the task was suspended last time. All the information to be saved can be stored in the task list as the properties of the task.

It seems very simple, but we still need to solve several problems:

(1) When we run in user mode, there is no mechanism such as interrupt or system call to interrupt code execution. Then, once our schedule () code gives control to the task code, when will our next scheduling occur? The answer is that it won’t happen. Only by actively calling schedule () by tasks can we have the opportunity to schedule. Therefore, tasks here can’t rely on kernel scheduling like threads, so they can execute without scruples. We must explicitly call schedule () in our tasks, which is the so-called cooperative scheduling( Although we can simulate the clock interrupt in the kernel and obtain control by registering the signal processing function, the problem is that the signal processing function is called by the kernel. When it ends, the kernel regains control, then returns to the user state and continues to execute along the code path interrupted when the signal occurs, So we can’t switch tasks in the signal processing function)

(2) Stack. Like the principle of kernel scheduling threads, we also need to allocate a stack separately for each task, save its stack information in the task attribute, and save or restore the current SS: ESP during task switching. The space of the task stack can be allocated on the stack of the current thread or on the heap, but it is usually better to allocate on the heap: there is almost no limit on the size or the total number of tasks, the stack size can be dynamically expanded (GCC has a split stack, but it is too complex), and it is convenient to switch tasks to other threads.

Here, we know how to construct a concurrent programming framework, and how to make tasks execute on multiple logical processors in parallel? Only the kernel has the permission to schedule the CPU, so we still have to create threads through system calls to achieve parallelism. When multithreading multitasking, we also need to consider several issues:

(1) If a task initiates a system call, such as waiting for IO for a long time, the current thread will be put into the queue waiting for scheduling by the kernel. Won’t it make other tasks have no chance to execute?

In the case of single thread, we have only one solution, that is, use non blocking IO system calls, give up the CPU, and then conduct unified polling in schedule (). When there is data, switch back to the FD corresponding task; The less efficient approach is not to conduct unified polling, so that each task can perform IO again in a non blocking manner when it is its turn to execute until data is available.

If we use multithreading to construct our whole program, we can encapsulate the interface of system call. When a task enters the system call, we will leave the current thread to it (temporarily) and start a new thread to deal with other tasks.

(2) Task synchronization. For example, in the example of producers and consumers mentioned in the previous section, how can consumers wait when the data has not been produced, and trigger consumers to continue execution when the data is available?

In the case of single thread, we can define a structure in which variables are used to store the interactive data itself, the current available state of the data, and the numbers of the two tasks responsible for reading and writing the data. Then, our concurrent programming framework provides read and write methods for task calls. In the read method, we cycle to check whether the data is available. If the data is not available, we call schedule() to let the CPU enter the wait; In the write method, we write data into the structure, change the available state of the data, and then return; In schedule (), we check the data availability status. If it is available, activate the task that needs to read the data. The task continues to cycle to detect whether the data is available. If it is found to be available, read, change the status to unavailable, and return. The simple logic of the code is as follows:

struct chan {
bool ready,  int data 
int read (struct chan *c)  {
    while  (1)  { 
        if  (c->ready)  { 
            c->ready =  false; 
            return c->data;  
        }  else  { 

void write (struct chan *c,  int i)  { 
    while  (1)  { 
        if  (c->ready)  { 
        }  else  {
            c->data = i;
            c->ready =  true;
            // optional  return;

Obviously, if it is multithreaded, we need to protect the access to the data in this structure through the synchronization mechanism provided by thread library or system call.

The above is the design consideration of the most simplified concurrency framework. The concurrency framework encountered in our actual development work may be different due to different languages and runtime, and may have different choices in function and ease of use, but the underlying principles are the same.

For example, in glicgetcontext/setcontext/swapcontextSeries library functions can be conveniently used to save and restore task execution status; Windows provides SDK API of fiber series; Neither is a system call,Getcontext and setcontextAlthough the man page of is in Section 2, it is only a historical problem left over from SVR4. Its implementation code is in glibc rather than kernel;CreateFiberIt is provided in Kernel32. There is no corresponding ntcreatefiber in ntdll.

In other languages, what we call “task” is more often called “collaborative process”, that is, coroutine. For example, the most commonly used in C + + is boost. Coroutine; Java is troublesome because it has a layer of bytecode interpretation, but it also has JVM patches that support collaborative processes, or projects that dynamically modify bytecode to support collaborative processes; The generator and yield of PHP and python are actually the support of collaborative processes. On this basis, more general collaborative process interfaces and scheduling can be encapsulated; In addition, there are Erlang and other native processes that support collaboration. If I don’t understand it, I won’t say it. For details, please refer to the page of

Since saving and restoring task execution status requires access to CPU registers, the relevant runtime will also list the supported CPUs.

From the operating system level, it seems that only OS X and IOS provide co processes and their parallel schedulingGrand Central Dispatch, most of its functions are also implemented in the runtime.

5. goroutine

The go language provides the clearest and most direct support for concurrent programming in all (I know) languages so far through goroutine. The go language documents also describe its features very comprehensively or even more than. Here, based on our system knowledge introduction above, list the features of goroutine, which can be regarded as a summary:

(1) Goroutine is a function of the go language runtime, not provided by the operating system. Goroutine is not implemented by threads. For details, see the in the go language source codepkg/runtime/proc.c

(2) Goroutine is a piece of code, a function entry, and a stack allocated to it on the heap. So it’s very cheap. We can easily create tens of thousands of goroutines, but they are not scheduled by the operating system

(3) In addition to the threads blocked by system calls, the go runtime will start up to $gomaxprocs threads to run goroutine

(4) Goroutine is a cooperative scheduling. If goroutine executes for a long time and is not synchronized by waiting for data to be read or written to the channel, it needs to be called activelyGosched()To give up the CPU

(5) Like all co processes in other concurrent frameworks, the so-called “no lock” advantage in goroutine is only effective in single thread. If $gomaxprocs > 1 and communication is required between CO processes, the go runtime will be responsible for locking and protecting data, which is why examples such as sieve.go are slower in multi CPU and multi thread

(6) In essence, the requests to be processed by server-side programs such as the web are parallel processing problems. Each request is basically independent, independent of each other, and there is almost no data interaction. This is not a model of concurrent programming, and the concurrent programming framework only solves the complexity of its semantic expression, but does not fundamentally improve the processing efficiency, Perhaps concurrent connection and concurrent programming are both concurrent in English. It is easy to misunderstand that “concurrent programming framework and coroutine can efficiently handle a large number of concurrent connections”.

(7) The go language runtime encapsulates asynchronous IO, so it can write servers that seem to have a lot of concurrency. However, even if we make full use of multi-core CPU parallel processing by adjusting $gomaxprocs, its efficiency is not as efficient as the thread pool designed by using IO event driven and divided appropriately according to transaction types. In terms of response time, cooperative scheduling is a hard injury.

(8) Goroutine’s greatest value is that it realizes the mapping and dynamic expansion of concurrent processes and actual parallel threads, with the continuous development and improvement of its runtime, its performance will be better and better, especially in the future with more and more CPU cores, one day we will give up that little performance difference for the sake of code simplicity and maintainability.

This work adoptsCC agreement, reprint must indicate the author and the link to this article

Happiness is to solve one problem after another!

Recommended Today

Implementation example of go operation etcd

etcdIt is an open-source, distributed key value pair data storage system, which provides shared configuration, service registration and discovery. This paper mainly introduces the installation and use of etcd. Etcdetcd introduction etcdIt is an open source and highly available distributed key value storage system developed with go language, which can be used to configure sharing […]