Write OS kernel from scratch – lock and multithreading synchronization

Time:2021-9-17

Series catalog

Multithreading competition

LastWe finally run multithreading and preliminarily establish a task scheduling systemscheduler, so that the kernel finally began to show the appearance of an operating system. On the basis of multi threads running, we need to enter the user state to run threads and establish the processprocessThe concept of loading user executable programs.

However, before that, an important and dangerous problem has come with the operation of multi threads, which is the problem of thread competition and synchronization. I believe you should use your programming experience related to multithreading in user mode to understand the problems and causes of competitive synchronization between threads, as well aslockConcept and use of. This article will start fromkernelView and discuss lock and code implementation from the perspective of. It should be noted that,lockIt is a huge and complex subject, and there will be many differences in the use and implementation of lock in the kernel and in the user state (of course, there are most in common). This article is only my superficial understanding and implementation, and welcome discussion and correction.

Lock lock

I don’t think it is necessary to explain the data problems caused by the competition between threads here. After the thread in the first two articles is started and run, we should be able to clearly realize that,interruptIt can happen at any time and at any instruction. Any non atomic operation will lead to data competition between threads(race)。

In fact, there are many places to add to our current kernellockTo protect access to public data structures, such as:

  • page faultThe processing function allocates the bitmap of the physical frame, which obviously needs to be protected;
  • kheap, all threads dig memory in it;
  • schedulerVarious task queues in, such asready_tasks
  • ……

In most programming languages that support multithreading, there are lock related concepts and tools. As a poor kernel project, we need to implement it ourselves.

lockIt is a complex subject. On the basis of safety first, the quality of design and implementation and the way of use will greatly affect the performance of the system. Bad lock design and use may lead to unreasonable scheduling of threads and a large waste of CPU time, and reduce the system throughput performance.

Next, let’s start withlockBased on the underlying principle of, several common problems are discussedlockThe classification and implementation of, and their use scenarios.

Atomic instruction operation

Logically,lockThe implementation of is simple:

if (lock_hold == 0) {
  // Not locked, I get the lock!
  lock_hold = 1;
} else {
  // Already locked :(
}

herelock_holdWhether the current status is locked after saving. The value istrue / false。 For those who try to get the lock, first check whether it is 0. If it is 0, it means that the lock has not been held by others. Then I can get the lock and set it to 1, which means that it is locked to prevent others from getting the lock.

However, the error of the above implementation is that,ifCondition judgment and the following settingslock_hold = 1The two-step operation is not atomic, and both threads may belock_holdIt’s 0 and the other party hasn’t had time to modify itlock_hold = 1If necessaryifJudge and pass, get lock together and enter.

The core problem here is rightlockThe two steps of judgment and modification are not atomic, that is, they are not completed by one instruction. The cross operation of two threads may lead to data race.

Therefore, the lowest level implementation of any lock must be an atomic instruction, that is, an instruction is used to complete the processing of datatestandchange, it ensures that only one thread can successfully pass the command while blocking others out of the door, for example:

uint32 compare_and_exchange(volatile uint32* dst,
                            uint32 src);

It must be implemented by assembly:

compare_and_exchange:
  mov edx, [esp + 4]
  mov ecx, [esp + 8]
  mov eax, 0
  lock cmpxchg [edx], ecx
  ret

cmpxchgThe instruction is acompare and exchangeInstruction, which compares the first operand witheaxValue of:

  • If the same, load the second operand into the first operand;
  • If not, assign the value of the first operand toeax

cmpxchgAdd beforelockPrefix to ensure exclusive access to memory when the instruction is executed on a multi-core CPU(exclusive)And can be perceived by other cores(visible), this involves the cache consistency of multi-core CPUs. You can skip it temporarily. For the single core CPU used in our project experiment,lockPrefix is not required.)

In fact, this instruction implements theCheck and modifyAtomic merge of operations. We use itlockLogic to operandsdstTo mark whether lock has been locked, compare it with eax = 0:

  • If they are equal, then this is the first case. 0 means there is no lock, so assign 1 todst, which means you got itlockAnd locked, the return value iseax = 0
  • If not, explaindstAlready equal to 1,lockIt has been locked by others, then it is the second case, which willdst = 1Value assigned toeax, the return value is eax, which has been modified to 1;
int lock_hold = 0;

void acquire_lock() {
    if (compare_and_exchange(&lock_hold, 1) == 0) {
        // Get lock!
    } else {
        // NOT get lock.
    }
}

exceptcmpxchgInstruction, another implementation isxchgInstructions, I personally feel better to understand:

atomic_exchange:
  mov ecx, [esp + 4]
  mov eax, [esp + 8]
  xchg [ecx], eax
  ret

xchgInstruction that has two operands that represent exchanging their values, and thenatomic_exchangeThe function returns the value of the second operand after the exchange, which is actually the old value before the exchange of the first parameter.

How to use itatomic_exchangeTo achievelockFunction of? Same code:

int lock_hold = 0;

void acquire_lock() {
    if (atomic_exchange(&lock_hold, 1) == 0) {
        // Get lock!
    } else {
        // NOT get lock.
    }
}

People who try to get a lock always use the value of 1 (lock) to andlock_holdExchange, soatomic_exchangeFunction, always returnlock_holdOld value of, about tolock_holdThe old value of is exchanged and returned only whenlock_holdWhen the old value is 0, the above judgment can be passed, which means that the lock has not been held before, and then the lock is successfully taken.

As you can see, only one instruction is used here to complete the pairlock_holdThe interesting thing is that it changes first and then checks, but this does not affect its correctness.

Spin lock

Discussed abovelockofacquireThe underlying implementation of operation, however, is only the tip of the iceberg of lock related problems. The real complexity of lock lies in the processing after acquisition failure, which is also a very important way to classify lock, which greatly affects the performance and use scenario of lock.

The simplest one to be discussed here firstlockThe type is spin lock(spinlock), it will try again and again after the acquire fails until it succeeds:

#define LOCKED_YES  1
#define LOCKED_NO   0

void spin_lock() {
  while (atomic_exchange(&lock_hold , LOCKED_YES) != LOCKED_NO) {}
}

This is a busy wait mode. The current thread continues to hold the CPU running and keeps trying to acquire, which is simple and rough;

First of all, we need to be clear,spinlockIt cannot be used on a single core CPU. On a single core CPU, only one thread is executed at the same time. If the lock is not obtained, such spin idling is meaningless, because the thread holding the lock cannot release the lock during this period.

However, if on a multi-core CPU,spinlockBut it has a place to play. If acquire lock fails and continues to retry for a period of time, it may wait for the thread holding the lock to release the lock, because it is likely to be running on another core at this time, and it must be running oncritical sectionWithin:

Write OS kernel from scratch - lock and multithreading synchronization

This is forcritical sectionIt is very small, and the lock competition is not very fierce. It is very suitable for the use scenario, because in this case, the spin wait time is probably not very long. And if the current thread can’t get itlockIf you give up the CPU, you may pay a greater price, which will be discussed in detail later.

However, if socritical sectionIn the case of large or fierce lock competition, spin lock is not applicable even on multi-core CPUs. It is unwise to keep waiting and spin empty transfer, which wastes a lot of CPU time.

yield lock

It says,spinlockIt doesn’t make any sense for a single core CPU, but our kernel happens to run on a single core CPU Simulator, so we need to implement a lightweight lock similar to spinlock, which I’ll call temporarilyyield lock

seeing the name of a thing one thinks of its function,yield lockIt refers to actively giving up the CPU after the acquisition fails. In other words, I can’t get the lock for the time being. I’ll take a break and let other threads run first. After they rotate around, I’ll come back and try again.

Its behavior is essentially a kind of spin, but unlike idling in place, it does not waste any CPU time, but immediately gives the CPU to others, so that the thread holding the lock may be run. After the next round of time slice is run, it is likely to have released the lock:

Write OS kernel from scratch - lock and multithreading synchronization

void yield_lock() {
  while (atomic_exchange(&lock_hold , LOCKED_YES) != LOCKED_NO) {
    schedule_thread_yield();
  }
}

Note that you mustschedule_thread_yieldPut it in the while loop, because even if the thread holding the lock releases the lock, it does not mean that the current thread can get the lock later. Because there may be other competitors, it must compete for acquire lock again after the yield comes back;

andspinlocksimilar,yield lockAlso suitable forcritical sectionIt is relatively small and the competition is not very fierce. Otherwise, many threads are empty again and again, which is also a waste of CPU resources.

Blocking lock

The above two locks areNon blocking lockIn other words, when the acquire fails, the thread will not block, but keep trying again, or try again after a period of time. In essence, it is trying again. However, incritical sectionIn the case of large or fierce lock competition, it is likely to be futile to try again and again, which is a waste of CPU resources.

In order to solve this problem, there isBlocking lockblocking lock), it will maintain a queue internally. If the thread cannot get the lock, it will add itself to the queue to sleep and give up the CPU. During sleep, it will not be scheduled to run, that is, it will enter the queueblockState; When the thread holding the lock releases the lock, it will take out a thread from the queue and wake up again.

For example, we define the following blocking lock, namedmutex

struct mutex {
  volatile uint32 hold;
  linked_list_t waiting_task_queue;
  yieldlock_t ydlock;
};

Realization of locking:

void mutex_lock(mutex_t* mp) {
  yieldlock_lock(&mp->ydlock);
  while (atomic_exchange(&mp->hold, LOCKED_YES) != LOCKED_NO) {
    // Add current thread to wait queue.
    thread_node_t* thread_node = get_crt_thread_node();
    linked_list_append(&mp->waiting_task_queue, thread_node);
    
    // Mark this task status TASK_WAITING so that
    // it will not be put into ready_tasks queue
    // by scheduler.
    schedule_mark_thread_block();
    
    yieldlock_unlock(&mp->ydlock);
    schedule_thread_yield();

    // Waken up, and try acquire lock again.
    yieldlock_lock(&mp->ydlock);
  }
  yieldlock_unlock(&mp->ydlock);
}

The implementation of lock here is already complicated, which is actually standardconditional waitImplementation method of. so-calledconditional wait, i.eConditional waiting, is to wait for a desired condition to be met in a blocking manner. The expected condition to wait here is that lock is released and I can try to get lock.

After the attempt to obtain the lock fails, the current thread adds itself to the mutexwaiting_task_queueAnd mark yourself asTASK_WAITINGState, and then give up the CPU; thereGive upCPU and aboveyield lockIt’s the same as in theschedule_thread_yieldFunctions, however, they are essentially different:

  • yield lockAfter the thread passes through yield, it will still be put intoready_tasksThe queue will still be scheduled by the scheduler;
  • Here thread marks itself asTASK_WAITING, so inschedule_thread_yieldIn the implementation of, the thread will not be added toready_tasksIn the queue, it actually enteredblockThe state will not be scheduled again until the next time the thread of lock is heldunlockWhen, remove it from mutex’swaiting_task_queueTake out the wake-up and put it back inready_tasksQueue;
void mutex_unlock(mutex_t* mp) {
  yieldlock_lock(&mp->ydlock);
  mp->hold = LOCKED_NO;

  if (mp->waiting_task_queue.size > 0) {
    // Wake up a waiting thread from queue.
    thread_node_t* head = mp->waiting_task_queue.head;
    linked_list_remove(&mp->waiting_task_queue, head);
    // Put waken up thread back to ready_tasks queue.
    add_thread_node_to_schedule(head);
  }
  yieldlock_unlock(&mp->ydlock);
}

abovelockandunlockThere is a key element in the code, which is defined inmutexInternal oneyieldlock。 This seems strange becausemutexThe essential function of a lock is a lock. As a result, the internal data of this lock actually needs another lock to protect itself. Isn’t this a doll?

In terms of implementation,mutexIt is already a complex lock. It internally maintains the waiting queue, which obviously needs to be protected, so there is the dolls paradox above. The key point here is that the type and purpose of these two layers of locks are essentially different:

  • mutexIt is a heavy-duty lock. It is provided for external use. Its purpose and protection object are uncertain. Generally speaking, it iscritical sectionRelatively large and highly competitive areas;
  • insideyield lockIt is a lightweight lock. The purpose and protection object of this lock are determined. It protectsmutexInternal operation, thiscritical sectionIt can be controlled very small, so the introduction of this lock is necessary and reasonable;

insideyield lockThe price is that it introduces new competition and makes the wholemutexThe competition on threads is more intense. However, such an additional cost comes frommutexThe design and use of is inevitable, and can be tolerated and ignored in a sense: because it is generally believed thatmutexProtected exteriorcritical sectionIs relatively large compared to its interioryield lockFor protected areas.

Kernel and user state lock

The above discussion is about the principle and implementation of several locks, as well as their different use scenarios. A very important principle of distinction is,critical sectionThe size of the lock and the intensity of competition essentially reflect the ease (or probability) of a thread trying to get the lock each time. Based on this, we can divide it into two types of use cases:

  • critical sectionIf it is small and the competition is not fierce, use spin type locks (includingspinlockandyieldlock), they are non blocking;
  • critical sectionIf it is large or competitive, use blocking lock;

However, the above discussion is only in the kernel statelockIn fact, the choice of which lock to use in the kernel state is far more than that. It also involves the use of locks, such as interrupt context or exception context, and many other considerations, which will bring many restrictions and differences. On these issues, I will try to write a separate discussion when I have time, and dig a hole first.

And if the user status is reached,lockThe use of is very different from the kernel state. One of the most discussed is about blocking locks andspinlockYour choice. As mentioned above, blocking locks are often used tocritical sectionIn the case of large or fierce competition, it will not cause a large number of idle CPUs, which seems to save CPU resources. However, threads in user mode need to fall into kernel state when they enter blocking sleep, which has a great impact on CPU performance. It may not be as cost-effective as waiting in situ spin (provided that multi-core CPUs), Therefore, the considerations here are very different from the use of locks in the kernel state.

summary

This article discusses the principle and implementation of lock, which is limited to my own level. It is only my superficial understanding. I hope it can be helpful to readers. We are also welcome to discuss and comment together. In this scroll project, performance is not our consideration for the time being. For the sake of simplicity and security, I use yield lock as the main lock under the kernel.