Take you to explore the mystery of CPU scheduling

Time:2022-1-25

Summary:This paper will start with the most basic scheduling algorithm, analyze the principles of various mainstream scheduling algorithms one by one, and take you to explore the mystery of CPU scheduling.

This article is shared from Huawei cloud community《Explore the scheduling principle of CPU》, author: yuan Runzi.

preface

Software engineers are always used to regard OS (operating system) as a very trusted housekeeper. We only manage to run programs on OS, but rarely deeply understand the operation principle of the operating system. Indeed, as a general software system, OS performs well enough in most scenarios. However, there are still some special scenarios. We need to optimize the OS in order to make the business system complete the task more efficiently.This requires us to deeply understand the principle of OS. We can not only call the housekeeper, but also know how to make the housekeeper do better.

OS is a very huge software system. This paper mainly explores the tip of the iceberg:CPU scheduling principle.

Speaking of the scheduling principle of CPU, many people’s first reaction is based on time slice scheduling, that is, each process has a time slice that occupies the CPU. After the time slice is used up, they give up the CPU to other processes. As for the deeper principles of how OS judges whether a time slice is used up and how to switch to another process, few people seem to know.

In fact, time slice based scheduling is only one of many CPU scheduling algorithms. This paper will start with the most basic scheduling algorithms, analyze the principles of various mainstream scheduling algorithms one by one, and take you to explore the mystery of CPU scheduling.

Context switching of CPU

Before exploring the principle of CPU scheduling, let’s first understand the context switching of CPU, which is the basis of CPU scheduling.

Today’s OS almost supports “simultaneous” running tasks that are much larger than the number of CPUs, and the OS will allocate CPUs to them in turn. This requires that the OS must know where to load the task and where to start running after loading, and these information is stored in the CPUregisterThe address of the next instruction to be executed is saved inProgram counter(PC) on this special register.We call this information of the register CPU context, also known as hardware context.

When the OS switches running tasks,The action of saving the context of the previous task and loading the context of the task to be run into the CPU register is called CPU context switching.

The CPU context is a part of the process context. The process context we often say consists of the following two parts:

  • User level context:Contains the runtime stack, data block, code block and other information of the process.
  • System level context:It includes process identification information, process field information (CPU context), process control information and other information.

This involves two questions: (1) how to save the CPU context of the previous task? (2) When is context switching performed?

Question 1: how to save the CPU context of the previous task?

The CPU context is saved in the processKernel space(kernel space). When the OS allocates virtual memory space to each process, it will allocate a kernel space, which can only be accessed by kernel code. Before switching the CPU context, the OS will first save the general register of the current CPU, PC and other process site information in the process kernel space, and then take it out and reload it on the CPU to resume the operation of the task when switching next time.
Take you to explore the mystery of CPU scheduling

Question 2: when is context switching performed?

If the OS wants to switch the task context, it must occupy the CPU to execute the switching logic. However, in the process of running the user program, the CPU has been occupied by the user program, that is, the OS is not running at the moment, and naturally it is unable to perform context switching. To solve this problem, there are two solutions, collaborative strategy and preemptive strategy.

Collaborative strategyRely on the user program to actively give up the CPU, such as executing system call or exceptions such as division by zero. However, this strategy is not reliable. If the user program does not actively give up the CPU, or even malicious dead cycle, the program will always occupy the CPU, and the only recovery means is to restart the system.

Preemptive strategyDepending on the timer interrupt mechanism of the hardware, the OS will register an interrupt handler with the hardware during initialization. When the hardware generates an interrupt, the hardware will hand over the processing power of the CPU to the OS, and the OS can switch the CPU context on the interrupt callback.
Take you to explore the mystery of CPU scheduling

Scheduling metrics

The quality of a CPU scheduling algorithm is generally measured by the following two indicators:

  • Turnaround time(turnaround time) refers to the time from the arrival of the task to the completion of the task, i.e. t_ {turnaround}=T_ {completiong}-T_ {arrival}Tturnaround​=Tcompletiong​−Tarrival​
  • response time(response time) refers to the time from the arrival of the task to the first scheduling of the task, i.e. t_ {response}=T_ {firstrun}-T_ {arrival}Tresponse​=Tfirstrun​−Tarrival​

The two indicators are opposite to each other to some extent, requiring high average turnaround time, which will inevitably reduce the average response time. The specific target is related to the task type, such as the task of program compilation. The turnaround time is required to be small and the compilation should be completed as soon as possible; For user interaction tasks, the response time is required to be small to avoid affecting the user experience.

Workload assumptions

The workload on the OS (i.e. the running status of various tasks) is always changeable. In order to better understand the principles of Various CPU scheduling algorithms, we first make the following assumptions for the workload:

  • Hypothesis 1:All tasks run for the same length of time.
  • Hypothesis 2:All tasks start at the same time
  • Hypothesis 3:Once the task starts, it runs until the task is completed.
  • Hypothesis 4:All tasks only use CPU resources (such as no I / O operations).
  • Hypothesis 5:Know the run time of all tasks in advance.

The preparations are ready. Now let’s enter the wonderful world of CPU scheduling algorithm.

FIFO: first in first out

FIFO (first in first out) scheduling algorithm is famous for its simple principle and easy implementation,It first schedules the first arriving task until the end, then schedules the next task, and so on。 If multiple tasks arrive at the same time, select one at random.

Under our assumed workload, FIFO efficiency is good. For example, there are three tasks a, B and C that meet all the above load assumptions. Each task runs for 10s and arrives at t = 0. Then the task scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

According to the scheduling principle of FIFO, a, B and C complete the task at 10, 20 and 30 respectively, and the average turnaround time is 20s (\ frac {10 + 20 + 30} {3} 310 + 20 + 30). The effect is very good.

However, the reality is always cruel. If assumption 1 is broken, for example, the running time of a becomes 100s and that of B and C is 10s, the scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

According to the scheduling principle of FIFO, because the running time of a is too long, B and C cannot be scheduled for a long time, resulting in the deterioration of the average turnaround time to 110 (\ frac {100 + 110 + 120} {3} 3100 + 110 + 120).

So,FIFO scheduling strategy is prone to the problem of task starvation in scenarios with large differences in task running time!

To solve this problem, if B and C with short running time are scheduled first, the problem can be solved, which is the idea of SJF scheduling algorithm.

SJF: shortest task first

SJF (shortest job first)Select the task with the shortest running time from multiple tasks with the same arrival time for scheduling, then schedule the second shortest task, and so on.

For the workload in the previous section, SJF is used for scheduling. The turnaround time becomes 50s (\ frac {10 + 20 + 120} {3} 310 + 20 + 120), which is more than twice the 110s of FIFO.
Take you to explore the mystery of CPU scheduling

Let’s continue to break hypothesis 2. A arrives at t = 0 and B and C arrive at t = 10. Then the scheduling situation will be like this:
Take you to explore the mystery of CPU scheduling

Because tasks B and C arrive later than a, they have to wait until a runs, even if a needs to run for a long time. The turnaround time deteriorated to 103.33s (\ frac {100 + (110-10) + (120-10)} {3} 3100 + (110 − 10) + (120 − 10)), and the problem of task starvation again!

STCF: the shortest time is preferred

In order to solve the problem of task starvation in SJF, we need to break hypothesis 3, that is, tasks are allowed to be interrupted during operation. If B and C are scheduled immediately upon arrival, the problem is solved. This belongs to preemptive scheduling. The principle is mentioned in the CPU context switching section. After the interrupt timer arrives, the OS completes the context switching of tasks a and B.

Based on the SJF algorithm of cooperative scheduling and the preemptive scheduling algorithm, we evolved into STCF algorithm (shortest time to completion first),The scheduling principle is to interrupt the current task when the task with shorter running time arrives, and give priority to the task with shorter running time.

Using STCF algorithm to schedule the workload is as follows. The turnaround time is optimized to 50s (\ frac {120 + (20-10) + (30-10)} {3} 3120 + (20 − 10) + (30 − 10)), which solves the problem of task starvation again!
Take you to explore the mystery of CPU scheduling

So far, we only care about the measurement index of turnaround time. How long is the response time of FIFO, SJF and STCF scheduling algorithms?

Let’s assume that tasks a, B and C arrive at t = 0 and the running time is 5S. Then the scheduling of these three algorithms is as follows. The average response time is 5S (\ frac {0 + (5-0) + (10-0)} {3} 30 + (5 − 0) + (10 − 0)):
Take you to explore the mystery of CPU scheduling

Worse, with the growth of task running time, the average response time also increases, which will be disastrous for interactive tasks and seriously affect the user experience. The root of this problem is that when all tasks arrive at the same time and run at the same time, the last task must wait for all other tasks to complete before scheduling.

In order to optimize the response time, the familiar time slice based scheduling appeared.

RR: time slice based polling scheduling

RR (Round Robin, training algorithm) assigns a time slice to each task. When the time slice of the task is used up, the scheduler will interrupt the current task and switch to the next task, and so on.

It should be noted that the length of the time slice must be an integral multiple of the interrupt timer. For example, if the duration of the interrupt timer is 2ms, the time slice of the task can be set to 2ms, 4ms and 6ms… Otherwise, even if the timing interrupt does not occur after the time slice of the task is used up, the OS cannot switch tasks.

Now, use RR for scheduling and allocate a 1s time slice to a, B and C. the scheduling situation is as follows: the average response time is 1s (\ frac {0 + (1-0) + (2-0)} {3} 30 + (1 − 0) + (2 − 0)):
Take you to explore the mystery of CPU scheduling

From the scheduling principle of RR, it can be found that the smaller the time slice is set, the smaller the average response time is. However, as the time slice becomes smaller, the number of task switching also increases, that is, the consumption of context switching will become larger. Therefore, the setting of time slice size is a trade-off process. We cannot blindly pursue the response time and ignore the consumption caused by CPU context switching.

The consumption of CPU context switching is not just the consumption caused by saving and restoring registers. In the process of running, the program will gradually establish its own cache data on hardware such as CPU cache at all levels, TLB, branch predictor and so on. When the task is switched, it means that you have to warm up the cache again, which will bring huge consumption.

In addition, the turnaround time of RR scheduling algorithm is 14s (\ frac {(13-0) + (14-0) + (15-0)} {3} 3 (13 − 0) + (14 − 0) + (15 − 0)), which is much worse than 10s (\ frac {(5-0) + (10-0) + (15-0)} {3} 3 (5 − 0) + (10 − 0) + (15 − 0)) of FIFO, SJF and STCF. This also verifies that the turnaround time and response time are opposite to each other to some extent. If you want to optimize the turnaround time, it is recommended to use SJF and STCF; If you want to optimize response time, RR is recommended.

Impact of I / O operation on Scheduling

So far, we have not considered any I / O operations. We know that when I / O operation is triggered, the process will not occupy CPU, but block and wait for the completion of I / O operation. Now let’s break Hypothesis 4 and consider that tasks a and B arrive at t = 0, and the running time is 50ms. However, a performs I / O operations blocking 10ms every 10ms, while B has no I / O.

If STCF is used for scheduling, the scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

As can be seen from the above figure, the total scheduling time of tasks a and B is up to 140ms, which is greater than the total running time of actual tasks a and B of 100ms. Moreover, during the I / O operation, the scheduler did not switch to B, resulting in the idling of the CPU!

To solve this problem, we only need to use RR scheduling algorithm to allocate 10ms time slice to tasks a and B, so that when a is blocked in I / O operation, B can be scheduled, and when B runs out of time slice, just a returns from I / O blocking, and so on. The total scheduling time is optimized to 100ms.
Take you to explore the mystery of CPU scheduling

The scheduling scheme is based on Hypothesis 5, that is, the scheduler is required to know the running time of a and B, long I / O operation time and other information in advance, so as to make full use of CPU. However, the actual situation is much more complex than this. The I / O blocking time will not be the same every time, and the scheduler cannot accurately know the operation information of a and B. When Hypothesis 5 is also broken, how can the scheduler be implemented to ensure the CPU utilization and the rationality of scheduling to the greatest extent?

Next, we will introduce a CPU scheduling algorithm, mlfq, which can still perform well when all workload assumptions are broken and is adopted by many modern operating systems.

Mlfq: multilevel feedback queue

The objectives of mlfq (multi-level feedback queue) scheduling algorithm are as follows:

  1. Optimize turnaround time.
  2. Reduce the response time of interactive tasks and improve the user experience.

From the previous analysis, we know that to optimize the turnaround time, priority can be given to scheduling tasks with long and short running time (such as the practice of SJF and STCF); To optimize the response time, a time slice based scheduling similar to RR is adopted. However, these two goals seem contradictory. To reduce the response time, it is bound to increase the turnaround time.

For mlfq, the following two problems need to be solved:

  1. How to balance turnaround time and response time without knowing the operation information of the task in advance (including operation time, I / O operation, etc.)?
  2. How can we learn from historical scheduling so that we can make better decisions in the future?

Prioritize tasks

The most remarkable feature of mlfq and several scheduling algorithms introduced earlier is that a priority queue is added to store tasks with different priorities, and the following two rules are set:

  • Rule 1:If priority (a) > priority (b), schedule a
  • Rule 2:If priority (a) = priority (b), schedule a and B according to RR algorithm
    Change of priority
    Mlfq must consider changing the priority of the task. Otherwise, according to rule 1 and rule 2, for task C in the above figure, C will not get the opportunity to run before the end of running a and B, resulting in a long response time of C. Therefore, the following priority change rules can be set:
  • Rule 3:When a new task arrives, it is placed in the highest priority queue
  • Rule 4A:If task a runs for a time slice and does not actively give up the CPU (such as I / O operation), the priority will be reduced by one level
  • Rule 4B:If task a voluntarily gives up the CPU before the time slice runs out, the priority remains unchanged

Rule 3 mainly considers that all newly added tasks can get scheduling opportunities to avoid the problem of task starvation

Rules 4A and 4B mainly consider that most interactive tasks are short running and will frequently give up the CPU. Therefore, in order to ensure the response time, it is necessary to maintain the existing priority; CPU intensive tasks tend not to pay much attention to response time, so they can reduce the priority.

According to the above rules, when a long running task a arrives, the scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

If short-time Task B arrives when task a runs to t = 100, the scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

From the above scheduling situation, it can be seen that mlfq has the advantage of STCF, that is, it can prioritize the scheduling of short running tasks and shorten the turnaround time.

If task a runs to t = 100 and interactive task C arrives, the scheduling situation is as follows:
Take you to explore the mystery of CPU scheduling

Mlfq will select other tasks to run according to priority when the task is blocked to avoid CPU idling.Therefore, in the above figure, when task C is in the I / O blocking state, task a gets the running time slice. When task C returns from the I / O blocking, task a hangs again, and so on. In addition, because task C actively gives up the CPU within the time slice, the priority of C remains unchanged, which effectively improves the user experience for interactive tasks.

CPU intensive tasks starve to death

So far, mlfq seems to be able to take into account both turnaround time and response time of interactive tasks. Is it really perfect?

Considering the following scenarios, when task a runs to t = 100, interactive tasks C and D arrive at the same time, then the scheduling situation will be like this:
Take you to explore the mystery of CPU scheduling

It can be seen that if there are many interactive tasks on the current system, CPU intensive tasks will starve to death!

In order to solve this problem, the following rules can be established:

  • Rule 5:After the system runs for s, put all tasks on the highest priority queue(Priority Boost)
    After adding this rule, suppose s is set to 50ms, then the scheduling situation is like this, and the starvation problem is solved!
    Take you to explore the mystery of CPU scheduling

Malicious task problem

Consider the following malicious task E. in order to occupy CPU for a long time, task e deliberately performs I / O operation when there is 1% left in the time slice and returns soon. According to rule 4b, e will remain on the original highest priority queue, so when scheduling next timeTake you to explore the mystery of CPU scheduling

To solve this problem, we need to adjust rule 4 to the following rule:

  • Rule 4:Assign a time slice to each priority. When the task uses up the time slice of the priority, the priority will be reduced by one level
    After applying the new rule 4, the scheduling situation of the same workload becomes as follows, and the problem of malicious task e occupying a large amount of CPU no longer occurs.
    Take you to explore the mystery of CPU scheduling

So far, the basic principles of mlfq have been introduced. Finally, we summarize the five key rules of mlfq:

  • Rule 1:If priority (a) > priority (b), schedule a
  • Rule 2:If priority (a) = priority (b), schedule a and B according to RR algorithm
  • Rule 3:When a new task arrives, it is placed in the highest priority queue
  • Rule 4:Assign a time slice to each priority. When the task uses up the time slice of the priority, the priority will be reduced by one level
  • Rule 5:After the system runs for s hours, put all tasks on the highest priority queue (priority boost)

Now return to the two questions raised at the beginning of this section:

1. How does mlfq balance turnaround time and response time without knowing the operation information of the task in advance (including operation time, I / O operation, etc.)?

When it is unclear whether the task is long running or short running in advance, mlfq will first assume that the task belongs to shrot running task. If the assumption is correct, the task will be completed quickly, and the turnaround time and response time will be optimized; Even if the assumption is wrong, the priority of tasks can be gradually reduced and more scheduling opportunities can be given to other short running tasks.

2. How can mlfq learn from historical scheduling in order to make better decisions in the future?

Mlfq mainly judges whether a task is an interactive task according to whether it has the behavior of actively giving up the CPU. If so, it will maintain the current priority to ensure the scheduling priority of the task and improve the responsiveness of interactive tasks.

Of course, mlfq is not a perfect scheduling algorithm. It also has various problems. The most troublesome thing is the setting of mlfq parameters, such as the number of priority queues, the length of time slices, the interval of priority boost, etc. These parameters have no perfect reference value and can only be set according to different workloads.

For example, we can set the time slice of tasks on the low priority queue to be longer, because low priority tasks are often CPU intensive tasks, which do not care much about the response time. A longer time slice can reduce the consumption caused by context switching.

CFS: fully fair scheduling for Linux

In this section, we will introduce a scheduling algorithm that we usually deal with most, CFS (completely fair scheduler) under Linux system. Unlike the mlfq described in the previous section,CFS does not aim to optimize turnaround time and response time, but wants to distribute the CPU equally to each task.

Of course, CFS also provides the function of setting priorities for processes, allowing users / administrators to decide which processes need more scheduling time.

Basic principles

Most scheduling algorithms are based on fixed time slices, while CFS finds another way to adopt count based scheduling method, which is calledvirtual runtime。

CFS maintains a vruntime value for each task. Every time a task is scheduled, it accumulates its vruntime. For example, when task a runs a 5ms time slice, it is updated to vruntime + = 5ms. At the next dispatch of CFS,Select the task with the smallest vruntime value to scheduleFor example:
Take you to explore the mystery of CPU scheduling

When should CFS switch tasks? Switching more frequently will make the task scheduling more fair, but the consumption of context switching is also greater. Therefore, CFS provides users with a configurable parameter sched_ Latency, allowing the user to decide the timing of switching. CFS sets the time slice assigned to each task to time_ slice = sched_ Latency / N (n is the current number of tasks) to ensure that in sched_ In the latency cycle, each task can share the CPU equally to ensure fairness.

For example, sched_ Latency is set to 48ms. Currently, there are four tasks a, B, C and D, so the time slice assigned to each task is 12ms; After the end of C and D, the time slice assigned by a and B is also updated to 24ms:
Take you to explore the mystery of CPU scheduling

From the above principle, in sched_ When latency remains unchanged, as the number of system tasks increases, the time slice allocated to each task decreases, and the consumption caused by task switching will also increase. In order to avoid excessive task switching consumption, CFS provides a configurable parameter min_ Granularity to set the minimum time slice of the task. Like sched_ Latency is set to 48ms, min_ If the granularity is set to 6ms, even if the current number of tasks is 12, the time slice allocated to each task is 6ms instead of 4ms.

Assign weights to tasks

Sometimes, we want to allocate more time slices to an important business process in the system, while other unimportant processes allocate less time slices. However, according to the basic principles introduced in the previous section, when using CFS scheduling, each task is equally divided into CPUs. Is there any way to do this?

You can assign weights to tasks, so that tasks with high weights have more CPUs!

After adding the weight mechanism, the calculation method of task time slice becomes as follows:
Take you to explore the mystery of CPU scheduling

For example, sched_ Latency is still set to 48ms. There are two tasks a and B. the weight of a is set to 1024 and the weight of B is set to 3072. According to the above formula, the time slice of a is 12ms and the time slice of B is 36ms.

It can be seen from the previous section that CFS selects the task with the smallest vruntime value to schedule each time, and after each scheduling, the calculation rule of vruntime is vruntime + = runtime. Therefore, the calculation rule of changing only the time slice will not take effect. It is also necessary to adjust the calculation rule of vruntime to:
Take you to explore the mystery of CPU scheduling

As in the previous example, assuming that neither A nor B has I/O operations, the update is vruntime, and the scheduling is as follows: task B can get more CPU than task A.
Take you to explore the mystery of CPU scheduling

Using red black tree to improve vruntime search efficiency

Every time CFS switches tasks, it will select the task with the lowest vruntime value to schedule, so it needs to have a data structure to store each task and its vruntime information.

Of course, the most intuitive is to select a sequential table to store this information, and the list is sorted according to vruntime. In this way, when switching tasks, CFS only needs to obtain the tasks in the list header, and the time complexity is O (1). For example, there are currently 10 tasks, and vruntime is saved as an ordered linked list [1, 5, 9, 10, 14, 18, 17, 21, 22, 24], but each time a task is inserted or deleted, the time complexity will be o (n), and the time consumption will increase linearly with the increase of the number of tasks!

In order to take into account the efficiency of query, insert and delete, CFS uses red black tree to save task and vruntime information. In this way, the complexity of query, insert and delete operations becomes log (n), which will not increase linearly with the increase of the number of tasks, which greatly improves the efficiency.
Take you to explore the mystery of CPU scheduling

In addition, in order to improve storage efficiency, CFS only saves the information of running tasks in the red black tree.

Dealing with I / O and hibernation

The task with the smallest vruntime value is selected to schedule this strategy every time, which will also cause the problem of task starvation. Considering that there are two tasks a and B, the time slice is 1s. At first, a and B share the CPU and run in turn. After a certain scheduling, B enters sleep. It is assumed that it has been dormant for 10s. When B wakes up, vruntime_ {B} Vruntimeb , will be better than vruntime_ {A} Vruntime A is less than 10s. In the next 10s, B will be scheduled all the time, so task a starves to death.
Take you to explore the mystery of CPU scheduling

To solve this problem, CFS stipulates that when a task returns from sleep or I / O, the vruntime of the task will be set to the minimum vruntime value in the current red black tree. In the above example, after B wakes up from sleep, vruntime_ {B} Vruntimeb # will be set to 11, so task a won’t starve.

In fact, this method also has defects. If the sleep time of a task is very short, it will still be scheduled first after waking up, which is unfair to other tasks.

Write at the end

This paper spent a long time explaining the principles of several common CPU scheduling algorithms. Each algorithm has its own advantages and disadvantages, and there is no perfect scheduling strategy. In the application, we need to select the appropriate scheduling algorithm according to the actual workload, configure reasonable scheduling parameters, and weigh the turnover time and response time, task fairness and switching consumption. All these fulfill the famous saying in fundamentals of software architecture:Everything in software architecture is a trade-off.

The scheduling algorithms described in this paper are analyzed based on single core processor, and the scheduling algorithm on multi-core processor is much more complex. For example, it is necessary to consider the relationship between processorsShared data synchronization, cache affinityBut the essential principle is still inseparable from several basic scheduling algorithms described in this paper.

reference resources

  1. Operating Systems: Three Easy Pieces, Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau
  2. Fundamentals of computer system (III): exceptions, interrupts and I / O,

Click follow to learn about Huawei’s new cloud technology for the first time~

Recommended Today

Building the basic static page of Vue chat room

design sketch HTML: <template>     <view>         <view>             <scroll-view scroll-y=”true”>                 <div> <!– Message notification — >                     <div>                         <div>2021-12-28 16:50:00</div> < div > XXX processed this work order < / div >                     </div> <!– Left — >                     <!– <div></div> –> <!– Right — >                     <!– <div></div> –>               </div>               <div>                 <div>                     <image src=”../../static/logo.png”>                     <div>                         <div>2021-12-28 16:50:00</div> < […]