How to realize non trivial of efficient thread pool in Linux C programming


Most of the implementation of thread pool is inseparable from the use of lock, such as pthread_ Mutex * binding condition variable pthread_ cond*。 As we all know, the use of lock has a great impact on program performance, although the existing pthread_ Mutex * has made great optimization in lock application and release, but the implementation of thread pool can be lock free.

1. Common thread pool implementation principle

How to realize non trivial of efficient thread pool in Linux C programming

As shown in the figure above, the work queue is shared by the main thread and the worker thread. The main thread puts the task into the work queue, and the worker thread takes the task out of the work queue for execution. The operation of sharing work queue should be carried out safely under the protection of mutual exclusion. When the main thread puts the task into the work queue, if it detects that the number of work to be executed is less than the total number of worker threads, it needs to use the condition variable to wake up the worker threads that may be in the waiting state. Of course, there are other places where mutex and conditional variables may also be used, so I won’t repeat them.

2. Implementation principle of thread pool without lock

How to realize non trivial of efficient thread pool in Linux C programming

In order to solve the problem of no lock, we need to avoid the competition of shared resources, so we split the shared work queue into one work queue per worker thread. For the main thread into the work and work thread out of the task of competition, can take the way of ring queue to avoid. After the lock mechanism is solved, there is only the problem of condition variable. Condition variable itself solves the problem of thread communication when conditions are met, and signal, as a communication method, can be replaced

sigemptyset (&oldmask);sigemptyset (&signal_mask);sigaddset (&signal_mask, SIGUSR1);rc = pthread_sigmask(SIG_BLOCK, &signal_mask, NULL);if (rc != 0) {    debug(TPOOL_ERROR, "SIG_BLOCK failed");    return -1;}...while (!condition) {    rc = sigwait (&signal_mask, NULL);    if (rc != 0) {        debug(TPOOL_ERROR, "sigwait failed");        return -1;    }}rc = pthread_sigmask(SIG_SETMASK, &oldmask, NULL);if (rc != 0) {    debug(TPOOL_ERROR, "SIG_SETMASK failed");    return -1;}

Need C / C + + Linux server architect learning materials plus Qun (563998835) (materials include C / C + +, Linux, golang technology, nginx, zeromq, mysql, redis, fastdfs, mongodb, ZK, streaming media, CDN, P2P, k8s, docker, TCP / IP, coprocessor, dpdk, ffmpeg, etc.), free sharing

How to realize non trivial of efficient thread pool in Linux C programming

3. Implementation of thread pool without lock

In the lock free thread pool, the main differences from the common thread pool are signal and condition variables, task scheduling algorithm, task migration after increasing or decreasing the number of threads. In addition, the implementation of circular queue refers to the kfifo implementation in Linux kernel.

(1) Signals and conditional variables

The main difference between signal and conditional variable is that the signal of conditional variable can be ignored for the receiving thread, while the receiving of signal will lead to the termination of the receiving thread or even the whole program without setting the signal processing function. Therefore, it is necessary to specify the signal processing function before the thread is generated in the thread pool, so that the new thread will inherit the signal processing function. Pthread is mainly used to send signals in multithreading_ To avoid using other signals, SIGUSR1 is used in this program.

(2) Task scheduling algorithm

The task scheduling of common thread pool is mainly realized by thread scheduling at the operating system level. Considering the load balance, the main thread should take appropriate task scheduling algorithm to put the task into the corresponding worker thread queue. This program has implemented round robin and lead load algorithm. Round robin is polling to allocate work, and lead load is to select the worker thread with the least work.

(3) Task migration

In the process of dynamic increase and decrease of threads, the migration of existing tasks is also involved based on the consideration of load balancing. The load balancing algorithm is mainly based on the idea of average workload, that is to count the total number of tasks at the current time, divide them equally to each thread, find out the number of work that each worker thread should increase or decrease, and then traverse from beginning to end. The threads that need to move out of work and the threads that need to move in work perform task migration to offset each other. Finally, if there is any more work, it will be allocated in turn. There is no race state in the migration work, because the join work is always completed by the main thread, while there is race state in the migration work, because the worker thread may execute the task at the same time. So we need to use atomic operation to modify it. The main idea is prefetching technology

do {    work = NULL;     if (thread_ queue_ len(thread) <= 0)  //also atomic        break;     tmp = thread->out;    // prefetch work    work = &thread->work_ queue[queue_ offset(tmp)];}  while (!__ sync_ bool_ compare_ and_ swap(&thread->out, tmp, tmp + 1)); If (work) {// do something after the dynamic reduction of threads, the tasks that could not be finished on the original thread only need to be reallocated by // the main thread to the queue of other surviving worker threads according to the task scheduling algorithm. There is no // the above problem. Of course, the load balancing algorithm can be implemented at the same time to optimize.}

(4) Ring queue

The implementation of ring queue in the source code mainly refers to the implementation of kfifo in Linux kernel, as shown in the figure below:

How to realize non trivial of efficient thread pool in Linux C programming

The length of the queue is an integral power of 2. The subscripts of out and in increase until they are out of bounds and then turn around. The type of the queue is unsigned int, that is, the out pointer always catches up with the in pointer. Out and in are mapped to the corresponding subscripts of FIFO, and the elements between them are queue elements.