Build high-performance queue, you have to know the underlying knowledge!

Time:2021-3-5

preface

This article is included in the album:http://dwz.win/HjK, click to unlock more knowledge of data structure and algorithm.

Hello, I’m brother Tong.

In the previous section, we learned how to rewrite recursion as non recursion, in which the data structure used is mainly stack.

Stack and queue are the most basic data structures in addition to arrays and linked lists. They are useful in many scenarios, and we will see them later.

Today, I would like to introduce how to build a high-performance queue in Java and the underlying knowledge we need to master.

Students learning other languages can also see how to build high-performance queues in your language.

queue

Queue is a first in first out (FIFO) data structure, which is similar to the queue in real life scenarios. First come first served.

file

Using arrays and linked lists to realize simple queues, which we have described before, will not be repeated here. Interested students can click the following link to view:

Review four basic data structures: array, linked list, queue and stack

Today we are going to learn how to implement high performance queues.

When it comes to high-performance queues, of course, it means queues that can work well in a high concurrency environment. Good here mainly refers to two aspects: concurrency security and good performance.

Concurrent secure queue

In Java, by default, there are also some concurrent safe queues:

queue Boundedness lock data structure
ArrayBlockingQueue Bounded Lock array
LinkedBlockingQueue Optional bounded Lock Linked list
ConcurrentLinkedQueue unbounded No lock Linked list
SynchronousQueue unbounded No lock Queue or stack
LinkedTransferQueue unbounded No lock Linked list
PriorityBlockingQueue unbounded Lock heap
DelayQueue unbounded Lock heap

The source code parsing shortcut for these queues:The end of Java Concurrent collection

To sum up, the main data structures to implement concurrent security queue are: array, linked list and heap. Heap is mainly used to implement priority queue, which is not universal and will not be discussed for the moment.

In terms of boundedness, only arrayblockingqueue and linkedblockingqueue can implement bounded queues, while others are unbounded queues.

In terms of locking, both arrayblocking queue and linkedblocking queue adopt the locking method, and the others are implemented by CAS, a lock free technology.

From the point of view of security, we generally choose bounded queue to prevent memory overflow caused by too fast producer speed.

From the perspective of performance, we generally consider the lock free way to reduce the performance loss caused by thread context switching.

From the perspective of JVM, we generally choose the implementation of array, because the linked list will frequently add and delete nodes, resulting in frequent garbage collection, which is also a performance loss.

Therefore, the best choice is: array + bounded + no lock.

JDK does not provide such queues. Therefore, many open source frameworks have implemented their own high-performance queues, such as destructor and jctools used in netty.

High performance queue

We will not discuss a specific framework here, but only introduce the general technology of high-performance queue and implement one by ourselves.

Circular array

Through the above discussion, we know that the data structure used to realize high-performance queue can only be array, and array to realize queue must use ring array.

A ring array is usually realized by setting two pointers: putindex and takeindex, or writeindex and readindex. One is used for writing and the other is used for reading.

file

When the write pointer reaches the end of the array, it will start from the beginning. Of course, it can’t cross the read pointer. Similarly, when the read pointer reaches the end of the array, it will start from the beginning. Of course, it can’t read the unwritten data.

file

In order to prevent the overlap of the write pointer and the read pointer, it is impossible to distinguish whether the queue is full or empty. Generally, a size field is added

file

file

Therefore, the data structure of ring array queue is generally as follows:

public class ArrayQueue {
    private T[] array;
    private long wrtieIndex;
    private long readIndex;
    private long size;
}

In the case of a single thread, this will not have any problems, but in a multithreaded environment, it will bring serious pseudo sharing problems.

Pseudo sharing

What is sharing?

In the computer, there are many storage units. The memory that we contact most is also called main memory. In addition, the CPU has three levels of cache: L1, L2 and L3. L1 is closest to the CPU. Of course, its storage space is also very small. L2 is slightly larger than L1, and L3 is the largest. It can cache multiple core data at the same time. When the CPU fetches data, it first reads from the L1 cache. If it does not read from the L2 cache, if it does not read from the L3 cache, if it does not read from the L3 cache, it will finally read from the memory. The farther away from the CPU core, the longer the relative time-consuming. Therefore, if you want to do some frequent operations, try to ensure that the data is cached in L1, which can greatly improve the performance.

file

Cache line

In the third level cache, data is not cached once, but a batch of data is cached at a time. This batch of data is also called cache line, which is usually 64 bytes.

file

Every time, when the CPU goes to the memory to fetch data, it will fetch the following data together (forming 64 bytes). Let’s take the long array as an example. When the CPU fetches a long in the array, it will fetch the following seven long into the cache line.

file

This can speed up the data processing to a certain extent, because at this time, when processing the data with the subscript of 0, the data with the subscript of 1 may be processed at the next moment, so it is much faster to directly retrieve the data from the cache.

However, this brings a new problem pseudo sharing.

Pseudo sharing

Just imagine that two threads (CPUs) are processing the data in this array at the same time, and both CPUs are caching. One CPU is adding 1 to the data in array [0], and the other CPU is adding 1 to the data in array [1]. Then, when writing back to main memory, which cache line’s data will prevail (when writing back to main memory, it will also be written back in the form of cache line). So, at this time, it’s time to write back to main memory The two cache lines need to be “locked”. One CPU first modifies the data and writes it back to the main memory. The other CPU can read the data, modify the data, and then write it back to the main memory. This will inevitably lead to performance loss. This phenomenon is called “locking”Pseudo sharingThis method of “locking” is called memory barrier. We will not describe the knowledge of memory barrier.

So, how to solve the problem of pseudo sharing?

Take the queue implemented by the ring array as an example, the writeindex, readindex and size are now processed as follows:

file

Therefore, we only need to add 7 long between writeindex and readindex to separate them. Similarly, readindex and size are the same.

file

In this way, the problem of pseudo sharing between writeindex and readindex is eliminated, because writeindex and readindex must be updated in two different threads. Therefore, the performance improvement after eliminating pseudo sharing is obvious.

If there are multiple producers, the writeindex will definitely be used. At this time, how to modify the writeindex friendly? That is, if one producer thread modifies the writeindex, the other producer thread will be visible immediately.

The first thing you think about is thatvolatileYes, but volatile is not enough. Volatile can only guarantee visibility and order, but not atomicity. So we need to add the atomic instruction CAS. Who provided CAS? Both atomicinteger and atomiclong have the function of CAS, so can we use them directly? Definitely not. After careful observation, we find that they all call unsafe in the end.

OK, now it’s the turn of the most powerful bottom Assassin – unsafe.

Unsafe

Unsafe not only provides CAS instructions, but also provides many other underlying methods, such as operating direct memory, modifying the value of private variables, instantiating a class, blocking / waking threads, methods with memory barrier, etc.

About unsafe, you can see this article:Analysis of Java magic class unsafe

Of course, to build a high-performance queue, the CAS instruction of unsafe and the method with memory barrier are mainly used

//Atomic instruction
public final native boolean compareAndSwapLong(Object var1, long var2, long var4, long var6);
//Getting the value in the form of volatile is equivalent to adding the volatile keyword to the variable
public native long getLongVolatile(Object var1, long var2);
//When the update is delayed, changes to the variable are not immediately written back to main memory, that is, another thread is not immediately visible
public native void putOrderedLong(Object var1, long var2, long var4);

Well, the underlying knowledge is almost introduced. It’s time to show the real technology – handwriting high-performance queue.

Handwriting high performance queue

Let’s assume a scenario where there are multiple producers but only one consumer. This is a classic scenario in netty. How to implement such a queue?

Direct code:

/**
 *Multi producer single consumer queue
 *
 * @param 
 */
public class MpscArrayQueue {

    long p01, p02, p03, p04, p05, p06, p07;
    //Where elements are stored
    private T[] array;
    long p1, p2, p3, p4, p5, p6, p7;
    //Write pointer, multiple producers, so declare volatile
    private volatile long writeIndex;
    long p11, p12, p13, p14, p15, p16, p17;
    //Read pointer, only one consumer, so do not declare volatile
    private long readIndex;
    long p21, p22, p23, p24, p25, p26, p27;
    //The number of elements, which can be modified by both producers and consumers, is declared volatile
    private volatile long size;
    long p31, p32, p33, p34, p35, p36, p37;

    //Unsafe variable
    private static final Unsafe UNSAFE;
    //Array base offset
    private static final long ARRAY_BASE_OFFSET;
    //Array element offset
    private static final long ARRAY_ELEMENT_SHIFT;
    //Offset of writeindex
    private static final long WRITE_INDEX_OFFSET;
    //The offset of readindex
    private static final long READ_INDEX_OFFSET;
    //Offset of size
    private static final long SIZE_OFFSET;

    static {
        Field f = null;
        try {
            //Get instance of unsafe
            f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            UNSAFE = (Unsafe) f.get(null);

            //Calculate array base offset
            ARRAY_BASE_OFFSET = UNSAFE.arrayBaseOffset(Object[].class);
            //Calculates the offset of elements in an array
            //In a 64 bit system, a compressed pointer takes up 4 bytes, while a non compressed pointer takes up 8 bytes
            int scale = UNSAFE.arrayIndexScale(Object[].class);
            if (4 == scale) {
                ARRAY_ELEMENT_SHIFT = 2;
            } else if (8 == scale) {
                ARRAY_ELEMENT_SHIFT = 3;
            } else {
                Throw new IllegalStateException ("size of unknown pointer");
            }

            //Calculate the offset of writeindex
            WRITE_INDEX_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("writeIndex"));
            //Calculate the offset of readindex
            READ_INDEX_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("readIndex"));
            //Calculate the offset of size
            SIZE_OFFSET = UNSAFE
                    .objectFieldOffset(MpscArrayQueue.class.getDeclaredField("size"));
        } catch (Exception e) {
            throw new RuntimeException();
        }
    }

    //Construction method
    public MpscArrayQueue(int capacity) {
        //Round to the nth power of 2
        capacity = 1 << (32 - Integer.numberOfLeadingZeros(capacity - 1));
        //Instantiate array
        this.array = (T[]) new Object[capacity];
    }

    //Production elements
    public boolean put(T t) {
        if (t == null) {
            return false;
        }
        long size;
        long writeIndex;
        do {
            //The size is retrieved every time the loop
            size = this.size;
            //When the queue is full, return directly
            if (size >= this.array.length) {
                return false;
            }

            //Each loop reacquires the value of writeindex
            writeIndex = this.writeIndex;

            //In the while loop, the atom updates the value of writeindex
            //If you fail, go back to the above process
        } while (!UNSAFE.compareAndSwapLong(this, WRITE_INDEX_OFFSET, writeIndex, writeIndex + 1));

        //So far, it shows that the above atomic update is successful
        //Then, put the value of the element in the position of writeindex
        //And update size
        long eleOffset = calcElementOffset(writeIndex, this.array.length-1);
        //Delay update to main memory, update when read
        UNSAFE.putOrderedObject(this.array, eleOffset, t);

        //Update to death until you succeed
        do {
            size = this.size;
        } while (!UNSAFE.compareAndSwapLong(this, SIZE_OFFSET, size, size + 1));

        return true;
    }

    //Consumption elements
    public T take() {
        long size = this.size;
        //If size is 0, it means that the queue is empty
        if (size <= 0) {
            return null;
        }
        //If size is greater than 0, there must be a value
        //There is only one consumer and thread safety is not a concern
        long readIndex = this.readIndex;
        //Calculates the offset of the element at the read pointer
        long offset = calcElementOffset(readIndex, this.array.length-1);
            //Get the element at the read pointer, and use volatile syntax to update the producer's data to the main memory
        T e = (T) UNSAFE.getObjectVolatile(this.array, offset);

        //Add read pointer
        UNSAFE.putOrderedLong(this, READ_INDEX_OFFSET, readIndex+1);
        //Reduce size
        do {
            size = this.size;
        } while (!UNSAFE.compareAndSwapLong(this, SIZE_OFFSET, size, size-1));

        return e;
    }

    private long calcElementOffset(long index, long mask) {
        //Index & mask is equivalent to the remainder, indicating that the index has reached the end of the array, starting from the beginning
        return ARRAY_BASE_OFFSET + ((index & mask) << ARRAY_ELEMENT_SHIFT);
    }

}

Don’t you understand? That’s right. Watch it a few times, and the interview will blow again.

Here we use adding 7 long type variables between each two variables to eliminate pseudo sharing. You may see that some open source frameworks are implemented by inheritance, and others add 15 long types. In addition, jdk8 also provides an annotation@ContendedTo eliminate false sharing.

In fact, there is room for optimization in this example. For example, can you use size without using it? How to realize without size?

Postscript

In this section, we learned how to build a high-performance queue in Java, and learned some underlying knowledge. It’s no exaggeration to say that after learning these underlying knowledge, the queue alone can blow an hour with the interviewer during the interview.

In addition, recently received feedback from some students, saying that hash, hash table, hash function, do they have a relationship? What’s the relationship? Why put a hash () method in object? How is it related to the equals () method?

In the next section, let’s take a look at everything about hashing. Do you want to get the latest tweets in time? Don’t pay attention to me soon!

Pay attention to the owner of the public account “tongge read the source code” to unlock more knowledge of source code, foundation and architecture.