Introduction to redis data structure part 3 – Integer Set

Time:2021-6-13

This article uses the “signature 4.0 International (CC by 4.0)” license agreement. You are welcome to reprint or modify it, but you need to indicate the source.Signature 4.0 International (CC by 4.0)

Author: nicksxs

Created on: January 10, 2020

Link to this article:Introduction to redis data structure part 2 – table skipping
In redis, there are actually two kinds of processing for set. When the number of elements is small and the number of elements is integer, intset is used as the underlying data structure. Otherwise, dict is used as the underlying data structure. Let’s look at the code first

typedef struct intset {
    //Coding mode
    uint32_t encoding;
    //The number of elements contained in the collection
    uint32_t length;
    //Save an array of elements
    int8_t contents[];
} intset;

/* Note that these encodings are ordered, so:
 * INTSET_ENC_INT16 < INTSET_ENC_INT32 < INTSET_ENC_INT64. */
#define INTSET_ENC_INT16 (sizeof(int16_t))
#define INTSET_ENC_INT32 (sizeof(int32_t))
#define INTSET_ENC_INT64 (sizeof(int64_t))

At a glance, why does integer still need to be coded and then int8_ How can t save a large integer? With these questions, let’s analyze step by step. The code here actually refers to the size of the integer set, 16 bits, 32 bits, or 64 bits. The macro definition under the structure represents the possible value of encoding, intset_ ENC_ Int16 means that each element is stored in 2 bytes, intset_ ENC_ Int32 means that each element is stored in 4 bytes, intset_ ENC_ Int64 means that each element is stored in 8 bytes. Therefore, the integer stored in intset can only occupy 64 bits at most. Length is the normal number of elements in the collection. The strangest thing is the contents. It’s an int8_ T array, the smallest wool data has 16 bits. I was a little confused when I looked at the code and redis design and implementation. Later, I found that this is a more ingenious usage. Here I use my own understanding to express it. First, I look at the relationship between 8, 16, 32 and 64. At a glance, I know that they are all N times of 2, and the relationship is double, And 8 bits is just a byte, so in fact, the contents here is not int8 in the normal sense_ T, but a flexible array. Take a look at the definition of Wiki

Flexible array members1 were introduced in the C99 standard of the C programming language) (in particular, in section §6.7.2.1, item 16, page 103).2 It is a member of a struct, which is an array without a given dimension. It must be the last member of such a struct and it must be accompanied by at least one other member, as in the following example:

struct vectord {
    size_t len;
    double arr[]; // the flexible array member must be last
};

When initializing the intset, the contents array does not take up space. Anyway, the later one uses the application. Then there is a problem. Three possible encoding values are given. Can they be changed at will? Obviously not. First, the storage of data in the intset is orderly. Part of the reason is that it is convenient for binary search, Then store the data. In fact, with the size of the data, there will be an upgrade process, as shown in the figure below
Introduction to redis data structure part 3 - Integer Set
The newly created intset has only one header, a total of 8 bytes. Where encoding = 2, length = 0, the type is uint32_ t. Each takes 4 bytes. After adding 15 and 5 elements, because they are relatively small integers, they can be represented by 2 bytes, so encoding remains unchanged, and the value is still 2, which is the defaultINTSET_ENC_INT16When 32768 is added, it can no longer be represented by two bytes (the range of data represented by two bytes is – 215 ~ 215-1, while 32768 is equal to 215, which is out of range). Therefore, encoding must be upgraded to intset_ ENC_ Int32 (value is 4), that is, an element is represented by 4 bytes. In the process of adding each element, intset always keeps order from small to large. Similar to ziplist, intsets are stored in little endian mode (see Wikipedia entry)Endianness)。 For example, in the figure above, after adding all the data to the intset, the four bytes representing the encoding field should be interpreted as 0x00000004, and the fourth data should be interpreted as 0x00008000 = 32768