Redis’s hash conflict and progressive rehash source code analysis

Time:2021-2-27

Redis data DB

Redis is a DB. What kind of data structure is this DB?

The following is the official source code of redis (5.0)

/*Redis database representation. There are multiple databases that identify integer databases from 0 (the default database) to the configured maximum. The database number is the "Id" field in the structure*/
typedef struct redisDb {
    Dict * dict; / * the key space of this database (Dictionary type)*/
    Dict * expires; / * sets the timeout of the timeout key*/
    dict *blocking_ Keys; / * the key of the client waiting for data (blpop)*/
    dict *ready_ Keys; / * received block key pushed*/
    dict *watched_ Keys; / * exec CAS monitoring key*/
    Int ID; / * database ID*/
    long long avg_ TTL; / * average TTL, for statistical purposes only*/
    list *defrag_ Later; / * list of key names to be sorted one by one*/
} redisDb;

We can see that the main data of redis database is stored in the dictionary

Redis data dict dictionary

Source code address of the official website: https://github.com/redis/redi…

We find the definition of dict dictionary

//Dictionary type data definition
typedef struct dict {
    Dicttype * type; / * dictionary type array*/
    Void * privdata; / * private data*/
    Dictht [2]; / * dictionary hash table array*/
    Long rehashidx; / * if rehashidx = = - 1, it means that rehash has not been performed; if it is a positive number, it means that rehash has been performed*/
    Unsigned long iterators; / * the number of iterators currently running*/
} dict;

The main data is stored in our dictionary hash table array. Let’s take a look at the dictht dictionary hash table

//Dictionary hash table type data definition
typedef struct dictht {
    Dictentry * * table; / * hash table, which stores one dictionary element after another, is actually an array*/
    Unsigned long size; / * hash table size, that is, hash table array size*/
    Unsigned long sizemask; / * hash table size mask, which is always equal to size-1. It is mainly used to calculate the index*/
    Unsigned long used; / * the number of nodes used, that is, the logarithm of key values used*/
} dictht;

What’s more important is that the element of each dictionary represents the element data we store

//Dictionary element type data definition
typedef struct dictEntry {
      //No type pointer, key points to Val value
    void *key;
    //Value, is a common body, it may be a pointer, or a 64 bit positive integer, or 64 bit int, floating point number
    union {
           //Value pointer
        void *val;
          //64 bit positive integer
        uint64_t u64;
          //64 bit int
        int64_t s64;
          //Floating point number
        double d;
    } v;
      //Next node, each dictentry is a linked list used to handle hash conflicts
    struct dictEntry *next;
} dictEntry;

Redis handles hash conflicts

We know that the dict above has two hash tables, so why do we put two hash tables?

The answer is that our redis hash table needs to be used when expanding. Let’s take a look at the explanation in the source code.

      int dictRehash(dict *d, int n);

Source code location: https://github.com/redis/redi…

First of all, we must know at which step we are expanding the capacity. It must be the method that we locate the add when we perform the add operation

/*Add an element to the target hash table*/
int dictAdd(dict *d, void *key, void *val)
{
      //Add key to dictionary
    dictEntry *entry = dictAddRaw(d,key,NULL);

    if (!entry) return DICT_ERR;
      //Then set the value of the node
    dictSetVal(d, entry, val);
    return DICT_OK;
}

Then we go to dictaddraw, which uses linked list to solve hash conflicts

/*Low level add or find:
 *Instead of setting a value, this function returns the dictentry structure to the user, which ensures that the value field is filled in as needed
 *
 *This function is also directly exposed to the user to be called. The API is mainly used to store non pointers in the hash value, for example:
 * entry = dictAddRaw(dict,mykey,NULL);
 * if (entry != NULL) dictSetSignedIntegerVal(entry,1000);
 * 
 *Return value:
 *
 *Null if the key already exists, or "* existing" if it does not exist
 *If a key is added, the hash entry is returned for operation by the caller.
 */
dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing)
{
    long index;
    dictEntry *entry;
    dictht *ht;
    //Determine whether rehash is in progress, and call if necessary_ Dictrehash step (steps in subsequent rehash), rehash one data at a time until the whole rehash is completed
    if (dictIsRehashing(d)) _dictRehashStep(d);

    /*Get the index of the new element, calculate the index according to the key, and judge whether to expand rehash (!!)!!! Important) (the first rehash call)*/
    if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == -1)
        return NULL;

      /*Solve the hash conflict, and the efficiency of rehash*/
    /*Allocate memory and store new entries. Assuming that the recently added entry is more likely to be accessed more frequently in the database system, the element is inserted at the top*/
      //Determine whether rehash is needed. If so, the current hashtable is the second one in the dictionary. If not, the original hashtable will be used
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    //Create elements, allocate memory
    entry = zmalloc(sizeof(*entry));
      //For element linked list operation, the next node of the element points to the corresponding index in the hash table. If the previous index has an element, it will be linked to the back of the current element
    entry->next = ht->table[index];
      //The hash table node index is set to itself to replace the original element
    ht->table[index] = entry;
    ht->used++;

    /*Set the key of this hash element*/
    dictSetKey(d, entry, key);
    return entry;
}

After looking at the solutions to hash conflicts, let’s take a look at the expansion. First, let’s take a look at dictis rehashing, how to judge the need for rehash

Rehash process to expand rehash capacity

//If the rehashidx of the dictionary is not - 1, it means that hash expansion is needed
 dictIsRehashing(d) ((d)->rehashidx != -1)

So where did we modify rehashidx when we calculated the index

/*Returns the index that can be filled with slots. According to the hash calculation of "key", if the key already exists, it returns - 1
 *Note that if we are re hashing, the index is always returned in the context of the second (New) hash table, which is HT [1]*/
static long _dictKeyIndex(dict *d, const void *key, uint64_t hash, dictEntry **existing)
{
    unsigned long idx, table;
    dictEntry *he;
    if (existing) *existing = NULL;

    /*If necessary, expand the hash table. If it fails, return - 1 (rehash expansion mechanism)*/
    if (_dictExpandIfNeeded(d) == DICT_ERR)
        return -1;
      /*Query from two hash tables, maybe the key is put into the second hash table*/
    for (table = 0; table <= 1; table++) {
              /*According to the length of the array - 1 and then take the module to calculate the card slot*/
        idx = hash & d->ht[table].sizemask;
        /*Get the element according to the hash table, and judge whether the key is in the hash table. If it exists, return - 1*/
        he = d->ht[table].table[idx];
        while(he) {
            if (key==he->key || dictCompareKeys(d, key, he->key)) {
                if (existing) *existing = he;
                return -1;
            }
            he = he->next;
        }
        //If it is not in rehash, return the index card slot of the first hash table directly. If it is rehash, put IDX in the second hash table
        if (!dictIsRehashing(d)) break;
    }
    return idx;
}

Start here to judge whether expansion is needed

/*If necessary, expand the hash table*/
static int _dictExpandIfNeeded(dict *d)
{
    /*If it is already in rehash, return directly*/
    if (dictIsRehashing(d)) return DICT_OK;

    /*If the hash table is empty, expand it to the initial size. Initial size 4*/
    if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);

    /*If the ratio between the number of used elements and the length of hash table array reaches 1:1, then we need to expand (global setting) or we should avoid it, but the ratio between elements / buckets exceeds the "safe" threshold, so we adjust the size to double the number of buckets*/
      /*To put it simply, if the element we use is equal to the length of the array, we will expand the hash table to double its capacity*/
    if (d->ht[0].used >= d->ht[0].size &&
        (dict_can_resize ||
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
    {
        return dictExpand(d, d->ht[0].used*2);
    }
    return DICT_OK;
}

Reinitialize the second hash table, and all elements in subsequent rehash will be put into the second hash table

/*Expand or create hash table*/
int dictExpand(dict *d, unsigned long size)
{
    /*If rehash is in progress, or the number used is greater than the original size * 2, return - 1*/
    if (dictIsRehashing(d) || d->ht[0].used > size)
        return DICT_ERR;

    Dictht n; / * new hash table*/
    unsigned long realsize = _dictNextPower(size);

    /*It's no use resizing to the same table size, return - 1*/
    if (realsize == d->ht[0].size) return DICT_ERR;

    /*Allocates a new hash table and initializes all pointers to null*/
    n.size = realsize;
    n.sizemask = realsize-1;
      /*Allocate memory expansion space*/
    n.table = zcalloc(realsize*sizeof(dictEntry*));
    n.used = 0;

    /*Is this the first initialization? If so, it's not really a restatement, we just set the first hash table so that it can accept keys. */
    if (d->ht[0].table == NULL) {
        d->ht[0] = n;
        return DICT_OK;
    }

    /*Prepare the second hash table for incremental rehashing, reinitialize the second temporary hash table, and start rehashing*/
    d->ht[1] = n;
    d->rehashidx = 0;
    return DICT_OK;
}

Rehash process means that we set the state to rehash and write the new elements to the second hash table. At this time, we need to write the second hash table and the first hash table

/*For dictionary rehash operation, the first parameter represents the dictionary each time, and the second parameter represents the number of rehash each time. For example, if there is no hash conflict, we need to pass in 100 to complete rehash*/
int dictRehash(dict *d, int n) {
    int empty_ Visits = n * 10; / * maximum number of empty buckets accessible*/
      /*Return directly without rehash*/
    if (!dictIsRehashing(d)) return 0;
        /*Rehash will start with the second table*/
    while(n-- && d->ht[0].used != 0) {
        dictEntry *de, *nextde;

        /*Note that rehashidx does not overflow because we are sure there are more elements because HT [0]. Used! = 0*/
        assert(d->ht[0].size > (unsigned long)d->rehashidx);
          //If the card slot is empty, it will be automatically incremented from rehashindex. Because it needs to traverse, rehashidx is set to 0 by default from the beginning. If you need to complete rehash on the original hash table, you need to traverse the whole hash table from 0
        while(d->ht[0].table[d->rehashidx] == NULL) {
            d->rehashidx++;
              //If n * 10 card slots are found to be empty, then we return 1 and do not perform the operation
            if (--empty_visits == 0) return 1;
        }
          //Get the hash table of the corresponding slot of the original HT [0]
        de = d->ht[0].table[d->rehashidx];
        /*Then put all the key values in the card slot into HT [1], which means to move the data from HT [0] to HT [1]*/
        while(de) {
            uint64_t h;

            nextde = de->next;
            /*Gets the index in the new hash table*/
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;
            d->ht[0].used--;
            d->ht[1].used++;
            de = nextde;
        }
          //If it is empty, continue + 1 until the table of HT [0] becomes empty
        d->ht[0].table[d->rehashidx] = NULL;
        d->rehashidx++;
    }

      //Complete rehash, which means that all data of HT [0] has been moved to HT [1], then assign HT [1] to HT [0], and then clear HT [1], and a rehash operation is completed
    /*Check if we have rehash the first hash table*/
    if (d->ht[0].used == 0) {
        zfree(d->ht[0].table);
        d->ht[0] = d->ht[1];
        _dictReset(&d->ht[1]);
        d->rehashidx = -1;
        return 0;
    }

    /*Return data. This step is usually only partially rehash (unfinished rehash) because rehash is not finished*/
      //Return 1 to indicate that the given task is scheduled circularly, while condition indicates that no rehash is completed
    return 1;
}

And there is task scheduling rehash

In server https://github.com/redis/redi…

/*Database timing task*/
void databasesCron(void) {
/* Rehash */
        if (server.activerehashing) {
            for (j = 0; j < dbs_per_call; j++) {
                  //Database rehash
                int work_done = incrementallyRehash(rehash_db);
                if (work_done) {
                    break;
                } else {
                    /* If this db didn't need rehash, we'll try the next one. */
                    rehash_db++;
                    rehash_db %= server.dbnum;
                }
            }
        }


//Each database executes 1 ms rehash at a time
int incrementallyRehash(int dbid) {
    /* Keys dictionary */
    if (dictIsRehashing(server.db[dbid].dict)) {
        dictRehashMilliseconds(server.db[dbid].dict,1);
        Return 1; / * has used milliseconds as the cycle period. ... */
    }
    /* Expires */
    if (dictIsRehashing(server.db[dbid].expires)) {
        dictRehashMilliseconds(server.db[dbid].expires,1);
        Return 1; / * has used milliseconds as the cycle period. ... */
    }
    return 0;
}
  
/*Rehash in MS + delta milliseconds. Delta value is larger, less than 0, and less than 1 in most cases. The exact upper bound depends on the running time of dictrehash (D, 100)*/
int dictRehashMilliseconds(dict *d, int ms) {
    if (d->iterators > 0) return 0;
        //Record start synchronization
    long long start = timeInMilliseconds();
      //Record the number of rehash
    int rehashes = 0;
        //Rehash100 data at a time
    while(dictRehash(d,100)) {
        rehashes += 100;
          //If the execution reaches a specified time, such as one millisecond, and the current time start time is greater than 1 millisecond, it will break directly
        if (timeInMilliseconds()-start > ms) break;
    }
    return rehashes;
}