**Reading Guide:**This paper starts with the traditional design and solution of hash table, and leads to a new design idea in simple terms: from avoiding hash conflict as much as possible to optimizing calculation and storage efficiency by making use of appropriate hash conflict probability. The new hash table design shows that the merging processing of SIMD instructions can effectively improve the tolerance of hash conflicts, improve the query speed, and help the hash table to achieve the ultimate storage space compression.

**one background**

Hash table is a data structure with excellent search performance. It has a wide application in computer system. Although the theoretical lookup time complexity of hash tables is O (1), there are still huge performance differences in the implementation of different hash tables, because engineers have never stopped exploring better hash data structures.

**1.1 core of hash table design**

In computer theory, hash table is a data structure that can map key to value storage location through hash function. The core of hash table design is two points:

- How to improve the efficiency of mapping key to value storage location?
- How to reduce the space overhead of storing data structures?

Since the storage space overhead is also the ⼀ core ⼼ control point at the time of design, in the case of limited space, the mapping algorithm of hash function has a ⾮ constant ⾼ probability to map different keys to the same storage location, that is**Hash Collisions **。 ⼤ the difference between some hash table designs lies in how it handles hash conflicts.

When hash conflicts are encountered, there are three common solutions: open addressing method, zipper method and sub hashing method. But next, we introduce two interesting and unusual solutions, and lead to our new implementation——**B16 hash table.**

**two Avoid hash conflicts**

The processing of hash conflicts in traditional hash tables will add additional split jumps and memory access, which will degrade the processing efficiency of streaming CPU instructions. Then there must be (consideration. How can we completely avoid hash conflicts? So there is such a function, that is, the perfect hash function.

The design of perfect hash function is often exquisite. For example, cmph（http://cmph.sourceforge.net/）The cdz perfect hash function provided by the function library makes mathematical sense**⽆ ring random 3-part Hypergraph**Concept. Cdz randomly maps each key to ⼀ hyperedges of 3-part hypergraph through three different hash functions. If the hypergraph passes ⽆ ring detection, then map each key to ⼀ vertices of hypergraph, and then obtain the storage subscript corresponding to the key through ⼀ finely designed auxiliary array with the same number of vertices of hypergraph.

The perfect hash function sounds elegant, but it also has some real defects:

- Perfect hash functions can only be used on a limited set, that is, all possible keys belong to a superset, and it can’t handle unprecedented keys;
- The construction of perfect hash function has a certain complexity, and there is a probability of failure;
- The perfect hash function is different from the hash function in cryptography. It is often not a simple function
**Mathematical function**, ⽽ is data**Structure + algorithm**Composed of ⼀**Function function**, it also has storage space overhead, access overhead and additional sub jump overhead;

However, in specified scenarios, such as read-only scenarios and set determined scenarios (such as Chinese character sets), the perfect hash function may achieve ⾮ good performance.

**three Hash conflict**

Even if the ⽤ perfect hash function is not made, many hash tables will deliberately control the probability of hash conflict. The simplest way is to control the space overhead of the hash table by controlling the load factor, so that the bucket array of the hash table retains enough holes to accommodate the new keys. Load factor is like a super parameter that controls the efficiency of the hash table. Generally speaking, the more load factor is, the more space is wasted, and the better the hash table performance is.

However, in recent years, the emergence of some new technologies makes us see another possibility to solve hash conflicts, that is**Full benefit hash conflict.**

**3.1 SIMD instruction**

SIMD is short for single instruction multiple data. Such instructions can enable ⽤ instructions to operate multiple data. For example, the GPU that has been used frequently in recent years accelerates neural network computing through a super large-scale SIMD computing engine.

The former mainstream CPU processors already have a rich set of SIMD instructions. For example, some of the x86 CPUs that can be accessed by the family already have sse4.2 and AVX instruction sets, and arm CPUs also have neon instruction sets. However, some of the procedures other than scientific calculation are not sufficient for SIMD instructions.

**3.2 F14 hash table**

Facebook’s F14 hash table, which is open-source in folly library, has a very delicate design, that is to map the key to the block, and then filter the SIMD instruction in the block. Because the number of blocks ⽐ the traditional bucket division is more ⼩ this is equivalent to adding hash conflicts, and then solving the conflicts by ⽤ SIMD instructions in the block.

The specific approach is as follows:

- Calculate two hash codes for key through hash function:_ H1_ and
*H2*, where*H1*⽤ to determine the block to which the key is mapped_ H2_ There are only 8 bits to filter in the block; - Each block ⾥ can store up to 14 elements, and the block header has 16 bytes. The first 14 bytes of the block header store the data corresponding to 14 elements
*H2*, the 15th byte is the control byte, which mainly records how many elements of the block ⾥ overflow from the previous block, and the 16th byte is the out of bounds counter, which mainly records how many elements should be placed if the block space ⾜ is enough. - When inserting ⼊, when 14 positions in the block to which the key is mapped are still empty, insert ⼊ directly; When the block is full, increase the out of range counter and try to insert it into the next block;
- When querying, it is calculated by the key to be found
*H1*and*H2*。 adopt*H1*After the module of the block number is taken to determine the block to which it belongs, ⾸ first read the block header, and ⾏⽐ compare it through the SIMD instruction*H2*With 14 elements*H2s*Whether it is the same. If there is the same*H2*, then ⽐ check whether the keys are the same to determine the final result; Otherwise, judge whether the next block needs to be aligned according to the 16th byte of the block header.

F14 in order to make full use of the parallelism of SIMD instructions, make ⽤ in the block*H2*This hash value of 8 bits. Because ⼀ 128 bits wide SIMD instructions can enter ⾏ up to 16 8 bits integers and ⾏ compare. Although the theoretical collision probability of 8 bits hash value is 1 / 256, it is also equivalent to the possibility of 255 / 256, eliminating the overhead of key pair by key, so that the hash table can tolerate a higher collision probability.

**four B16 hash table**

Regardless of the internal design of blocks, F14 is essentially an open addressed hash table. The 15th and 16th bytes of each block header are stored in the open addressing control strategy, and only 14 bytes are left for the hash code, which is also named F14.

Then we consider whether we can use the zipper method to organize blocks from another degree. Because the control information is omitted, 16 elements can be placed in each block, which is named B16.

**4.1 B16 hash data structure**

*△ B16 hash table data structure (3 element example)*

The above figure shows the data structure of B16 hash table displayed by 3 elements in each block. The green one in the middle is Chang’s bucket array, which stores the head pointer of chunk zipper in each sub bucket. Each chunk on the right is in phase with F14, with less control bytes and more next pointers to the next chunk.

B16 also calculates two hash codes for key through hash function:_ H1_ and*H2*。 For example, the two hash codes of “lemon” are 0x24eb and 0x24, so that ⽤*H1*⾼ bit of as*H2*Generally speaking, that’s enough.

When inserting ⼊ through*H1*For example, the bucket where “lemon” is located is 0x24eb mod 3 = 1. Then find the ⼀ th space in the block zipper of No. 1 barrel, and put the corresponding key*H2*And elements write ⼊ the block. When the block zipper does not exist or is full, create a new block for the zipper to load the inserted element.

When searching, first pass*H1*Find the corresponding barrel dividing zipper, and then enter ⾏ based on SIMD instruction for each block*H2*Yes. Load the block header 16 bytes of each block into the 128 bits register, ⾥⾯ contains 16 bits*H2′*, put*H2*It is also repeatedly expanded to the 128 bits register, and 16 simultaneous pairs are entered through the SIMD instruction. If they are all different, then the next block is for ⽐; If the same exists*H2*, continue to check whether the key of the corresponding element is the same as the searched key. Until the entire zipper is traversed, or the corresponding element is found.

When deleting, first find the corresponding element, and then overwrite the corresponding element with the element at the end of the block zipper.

Of course, the number of elements in each block of the B16 hash table can be flexibly adjusted according to the width of the SIMD instruction. For example, the 256 bits width instruction can select a block of 32 elements. However, it is not only the lookup algorithm that affects the performance of hash table, but also the speed and continuity of memory access. The control block ⼤ is within 16. In most cases, it can make full use of the cache line of X86 CPU, which is a better choice.

In an ordinary zipper hash table, each node of the zipper has only ⼀ elements. B16 this block zipper method, each node contains 16 elements, which will cause a lot of holes. In order to reduce the number of holes as much as possible, we must increase the probability of hash collision, that is, reduce the size of the bucket array as much as possible. Through experiments, we found that when the load factor is between 11-13, the overall performance of B16 is the best. In fact, this is also equivalent to transferring the holes originally existing in the bucket array to the chunk zipper, which also saves the next pointer overhead of each node of the ordinary zipper.

**4.2 b16compact hash data structure**

*△ b16compact hash table data structure (3 element examples)*

B16compact compresses the hash table structure to the extreme.

⾸ first, it omits the next pointer in chunk, combines all chunks into ⼀ arrays, and fills all chunk holes. For example, the zipper of bucket [1] in [figure 1] originally has four elements, including banana and lemon, of which the first two elements are added to chunk [0] in [figure 2]. By analogy, all chunks are full except the last chunk in the chunk array.

Then it omits the pointer to the chunk zipper in the bucket, and only retains the array subscript pointing to the chunk where the ⼀ th element in the original zipper is located. For example, if the ⼀ th element of the zipper of bucket [1] in [figure 1] is added to bucket [0] in [figure 2], only the subscript 0 is stored in the new bucket [1].

Finally, ⼀ tail buckets are added to record the subscript of the last ⼀ chunk in the chunk array.

After such processing, the elements in the original zipper of each bucket are still continuous in the new data structure. Each bucket still points to the ⼀ th chunk containing its elements. The last ⼀ chunk containing its elements can still be known through the subscript in the next ⼀ bucket. The difference is that each chunk may contain multiple bucket zipper elements. Although there may be more chunks to find, each chunk can be quickly filtered through SIMD instructions, which has a relatively small impact on the overall search performance.

This read-only hash table only supports search, and the search process is not different from the original. Take lemon as an example, ⾸ first find the corresponding bucket 1 through H1 = 24eb, and obtain that the starting chunk subscript of the zipper corresponding to the bucket is 0 and the ending chunk subscript is 1. Use the same algorithm as B16 to find lemon in chunk [0], and then continue to find chunk [1] to find the corresponding element.

The theoretical additional storage overhead of B16 compact can be calculated by the following formula:

Where n is the number of elements of the read-only hash table.

When n is 1 million and load factor is 13, the theoretical additional storage overhead of b16compact hash table is 9.23 bits / key, that is, the additional overhead of storing each key is only one byte. This is almost comparable to some of the most perfect hash functions, and there will be no construction failure.

**five experimental data **

**5.1 experimental setting**

In the experiment, the key and value types of the hash table of ⽤ are Uint64_ t. The input array of key and value pairs is pre formed by the random number generator. The hash table initializes the number of ⽤ elements, and there is no need to rehash in the process of plug and play.

- Interpolation performance: obtained by dividing the total time of n elements by N, in NS / key;
- Query performance: obtained by 200000 random key queries (all hits) + 200000 random value queries (possible misses). The total time is divided by 400000. The unit is NS / key;
- Storage space: obtained by dividing the total allocated space by the hash table. The unit is bytes / key. For the total allocated space, F14 and B16 have corresponding ⼝ functions, which can be obtained directly, unordered_ Map is obtained by the following formula:

Folly library makes ⽤ mavx – O2 compile, and load factor makes ⽤ default parameter; B16 compile ⽤ – mavx – O2, and set the load factor to 13; unordered_ Map enables the Ubuntu system to have a version, and load factor enables the default parameters.

The test server is a 4-core 8g CentOS 7u5 virtual machine, the CPU is Intel (R) Xeon (R) gold 6148 @ 2.40GHz, and the program is compiled and executed in Ubuntu 20.04.1 lts docker.

**5.2 experimental data**

*△ insertion performance vs*

The broken line in the figure above shows unordered_ Interpolation performance of map, f14valuemap and b16valuemap. Different columns show the storage overhead of different hash tables.

You can see that the storage overhead of B16 hash table is significantly lower than that of unordered_ Map still provides significantly better performance than unordered_ Interpolation performance of map.

Due to the different dynamic optimization strategies of F14 hash table for load factor, the storage space overhead of F14 fluctuates under different hash tables, but the storage overhead of B16 is still better than F14 as a whole. The insertion performance of B16 is better than F14 at less than 1 million keys, but worse than F14 at 10 million keys, possibly because the locality of B16 zipper memory access is worse than F14 when the amount of data is less than F14.

*△ search performance on ⽐*

The broken line in the figure above shows unordered_ The lookup performance of map, f14valuemap, b16valuemap and b16compact. For ⽐ different columns ⼦ show the storage overhead of different hash tables.

It can be seen that the storage overhead of B16 and B16 compact hash tables is significantly lower than that of unordered_ Map still provides significantly better performance than unordered_ Map lookup performance.

The lookup performance of B16 and F14 hash tables is similar to that of ⽐ and ⼊ and is significantly better than F14 when the key is less than 1 million, but slightly worse than F14 when the key is 10 million.

It is worth noting the performance of b16compact hash table. Because the key and value types of the experimental hash table are Uint64_ t. Storing key and value pairs requires 16 bytes of space. The B16 compact hash table stores hash tables with stable 17.31 bytes / key, which means that the hash structure only costs 1.31 bytes for each key. The reason why the theoretical overhead of 9.23 bits / key is not reached is that our bucket array does not make ⽤ bitpack ⽅ extremely compressed (which may affect performance), and ⽽ makes ⽤ uint32_ t。

**six summary**

Inspired by F14, we designed B16 hash table, which makes the data structure easier to understand and the implementation logic of adding, deleting and querying simpler. Experiments show that in some scenarios, B16 has better storage overhead and performance than F14.

The new hash table design shows that the merging processing of SIMD instructions can effectively improve the tolerance of hash conflicts, improve the query speed, and help the hash table to achieve the ultimate storage space compression. This makes the design idea of hash table change from avoiding hash conflict as much as possible to optimizing computing and storage efficiency by using appropriate hash conflict probability.

Original link:https://mp.weixin.qq.com/s/oeuExiW3DYQnBG8HvDgBAg

———- END ———-

**Baidu architect**

The official account of Baidu technology is on the line.

Technical dry goods · Industry information· Online Salon · Industry Conference

Recruitment Information· Push in information· Technical books · Baidu peripheral

Welcome to pay attention!