“Judging whether a value is in a huge set” (hereinafter referred to as set membership test) is a common data processing problem. In the past experience, if a certain false positive rate is allowed, then the bloom filter is the first choice, but now we have a better choice: the cuckoo filter.
Recently, the business needs to use filters. After a search, we found that the cuckoo filter is more costeffective than the bloom filter in our scenario.
In order to determine the final technology selection, I read the original paper. Later, when I decided to use cuckoo filter, I found that there was almost no comprehensive implementation of golang. At present, several high stars implementations on GitHub have some defects and did not maximize the space utilization. Therefore, I refer to the original paper and the original C + + implementation of the paper, Transplant and optimize a version of golang’s library, the details are in the following.
The code address is here. Welcome to star, use, contribute and debuggithub.com/linvon/cuckoofilter
Cuckoo filter
Cuckoo filter in the network has a lot of introduction articles, here no longer do too much introduction, just mention the main points, used to lead to the following content
For more details, please refer toOriginal paper, or check out myChinese Translation
What is a cuckoo filter?
It is a kind of filter based on cuckoo hash algorithm. In essence, it is a cuckoo hash table that stores the hash value of storage items.
If you know about bloom filter, you should know that the principle of Bloom filter is to use a variety of hash methods to map different hashes of storage items to bit groups, and check these bits to determine whether they exist during query.
The cuckoo filter hashes the stored items and stores the hash value in the array with a certain number of digits. When querying, it determines whether the hash of the same number of digits exists in the array.
Why cuckoo filter?
They all store hash values. In essence, they store multiple hashes. Why is cuckoo filter better?

One is that cuckoo hash table is more compact, so it can save more space.

Second, when querying, the bloom filter uses a variety of hash functions for multiple hashes, while the cuckoo filter only needs one hash, so the query efficiency is very high

Thirdly, cuckoo filter supports deletion, while bloom filter does not
What are the disadvantages? Compared to the bloom filter
 Cuckoo filter adopts a scheme of spare candidate bucket. The candidate bucket and the preferred bucket can be obtained by XOR of position and storage value. This correspondence requires that the size of the bucket must be an exponential multiple of 2
 When the bloom filter inserts, it can calculate the hash and directly write it to the bit. After the calculation, the cuckoo filter may appear that the fingerprint has been stored in the current position. At this time, it needs to kick the stored items to the candidate bucket. As the bucket becomes more and more full, the possibility of conflict becomes greater and greater, and the insertion time becomes more and more expensive. Therefore, the insertion performance of the cuckoo filter is very poor compared with the bloom filter
 Insert repeating elements: the bloom filter does not affect the insertion of repeating elements, it just sets the existing bits again. The cuckoo filter will kick out the existing value, so there is an upper limit for the insertion of duplicate elements
 Cuckoo filter deletion is not perfect: with the above restriction of repeated insertion, there will be related problems in deletion: deletion is perfect only when the same hash value is inserted once. If the element is deleted without insertion, it may be deleted by mistake, which is the same as the false positive rate; If the element is inserted many times, only one value will be deleted each time. You need to know how many times the element has been inserted before it can be deleted completely, or you need to run the deletion loop until the deletion fails
All the advantages and disadvantages are listed. Let’s sum up. For this kind of set membership testing problem, most scenarios are read more and write less, and repeated insertion is meaningless. Although the deletion of cuckoo filter is not perfect, it is better than none. At the same time, it has better query and storage efficiency. It should be said that in most cases, it is a more costeffective choice.
Practical Study
Detail implementation
First, let’s talk about the concept of cuckoo filter. The filter is composed of many buckets. Each bucket stores the hash value of the inserted item, which only stores a fixed number of digits.
There are n buckets in the filter, and the number of buckets is calculated according to the number of items to be stored. Through the hash algorithm, we can calculate which bucket an item should be stored in. In addition, every time we add a hash algorithm, we can generate a candidate bucket for an item. When the item is repeatedly inserted, we will kick the currently stored item into the candidate bucket. Theoretically, the more hash algorithms, the higher space utilization, but the actual test using k = 2 hash functions can achieve 98% utilization.
Each bucket will store multiple fingerprints, which depends on the size of the bucket. Different fingerprints may be mapped to the same bucket. The larger the bucket is, the higher the space utilization is. But at the same time, the more fingerprints in the same bucket are scanned in each query, so the higher the probability of false positive is. At this time, we need to improve the number of stored fingerprints to reduce the conflict rate and maintain the false positive rate.
In this paper, several parameters needed to realize cuckoo filter are mentioned
 Number of hash functions (k): number of hashes, 2 is enough
 Bucket size (b): how many fingerprints are stored in each bucket
 Fingerprint size (f): how many bits of hash value of each fingerprint storage key
In Chapter 5, the author tells us how to choose the most suitable construction parameters based on the experimental data. We can get the following conclusions
 The filter cannot be 100% filled and there is a maximum load factor α， Then the storage space evenly allocated to each item is f/ α
 When the total size of the filter remains unchanged, the larger the bucket is, the higher the load factor is, that is, the higher the space utilization rate is. However, the more fingerprints stored in each bucket, the higher the probability of conflict during query. In order to maintain the same false positive rate, the larger the bucket is, the larger the fingerprint is needed
According to the above theoretical basis, the relevant experimental data are as follows
 When k = 2 hash functions are used, when the bucket size B = 1 (that is, directly mapping the hash table), the load factor will decrease α However, when B = 2, 4 or 8, it will increase to 84%, 95% and 98% respectively
 In order to ensure the false positive rate R, we need to ensure $2B / 2 ^ f / Leq R $, then the size of fingerprint f is about $f ≥ log_ 2(2b/r)=log_ 2(1/r) + log_ 2 (2b) $, then the shared cost of each item is $C ≤ [log]_ 2(1/r) + log_ 2(2b)]/ α$
 The experimental data show that when r > 0.002. The result of two entries per barrel is slightly better than that of four entries per barrel; When R decreases to 0.00001 < R ≤ 0.002, four entries per bucket can minimize space
 If we use the semi sort bucket, we can reduce the storage space of 1 bit for each storage item, but it only works on the filter with B = 4
In this way, we can determine how to choose parameters to build our cuckoo filter：
First of all, we use two hash functions, which can achieve enough space utilization. According to the false positive rate we need, we can determine the size of the bucket to use. Of course, the choice of B is not absolute. Even if r > 0.002, you can use B = 4 to enable the semi sorted bucket. Then we can calculate the size of F according to B to achieve the target false positive rate, so that all the filter parameters are determined.
Compare the above conclusion with the $1.44log of each item of Bloom filter_ 2 (1 / R) $in comparison, it can be found that when semi sorting is enabled, the cuckoo filter space is smaller when R < 0.03; if semi sorting is not enabled, it will degenerate to R < 0.003.
Some advanced explanations
Optimization of hash algorithm
Although we specify that we need two hash algorithms, in practice, we use one hash algorithm is enough, because an alternative bucket calculation method is mentioned in the paper. The second hash value can be calculated by the XOR of the first hash value and the fingerprint stored in the location. If you are worried that we need to calculate the hash of the fingerprint and the hash of the location separately, we can use only one algorithm to make a 64 bit hash. The high 32 bits are used to calculate the location, and the low 32 bits are used to calculate the fingerprint.
Why can semi sorted buckets only be used when B = 4?
The essence of semi sorting is to take four digits of each fingerprint, which can be expressed as a hexadecimal. For the four digit storage of B fingerprints, it can be expressed as B hexadecimal numbers. After all the possible permutations are arranged in order, the corresponding permutation can be found by indexing its position, so as to obtain the actual storage value.
We can count all kinds of cases by the following function
func getNum(base, k, b, f int, cnt *int) {
for i := base; i < 1<<f; i++ {
if k+1 < b {
getNum(i, k+1, b, f, cnt)
} else {
*cnt++
}
}
}
func getNextPow2(n uint64) uint {
n
n = n >> 1
n = n >> 2
n = n >> 4
n = n >> 8
n = n >> 16
n = n >> 32
n++
return uint(n)
}
func getNumOfKindAndBit(b, f int) {
cnt := 0
getNum(0, 0, b, f, &cnt)
fmt.Printf("Num of kinds: %v, Num of needed bits: %v\n", cnt, math.Log2(float64(getNextPow2(uint64(cnt)))))
}
When B = 4, there are 3786 permutations, which are less than 4096. That is to say, all permutation indexes can be stored with 12 bits. If all fingerprints are stored directly, 4×4 = 16 bits is needed. In this way, 4 bits are saved, that is, one bit is saved for each fingerprint.
It can be found that when B is 2, whether to enable semi sorting needs to store the same number of bits, which is meaningless. If B is too large, the index that needs to be stored will also expand rapidly, and there will be a great loss in query performance. Therefore, B = 4 is the most costeffective choice.
In addition, the choice of encoding and storing four digit fingerprint is that it can just be represented by a hexadecimal, which is convenient for storage
Parameter selection when using semi sorting
When using semi sorting, we should ensure that $ceil (b)(f1)/8)<ceil(bF / 8) $, otherwise the space occupied by using semi sorting is the same
Filter size selection
The total size of the filter barrel must be an exponential multiple of 2, so when setting the filter size, try to meet $size/ α ~=(<) 2 ^ n $, size is the amount of data that you want a filter to store. If necessary, you should choose a smaller filter and use multiple filters to achieve the target effect
Golang implementation
This part is mainly about golang library
After reading the golang implementation of cutoofilter on GitHub, we find that there are some shortcomings in the existing implementation
 Most of the libraries are fixed B and F, that is, the false positive rate is also fixed, and the adaptability is not good
 All the library f are in bytes, which can only be adjusted in multiples of 8. It is not convenient to adjust the false positive rate
 All libraries do not implement semi sorted buckets, which greatly reduces the advantages of Bloom filters
Because our scene needs better space and selfdefined false positive rate, we transplant the C + + implementation of the original paper and do some optimization, mainly including

Support adjustment parameters

Support semi sort bucket

Compress the space to a compact group of bits, and store the fingerprint bit by bit

Support binary serialization
This work adoptsCC agreementReprint must indicate the author and the link of this article