This is why’s 86th original article

In my earlier articles, I wrote about bloom filter:

Alas, this terrible typesetting is beyond words

In fact, writing articles is the same as writing code.

When I see a piece of code with hot eyes, I want to spit fragrance: which code is this?

As a result, the author of the code was himself.

I can’t even believe it. I have to open and have a look at git’s submission record.

I found that I typed out and submitted the code a few months ago.

So I changed it silently.

In this case, I often comfort myself: it’s okay. It’s a good thing. It shows that I’m making progress.

All right, get down to business.

In the article at that time, I said that I didn’t know the internal principle of Bloom filter.

In fact, I’m just too lazy to write. This thing is not complicated. What’s unclear?

# Bloom filter

Bloom filter plays a role in reasonable use scenarios. Because the use scenario is in a scenario with a large amount of data, it is similar to second kill. Although it has not been really used, it should also be said to be reasonable.

It is common in interview links, such as judgment of duplicate data in large sets, cache penetration problems, etc.

First, share a real case of Bloom filter in Tencent’s short video products:

`https://toutiao.io/posts/mtrvsx/preview`

So how does bloom filter meet the above requirements?

First, the bloom filter does not store raw data, because its function is only to tell you whether an element exists. You don’t need to know what elements are in the bloom filter.

Of course, if we know which elements are in the container, we can know whether an element exists.

However, in this way, we need to store all the elements that have appeared. In the case of a large amount of data, such storage takes up a lot of space.

How does the bloom filter know whether an element exists without storing elements?

It’s actually very simple: a long array plus several hash algorithms.

In the above diagram, there are three different hash algorithms and an array with a length of 10. The array stores bit bits and only puts 0 and 1. The initial value is 0.

Suppose there is now an element [why] to pass through the bloom filter.

Firstly, [why] obtains three different numbers through three hash algorithms.

Hash algorithm can ensure that the number is between 0 and 9, that is, it does not exceed the length of the array.

We assume that the calculation results are as follows:

- Hash1(why)=1
- Hash2(why)=4
- Hash3(why)=8

This corresponds to the picture:

At this time, if there is another element [why], the subscripts obtained by the hash algorithm are still 1, 4 and 8, and it is found that the corresponding positions of the array are all 1. Indicates that this element is most likely to have occurred.

Note that what is said here is very likely. In other words, there will be a certain misjudgment rate.

Let’s store another element [Jay].

- Hash1(jay)=0
- Hash2(jay)=5
- Hash3(jay)=8

At this point, we combine the two elements to have the following picture:

The position where the subscript is 8 is special. Both elements point to it.

This picture looks a little uncomfortable. Let me beautify it:

Well, now this array becomes like this:

You said, you only look at this thing. Can you know why and Jay were once in this filter?

Don’t say you don’t know, not even the filter itself.

Now, suppose another element [Leslie] comes. After three hash algorithms, the calculation results are as follows:

- Hash1(Leslie)=0
- Hash2(Leslie)=4
- Hash3(Leslie)=5

From the above elements, you can know that 0, 4 and 5 are all 1 at this time.

The bloom filter will think that this element may have appeared before. Then it will be returned to the caller: [Leslie] once appeared.

But what about the actual situation?

In fact, our hearts are clear, [Leslie] has never been here.

This is the case of false positives.

This is what I said earlier:**Bloom said that the elements that exist do not necessarily exist.**

After a hash calculation, if the value in the corresponding position of an element is 0, it means that the element must not exist.

However, it has a fatal disadvantage that it does not support deletion.

Why?

If you want to delete [why], set the three positions 1, 4 and 8 to 0.

But you think, [Jay] also points to position 8.

If [why] is deleted and position 8 becomes 0, is it equivalent to removing [Jay]?

Why is it fatal not to support deletion?

Think again, the bloom filter was originally used in scenarios with a large amount of data. With the passage of time, there are more and more positions of 1 in the filter array, resulting in an increase in the misjudgment rate. Therefore, reconstruction must be carried out.

Therefore, in the example of Tencent cited at the beginning of the article, there is such a sentence:

In addition to deleting this problem, the bloom filter has another problem: poor query performance.

Because the array length in the filter in the real scene is very long, the span of the array subscript in memory may be very large after multiple different hash functions. If the span is large, it is discontinuous. Discontinuity will lead to low CPU cache row hit rate.

This thing, so to speak. Just recite it as an eight part essay.

Leave marks on the snow, and the wild geese leave a sound. This is the bloom filter.

If you want to play bloom filter, you can visit this website:

`https://www.jasondavies.com/bloomfilter/`

Insert on the left and query on the right:

What if you want the bloom filter to support deletion?

There is one called counting bloom filter.

It uses a counter array to replace the bits of the array, so that the space of one bit is expanded into a counter.

At the cost of several times more storage space, the deletion operation is added to bloom filter.

This is also a solution.

But there is a better solution, the cuckoo filter.

In addition, there is a mathematical reasoning formula about the misjudgment rate of Bloom filter. It’s very complicated and boring. I won’t talk about it. If you are interested, you can learn about it.

`http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html`

# Cuckoo hash

Cuckoo filter first appeared in a paper published in 2014: “cuckoo filter: practically better than bloom”

`https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf`

But before talking about cuckoo filter, you have to simply pave the way for Cuckoo hashing, that is, cuckoo hash knowledge.

Because this word is the key word of the paper, it appears 52 times in the text.

Cuckoo hashing first appeared in this paper in 2001:

`https://www.cs.tau.ac.il/~shanir/advanced-seminar-data-structures-2009/bib/pagh01cuckoo.pdf`

I mainly look at this place of the paper:

Its working principle is summarized as follows:

It has two hash tables, T1 and T2.

Two hash functions, denoted H1 and H2.

When a nonexistent element is inserted, its position in T1 table will be calculated according to H1. If the position is empty, it can be put in.

If the position is not empty, calculate its position in T2 table according to H2. If the position is empty, it can be put in.

If the position is not empty, kick out the element in the current position and put the current element in.

You can also kick one of the two positions randomly. In short, an element will be kicked out.

What about the kicked out elements?

It’s all right. It also has its own other position.

The pseudo code in the paper is as follows:

It doesn’t matter if you don’t understand. Let’s draw a schematic diagram:

The picture above says something like this:

I want to insert element X. after calculation by two hash functions, its two positions are position 2 of T1 table and position 1 of T2 table.

Both positions are occupied, so randomly kick y out of position 2 of T1 table.

The other position of Y is occupied by the Z element.

So y kicked Z out mercilessly.

Z finds that his spare position is still empty (although this spare position is also the spare position of element V), and hurry to put it in place.

Therefore, when x is inserted, the graph becomes like this:

The above figure actually comes from the paper:

This solution similar to dolls is feasible, but there is always the problem of circular kicking out, resulting in not putting X in.

For example (b) in the figure above.

When this happens, it means that cuckoo hash has reached its limit and should be expanded or hash function optimized.

Therefore, when you look at the pseudo code again, you will understand the meaning of maxloop.

The meaning of maxloop is to set a threshold to avoid too many times of execution of the process kicked out of each other.

In fact, I understand that cuckoo hash is an operation to solve hash conflicts.

If you want to get started, you can visit this website:

`http://www.lkozma.net/cuckoo_hashing_visualization/`

After 16 kicks (maxloop) and the insertion is not completed, it will tell you that you need to rehash and expand the capacity of the array:

That’s what cuckoo hash is all about.

Next, we look at the cuckoo filter.

# Cuckoo filter

Cuckoo filter’s paper “cuckoo filter: practically better than bloom” has such a passage on the first page.

I’m a cuckoo filter, but I’m a little better than you.

Come up and point at the weakness of others:

A major limitation of the standard bloom filter is that it cannot delete existing data. If you use a variant of it, such as counting bloom filter, but the space is expanded by 3 to 4 times, balabalabala

And I’m different:

This paper will prove that supporting deletion does not require higher overhead in space or performance than standard Bloom filters.

Cuckoo filter is a practical data structure that provides four advantages:

- 1. Support dynamic addition and deletion of elements.
- 2. It provides higher search performance than the traditional bloom filter, even when it is nearly full (such as when the space utilization reaches 95%).
- 3. Easier to implement than alternatives such as quotient filter (another filter).
- 4. If the error rate is required to be less than 3%, it occupies less space than bloom filter in many practical applications.

The cuckoo filter API is nothing more than insert, query and delete.

The most important one is insertion. Take a look:

You may glance at the part of the paper. It doesn’t matter if you can’t understand it. I’ll analyze it for you right away.

Insert some pseudo code and you can see the shadow of cuckoo hash, because it is based on this thing.

So where is the biggest change?

Nothing more than the change of hash function.

I stared at the dog and thought: is there such a coquettish operation?

First, let’s recall the cuckoo hash, which stores the original value of the inserted element. For example, X and X will pass through two hash functions. If we remember that the length of the array is l, it is like this:

- p1 = hash1(x) % L
- p2 = hash2(x) % L

What is the calculation position of cuckoo filter?

- h1(x) = hash(x),
- h2(x) = h1(x) ⊕ hash(x’s fingerprint).

We can see that when calculating H2 (position 2), a hash calculation is performed on the fingerprint of X.

The concept of “fingerprint” will be discussed later. Let’s focus on the calculation of location first.

The XOR operation in the above algorithm ensures an important property: position H2 can be calculated from the “fingerprint” stored in positions H1 and H1.

As long as we know the location (H1) of an element and the “fingerprint” information stored in the location, we can know the standby location (H2) of the “fingerprint”.

**Because of the XOR operation used, the two positions are dual.**

Just ensure hash (x’s fingerprint)= 0, then you can ensure H2= H1, which can ensure that there will be no dead cycle problem of kicking yourself.

In addition, why do you perform an exclusive or operation after performing a hash calculation on the “fingerprint”?

A counter proof method is given in this paper: if the length of the “fingerprint” is 8 bits without hash calculation, then its dual position is calculated, and the farthest distance from the current position is only 256.

Why, the paper says:

Because if the length of the “fingerprint” is 8bit, the XOR operation will only change the lower 8 bits of the current position H1 (x), and the high bit will not change.

Even if all the lower 8 bits are changed, the calculated position is what I just said: the farthest 256 bits.

Therefore, hashing the “fingerprint” can ensure that the kicked out elements can be relocated to completely different buckets in the hash table, so as to reduce hash conflicts and improve table utilization.

Then there’s another problem with the hash function. Did you find it?

It does not take the module for the length of the array, so how can it ensure that the calculated subscript must fall in the array?

This comes to another limitation of cuckoo filter.

The length of its mandatory array must be exponential times 2.

The binary of exponential times of 2 must be like this: 10000000… (n zeros).

The advantage of this restriction is that during XOR operation, it can ensure that the calculated subscript must fall in the array.

The disadvantages of this restriction are:

- Cuckoo filter: I support deletion.
- Bloom filter: I don’t need to limit the length to an exponential multiple of 2.
- Cuckoo filter: my search performance is higher than you.
- Bloom filter: I don’t need to limit the length to an exponential multiple of 2.
- Cuckoo filter: my space utilization is also high.
- Bloom filter: I don’t need to limit the length to an exponential multiple of 2.
- Cuckoo filter: I’m bored, TMD!

Next, say “fingerprint”.

This is the first place where “fingerprint” appears in the paper.

“Fingerprint” is actually a hash calculation of the inserted element, and the product of hash calculation is several bits.

The cuckoo filter stores the “fingerprint” of the element.

When querying data, you should check whether there is corresponding “fingerprint” information in the corresponding location:

When deleting data, you just erase the “fingerprint” on the position:

Because the element is hashed, the same “fingerprint” will inevitably occur, that is, misjudgment will occur.

The original data is not stored, so the accuracy of the data is sacrificed, but only a few bits are saved, so the space efficiency is improved.

Speaking of space utilization, what is the space utilization of cuckoo hash?

In the perfect case, that is, before there is no hash conflict, its maximum space utilization is only 50%.

Because there is no conflict, at least half of the positions are empty.

In addition to storing only “fingerprints”, how can cuckoo filter improve its space utilization?

Look at what the paper says:

The previous (a) and (b) are very simple. They are still two hash functions, but they do not use two arrays to store data. They are cuckoo hash based on one-dimensional array. The core is still kicking around. I won’t say more.

The key point is (c). The array is expanded from one dimension to two dimensions.

For each subscript, you can put four elements.

With such a small change, the space utilization rate has directly increased from 50% to 98%:

I asked you, are you afraid?

The first point in the paper in the screenshot above is to state such a fact:

When the hash function is fixed to 2, if only one element can be placed in a subscript, the space utilization is 50%.

However, if a subscript can put 2, 4 or 8 elements, the space utilization will soar to 84%, 95% and 98%.

Here, we understand the optimization point and corresponding working principle of cuckoo filter for Cuckoo hash.

Everything seems so perfect.

All indicators are better than bloom filter. The main function is to support deletion.

But is it really so good?

When I saw this paragraph in Section 6 of the paper, I was silent:

Limit duplicate data: if the cuckoo filter is required to support deletion, it must know how many times a data has been inserted. The same data cannot be inserted KB + 1 times. Where k is the number of hash functions, B is the position of a subscript, and several elements can be placed.

For example, two hash functions and a two-dimensional array can insert up to four elements into each subscript. Then the same element can be inserted up to 8 times.

For example:

Why has been inserted 8 times. If a why is inserted again, the problem of loop kicking will occur until the maximum number of loops, and then a false will be returned.

How to avoid this problem?

We maintain a record table to record the number of times each element is inserted.

Although the logic is simple, it can’t support a large amount of data. Think about it. How about the storage space of this table?

It’s hard to think about it.

If you want to delete the cuckoo filter, you have to bear the pain.

Finally, let’s take a look at the comparison diagram of various types of filters:

In addition, the mathematical reasoning process, if you don’t say it, your eyes hurt, and it’s easy to lose your hair.

# Barren cavity walking board

Do you know why it’s called “cuckoo”?

Cuckoo, also known as cuckoo.

“Compendium of Materia Medica” has such a record: “a dove cannot be a nest, live in another nest and have children”. The nest parasitism of Rhododendron is described here. Nest parasitism refers to the breeding mode in which birds do not build their own nests, lay their eggs in the nests of other kinds of birds, and the host replaces hatching and brooding, including interspecific nest parasitism (the parasite and host are different species) and intraspecific nest parasitism (the parasite and host are the same species). Among more than 10000 kinds of birds, more than 100 species have nest parasitic behavior, of which the most typical is the Rhododendron.

That is to say, it lays its eggs in other birds’ nests and asks other birds to help it hatch chickens. Oh, no, hatching birds.

After the young cuckoo hatches, it will push the eggs of other biological birds in the same nest out of the nest so that the mother bird can focus on feeding it.

My God, it’s cruel.

But isn’t the action of “pushing out the bird’s nest” the same as the algorithm described above?

But our algorithm is a little more lovely. The bird eggs pushed out, that is, the elements kicked out, will be put in another position.

When I looked up the information, I was shocked when I knew that cuckoo was a cuckoo.

There are cuckoos in many poems. For example, I like it very much. Jinse by Li Shangyin, a poet of the Tang Dynasty:

There are fifty strings in a beautiful harp without any reason. One string and one column miss the year of China.

Zhuangsheng Xiaomeng is fascinated by butterflies and hopes the emperor’s spring heart holds cuckoos.

The moon pearl in the sea has tears, the sun in the blue field is warm, and the jade gives birth to smoke.

This situation can be recalled, but it was at a loss at that time.

since ancient times. The debate over whether this poem is saying “mourning” or “self injury” has not stopped.

But does it matter?

It doesn’t matter to me.

It is important that, at the right time and in the right atmosphere, when recalling the past, we can appropriately say: “this situation can be remembered, but it was at a loss at that time”.

Instead of saying: Hey, when I think of it now, I really regret that I didn’t cherish many things.

Oh, by the way.

I also found one thing when I wrote the article.

Bloom filter was put forward by a big man named Burton Howard bloom in 1970.

When I write these things, I just want to see what the big guy looks like.

But something magical happened. I turned upside down inside and outside the wall, but I didn’t find any photos of the boss.

My search stopped at finding this website:

`https://www.quora.com/Where-can-one-find-a-photo-and-biographical-details-for-Burton-Howard-Bloom-inventor-of-the-Bloom-filter`

This question should have been asked nine years ago, that is, in 2012:

Indeed, no photos of Burton Howard bloom were found on the Internet.

What a magical and low-key boss.

It may be a beautiful man who loves the country and the city.

# Last word (for attention)

If you find something wrong, you can put it forward in the background and I will modify it.

Thank you for your reading. I insist on originality. Welcome and thank you for your attention.

**I am why, a program that mainly writes code, often writes articles and occasionally takes videos.**

Also, welcome to pay attention to me.