[actual combat problem] – bloom filter of cache penetration (1)

Time:2021-4-19

As mentioned earlier, in the case of preventing cache penetration (CACHE penetration refers to,Data that neither cache nor database hasFor example, the order number cannot be-1, but the user requested a large number of-1Because the data does not exist, the cache will not exist, and all requests will directly penetrate into the database. ), we can consider using a bloom filter to filter out elements that absolutely do not exist in the collection.

What is a bloom filter?

Bloom filter was proposed by Burton Howard bloom in 1970. It is actually composed of a very long binary vector and a series of random hash mapping functions (to put it bluntly, the feature of storing data in binary array). It can be used to determine whether an element exists in the collection. Its advantages are high query efficiency and small space. Its disadvantages are that there are certain errors, and when we want to remove elements, they may affect each other.

In other words, when an element is added to the set, the element is mapped to k points in the digit group by multiple hash functions and set to 1.

Why a bloom filter?

In general, we want to determine whether there is an element. At the beginning, we must use array, but when using array, the efficiency is relatively slow. To determine whether an element does not exist in the array, we need to traverse all the elements each time. After deleting an element, you have to move the other elements to the front.

[actual combat problem] - bloom filter of cache penetration (1)

In fact, we can consider using ithashTables, if anyhashTable to store, will be the following structure:
[actual combat problem] - bloom filter of cache penetration (1)

However, although this structure meets most of the requirements, it may have two defects

  • There is only one hash function. In fact, two elements hash together, that is, the possibility of hash conflict is relatively high. Although it can be solved by zipper method (followed by a linked list), the operation time complexity may be increased.
  • When storing, we need to refer elements to store them. If there are hundreds of millions of data, we need to store them in a hash table. This is not recommended.

For the above defects, we can consider using multiple hash functions to reduce conflicts (Note: conflicts can not be avoided, only reduced), and using bits to store each hash value. This can not only reduce hash conflicts, but also reduce storage space.

Suppose there are three hash functions, then different elements will use three hash functions, hash to three positions.
[actual combat problem] - bloom filter of cache penetration (1)

Suppose there is another Zhang San, then in hash, it will also be hashed to the following position, all bits are 1, we can say that Zhang San already exists in it.

[actual combat problem] - bloom filter of cache penetration (1)

Is there any possibility of misjudgment? This is possible. For example, there are only Zhang San, Li Si, Wang Wu, Cai Ba and hash mapping values as follows:
[actual combat problem] - bloom filter of cache penetration (1)

Later came Chen Liu, but unfortunately, the bits of the three hash functions of his hash are just right, which are changed to 1 after being hashed by other elements. It is judged that Chen Liu already exists, but in fact, Chen Liu did not exist before.

[actual combat problem] - bloom filter of cache penetration (1)

The above situation is miscarriage of justice, and bloom filter will inevitably have miscarriage of justice. But it has one advantage,Bloom filter, judging the existing elements, may not exist, but judging the non-existent elements, must not exist.Because the judgment does not exist, it means that at least one hash is incorrect.

It is also because multiple elements may hash together, but a data is kicked out of the set. We want to set its mapped bit to 0, which is equivalent to deleting the data. At this time, other elements will be affected, and the mapping bit of other elements may be set to 0. That’s why the bloom filter can’t be removed.

Specific steps

Add element:

    1. Multiple hash functions are used to hash the element item to get multiple hash values.
    1. Each hash value modulus the bit group to get the position index index in the bit group.
    1. If the position of index is not 1, then the position will be 1.

To determine whether an element exists:

    1. Multiple hash functions are used to hash the element item to get multiple hash values.
    1. Each hash value modulus the bit group to get the position index index in the bit group.
    1. If the position of index is all 1, the element may already exist.

Deduction of misjudgment rate

Fortunately, the error rate of Bloom filter can be predicted. From the above analysis, we can also know that it is closely related to the size of digit group and the number of hash functions.

Suppose that the size of the digit group is m, and we have k hash functions in total, then each hash function can only be hashed to one position in M bits when hashing, so the probability of not being hashed is:
$$1-\frac{1}{m}$$

After k hash functions are hashed, the probability that the bit is not hashed to 1 is as follows:
$$(1-\frac{1}{m})^k$$

If we insert n elements, that is, hash n * k times, the probability that the bit is not hashed to 1 is as follows:
$$(1-\frac{1}{m})^{kn}$$

The probability that the bit is 1 is:
$$1-(1-\frac{1}{m})^{kn}$$

If you need to detect whether an element is in the collection, that is, the hash values of k hash elements corresponding to the element need to be set to 1. That is, the probability that the element does not exist, but all the bits corresponding to the element are hashed to 1 is as follows:
$${(1-(1-\frac{1}{m})^{kn})}^{k}\approx {(1-e^{-kn/m})}^k $$

It can be roughly seen that with the increase of the number of bit group M and hash functions, the probability will decrease, and with the increase of the inserted element n, the probability will increase.

Finally, we can roughly calculate the length of the number of bits group of the bloom filter by the expected error rate P and the expected number n
$$m=-(\frac{nInP}{(In2)^2})$$

The above is the general calculation method of the error rate, and it also reminds us that we can adjust the size of our array according to the data volume of our business and the error rate.

Function of Bloom filter

In addition to the above filtering crawler malicious requests, we can also de duplicate some URLs, filter the duplicate data in the massive data, filter the ID that does not exist in the database, and so on.

However, even if there is a bloom filter, we can not completely avoid or completely solve the problem of cache penetration. It’s just equivalent to optimizing and improving the accuracy.

Many key value databases also use bloom filter to speed up the query efficiency, because the efficiency is too low to judge one by one.

[notes]
GitHub warehouse address:https://github.com/Damaer/cod…
Note address:https://damaer.github.io/code…

[about the author]
Qin Huai, official account.Qinhuai grocery store】The author points out that the road of technology is not for a moment, even though it is slow and continuous. Personal writing direction: Java source code analysis, JDBC, mybatis, spring, redis, distributed, Jianzhi offer, leetcode, etc., carefully write every article, don’t like the title party, don’t like fancy, mostly write a series of articles, can’t guarantee that what I write is completely correct, but I guarantee that what I write is through practice or looking for information. We hope to correct the omissions or mistakes.

What did I write in 2020?

Notes on open source

I have precious time on weekdays. I can only learn to write in the evenings and weekends. Pay attention to me and let’s grow up together~