Principle and application scenario of Bloom filter

Time:2021-12-12

1. What is a bloom filter

Bloom filter was proposed by bloom in 1970. It is actually composed of a long binary vector and a series of random mapping functions. Bloom filter can be used to query whether an element is in a set. Its advantage is that its space and time efficiency are much higher than ordinary algorithms, but its disadvantage is that it will be difficult to misjudge and delete.

  • It does not store the data itself, so it cannot extract the original data from the bloom filter.
  • When it judges the existence of data, there is a certain error: a certain data does not exist, but it may say that the data exists.
  • If it determines that the data is not in it, it must be.
  • Data can only be added to, not removed.

2. Principle of Bloom filter(Reference articles)

First understand the concept of hash function: a function that converts data of any size into data of a specific size. The converted data is called hash value or hash encoding.

The bottom layer of the bloom filter is a bit array. When a data a is put in, several values are calculated through several hash functions. Suppose that three hash values x, y and Z are calculated through three hash functions. x. Y, Z correspond to the corresponding position of the bit array, and the bit at this position is set to 1 Then the data is put in.

In the following figure, blue / red / Purple indicates that three data are put into the bit array, and their corresponding bits are set to 1
At this time, the data W is not put into the bit array, but the bits corresponding to its corresponding hash value are all 1, then the bloom filter will consider that w exists. This will lead to miscarriage of justice.

Principle and application scenario of Bloom filter

Bottom layer of Bloom filter

3. Application scenarios

Bloom filter is suitable for large amount of data, but it can allow a certain degree of error. For example:

  • Duplicate crawler URL judgment
    When a URL is to be crawled, the bloom filter is used to determine whether it exists. If it does not exist, it will be put in, and if it exists, it will not be processed. The misjudgment is that this URL has not been crawled, but the bloom filter says that it has been crawled, so some URLs will be missing in the crawler process, which will not affect.

  • Cache penetration
    All data is put into a bit array through a bloom filter. When a request comes, if the requested data exists, it will be released; If the data does not exist, it may be blocked and released with a small probability. Then, most malicious requests can be blocked.

4. Java implementation of Bloom filter

The implementation of bloomfilter is provided in the guava package.

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class BloomFilterTest {

    public static void main(String[] args) {
        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), 200000, 1E-7);
                
        bloomFilter.put("test");
        
        boolean contain = bloomFilter.mightContain("test");
        
        if (contain)
            System.out.println("contain test");
    }

}

Recommended Today

Solve Google Chrome's use of window.open() and document.write(), right-click "Save Image As…" doesn't work

Click the image address, use window.open() and document.write() to open a new window, realize the image preview, right-click to save const imgWindow = window.open(”) imgWindow && imgWindow.document.write(`<img src=’${url}’ style=’display: flex; margin: 0 auto’/>`) It was found that, in some Google Chrome browsers, it is not possible to right-click to save a picture after a new […]