Don’t know about Bloom filters? I’ll give you a whole article!

Time:2021-8-26

The two scenarios of massive data processing and cache penetration let me know about bloom filter. I consulted some materials to understand it, but many ready-made materials did not meet my needs, so I decided to summarize an article about bloom filter. I hope that through this article, more people can understand bloom filter and actually use it!

Below, we will introduce bloom filter in several aspects:

  1. What is a bloom filter?
  2. Introduction to the principle of Bloom filter.
  3. Bloom filter usage scenario.
  4. The bloom filter is implemented manually through Java programming.
  5. Use the bloom filter in Google’s open source guava.
  6. Bloom filter in redis.

1. What is a bloom filter?

First, we need to understand the concept of Bloom filter.

Bloom filter was proposed by an elder brother called bloom in 1970. We can think of it as a data structure composed of binary vectors (or bit arrays) and a series of random mapping functions (hash functions). Compared with the commonly used data structures such as list, map and set, it occupies less space and is more efficient, but the disadvantage is that the returned results are probabilistic rather than very accurate. In theory, the more elements added to the set, the greater the possibility of false positives. Moreover, the data stored in the bloom filter is not easy to delete.
Don't know about Bloom filters? I'll give you a whole article!

Each element in the bit group occupies only 1 bit, and each element can only be 0 or 1. In this way, applying for a bit group of 100W elements only occupies a space of 1000000bit / 8 = 125000 byte = 125000 / 1024 KB ≈ 122kb.

Summary:A man named bloom proposed a data structure to retrieve whether elements are in a given large set. This data structure is efficient and has good performance, but its disadvantage is that it has a certain error recognition rate and deletion difficulty. Moreover, in theory, the more elements added to the set, the greater the possibility of false positives.

2. Introduction to the principle of Bloom filter

When an element is added to the bloom filter, the following operations will be performed:

  1. Use the hash function in the bloom filter to calculate the element value to get the hash value (several hash functions get several hash values).
  2. According to the obtained hash value, set the value of the corresponding subscript to 1 in the bit group.

When we need to judge whether an element exists in the bloom filter, we will perform the following operations:

  1. Perform the same hash calculation on the given element again;
  2. After obtaining the value, judge whether each element in the digit group is 1. If the value is 1, it indicates that the value is in the bloom filter. If there is a value that is not 1, it indicates that the element is not in the bloom filter.

Take a simple example:

Don't know about Bloom filters? I'll give you a whole article!

As shown in the figure, when the string storage is to be added to the bloom filter, the string first generates different hash values by multiple hash functions, and then the elements in the following table of the corresponding bit group are set to 1 (when the bit array is initialized, all positions are 0). When the same string is stored for the second time, because the previous corresponding position has been set to 1, it is easy to know that this value already exists (de duplication is very convenient).

If we need to judge whether a string is in the bloom filter, we only need to perform the same hash calculation on the given string again. After obtaining the value, we can judge whether each element in the bit group is 1. If the value is 1, it means that the value is in the bloom filter. If there is a value that is not 1, it means that the element is not in the bloom filter.

Different strings may have the same hash position. In this case, we can appropriately increase the size of the bit group or adjust our hash function.

To sum up, we can conclude that:Bloom filter says that if an element exists, it will be misjudged with a small probability. The bloom filter says that if an element is not present, the element must not be present.

3. Usage scenario of Bloom filter

  1. Judge whether a given data exists: for example, judge whether a number exists in a digital set containing a large number of numbers (the digital set is large, more than 500 million!) Prevent cache penetration (judge whether the requested data is effective and avoid directly bypassing the cache request database), email spam filtering, blacklist function, etc.
  2. De duplication: for example, when climbing a given URL, de duplication of the URL that has been crawled.

4. Manually implement bloom filter through Java programming

We have described the principle of Bloom filter above. After knowing the principle of Bloom filter, you can manually implement one by yourself.

If you want to implement one manually, you need to:

  1. Save data in a set of bits of appropriate size
  2. Several different hash functions
  3. Implementation of the method of adding elements to the bit group (Bloom filter)
  4. Implementation of the method to determine whether a given element exists in a bit array (Bloom filter).

Here is a code that I think is fairly good (improved by referring to the existing code on the Internet, which is applicable to all types of objects):

import java.util.BitSet;

public class MyBloomFilter {

    /**
     *Size of digit group
     */
    private static final int DEFAULT_SIZE = 2 << 24;
    /**
     *Six different hash functions can be created from this array
     */
    private static final int[] SEEDS = new int[]{3, 13, 46, 71, 91, 134};

    /**
     *Digit group. Elements in an array can only be 0 or 1
     */
    private BitSet bits = new BitSet(DEFAULT_SIZE);

    /**
     *An array of classes containing hash functions
     */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];

    /**
     *Initialize an array of multiple classes containing hash functions. The hash functions in each class are different
     */
    public MyBloomFilter() {
        //Initialize multiple different hash functions
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]);
        }
    }

    /**
     *Add element to digit group
     */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true);
        }
    }

    /**
     *Determines whether the specified element exists in the bit array
     */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /**
     *Static inner class. For hash operation!
     */
    public static class SimpleHash {

        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /**
         *Calculate hash value
         */
        public int hash(Object value) {
            int h;
            return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16)));
        }

    }
}

Test:

        String value1 = "https://javaguide.cn/";
        String value2 = "https://github.com/Snailclimb";
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
        filter.add(value1);
        filter.add(value2);
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));

Output:

false
false
true
true

Test:

        Integer value1 = 13423;
        Integer value2 = 22131;
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
        filter.add(value1);
        filter.add(value2);
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));

Output:

false
false
true
true

5. Use the bloom filter in Google’s open source guava

The purpose of our own implementation is to make ourselves understand the principle of Bloom filter. The implementation of Bloom filter in guava is quite authoritative, so we don’t need to manually implement a bloom filter in the actual project.

First, we need to introduce guava’s dependency into the project:

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>28.0-jre</version>
        </dependency>

The actual use is as follows:

We have created a bloom filter that can store up to 1500 integers, and we can tolerate a false positive probability of 0.01%

//Create a bloom filter object
        BloomFilter<Integer> filter = BloomFilter.create(
                Funnels.integerFunnel(),
                1500,
                0.01);
        //Determine whether the specified element exists
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));
        //Add element to bloom filter
        filter.put(1);
        filter.put(2);
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));

In our example, whenmightContain()Method returntrueWe can 99% determine that the element is in the filter when the filter returnsfalseWhen, we can be 100% sure that the element does not exist in the filter.

The implementation of the bloom filter provided by guava is still very good (you can see its source code implementation for details), but it has a major defect that it can only be used on a single machine (in addition, capacity expansion is not easy). Now the Internet is generally distributed. To solve this problem, we need to use the bloom filter in redis.

6. Bloom filter in redis

[](https://github.com/Snailclimb…

Redis v4.0 has the function of module (module / plug-in). Redis modules enable redis to use external modules to expand its functions. Bloom filter is one of the modules. For details, please refer to the official introduction of redis to redis modules: https://redis.io/modules

In addition, the official website recommends a redisbloom as the module of redisbloom filter. The address is: https://github.com/RedisBloom/RedisBloom. Others include:

  • Redis Lua scaling bloom filter (implemented by Lua script): https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter
  • Pyrebloom (quick redis bloom filter in Python): https://github.com/seomoz/pyreBloom
  • The latest video tutorial and learning route on Java basics in 2020
  • ……

Redisbloom provides client support in multiple languages, including python, Java, JavaScript and PHP.

6.2 installation using docker

If we need to experience the bloom filter in redis, it’s very simple. Just use docker! We search directly on Googledocker redis bloomfilterThen we found the answer we wanted in the first search result excluding advertising (this is my usual way to solve the problem. Share it). Specific address: https://hub.docker.com/r/redislabs/rebloom/ (the introduction was very detailed).

The specific operations are as follows:

➜  ~ docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest
➜  ~ docker exec -it redis-redisbloom bash
[email protected]:/data# redis-cli
127.0.0.1:6379> 

6.3 list of common commands

Note: key: the name of the bloom filter, item: the added element.

  1. BF.ADD : adds an element to the bloom filter, which is created if it does not already exist. Format:BF.ADD {key} {item}
  2. BF.MADD : add one or more elements to the bloom filter and create a filter that does not yet exist. How this command operatesBF.ADDIt is the same, except that it allows multiple inputs and returns multiple values. Format:BF.MADD {key} {item} [item ...]
  3. BF.EXISTS : determines whether the element exists in the bloom filter. Format:BF.EXISTS {key} {item}
  4. BF.MEXISTS: determines whether one or more elements have a format in the bloom filter:BF.MEXISTS {key} {item} [item ...]

In addition,BF.RESERVEThe command needs to be introduced separately:

The format of this command is as follows:

BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion]

The following is a brief introduction to the specific meaning of each parameter:

  1. Key: name of Bloom filter
  2. error_ Rate: expected probability of false positives. This should be a decimal value between 0 and 1. For example, for the expected false positive rate of 0.1% (1 in 1000), error_ Rate should be set to 0.001. The closer this number is to zero, the greater the memory consumption per project and the higher the CPU utilization per operation.
  3. Capacity: the capacity of the filter. When the number of elements actually stored exceeds this value, the performance will begin to decline. The actual degradation will depend on how far the limit is exceeded. As the number of filter elements increases exponentially, the performance will decrease linearly.

Optional parameters:

  • Expansion: if a new sub filter is created, its size will be the size of the current filter multiplied byexpansion。 The default extension value is 2. This means that each subsequent sub filter will be twice as large as the previous sub filter.

6.4 actual use

127.0.0.1:6379> BF.ADD myFilter java
(integer) 1
127.0.0.1:6379> BF.ADD myFilter javaguide
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter java
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter javaguide
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter github
(integer) 0