Redis bloom filter

Time:2022-3-18

I. Introduction to bloom filter

Bloom filter is a data structure similar to set proposed by bloom in 1970. It is actually a long binary vector and a series of random mapping functions. Bloom filter can be used to retrieve whether an element is in a set, but the retrieval result is not very accurate. If the amount of data becomes larger, misjudgment will occur. However, bloom filter can filter out the existing content, so the misjudgment situation is that the data not in bloom filter may be misjudged as existing. This function is very useful in some scenarios.

Bloom filter usage scenario

The biggest function of Bloom filter is the de duplication function under large amount of data, so it is often used in the following scenarios

  • Recommendation system, such as commodity and news recommendation, removes the news or commodity that has been recommended;
  • The web crawler de duplicates the URL to avoid crawling to the same URL address;
  • In the case of big data, determine the spam mailbox from the email;

II. Introduction to the principle of Bloom filter

Bloom filter is composed of a long string of binary vectors, which can be regarded as a binary array; Store 0 or 1, which is 0 by default;

Redis bloom filter

Add data

When adding an element key to the bloom filter, calculate multiple hash values through multiple hash functions and set the value of the corresponding position to 1; Because different keys have different 1 positions, they are used to distinguish keys; However, when the amount of data is large, the same key will have the same value after being hashed by several hash functions, so there will be errors;

Redis bloom filter

Bloom filter uses the contain method to judge whether the element has achieved the effect of de duplication in the filter

III. implementation mode

3.1 implementation of guava Library

Add guava dependency

<dependency>  
 <groupId>com.google.guava</groupId>  
 <artifactId>guava</artifactId>  
 <version>29.0-jre</version>  
 </dependency>

realization:

@Test  
public void contextLoads() {  
 //Total quantity 1W  
 int total = 10000;  
 //Set filter  
 BloomFilter<CharSequence> bf =  
 BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), total);  
 //Initialize 1W pieces of data into the filter  
 for (int i = 0; i < total; i++) {  
 bf. Put ("knowledge seeker" + I);  
 }  
 //Determine whether the value exists in the filter  
 int count = 0;  
 //Insert 2W pieces of data for de duplication verification  
 int addCount = 2*total;  
 for (int i = 0; i < addCount; i++) {  
 If (BF. Mightcontext ("knowledge seeker" + I)){  
 count++;  
 }  
 }  
 System. out. Println ("matching quantity" + count);  
}

The output matching quantity of the console is greater than 1W, so it can filter duplicate data;

Matching quantity 10294

Adjusting the error rate of FPP can achieve more accurate filtering

294/2w=0.0147

FPP defaults to 0.03d and is set to 0.0147d

BloomFilter<CharSequence> bf =  
 BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), total,0.0147d);

The matching results are as follows

Matching quantity 10143

3.2 redis implementation method

Bloom filter is only supported in redis version 4.0

The redis implementation mode uses the instruction BF Add add element, BF Exists query whether the element exists; When the amount of data is small, we can’t see the specific difference

127.0.0.1:6379> bf.add codeding h1  
(integer) 1  
127.0.0.1:6379> bf.add codeding h2  
(integer) 1  
127.0.0.1:6379> bf.add codeding h3  
(integer) 1  
127.0.0.1:6379> bf.exists codeding h1  
(integer) 1  
127.0.0.1:6379> bf.exists codeding h2  
(integer) 1  
127.0.0.1:6379> bf.exists codeding h3  
(integer) 1  
127.0.0.1:6379> bf.exists codeding h4  
(integer) 0

Implemented using redisson, version 3 Above x

<dependency>  
 <groupId>org.redisson</groupId>  
 <artifactId>redisson</artifactId>  
 <version>3.11.4</version>  
</dependency>

If it is springboot

 <dependency>  
 <groupId>org.redisson</groupId>  
 <artifactId>redisson-spring-boot-starter</artifactId>  
 <version>3.10.1</version>  
 </dependency>

Implementation code

@Test  
 public void test2(){  
 Config config = new Config();  
 config.useSingleServer().setAddress("redis://ip:port");  
 config. useSingleServer(). Setpassword ("password");  
 //Redisson structure  
 RedissonClient redisson = Redisson.create(config);  
 //Get bloom filter  
 RBloomFilter<String> bloomFilter = redisson.getBloomFilter("userList");  
 //Initialize bloom filter: the expected element is 1wl and the error rate is 3%  
 //Total quantity 100  
 long total = 100L;  
 bloomFilter.tryInit(total,0.03);  
 //Initialize 1W pieces of data into the filter  
 for (int i = 0; i < total; i++) {  
 bloomFilter.add("zszxz" + i);  
 }  
 int count = 0;  
 //200 pieces of data for de duplication verification  
 long addCount = 2*total;  
 for (int i = 0; i < addCount; i++) {  
 if (bloomFilter.contains("zszxz" + i)) {  
 count++;  
 }  
 }  
 //Matching quantity 108  
 System. out. Println ("matching quantity" + count);  
 }

The number of matches is 108, 8 more than 100; More accurate filtering can also be achieved by adjusting the error rate;

Pay attention to knowledge seekers and obtain an interview question of version 2020
Redis bloom filter