Dry goods, use bloom filter to achieve efficient cache!

Time:2022-1-12

preface

This paper mainly describes the use of bloom filtering to achieve efficient caching. In this paper, the array is used as the cache. If high concurrency hit is required, the array in this paper needs to be replaced by redis database.

Bloom filtration

The creation process of bloom cache is as follows:

1. First define the cache bit array. The length of the array is the maximum number of cached data.

2. Then hash the string to find its hashcode.

3. Then, hashcode is used as the seed of pseudo-random number generator (random) to generate a positive number x less than the maximum number.

4. Then take x as the index of the cache array and set the value of array [x] to true.

Bloom filtration

Calculate the array index of the obtained string through the above first three steps of operation, and then take the value of the specified index from the bloom cache. If true, the cache exists. You can use this string to get data from the real data cache. If not fasle, the cache does not exist, and get data from the database.

code implementation

First, create the WinForm project bloomtest.

Then write the bloom filter with the following code:

public class BloomFilter
{
    //Bloom cache array
    public BitArray BloomCache;
    //Bloom cache array的长度
    public Int64 BloomCacheLength { get; } 
    public Int64 HashCount { get; }
     
    /// 
    /// 
    /// 
    ///The length of the bloom cache array. The default is 20000 
    ///Number of hash operations, 3 by default
    public BloomFilter(int bloomCacheLength = 20000,  int hashCount = 3)
    {
        BloomCache = new BitArray(bloomCacheLength);
        BloomCacheLength = bloomCacheLength; 
        HashCount = hashCount;
    }
    public void Add(string str)
    {
        var hashCode =str.GetHashCode();
        Random random = new Random(hashCode);
        for (int i = 0; i < HashCount; i++)
        {
            var x = random.Next((int)(BloomCacheLength - 1));
            BloomCache[x] = true;
        }
    }
​
    public bool IsExist(string str)
    {
        var hashCode = str.GetHashCode();
        Random random = new Random(hashCode);
        for (int i = 0; i < HashCount; i++)
        {
            if (!BloomCache[random.Next((int)(BloomCacheLength - 1))])
            {
                return false;
            }
        }
        return true;
    }
​
    //Error rate view
    public double GetFalsePositiveProbability(double setSize)
    {
        // (1 - e^(-k * n / m)) ^ k
        return Math.Pow((1 - Math.Exp(-HashCount * setSize / BloomCacheLength)), HashCount);
    }
    //Calculate the best number of hash based on Bloom filter, that is, the best value of hashcount
    public int OptimalNumberOfHashes(int setSize)
    {
        return (int)Math.Ceiling((BloomCacheLength / setSize) * Math.Log(2.0));
    }
​
}

Then write the usage code of Bloom filter as follows:

public partial class Form1 : Form
    {
        BloomFilter bloom = new BloomFilter(20000, 3);
        int setSize = 2000;
        public Form1()
        {
            InitializeComponent();
            //Generate bloom cache array
            for (int i = 0; i < setSize; i++)
            {
                bloom.Add("kiba" + i);
            }
        } 
        private void btnSearch_Click(object sender, EventArgs e)
        { 
            Stopwatch sw = new Stopwatch();
            sw.Start();
            string con = tbCon.Text.Trim();
            var ret = bloom.IsExist(con);
            sw.Stop();
            lblRet. Text = [email protected] "result: {RET} {environment. Newline}
Time consuming: {SW. Elapsedticks} {environment. Newline}
Error probability: {bloom. Getfalse positiveprobability (SetSize)} {environment. Newline} 
Optimal quantity: {bloom. Optimalnumberofhashes (SetSize)} "; 
        } 
    }

test result

Run the project and click to query the data.

As shown in the figure above, we successfully hit it. If we hit it in the project, we can query the real cache.

Error probability

There is a hit error in the bloom cache, that is, if the hash values of the two data are the same, there will be a hit failure for a long time.

The error probability can be calculated by the number of hash operations, the length of bloom cache array and the number of inserted data.

Optimum quantity

Bloom cache suggests that we do more hash operations, that is, save more cache indexes, and create three by default.

In our code, 2000 data are inserted into the bloom cache array. Through calculation, the optimal number of hash operations is 7, that is, when the number of inserts is 2000 and the length of the bloom cache array is 20000, the optimal value of hashcount is 7.

Application scenario

Bloom cache has many scenarios that can be applied, such as duplicate URL detection, blacklist verification, spam filtering and so on.

For example, before crawling the website, the crawler will first calculate whether the URL has been crawled through bloom filtering, and then determine whether to initiate an HTTP request.

About cache penetration, cache breakdown and cache avalanche

Cache penetration

Cache penetration means that there is no data in the cache and database. At this time, users continue to make such requests, which will cause great pressure on the database and cache.

Solution: add more and more effective data verification to make these requests blocked before entering the query. Write the data not in the cache and the database into the cache, and set a shorter validity period to prevent the request from entering the database multiple times.

Buffer breakdown

Cache breakdown means that the data in the cache just expires, and then a large number of accesses to the data suddenly occur. This causes a large number of requests to be sent directly to the database.

Solution: mark the hot spot data, and then set the special validity period for the hot spot data. Extend the validity period of ordinary data, such as being requested once to extend the validity period of a fixed time.

Cache avalanche

The meaning of cache avalanche is similar to cache breakdown. The difference is that cache breakdown refers to only one data directly requesting the database, while avalanche refers to many such data directly requesting the database.

Solution: cache database distributed deployment.

epilogue

Due to misjudgment, bloom cache cannot be used in the scenario of 100% positioning data. However, if the scenario can be fault-tolerant, bloom cache will greatly improve the performance.

—————————————————————————————————-

This is the end of the introduction of bloom filtering.

The code has been transferred to GitHub. Welcome to download it.

GitHub address:https://github.com/kiba518/BloomFilter_Kiba

—————————————————————————————————-

Note: This article is original. Please contact the author for authorization and indicate the source in any form of reprint!
If you think this article is good, please click below【[recommended]Thank you.

https://www.cnblogs.com/kiba/p/14767430.html