Bloom filter based on PHP + redis

Time:2021-4-17

Since redis implements setbit and getbit operations, it is naturally suitable to implement bloom filter. Redis also has bloom filter plug-in. Here, we use PHP + redis to implement the bloom filter.

First of all, define a hash function set class. These hash functions are not necessarily used. In fact, three 32-bit hash values are enough. The specific number can be determined according to the total number of your bit sequence and the amount you need to store. The best value has been given above.

class BloomFilterHash
{
    /**
     *Bit hash function written by Justin Sobel
     */
    public function JSHash($string, $len = null)
    {
        $hash = 1315423911;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash ^= (($hash << 5) + ord($string[$i]) + ($hash >> 2));
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *The hash algorithm is based on the work of Peter J. Weinberger of at & T Bell Laboratories.
     *The book compiler (principles, techniques and tools), written by aho Sethi and Ulman, suggests using hash functions that use the hash methods in this particular algorithm.
     */
    public function PJWHash($string, $len = null)
    {
        $bitsInUnsignedInt = 4 * 8; //(unsigned int)(sizeof(unsigned int)* 8);
        $threeQuarters = ($bitsInUnsignedInt * 3) / 4;
        $oneEighth = $bitsInUnsignedInt / 8;
        $highBits = 0xFFFFFFFF << (int) ($bitsInUnsignedInt - $oneEighth);
        $hash = 0;
        $test = 0;
        $len || $len = strlen($string);
        for($i=0; $i<$len; $i++) {
            $hash = ($hash << (int) ($oneEighth)) + ord($string[$i]); } $test = $hash & $highBits; if ($test != 0) { $hash = (($hash ^ ($test >> (int)($threeQuarters))) & (~$highBits));
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *Similar to pjw hash function, but adjusted for 32-bit processor. It is a UNIX based system where Widley uses hash functions.
     */
    public function ELFHash($string, $len = null)
    {
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = ($hash << 4) + ord($string[$i]); $x = $hash & 0xF0000000; if ($x != 0) { $hash ^= ($x >> 24);
            }
            $hash &= ~$x;
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *This hash function comes from Brian Kernighan and Dennis Ritchie's book "the C programming language.".
     *It's a simple hash function that uses a strange set of possible seeds, all of which form the patterns of 31... 31... 31, etc. it seems to be very similar to DJB hash function.
     */
    public function BKDRHash($string, $len = null)
    {
        $seed = 131;  # 31 131 1313 13131 131313 etc..
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = (int) (($hash * $seed) + ord($string[$i]));
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *This is the preferred algorithm used in open source sdbm projects.
     *Hash functions seem to have a good overall distribution for many different datasets. It seems to work well when the MSBs of elements in a dataset are highly different.
     */
    public function SDBMHash($string, $len = null)
    {
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = (int) (ord($string[$i]) + ($hash << 6) + ($hash << 16) - $hash);
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *The algorithm, produced by Professor Daniel J. Bernstein, first appeared in Usenet newsgroup comp.lang . C.
     *It is one of the most efficient hash functions ever published.
     */
    public function DJBHash($string, $len = null)
    {
        $hash = 5381;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = (int) (($hash << 5) + $hash) + ord($string[$i]);
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *The algorithm proposed by Donald E. Knuth in "the art of Computer Programming Volume 3", the topic is sorting and searching chapter 6.4.
     */
    public function DEKHash($string, $len = null)
    {
        $len || $len = strlen($string);
        $hash = $len;
        for ($i=0; $i<$len; $i++) {
            $hash = (($hash << 5) ^ ($hash >> 27)) ^ ord($string[$i]);
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

    /**
     *Reference http://www.isthe.com/chongo/tech/comp/fnv/
     */
    public function FNVHash($string, $len = null)
    {
        $prime = 16777619; // 32-bit prime 2 ^ 24 + 2 ^ 8 + 0x93 = 16777619
        $hash = 2166136261; // 32-bit offset
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = (int) ($hash * $prime) % 0xFFFFFFFF;
            $hash ^= ord($string[$i]);
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }
}

The next step is to connect redis for operation

/**
 *Bloom filter using redis
 */
abstract class BloomFilterRedis
{
    /**
     *You need to use a method to define the name of the bucket
     */
    protected $bucket;

    protected $hashFunction;

    public function __construct($config, $id)
    {
        if (!$this->bucket || !$this->hashFunction) {
            Throw new exception ("bucket and hashfunction need to be defined", 1);
        }
        $this->Hash = new BloomFilterHash;
        $this - > redis = new yourredis; // suppose you are connected here
    }

    /**
     *Add to collection
     */
    public function add($string)
    {
        $pipe = $this->Redis->multi();
        foreach ($this->hashFunction as $function) {
            $hash = $this->Hash->$function($string);
            $pipe->setBit($this->bucket, $hash, 1);
        }
        return $pipe->exec();
    }

    /**
     *Query whether it exists. If it has been written, it must return to true. If it has not been written, there is a certain probability that it will be misjudged to exist
     */
    public function exists($string)
    {
        $pipe = $this->Redis->multi();
        $len = strlen($string);
        foreach ($this->hashFunction as $function) {
            $hash = $this->Hash->$function($string, $len);
            $pipe = $pipe->getBit($this->bucket, $hash);
        }
        $res = $pipe->exec();
        foreach ($res as $bit) {
            if ($bit == 0) {
                return false;
            }
        }
        return true;
    }

}

The above definition is an abstract class. If you want to use it, you can use it according to the specific business. For example, here is a filter to filter duplicate content.

/**
 *Duplicate content filter
 *The total number of bits of the bloom filter is 2 ^ 32, and the number of judgments is 2 ^ 30. The optimal number of hash functions is 3
 *The three hash functions used are
 * BKDR, SDBM, JSHash
 *
 *Note that when the amount of data stored reaches 2 ^ 30, the error rate will increase sharply. Therefore, it is necessary to regularly determine whether the number of bits with 1 in the filter exceeds 50%. If it exceeds 50%, it needs to be cleared
 */
class FilteRepeatedComments extends BloomFilterRedis
{
    /**
     *Represents a filter for determining duplicate content
     * @var string
     */
    protected $bucket = 'rptc';

    protected $hashFunction = array('BKDRHash', 'SDBMHash', 'JSHash');
}