Consistent hash algorithm for PHP

Time:2021-5-4

Basic scene

For example, if you have n cache servers (hereinafter referred to as cache), how to map an object to n caches? You will probably use the following general method to calculate the hash value of an object, and then map it evenly to n caches;

Complement algorithm: hash (object)% n

If everything is running normally, consider the following two situations;

1. A cache server m is down (this situation must be considered in practical application), so that all objects mapped to cache m will be invalid. What should I do? I need to remove cache m from cache. At this time, the cache is n-1, and the mapping formula becomes hash (object)% (n-1);

2. Due to the heavy access, the cache needs to be added. At this time, the cache is n + 1, and the mapping formula becomes hash (object)% (n + 1);

What do 1 and 2 mean? This means that all of a sudden almost all the caches fail. For the server, this is a disaster, flood like access will directly rush to the background server;

Let’s consider the third problem. As the hardware becomes more and more powerful, you may want to make the nodes added later do more work. Obviously, the hash algorithm above can’t do it either.

There is no way to change this situation

Hash algorithm and monotonicity

One measure of hash algorithm is monotonicity, which is defined as follows:

Monotonicity means that if some content has been allocated to the corresponding buffer through hash, and a new buffer has been added to the system. The result of hash should be able to ensure that the original allocated content can be mapped to the new buffer instead of other buffers in the old buffer set.

It is easy to see that the above simple algorithm hash (object)% n is difficult to satisfy the monotonicity requirement.

The principle of consistent hashing

Consistent hashing is a hash algorithm. In short, when a cache is removed or added, it can change the existing key mapping relationship as little as possible to meet the requirement of monotonicity.

  • 1. Circular hash space

    Consider the usual hash algorithm is to map value to a key value of 32, which is a numerical space of 0 ~ 2 ^ 32-1 power; We can think of this space as a ring with the beginning (0) and the end (2 ^ 32-1), as shown in Figure 1 below.

Consistent hash algorithm for PHP

  • 2. Map the contents (objects) to hash space

    Next, consider four objectsobject1~object4The distribution of hash value key calculated by hash function on the ring is shown in Figure 2.
    hash(object1) = key1;
    … …
    hash(object4) = key4;

Consistent hash algorithm for PHP

  • 3. Mapping servers (nodes) to hash spaceThe basic idea of consistent hashing is to map the object and cache to the same hash value space and use the same hash algorithm.

    Suppose there are three servers (nodes) a, B and C, then the mapping results will be as shown in Figure 3. They are arranged in the hash space according to the corresponding hash values.

    Generally, the IP address + port number or machine name of the server (node) machine can be used as hash input.

    hash(cache A) = key A;
    … …
    hash(cache C) = key C;

    Consistent hash algorithm for PHP

  • 4. Map the object to the cash server

    Now that the cache and the object have been mapped to the hash value space by the same hash algorithm, the next thing to consider is how to map the object to the cache.

    In this annular space, if you start from the key value of the object clockwise until you meet a cache, you will store the object in the cache. Because the hash values of the object and cache are fixed, the cache must be unique and certain. Don’t you find the mapping method between object and cache?!

    If the above example continues, the object object1 will be stored in cache a according to the above method; Object2 and object3 correspond to cache C; Object4 corresponds to cache B;

  • 5. Examine the change of cash

    As mentioned earlier, the biggest problem caused by the hash and then remainder method is that it can’t satisfy the monotonicity. When the cache changes, the cache will fail, which will have a huge impact on the background server. Now let’s analyze the consistent hashing algorithm.

    • 5.1 remove cache

      Consider assuming that cache B is down. According to the mapping method mentioned above, only the objects traversing counter clockwise along cache B until the next cache (CACHE C) will be affected, that is, the objects originally mapped to cache B.

      Therefore, we only need to change the object object4 and remap it to cache C; See Figure 4.

      Consistent hash algorithm for PHP
    • 5.2 add cache

      Consider the case of adding a new cache D. suppose that cache D is mapped between objects object2 and object3 in this ring hash space. At this time, only the objects traversing counter clockwise along cache d to the next cache (CACHE b) will be affected (they are part of the objects originally mapped to cache C), and these objects can be re mapped to cache D.

      Therefore, we only need to change the object object2 and remap it to cache D; See Figure 5.

Consistent hash algorithm for PHP

Virtual node

Another index to consider hash algorithm is balance, which is defined as follows:

Balance

Balance means that hash results can be distributed to all buffers as much as possible, so that all buffer spaces can be used.

Hash algorithm does not guarantee absolute balance. If there are few caches, objects can not be evenly mapped to the cache. For example, in the above example, when only cache a and cache C are deployed, among the four objects, cache a only stores object1, while cache C stores object2, object3 and object4; The distribution is very uneven.

To solve this problem, consider hashing introduces the concept of “virtual node”, which can be defined as follows:

A virtual node is a replica of an actual node in the hash space. An actual node corresponds to several virtual nodes, and the corresponding number is also called the number of copies. Virtual nodes are arranged by hash value in the hash space.

Taking the case of only deploying cache a and cache C as an example, we have seen in Figure 4 that the cache distribution is not uniform. Now we introduce virtual nodes and set the number of copies to 2, which means that there will be four virtual nodes. Cache A1 and cache A2 represent cache a; Cache C1 and cache C2 represent cache C; Suppose an ideal situation, see Figure 6.

Consistent hash algorithm for PHP

Figure 6 mapping relationship after introducing “virtual node”

In this case, the mapping relationship between the object and the “virtual node” is as follows:


objec1->cache A2 ; objec2->cache A1 ; objec3->cache C1 ; objec4->cache C2 ;

Therefore, both object1 and object2 are mapped to cache a, while object3 and object4 are mapped to cache C; The balance has been greatly improved.

After the introduction of “virtual node”, the mapping relationship is transformed from {object → node} to {object → virtual node}. The mapping relationship when querying the cache where the object is located is shown in Figure 7.

Consistent hash algorithm for PHP

Figure 7 cache of query object

The hash calculation of “virtual node” can adopt the way of IP address of corresponding node plus digital suffix. For example, suppose the IP address of cache a is 202.168.14.241.

Before introducing “virtual node”, hash value of cache a is calculated


Hash(“202.168.14.241”);

After introducing “virtual node”, the hash values of cache A1 and cache A2 are calculated


Hash(“202.168.14.241#1”); // cache A1


Hash(“202.168.14.241#2”); // cache A2

PHP implementation

<? php

This work adoptsCC agreementReprint must indicate the author and the link of this article

Recommended Today

Large scale distributed storage system: Principle Analysis and architecture practice.pdf

Focus on “Java back end technology stack” Reply to “interview” for full interview information Distributed storage system, which stores data in multiple independent devices. Traditional network storage system uses centralized storage server to store all data. Storage server becomes the bottleneck of system performance and the focus of reliability and security, which can not meet […]