[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

Time:2021-11-30

Load balancing series topics

01 – load balancing Basics

02 consistency hash principle

03 Java implementation of consistent hash algorithm

04 – Java implementation of load balancing algorithm

concept

Consistent hash is a special hash algorithm.

After using the consistent hash algorithm, the change of the slot number (size) of the hash table only needs to remap K / N keywords, where k is the number of keywords and N is the number of slots.

However, in the traditional hash table, adding or deleting a slot requires almost all keywords to be remapped.

What’s the usage?

Now, many distributed middleware need re balance when adding and deleting nodes.

With the help of consistency hash, this step can be avoided.

Business scenario

Suppose there are 1000W data items and 100 storage nodes, please design an algorithm to store them on these nodes reasonably.

Strong hash

Considering that a single server cannot carry, a distributed architecture is used. The initial algorithm is hash () mod n, hash () usually takes the user ID, and N is the number of nodes.

This method is easy to implement and can meet the operation requirements. The disadvantage is that when a single point of failure occurs, the system cannot recover automatically. Similarly, nodes cannot be added dynamically.

schematic diagram

Take a look at the principle of ordinary hash algorithm:

[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

The core calculation is as follows

for item in range(ITEMS):
    k = md5(str(item)).digest()
    h = unpack_from(">I", k)[0]
    #Mapping by remainder
    n = h % NODES
    node_stat[n] += 1

output

Ave: 100000
Max: 100695 (0.69%)
Min: 99073 (0.93%)

From the above results, it can be found that the ordinary hash algorithm evenly disperses these data items to these nodes, and the number of data items of the least and most distributed storage nodes is less than 1%.

The reason why the distribution is uniform mainly depends on the random distribution of hash algorithm (MD5 algorithm used in the implementation).

shortcoming

However, we see that there is a problem. Because the algorithm uses the method of node number remainder, it strongly depends on the number of nodes.

Therefore, when the number of nodes changes, the node corresponding to the item changes dramatically, and the cost of the change is that we need to migrate the data when the number of nodes changes, which is obviously unbearable for storage products. Let’s observe the movement of data items after adding nodes:

for item in range(ITEMS):
    k = md5(str(item)).digest()
    h = unpack_from(">I", k)[0]
    #Original mapping result
    n = h % NODES
    #Current mapping results
    n_new = h % NEW_NODES
    if n_new != n:
        change += 1

output

Change: 9900989 (99.01%)

If there are 100 items, when a node is added, 99% of the previous data needs to be moved again.

This is obviously unbearable. We have found the problem of ordinary hash algorithm. How to improve it?

Yes, our consistent hash algorithm is on the stage.

Weak hash

To solve a single point of failure, usehash() mod (n/m),

In this way, any user has m server candidates, which can be randomly selected by the client.

Because users between different servers need to interact with each other, all servers need to know exactly where users are.

Therefore, the user location is saved in memcached. When one fails, the client can automatically switch to the corresponding backup. Since the other one does not have a user’s session before switching, the client needs to log in again by itself.

  • benefit

His advantage over strong hashing is that it solves the single point problem.

  • shortcoming

However, there are the following problems: the load is unbalanced, especially after a single unit fails, the remaining one will be under too much pressure; Nodes cannot be added or deleted dynamically; When a node fails, the client needs to log in again

Consistent hash algorithm

Consistent hash algorithm proposes four definitions to determine the quality of hash algorithm in dynamic cache environment:

Balance

Balance means that the hash results can be distributed to all buffers as much as possible, so that all buffer spaces can be used. Many hash algorithms can meet this condition.

Monotonicity

Monotonicity means that if some content has been allocated to the corresponding buffer through hash, a new buffer is added to the system. The hash result should ensure that the original allocated content can be mapped to the original or new buffer, and will not be mapped to other buffers in the old buffer set.

Dispersion (spread)

In a distributed environment, the terminal may not see all the buffers, but only some of them.

When the terminal wants to map the content to the buffer through the hash process, the buffer range seen by different terminals may be different, resulting in inconsistent hash results. The final result is that the same content is mapped to different buffers by different terminals.

This situation should be avoided obviously, because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. Dispersion is defined as the severity of the above situation. A good hash algorithm should avoid inconsistency as much as possible, that is, reduce the dispersion as much as possible.

Load

The load problem is actually looking at decentralization from another perspective. Since different terminals may map the same content to different buffers, a specific buffer may also be mapped to different content by different users.

Like decentralization, this situation should be avoided, so a good hash algorithm should be able to minimize the load of buffer.

The ordinary hash algorithm (also known as hard hash) hashes the machine by simple modulus, which can achieve satisfactory results when the cache environment remains unchanged, but when the cache environment changes dynamically,
This static mode obviously does not meet the monotonicity requirements (when adding or reducing a machine, almost all the stored content will be re hashed to other buffers).

code implementation

Implementation logic

There are many specific implementations of consistent hash algorithm, includingChord algorithm),Kad algorithmThe implementation of the above algorithms is more complex.

This paper introduces the basic implementation principle of a consistency hash algorithm widely spread on the Internet. Interested students can query more detailed information according to the above link or on the Internet.

The basic implementation principle of consistent hash algorithm is to map the machine node and key value to a node according to the same hash algorithm0~2^32On the ring.

When a request to write to the cache arrives, calculate the hash value hash (k) corresponding to the key value K. if the value exactly corresponds to the hash value of a previous machine node, write it directly to the machine node,
If there is no corresponding machine node, find the next node clockwise and write it. If it exceeds2^32If the corresponding node is not found, start from 0 (because it is a ring structure).

As shown in Figure 1:

[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

In Figure 1, the hash value of key K is between a and B, so K is processed by node B.

In addition, when mapping a specific machine, you can also map an entity node to multiple virtual nodes according to different processing capabilities.

After hashing by the consistent hash algorithm, when a new machine is added, only the storage of one machine will be affected,

For example, if the hash of the newly added node h is between B and C, some data originally processed by C may be moved to h for processing,
The processing of all other nodes will remain unchanged, so it shows good monotonicity.

If a machine is deleted, such as deleting node C, the data originally processed by C will be moved to node D, while the processing of other nodes remains unchanged.

Because the same hash algorithm is used in machine node hash and buffer content hash, the dispersion and load are also reduced.

By introducing virtual nodes, the balance is also greatly improved.

Implementation code

consitent-hashing

schematic diagram

[image upload failed… (image-43a9f8-1592620804879)]

Core code

for n in range(NODES):
    h = _hash(n)
    ring.append(h)
    ring.sort()
    hash2node[h] = n
for item in range(ITEMS):
    h = _hash(item)
    n = bisect_left(ring, h) % NODES
    node_stat[hash2node[ring[n]]] += 1

Uniformity

Although the consistency hash algorithm solves the problem of data migration caused by node changes, we look back at the uniformity of data item distribution and implement the consistency hash algorithm

Ave: 100000
Max: 596413 (496.41%)
Min: 103 (99.90%)

The result is simply, indeed very poor, and the distribution is very uneven.

Let’s think about the reasons for the uneven distribution of consistent hash algorithms?

From the simulation of the first 1000W data items through the general hash algorithm, these data items can be evenly distributed after being “scattered”.

However, after the introduction of consistent hash algorithm, why is it uneven?

The hash value of the data item itself has not changed, but the algorithm for judging which node the data item hash should fall to has changed.

[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

Therefore, it is mainly because the 100 nodes are not evenly distributed on the ring after hash, resulting in different sizes of intervals actually occupied by each node on the ring.

Improved virtual node

When we hash nodes, these values do not fall evenly on the ring. Therefore, eventually, the range governed by these nodes is not uniform, resulting in uneven data distribution.

schematic diagram

[image upload failed… (image-3fd62c-1592620804879)]

Implementation code

for n in range(NODES):
    for v in range(VNODES):
        h = _hash(str(n) + str(v))
        #Construct ring
        ring.append(h)
        #Record the node corresponding to the hash
        hash2node[h] = n
ring.sort()
for item in range(ITEMS):
    h = _hash(str(item))
    #Search for the nearest hash on ring
    n = bisect_left(ring, h) % (NODES*VNODES)
    node_stat[hash2node[ring[n]]] += 1

Add node

Therefore, by adding virtual nodes, the “jurisdiction” of each node on the ring is more uniform.

This not only ensures that the change of data distribution will be affected as little as possible when the node changes, but also ensures the uniformity of data distribution.

That is to strengthen the uniformity of jurisdiction by increasing the “number of nodes”.

At the same time, observe the data changes after adding nodes

for item in range(ITEMS):
    h = _hash(str(item))
    n = bisect_left(ring, h) % (NODES*VNODES)
    n2 = bisect_left(ring2, h) % (NODES2*VNODES)
    if hash2node[ring[n]] != hash2node2[ring2[n2]]:
        change += 1

Another improvement

However, the strategy of winning by the number of virtual nodes increases the space needed to store these virtual node information.

In the swift component of openstack, a special method is used to solve the problem of uneven distribution, the algorithm of data distribution is improved, and the space on the ring is evenly mapped to a linear space, so as to ensure the uniformity of distribution.

schematic diagram

[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

Core code

for part in range(2 ** LOG_NODE):
    ring.append(part)
    part2node[part] = part % NODES
for item in range(ITEMS):
    h = _hash(item) >> PARTITION
    part = bisect_left(ring, h)
    n = part % NODES
    node_stat[n] += 1

It can be seen that the data distribution is ideal. If the number of nodes is just equal to the number of partitions, it can be evenly distributed in theory.

Add node

The data movement ratio after adding nodes is observed

for part in range(2 ** LOG_NODE):
    ring.append(part)
    part2node[part] = part % NODES
    part2node2[part] = part % NODES2
change = 0
for item in range(ITEMS):
    h = _hash(item) >> PARTITION
    p = bisect_left(ring, h)
    p2 = bisect_left(ring, h)
    n = part2node[p] % NODES
    n2 = part2node2[p] % NODES2
    if n2 != n:
        change += 1

Summary

This section describes the principle of consistency hash in detail. Later, we will learn how to use java to implement a tool.

[distributed] load balance 02 consistent hash algorithm detailed explanation of the principle of consistent hash algorithm

reference material

https://blog.csdn.net/lihao21…

https://yikun.github.io/2016/…

http://afghl.github.io/2016/0…

https://zh.wikipedia.org/wiki…

https://blog.csdn.net/sunxinh…

http://blog.huanghao.me/?p=14

  • code implementation

Implementation of consistent hashing in C language

  • chord

http://101.96.10.64/db.cs.duk…

https://github.com/ChuanXia/C…

https://en.wikipedia.org/wiki…

http://www.yeolar.com/note/20…

https://github.com/ChuanXia/C…

https://github.com/netharis/C…

https://github.com/TitasNandi…

  • kademlia

http://www.yeolar.com/note/20…

https://en.wikipedia.org/wiki…