Consistency hash algorithm
Balance means that hash results can be distributed in all caches as much as possible, so that all buffer spaces can be utilized. Many hash algorithms can meet this condition.
Monotonicity refers to that if some content has been distributed to the corresponding buffer through hash, and new buffer has been added to the system. The result of the hash should be able to ensure that the original allocated content can be mapped to the original or new buffer instead of other buffers in the old buffer collection.
In a distributed environment, the terminal may not see all the buffers, but only a part of them. When the terminal wants to map the content to the buffer through the hash process, the buffer range seen by different terminals may be different, resulting in inconsistent hash results, and the final result is that the same content is mapped to different buffers by different terminals. This situation should be avoided obviously, because it causes the same content to be stored in different buffers, which reduces the efficiency of system storage. Dispersion is defined as the severity of the above. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, to reduce the dispersion as much as possible.
The load problem is actually looking at decentralization from another perspective. Since different terminals may map the same content to different buffers, for a specific buffer, it may also be mapped to different content by different users. As with decentralization, this situation should be avoided, so a good hash algorithm should be able to reduce the buffer load as much as possible.
2. The whole space is organized in a clockwise direction. 32 power of 0 to 2 minus 1
coincides in the direction of zero. The next step is to hash each server using hash. Specifically, you can choose the IP or host name of the server as the key word to hash, so that each machine can determine its location on the hash ring. Here, it is assumed that the location in the ring space of the above four servers after using the IP address hash is as follows:
3. Next, use the following algorithm to locate and access the data to the corresponding server: use the same function hash to calculate the hash value of the data key, and determine the position of the data on the ring. From this position, “walk” clockwise along the ring. The first server encountered is the server to which it should be located.
For example, we have four data objects: node1, node2, node3, and node4. After hash calculation, their positions in the ring space are as follows:
4. According to the consistent hash algorithm, data a will be set to node 1, data B to node 2, data C to node 3, and data d to node 4.
The following analyzes the fault tolerance and scalability of the consistent hash algorithm. Now suppose that node 3 is down unfortunately. You can see that data a, B and D will not be affected at this time. Only data C is relocated to node 4. Generally, in the consistency hash algorithm, if a server is not available, the affected data is only the data between this server and the previous server in its ring space (i.e. the first server encountered when walking in the counterclockwise direction), and other data will not be affected.
Consider another case. If you add a server node x to the system, as shown in the following figure:
5. At this time, object a, B and D are not affected. Only object C needs to be relocated to the new node X. Generally, in the consistency hash algorithm, if a server is added, the affected data is only the data between the new server and the previous server in its ring space (that is, the first server encountered when walking in the counterclockwise direction), and other data will not be affected.
In summary, the consistent hash algorithm only needs to relocate a small part of the data in the ring space for the increase and decrease of nodes, which has good fault tolerance and scalability. In addition, when there are too few service nodes in the consistent hash algorithm, it is easy to cause data skew due to the uneven node segments.
For example, there are only two servers in the system, and their ring distribution is as follows,
. The specific method can be implemented by adding a number after the server IP or host name.
For example, in the above case, three virtual nodes can be calculated for each server, so the hash values of “node 1”, “node 1”, “node 1”, “node 2”, “node 2” and “node 2” can be calculated respectively, forming six virtual nodes:
. This solves the problem of data skew when there are few service nodes. In practical applications, the number of virtual nodes is usually set to 32 or more, so even a few service nodes can achieve relatively uniform data distribution.
Help to pay attention to WeChat official account study together: chengxuyuan95 (different programmer)