With the development of the Internet, more and more data are generated by users, and enterprises are facing the problem of huge data storage. At present, the mainstream distributed big data file systems in the market scatter the data slices and scatter the data on all nodes of the cluster through the discrete method, This article will show you DHT (distributed hash table): how the distributed hash table realizes the distributed discrete storage of data.
DHT (distributed hash table): distributed hash table
2、 Technical background
In the early stage of Internet development, data is usually stored on a single server, and the growth of data is relatively slow in the initial stage. The growth demand of data can be met by improving the storage capacity of single machine; With the development of Internet popularization, the number of users, data generated and accessed by users increase exponentially; The data needed by users cannot be stored in a single computer. Therefore, there is an urgent need for multiple servers to work together to store more data.
3、 Traditional hash
The traditional hash algorithm hash () = x mod s is used to disperse the data. In the case of scattered metadata, the data can be well hashed in the cluster nodes. Since s represents the number of nodes in the cluster, when the cluster is expanded or reduced, the change of s will affect the hit rate of historical data. Therefore, in order to improve the hit rate of data, a large number of test data migration will occur, resulting in poor performance.
4、 A simple DHT
Distributed hash constructs a ring with length of 2 to the 32nd power (the number of IPv4 addresses) and hashes the nodes on the ring. The hash values calculated by different hash algorithms are different. This paper uses FNV hash algorithm to calculate the hash value.
As shown in the figure, the storage node is added to the DHT ring after hash calculation. The interval between each node and the previous node is used as the data partition of the node, and the data whose hash value falls in this partition will be stored in this node;
Then hash the data to the DHT ring through hash algorithm. After the data falls on the DHT ring, find the nearest node as the data storage node clockwise. As shown below, data objecta falls on node NODEA and data objectb falls on node NodeB;
The source code for initializing DHT is as follows:
Firstly, a map is defined to store the cluster node metadata, which is used to store the data of the physical nodes connected to the DHT ring. Then a DHT ring vnodes is defined to store the location information of nodes in the DHT ring. In this way, we have implemented a simple DHT ring, which can simulate the joining of cluster nodes through the addphysicalnode method. When joining, the hash value of the node will be calculated and stored in vnodes.
Initialize four storage nodes.
Insert 100 pieces of data through the countnodevalue method. In the process of writing data, find the nearest node on the DHT ring according to the hash value of the data, and then write the data to the node.
After inserting 100 pieces of data, the data distribution of each node is as follows. It can be seen that the data of four nodes is not uniform, and only one node is assigned to the data (which also has a certain relationship with the written data).
After inserting 1 million pieces of data, the data distribution of each node is as follows. Although each node is assigned data, there is still a large data skew. As a result, 99% of the requests will be processed by node3, and there will be a situation where one core is difficult to be surrounded by three cores.
What are the reasons for the above problems? By looking at the hash value of each node on the DHT ring, it is not difficult to see that the spacing of each node is uneven. When the inserted data searches the node clockwise, node3 is found, so the data is written into node3, so the uneven interval of nodes will enable some nodes to cover more data, resulting in unbalanced data.
After describing the basic principle of a simple DHT ring, let’s think about another problem: simple DHT ring data is discrete, but there is still data skew, so it’s better to use the traditional hash method to allocate data.
As mentioned earlier, in the traditional hash mode, after node failure, a large amount of data in the whole cluster will be migrated, which will affect the cluster performance. Can DHT solve this problem?
We still use the 1 million pieces of data allocated before to simulate the failure of node 4. As shown in the figure below, the data on node 4 is only migrated to node 1, and there is no data migration for node 2 and node 3, so as to reduce the impact of node failure on data migration for each node.
5、 Improvement of DHT
1. Virtual node
How can we solve the problem of data skew?
The most simple and direct way to increase cluster nodes is to hash more nodes to the DHT ring, so that all nodes in the ring are more evenly distributed, and the interval interval between nodes is as balanced as possible. The following is the data distribution of 10 nodes and 20 nodes in the cluster.
It can be found that the problem of data skew can not be fundamentally solved by adding nodes. And increasing nodes will increase the equipment cost and maintenance cost of the cluster. At the same time, this solution also leads to a serious problem. If node20 fails, all the data of node20 will be migrated to the next node, eventually leading to data skew in the cluster. Nodes with more data will also process more IO requests, which is easy to form data hot spots and become performance bottlenecks, causing the overall performance of the cluster to decline.
（2） Introducing virtual nodes
In order to solve the problem of data skew, the concept of virtual node is introduced. Virtual node is also a logical copy of the real node. As shown in the figure, node a is hashed three times to form virtual nodes nodea1, nodea2 and nodea3. When NODEA fails, the data pointing to NODEA will point to NodeB and nodec.
When the number of virtual nodes is 100, the data has been scattered in each node. If there are enough virtual nodes, the data will be balanced.
Data distribution when virtual node data is 10000:
Data distribution when the number of virtual nodes is 1 million:
When node3 fails, the data on node3 is evenly distributed to other nodes without data skewing.
2. Load boundary factor
Is that perfect? We initialize a four node DHT ring, set the virtual node to 100, insert 100 pieces of data, and print the metadata information of DHT ring as follows:
It can be found that although the virtual node is set, it is still unable to hash the nodes to the DHT ring in a balanced way, resulting in node 2 overload and node idle. Let’s think about an extreme scenario. When our data is in the interval a after hash value is calculated, and there is only NODEA in this interval, there is still data skew. How to solve this problem, here we introduce a concept called load boundary factor. Four nodes are deployed in the DHT ring, and a total of 100 pieces of data are to be inserted. On average, the weight of each node is 100 / 4 + 1 = 26. When the weight of the node is reached in the data mapping process, it is mapped to the next node. Here is the code implementation.
When the load boundary factor switch is turned on:
After the load boundary factor switch is turned on, the data is well balanced.
6、 Thoughts on DHT
The above is just a simple DHT, and the data is also simplified. Data storage and reading all need to query DHT ring. How to improve the read-write performance of DHT? How to improve the reliability of DHT? When a node fails, how to migrate its data to a new node? How to do data backup well? How to ensure that the replica data is not centralized on one node? Also need to think about, this article is just a simple introduction to the basic idea of DHT, more production environment challenges, do not start here.
As you can see, DHT provides an idea of load balancing. By using the characteristics of hash algorithm, the data or business requests are distributed to each node in the cluster to improve the fault tolerance of the system.
vivo User operation development team