In 2011, CSDN’s “off database” incident, at that time, CSDN’s website was attacked by hackers, and the registered mailbox and password plaintext of more than 6 million users were leaked. Many netizens were dissatisfied with CSDN’s plaintext saving of user passwords. If you were an engineer of CSDN, how would you store such important data as user password? Is it enough to encrypt storage with MD5? To understand this problem, we must first understand the hash algorithm.
Hash algorithm has a long history, and there are many famous hash algorithms in the industry, such as MD5, Sha, etc. In our usual development, we basically use the ready-made ones directly. In the actual development, how do we use hash algorithm to solve the problem.
What is a hash algorithm?
In fact, whether it is “hash” or “hash”, this is the difference between Chinese translation and English is actually “hash”. Therefore, we often hear that “hash table” is called “hash table” and “hash table”, and “hash algorithm” is called “hash algorithm” or “hash algorithm”. So what is a hash algorithm? The definition and principle of hash algorithm are very simple, which can be summarized in one sentence. Mapping binary value string of any length to binary value string of fixed length, the mapping rule is hash algorithm, and the binary value string obtained after original data mapping is hash value. However, it is not easy to design an excellent hash algorithm. There are several requirements to be met:
- The original data cannot be deduced reversely from the hash value (so the hash algorithm is also called one-way hash algorithm);
- It is very sensitive to input data. Even if only one bit of the original data is modified, the resulting hash value is very different;
- The probability of hash collision is very small. For different original data, the probability of the same hash value is very small;
- The execution efficiency of hash algorithm should be as efficient as possible. For long text, it can also quickly calculate the hash value.
These definitions and requirements are more theoretical. Take MD5 as a hash algorithm. We calculate the MD5 hash value for “today I’ll talk about the hash algorithm” and “Jiajia” respectively, and get two strings that look irregular (the MD5 hash value is a 128 bit length. For convenience, I convert them into hexadecimal coding). It can be seen that no matter how long or short the text to be hashed, the length of the hash value obtained after MD5 hashing is the same, and the hash value obtained looks like a pile of random numbers, which is completely irregular.
MD5 ("let me talk about hash algorithm today") = bb4767201ad42c74e650c1b6c03d78fa MD5("jiajia") = cd611a31ea969b908932d44d126d195b
Let’s look at two very similar texts, “I’m talking about hash algorithm today!” and “I’m talking about hash algorithm today”. There is only one exclamation point between the two texts. If you use the MD5 hash algorithm to calculate their hash values respectively, you will find that although there is only one word difference, the hash values are completely different.
MD5 ("I'll talk about hash algorithm today!") = 425f0d5a917188d2c3dc85b5e4f2cb MD5 ("I'll talk about hash algorithm today") = a1fb91ac128e6aa37fe42c663971ac3d
It is difficult to deduce the original data from the hash value obtained by the hash algorithm. For example, in the above example, it is difficult to deduce the corresponding text “I’ll talk about hash algorithm today” from the hash value “a1fb91ac128e6aa37fe42c663971ac3d”. The text to be processed by the hash algorithm may be various. For example, for very long text, if the calculation time of hash algorithm is very long, it can only stay at the level of theoretical research, which is difficult to be applied to practical software development. For example, in today’s article containing more than 4000 Chinese characters, it takes less than 1ms to calculate the hash value with MD5. Hash algorithms are widely used. The most common seven are security encryption, unique identification, data verification, hash function, load balancing, data fragmentation and distributed storage.
Application 1: secure encryption
When it comes to the application of hash algorithm, the first thought should be secure encryption. The most commonly used hash algorithms for encryption are MD5 (MD5 message digest algorithm) and Sha (secure hash algorithm). In addition to these two, of course, there are many other encryption algorithms, such as DES (data encryption standard) and AES (Advanced Encryption Standard). The four requirements of the hash algorithm mentioned above are particularly important for the hash algorithm used for encryption. The first point is that it is difficult to deduce the original data according to the hash value. The second point is that the probability of hash conflict is very small. The first point is well understood. The purpose of encryption is to prevent the disclosure of original data, so it is difficult to deduce the original data through the hash value, which is the most basic requirement. So let me focus on the second point. In fact, no matter what hash algorithm is, we can only minimize the probability of collision. In theory, we can’t achieve no collision at all. Why do you say that? This is based on a very basic theory in combinatorial mathematics, pigeon nest principle (also known as drawer principle). The principle itself is very simple. It means that if there are 10 pigeon nests and 11 pigeons, there must be more than one pigeon in one pigeon nest. In other words, there must be two pigeons in one pigeon nest. With the groundwork of the pigeon nest principle, let’s look again. Why can’t the hash algorithm achieve zero conflict? We know that the length of hash value generated by hash algorithm is fixed and limited. For example, in the previous example of MD5, the hash value is a fixed 128 bit binary string, and the data that can be represented is limited. It can represent 2 ^ 128 data at most, and the data we want to hash is infinite. Based on the pigeon nest principle, if we calculate the hash value of 2 ^ 128 + 1 data, there must be the same hash value. Here you should be able to think that in general, the longer the hash value, the lower the probability of hash collision.
For example, the hash value generated after MD5 encryption of the following two strings is the same
However, even if the hash algorithm has hash conflict, it is relatively difficult to crack because the hash value range is large and the probability of conflict is very low. Like MD5, there are 2 ^ 128 different hash values. This data is an astronomical number, so the probability of hash conflict is less than 1 / 2 ^ 128. If we get an MD5 hash value and hope to find another data with the same MD5 value through irregular exhaustive methods, the time spent should be astronomical. Therefore, even if there are conflicts in the hash algorithm, it is difficult to crack the hash algorithm in limited time and resources. In addition, there is no absolutely secure encryption. The more complex and difficult the encryption algorithm is, the longer the computing time is. For example, SHA-256 is more complex and secure than SHA-1, and the corresponding calculation time will be longer. Cryptography has also been committed to finding a fast and difficult to crack hash algorithm. In the actual development process, we also need to weigh the cracking difficulty and computing time to decide which encryption algorithm to use.
Application 2: unique identification
If we want to search for the existence of a picture in a large number of picture libraries, we cannot simply compare it with the meta information of the picture (such as the picture name), because there may be cases where the name is the same but the picture content is different, or the name is different and the picture content is the same. How do we search? We know that any file can be expressed as a binary code string in the calculation, so the stupid way is to compare the binary code string of the picture to be found with the binary code string of all the pictures in the gallery one by one. If the same, the picture exists in the gallery. However, each picture is tens of KB in size and several MB in size. It is a very long string to convert into binary, which is very time-consuming to compare. Is there a faster way? We can give each picture a unique identification, or information summary. For example, we can take 100 bytes from the beginning of the binary code string of the picture, take 100 bytes from the middle, and then take 100 bytes from the end, and then put these 300 bytes into one piece to obtain a hash string through a hash algorithm (such as MD5), which is used as the unique identification of the picture. This unique identifier can be used to determine whether the picture is in the gallery, which can reduce a lot of workload. If we want to continue to improve efficiency, we can store the unique identification of each picture and the path information of the corresponding picture file in the gallery in the hash table. When we want to check whether a picture is in the gallery, we first take the unique ID of the picture through the hash algorithm, and then look for the existence of the unique ID in the hash table. If it does not exist, it means that the picture is not in the gallery; If it exists, we can obtain the existing image through the file path stored in the hash table, and compare it with the image to be inserted to see if it is exactly the same. If it is the same, it means it already exists; If they are different, it means that although the two pictures have the same unique identification, they are not the same picture.
Application 3: data verification
You must have used BT download software such as electric donkey. We know that the principle of BT download is based on P2P protocol. We download a 2GB movie from multiple machines in parallel. The movie file may be divided into many file blocks (for example, it can be divided into 100 blocks, each of which is about 20MB). After all the file blocks are downloaded, they can be assembled into a complete movie file. We know that network transmission is not secure. The downloaded file block may have been maliciously modified by the host machine, or there may be an error in the download process, so the downloaded file block may not be complete. If we do not have the ability to detect such malicious modifications or file download errors, the final merged movie will not be able to watch, and even lead to computer poisoning. The question now is, how to verify the security, correctness and integrity of file blocks? The specific BT protocol is very complex, and there are many verification methods. I’ll talk about one of them. Through the hash algorithm, we take the hash values of 100 file blocks respectively and save them in the seed file. As we mentioned earlier, hash algorithm has a feature that it is very sensitive to data. As long as the content of the file block changes a little, the final calculated hash value will be completely different. Therefore, after the file block is downloaded, we can calculate the hash value of the downloaded file block one by one through the same hash algorithm, and then compare it with the hash value saved in the seed file. If it is different, it means that the file block is incomplete or tampered with. You need to download the file block from other host machines again.
Application 4: hash function
In fact, hash function is also an application of hash algorithm. Hash function is the key to designing a hash table. It directly determines the probability of hash conflict and the performance of hash table. However, compared with other applications of hash algorithm, hash function has much lower requirements for hash algorithm conflict. Even if there are individual hash conflicts, as long as they are not too serious, we can solve them through open addressing method or linked list method. Moreover, the hash function does not care whether the value calculated by the hash algorithm can be decrypted in reverse. The hash algorithm used in the hash function pays more attention to whether the hashed values can be evenly distributed, that is, whether a group of data can be evenly hashed in each slot. In addition, the speed of hash function execution will also affect the performance of Hash list. Therefore, the hash algorithm used by hash function is generally relatively simple and pursues efficiency.
Application 5: load balancing
We know that there are many load balancing algorithms, such as polling, random, weighted polling and so on. How to implement a session sticky load balancing algorithm? In other words, we need to route all requests in a session to the same server on the same client. The most direct way is to maintain a mapping table, which contains the mapping relationship between the client IP address or session ID and the server number. For each request sent by the client, first find the server number that should be routed in the mapping table, and then request the server corresponding to the number. This method is simple and intuitive, but it also has several disadvantages:
- If there are many clients, the mapping table may be large, which wastes memory space;
- When the client goes offline and online, and the server expands or shrinks, the mapping will fail, so the cost of maintaining the mapping table will be very high;
With the help of hash algorithm, these problems can be solved perfectly. We can calculate the hash value of the client IP address or session ID through the hash algorithm, and modulo calculate the obtained hash value with the size of the server list. The final value is the server number to be routed. In this way, we can route all requests from the same IP to the same back-end server.
Application 6: data slicing
Hash algorithm can also be used for data fragmentation. I have two examples here.
- How to count the number of occurrences of “search keywords”?
If we have a 1t log file, which records the user’s search keywords, we want to quickly count the number of times each keyword is searched. What should we do? Let’s analyze it. There are two difficulties in this problem. The first is that the search log is too large to be put into the memory of a machine. The second difficulty is that if only one machine is used to process such huge data, the processing time will be very long. In view of these two difficulties, we can slice the data first, and then use the method of multi machine processing to improve the processing speed. The specific idea is as follows: in order to improve the processing speed, we use n machines for parallel processing. From the log file of the search record, we read out each search keyword in turn, calculate the hash value through the hash function, and then take the modulus with N. the final value is the machine number that should be assigned. In this way, search keywords with the same hash value are assigned to the same machine. In other words, the same search keyword will be assigned to the same machine. Each machine will calculate the number of keywords, and finally combine them to be the final result. In fact, the processing process here is also the basic design idea of MapReduce.
- How to quickly determine whether the picture is in the gallery?
How to quickly determine whether the picture is in the gallery? Assuming that there are 100 million pictures in our gallery, it is obvious that building hash tables on a single machine is not feasible. Because the memory of a single machine is limited, and the hash table constructed by 100 million pictures obviously far exceeds the maximum memory of a single machine.
We can also slice the data, and then use multi machine processing. We prepare n machines so that each machine can only maintain the hash table corresponding to some pictures. We read a picture from the library every time, calculate the unique ID, and then calculate the remainder and modulus with the number of machines n. The obtained value corresponds to the machine number to be assigned, and then send the unique ID and picture path of the picture to the corresponding machine to build a hash table.
When we want to judge whether a picture is in the library, we use the same hash algorithm to calculate the unique ID of the picture, and then find the remainder and modulus with the number of machines n. Assuming that the value obtained is k, look it up in the hash table built by the machine numbered K.
Now, let’s estimate how many machines it takes to build a hash table for these 100 million images.
Each data unit in the hash table contains two pieces of information, the hash value and the path of the picture file. Suppose we calculate the hash value through MD5, the length is 128 bits, that is, 16 bytes. The maximum length of the file path is 256 bytes. We can assume that the average length is 128 bytes. If we use the linked list method to solve the conflict, we also need to store the pointer, which only occupies 8 bytes. Therefore, each data unit in the hash table occupies 152 bytes (this is only an estimation, not accurate).
Assuming that the memory size of a machine is 2GB and the load factor of the hash table is 0.75, that machine can build a hash table for about 10 million (2GB * 0.75 / 152) pictures. Therefore, if you want to index 100 million images, you need about a dozen machines. In the project, this estimation is still very important, which can give us a general understanding of the resources and funds to be invested in advance, and better evaluate the feasibility of the solution.
In fact, for this massive data processing problem, we can all adopt multi machine distributed processing. With this idea of fragmentation, we can break through the limitations of single machine memory, CPU and other resources.
Application 7: distributed storage
Now the Internet is facing massive data and massive users. In order to improve the ability of reading and writing data, we generally use distributed methods to store data, such as distributed cache.
We have a large amount of data to cache, so a cache machine is certainly not enough. Therefore, we need to distribute the data on multiple machines.
How do you decide which data to put on which machine? We can borrow the previous idea of data fragmentation, that is, take the hash value of the data through the hash algorithm, and then take the modulus of the number of machines. This final value is the number of cached machines that should be stored. However, if the data increases, the original 10 machines can no longer afford it. We need to expand the capacity, for example, to 11 machines. At this time, trouble will come. Because it’s not just a machine. The original data is modeled by and 10. For example, the data of 13 is stored on the machine numbered 3. However, in a new machine, we took the mold for the data according to 11, and the original data of 13 was assigned to machine 2.
Therefore, all data must be recalculated and then moved to the correct machine. In this way, the data in the cache is invalidated at once. All data requests will penetrate the cache and directly request the database. In this way, an avalanche effect may occur and crush the database.
Therefore, we need a method to make it unnecessary to move a large amount of data after adding a new machine. At this time, the consistent hash algorithm is about to appear. Suppose we have K machines, and the range of hash value of data is [0, Max]. We divide the whole range into M cells (M is much greater than k), and each machine is responsible for M / K cells. When a new machine is added, we move the data between several cells from the original machine to the new machine. In this way, there is no need to re hash and move all the data, and the balance of the amount of data on each machine is maintained. The basic idea of consistent hash algorithm is so simple. In addition, it will be realized more gracefully with the help of a virtual ring and virtual node.