Qixi also need to learn, hash!

Time:2021-2-23

file

preface

This article is included in the album:http://dwz.win/HjK, click to unlock more knowledge of data structure and algorithm.

Hello, I’m brother Tong.

In the last section, we learned how to build a high-performance queue in Java, which involves a lot of underlying knowledge. I don’t know how much get you have?!

In this section, I want to follow you to learn all about hashing – hashes, hash functions, hash tables.

What kind of love and hatred do these three have?

Why do I need a hashcode () method in the object class? What does it have to do with the equals () method?

How to write a high performance hash table?

Can the red black tree in HashMap in Java be replaced by other data structures?

What is hash?

Hash is to change an input of any length into a hash by a certain algorithmFixed length outputThis output is called hash value, or hash code. This algorithm is called hash algorithm, or hash function. This process is generally called hash, or hash calculation. Hash translation into Chinese has hash, hash, hash and so on.

file

Since it is a fixed length output, it means that the input is infinite and the output is limited. It is inevitable that different inputs may get the same output. Therefore, hash algorithm is generally irreversible.

So, what are the uses of hash algorithm?

The use of hash algorithm

Hash algorithm is a generalized algorithm, or an idea. It does not have a fixed formula. As long as it meets the above defined algorithm, it can be called hash algorithm.

Generally speaking, it has the following uses:

  1. Encrypt the password, for example, use MD5 + salt to encrypt the password;
  2. Fast query, for example, the use of hash table, through the hash table can quickly query elements;
  3. Digital signature, such as inter system call plus signature, can prevent tampering with data;
  4. For example, when you download Tencent games, you usually have an MD5 value. After the installation package is downloaded, you can calculate an MD5 value and compare it with the official MD5 value to know whether the files are damaged or tampered during the download process;

Well, speaking of hash algorithm or hash function, in Java, the parent class object of all objects has a hash function, that is, the hashcode () method. Why do you need to define such a method in the object class?

Strictly speaking, there are some differences between hash algorithm and hash function. I believe you can distinguish them according to context.

Let’s take a look at the comments of the JDK source code

file

Return a hash value for this object, which exists to better support hash table, such as HashMap. To put it simply, this method is used for hash tables such as HashMap.

//By default, the internal address of the object is returned
public native int hashCode();

At this point, we have to mention another method in the object class – equals().

//The default is to compare the addresses of two objects directly
public boolean equals(Object obj) {
    return (this == obj);
}

What is the entanglement between hashcode () and equals?

Generally speaking, hashcode () can be regarded as a weak comparison, which regresses to the essence of hash and maps different inputs to fixed length outputs. Then, the following situations will occur:

  1. If the input is the same, the output must be the same;
  2. If the input is different, the output may be the same or different;
  3. If the output is the same, the input may be the same or different;
  4. Different output leads to different input;

Equals() is a method to strictly compare whether two objects are equal. Therefore, if equals() is true, their hashcode() must be equal. What happens if they are not equal?

If equals() returns true and hashcode() is not equal, then if you take these two objects as the keys of HashMap, they will probably be located in different slots of HashMap. At this time, two equal objects will be inserted into a HashMap, which is not allowed. This is also the reason why the hashcode() method must be overridden when the equals() method is overridden.

For example, for the string class, we all know that its equals () method is to compare whether the contents of two strings are equal, not the addresses of two strings. Here is its equals () method:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = value.length;
        if (n == anotherString.value.length) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

Therefore, for the following two string objects, use equals() to compare them. They are equal, but their memory addresses are not the same:

String a = new String("123");
String b = new String("123");
System.out.println(a.equals(b)); // true
System.out.println(a == b); // false

At this time, if we do not rewrite the hashcode () method, then a and B will return different hash codes, which will cause great interference to our frequent use of string as the key of HashMap. Therefore, the hashcode () method rewritten by string:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

This algorithm is also very simple, expressed as: s [0] * 31 ^ (n-1) + s [1] * 31 ^ (n-2) +… + s [n-1].

Well, since hash tables are mentioned many times, let’s see how hash tables evolve step by step.

Evolution of hash table

array

Before we talk about hash table, let’s look at the ancestor of data structure array.

The array is relatively simple. I won’t talk about it any more. Everyone will understand it. See the figure below.

file

The subscript of an array generally starts from 0 and stores the elements in turn. The same is true for finding the specified elements. You can only find the elements from the beginning (or from the end) in turn.

For example, to find the element 4, you need to find it three times from the beginning.

Early hash tables

As mentioned above, the disadvantage of array is that you can only find elements from the beginning or from the end until they are matched. Its balancing time complexity is O (n).

So, is there any way to quickly find elements by using arrays?

Clever programmers have come up with a way to calculate the value of the element through hash function, and use this value to determine the position of the element in the array, so that the time complexity can be reduced to o (1).

For example, there are five elements, namely 3, 5, 4, and 1. Before putting them into the array, the hash function is used to calculate the position and place them accurately, instead of placing the elements in turn like a simple array (searching for the position based on the index rather than the element value).

If the length of the array applied here is 8, we can create such a hash function as hash (x) = x% 8, then the last element will become the following figure:

file

At this time, we will find the element 4. First, we will calculate its hash value as hash (4) = 4% 8 = 4, so we can directly return the element at position 4.

Hash table of evolution

It seems to be perfect. However, an element 13 is coming to the hash table to be inserted. Its hash value is hash (13) = 13% 8 = 5. Nani, its calculation position is also 5. However, No. 5 has been occupied first. What should we do?

This is it.Hash Collisions

Why hash conflicts?

Because the array we applied for is of finite length, there will be conflicts sooner or later when we map infinite numbers to finite arrays, that is, multiple elements are mapped to the same location.

Well, since there is a hash conflict, we have to solve it. We have to do it!

How to?

Linear detection method

Now that position 5 already has a master, I’ll recognize element 13. I’ll move back one bit and I’ll go to position 6. This is the linear detection method. When there is a conflict, I’ll move back one by one until I find the empty position.

file

However, there is a new element 12, and its hash value is hash (12) = 12% 8 = 4, what? In this way, you have to move back three times to position 7 before you have a free position, which leads to a very low efficiency of inserting elements. The same is true for searching. First, you locate position 4, find that it is not the person I am looking for, and then move back until you find position 7.

Secondary detection method

There is a big disadvantage in using linear detection method. The conflicting elements are often stacked together. For example, if No. 12 is placed at No. 7, then another No. 14 conflicts, and then the end of the array is put back, and then the beginning is placed at No. 0, you will find that the conflicting elements have aggregation phenomenon, which is not conducive to searching, and it is also not conducive to inserting new elements.

At this time, another clever programmer brother put forward a new idea — the second detection method. When there is a conflict, instead of looking for the empty position one by one, I use the original hash value plus the quadratic power of I to look for the empty position. I go from 1, 2, 3… In turn until I find the empty position.

Or take the above example, insert element 12, the process is like this, this article comes from the princess number tongge read source code:

file

In this way, we can quickly find an empty place to place new elements, and there will be no accumulation of conflicting elements.

But geese, here comes the new element 20. Where do you put it?

I found that I couldn’t put it anywhere.

Research shows that when more than half of the elements are placed in the hash table, the new elements can not find the location.

Therefore, a new concept, capacity expansion, is introduced.

What is capacity expansion?

When the placed element reaches x% of the total capacity, it needs to be expanded, which is also calledExpansion factor

Obviously, the larger the expansion factor, the better, indicating the higher space utilization of the hash table.

So, it’s a pity that the secondary detection method can’t meet our goal, the expansion factor is too small, only 0.5, half of the space is wasted.

At this time, it’s time for the older programmers to give full play to their intelligence. After brainstorming in 996, they came up with a new way to realize hash table – linked list method.

Linked list method

It’s about solving conflicts! If there is a conflict, I will not put it in the array. I use a linked list to connect the elements in the subscript position of the same array. In this way, I can make full use of the space, ha ha ha~~

file

Hey, hey, hey, perfect.

It’s really perfect. I’m a hacker. I always put *% 8 = 4 elements in it, and then you will find that almost all elements run to the same linked list. Ha ha, the final result is that your hash table degenerates into a linked list, and the efficiency of inserting elements into queries becomes O (n).

file

At this point, of course, there is a way, what is the expansion factor?

For example, when the expansion factor is set to 1, when the number of elements reaches 8, the expansion is doubled, half of the elements are still in position 4, and half of the elements go to position 12, which can relieve the pressure of hash table.

However, geese are still not perfect, and they just change from one list to two lists. This article comes from the source code of Princess tongge.

The clever programmer brothers started a brainstorming of 9127, and finally came up with a new structure – linked list tree method.

Linked list tree method

Although the above expansion can solve some problems when the number of elements is relatively small, the overall search and insertion efficiency will not be too low, because the number of elements is small.

However, hackers are still attacking, and the number of elements continues to increase. When it increases to a certain extent, the efficiency of search and insertion is always very low.

So, another way of thinking, since the efficiency of the linked list is low, I upgrade it. When the linked list is long, how about upgrading it to a red black tree?

Well, I think it’s OK. Do as you say.

file

Well, it’s not bad. Mom is no longer afraid that I will be attacked by hackers. The query efficiency of the red black tree is O (log n), which is much higher than that of the linked list.

So, is that the end?

You think too much. Every time you expand, you still need to move half of the elements. One tree is divided into two trees. Is that really good?

It’s so hard for programmers. After 12127 brainstorming, they finally come up with a new thing consistency hash.

Consistent hash

Consistent hash is more used in distributed systems. For example, the redis cluster has four nodes deployed. We define all hash values as 0 ~ 2 ^ 32, and place a quarter of the elements on each node.

This is just an example. The principle of the actual redis cluster is like this, but the specific value is not like this.

At this point, suppose you need to add a node to redis, such as node5, between node3 and node4. In this way, you only need to move the elements between node3 and node4 from node4 to node5, and the other elements remain unchanged.

In this way, the speed of capacity expansion is increased, and the affected elements are relatively small, and most requests are almost imperceptible.

file

OK, that’s all for the evolution history of hash table. Did you get it?

Postscript

In this section, we have learned about hash, hash function and hash table. In Java, the ultimate form of HashMap is presented in the form of array + linked list + red black tree.

It is said that this red black tree can be replaced by other data structures, such as jump table. Do you build it?

Next, let’s talk about itJump WatchThis data structure, and use it to rewrite HashMap, to get the latest promotion, come quickly to pay attention to me!

Pay attention to the owner of the public account “tongge read the source code” to unlock more knowledge of source code, foundation and architecture.