Everything about hashing!

Time:2020-11-26

preface

This article is on the album: http://dwz.win/HjK , click to unlock more knowledge of data structures and algorithms.

Hello, this is tongge.

In the last section, we learned how to build a high-performance queue in Java, which involves a lot of low-level knowledge. I don’t know how much you get?!

In this section, I want to follow you to relearn everything about hashes – hashes, hash functions, hash tables.

What kind of love and hatred do these three have?

Why does the object class need a hashcode () method? What does it have to do with the equals () method?

How to write a high performance hash table?

Can red and black trees in Java HashMap be replaced by other data structures?

What is hash?

Hash refers to transforming any length of input intoFixed length outputThis output is called hash value, or hash code. This algorithm is called hash algorithm, or hash function. This process is generally called hash, or computing hash. Hash is translated into Chinese with hash, hash, hash, etc.

Everything about hashing!

Since it is a fixed length output, it means that the input is infinite and the output is limited. It is inevitable that different inputs may get the same output. Therefore, hash algorithm is generally irreversible.

So what are the uses of hash?

The purpose of hash algorithm

Hash algorithm is a generalized algorithm, or an idea. It does not have a fixed formula. As long as the algorithm defined above is satisfied, it can be called hash algorithm.

Generally speaking, it has the following uses:

  1. For example, MD5 + salt is used to encrypt the password;
  2. Fast query, for example, the use of hash table, through the hash table can quickly query elements;
  3. Digital signature, for example, inter system call plus signature, can prevent tampering with data;
  4. For example, when downloading Tencent games, there is usually an MD5 value. After the installation package is downloaded, a MD5 value is calculated and compared with the official MD5 value to know whether there is any file damage or tampering during the download process;

Well, speaking of hash algorithm or hash function, in Java, the parent class object of all objects has a hash function, that is, the hashcode() method. Why should such a method be defined in the object class?

Strictly speaking, there is a difference between hash algorithm and hash function. I believe you can distinguish them according to context.

Let’s take a look at the comments of JDK source code

Return a hash value for this object. It exists to better support hash tables, such as HashMap. In short, this method is used for hashmaps and other hash tables.

//The internal address of the object is returned by default

At this point, we have to mention another method in the object class – equals().

//The default is to directly compare whether the addresses of two objects are equal

What is the entanglement between hashcode () and equals?

Generally speaking, hashcode() can be regarded as a kind of weak comparison, which returns to the essence of hash and maps different inputs to fixed length outputs. Then, the following situations will occur:

  1. If the input is the same, the output must be the same;
  2. With different inputs, the outputs may be the same or different;
  3. If the output is the same, the input may be the same or different;
  4. Different output means different input;

And equals() is a method to strictly compare whether two objects are equal. Therefore, if the equals() of two objects is true, then their hashcode() must be equal. What if they are not equal?

If equals() returns true, but hashcode() is not equal, try to use these two objects as the keys of HashMap. They may be located in different slots of HashMap. At this time, two equal objects will be inserted into a HashMap, which is not allowed. This is why the hashcode() method must be rewritten when the equals() method is overridden.

For example, for the string class, we all know that its equals() method is to compare whether the contents of two strings are equal, not the addresses of two strings. Here is its equals() method:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = value.length;
        if (n == anotherString.value.length) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

Therefore, for the following two string objects, using equals() to compare them is equal, but their memory addresses are not the same:

String a = new String("123");
String b = new String("123");
System.out.println(a.equals(b)); // true
System.out.println(a == b); // false

At this point, if the hashcode() method is not overridden, then a and B will return different hash codes, which will cause great interference to us when we often use string as the key of HashMap. Therefore, the hashcode() method overridden by string:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

This algorithm is also very simple, expressed as: s [0] * 31 ^ (n-1) + s [1] * 31 ^ (n-2) +… + s [n-1].

Well, since hashes are mentioned many times, let’s take a look at how hashes evolve step by step.

Evolution of hash table

array

Before we talk about hash tables, let’s take a look at arrays, the ancestor of data structures.

The array is relatively simple, so I won’t say much. Everyone will understand it, as shown in the figure below.

The subscript of an array generally starts from 0 and stores the elements in turn. The same is true for finding the specified element. You can only find elements from the beginning (or from the end).

For example, to find the element 4, you need to look it up three times from scratch.

Early hash tables

As mentioned above, we can only find elements from the beginning or from the end until they match. The complexity of balancing time is O (n).

So, is there any way to quickly find elements using arrays?

Programmers can use this method to calculate the value of the complex element in the array.

For example, there are five elements, namely 3, 5, 4, and 1. Before putting them into the array, the hash function is used to calculate the position and place them precisely. Instead of placing the elements in sequence like a simple array (the location is found based on the index rather than the element value).

If the array length applied here is 8, we can create such a hash function as hash (x) = x% 8, then the final element will become the following figure:

At this time, we will look for the element 4, and calculate its hash value as hash (4) = 4% 8 = 4, so we can directly return the element at position 4.

Evolving hash table

Things look perfect, but an element 13 is coming. The hash table to be inserted is hash (13) = 13% 8 = 5. Nani, it calculates the position of 5, but no. 5 has been occupied by others. What should we do?

This is it.Hash Collisions

Why do hash conflicts occur?

Because the array we applied for is of finite length, there will be conflicts sooner or later when we map infinite numbers to a finite array, that is, multiple elements are mapped to the same location.

Well, since there is a hash conflict, we have to solve it. We have to do it!

How to?

Linear detection method

Now that position 5 has a master, I move back one bit and I go to position 6. This is the linear detection method. When there is a conflict, move back in turn until the empty position is found.

However, geese have a new element 12, the hash value is hash (12) = 12%, 8 = 4, what? In this way, it is necessary to move back three times to position 7 before there is a free position. This leads to the low efficiency of inserting elements. The same is true for searching. First, locate position 4 and find that it is not the person I am looking for, and then move backward until position 7 is found.

Secondary detection method

There is a big drawback of using linear detection method. Conflicting elements are often stacked together. For example, the number 12 is placed in position 7, and another number 14 is in conflict. After that, the end of the array is followed, and then it is placed at position 0 from the beginning. You will find that the conflicting elements are clustered, which is not conducive to finding and inserting new elements.

At this time, a clever programmer brother put forward a new idea — the secondary detection method. When there is a conflict, I do not use the next bit to find the empty position. Instead, I use the original hash value plus the quadratic power of I to find the empty position. I successively starts from 1, 2, 3… So as to find the empty position.

Take the above example to insert element 12. The process is like this. This article comes from the source code of Princess tongge

In this way, we can quickly find the empty position to place new elements, and there will be no conflict elements piling up.

But geese, a new element 20, where do you see it?

I found I couldn’t put it anywhere.

The research shows that when the hash table with double probe method is used, when more than half of the elements are placed, the position of the new element will not be found.

Therefore, there is a new concept – expansion.

What is expansion?

When the placed elements reach x% of the total capacity, the capacity needs to be expanded. This x% is also calledExpansion factor

Obviously, the larger the expansion factor, the better, indicating the higher space utilization of the hash table.

Therefore, it is a pity that the secondary detection method can not meet our goal. The expansion factor is too small, only 0.5. Half of the space is wasted.

At this time, it’s time for programmers to play their smart features. After 996 brainstorming, they came up with a new hash table implementation method – linked list method.

Linked list method

It’s about resolving conflicts! If there is a conflict, I don’t put it in the array. I use a linked list to connect the elements at the subscript position of the same array. In this way, we can make full use of the space, ah ha ha ha ha~~

Hey, hey, hey, perfect.

It’s really perfect. I’m a hacker. I always put the elements with *% 8 = 4 into it. Then you will find that almost all the elements are in the same linked list. Ha ha, the final result is that your hash table degenerates into a linked list, and the efficiency of querying and inserting elements becomes o (n).

At this point, of course, there is a way. What does the expansion factor do?

For example, if the expansion factor is set to 1, when the number of elements reaches 8, the capacity will be doubled. Half of the elements are still at position 4 and half of the elements are at position 12, which can relieve the pressure on the hash table.

However, geese, still not very perfect, only from a list into two lists, this article from the princess tongge read source code.

Smart programmers started a brainstorming session of 9127, and finally came up with a new structure – linked list tree method.

Linked tree method

Although the above expansion can solve some problems when the number of elements is relatively small, the overall search and insertion efficiency will not be too low, because the number of elements is small.

However, hackers are still attacking and the number of elements is still increasing. When it is increased to a certain extent, the efficiency of search and insertion will always be very low.

So, to change my mind, since the efficiency of the linked list is low, I will upgrade it. How about upgrading it to a red black tree when the list is long?

Well, I think it’s OK. Do what you say.

Well, it’s good. Mom is not afraid that I will be attacked by hackers any more. The query efficiency of red black tree is O (log n), which is much higher than o (n) of linked list.

So, is that the end of it?

You think too much. Every time you expand the capacity, you still have to move half of the elements. One tree is divided into two trees. Is this really good, OK?

It’s so hard for the programmers. After brainstorming in 12127, I finally came up with a new thing — consistency hash.

Consistency hash

For example, we have defined a quarter of the hash value of 32 nodes in the distributed system, that is to say, one fourth of the hash value is used to deploy all nodes in the distributed system.

Here is just an example. The principle of the actual redis cluster is like this, but the specific value is not.

At this point, suppose you need to add a node to redis, such as node5, between node3 and node4. In this way, you only need to move the elements between node3 and node4 from node4 to node5, and the other elements remain unchanged.

In this way, the speed of capacity expansion is increased, and the number of elements affected is relatively small, and most requests are almost imperceptible.

Well, that’s all for the evolution history of hash table. Did you get it?

Postscript

In this section, we re learn about hashes, hash functions, and hash tables. In Java, the ultimate form of HashMap is presented in the form of array + linked list + red black tree.

It is said that this red black tree can also be replaced with other data structures, such as skip table. Do you make it?

In the next section, we’ll talk about itSkip WatchThis data structure, and use it to rewrite HashMap. For the latest promotion, please pay attention to me!

Pay attention to “tongge read the source code” of the public name to unlock more knowledge of source code, foundation and architecture.