Basic knowledge of hash table

Time:2020-9-17

As a php programmer, the common data structure is array. Basically, array can solve most problems, but we should know something else.

Data structure can be divided into three categories

Linear structure (array, linked list, stack, queue), tree, graph~~~~

To understand the complexity of several simple data structures

  1. Array: a linear structure, using a continuous storage unit to store data. For the search of the specified index, the time complexity is O (1); to search through the given value, it is necessary to traverse the array, and compare the given keywords and array elements one by one. Of course, the time complexity is O (n). Of course, the algorithm can also be used to optimize. For example, for the ordered array, binary search, interpolation search, Fibonacci search and other methods can be used to improve the search complexity For general insert and delete operations, it involves the movement of array elements, and its average complexity is O (n).
  2. Linear linked list: it belongs to a kind of linear structure. For the operations of adding and deleting the linked list (after finding the specified operation position), you only need to process the references between nodes, and the time complexity is O (1); the search operation needs to traverse the linked list one by one, and the complexity is O (n).
  3. Binary tree: it belongs to a kind of tree. For a relatively balanced ordered binary tree, insert, search, delete and other operations are carried out, and the average complexity is O (logn).
  4. Hash table: compared with the above-mentioned data structures, adding, deleting, searching and other operations in the hash table have very high performance. Without considering the hash conflict, it only needs one positioning to complete, and the time complexity is O (1). Next, let’s see how the hash table achieves the amazing constant order o (1).
Overview of hash table

There are only two physical storage structures of data structure: sequential storage structure and chain storage structure (such as stack, queue, tree, graph, etc. are abstracted from logical structure and mapped to memory, which are the two physical organization forms). Array is sequential storage structure, and the backbone of hash table is array.

First, standardize a few concepts that will be used later. The essence of a hash table is an array. Each element in the array is called a bin, in which key value pairs are stored. F is the hash function.

The stored procedure of hash table is as follows:

  1. According to the key, the hash function is used to calculate its hash value H (integer) = f (keyword).
  2. If the number of boxes is n, then this key value pair should be placed in the (H% n) box.
  3. If the box already has a key value pair, the open addressing method or zipper method is used to solve the conflict.
Hash Collisions

What if two different keys get the same hash value through the hash function? In other words, when we hash a key to get the same hash value, and then insert it, we find that it has been occupied by other elements. In fact, this is called hash conflict, also known as hash collision.

The design of hash function is very important. A good hash function can ensure that the calculation is simple and the hash address is evenly distributed. However, we need to be clear that the array is a continuous fixed length memory space, and no matter how good the hash function is, it can not guarantee that the storage address obtained will never conflict. So how to solve the hash conflict? There are many solutions to hash conflict: open addressing method (in case of conflict, continue to search for the next unoccupied storage address), zipper method.
Zipper method(borrowing other pictures)
Basic knowledge of hash table

Load factor

It is used to measure theEmpty / fullTo a certain extent, it can also reflect the query efficiency. The calculation formula is as follows:

Load factor = logarithm of total key value / number of boxes

The larger the load factor, the more full the hash table is, the more likely it is to cause conflicts and the lower the performance. Therefore, generally speaking, when the load factor is greater than a constant (maybe 1, or 0.75, etc.), the hash table will be automatically expanded.

When the hash table is automatically expanded, the number of boxes that are twice the original number will be created. Therefore, even if the hash value of the key remains unchanged, the result of the remaining number of boxes will change. Therefore, the storage location of all key value pairs may change. This process is also known as rehash.

However, the expansion of hash table can not always effectively solve the problem of large load factor. Assuming that the hash values of all keys are the same, their positions will not change even after the expansion. Although the load factor will be reduced, the length of the list actually stored in each bin does not change, so the query performance of hash table cannot be improved.

Therefore, we will find two problems with hash table

  1. If there are many boxes in the hash table, it is necessary to re hash and move the data during capacity expansion, which has a great impact on performance.
  2. If the hash function is not designed properly, the hash table will become a linear table in extreme cases, with extremely low performance. The rationality of hash function is very important.

Tip: different languages have different ways to solve rehash.