Explanation of computer basic data structure part 2 – hash search

Time:2022-5-23

This article will introduce how to use hash table to find.

1: Basic concepts of hash table

1. Hash function

Hash function is a function that maps the keyword in the lookup table to the address corresponding to the keyword, which is recorded as hash (key) = addr. It may map two or more different keywords to the same address, which is called conflict, and these conflicting keywords are called synonyms. Because conflicts cannot be avoided, we should also design methods to deal with conflicts when designing hash functions.

2. Hash table

   hash table is a data structure accessed directly according to keywords, which establishes a direct mapping relationship between keywords and storage addresses.
   under ideal conditions, the time complexity of hash table lookup is O (1), which is independent of the number of elements in the table.
A hash table is also called a hash table.

2: Construction method of hash function

   when constructing hash function, pay attention to the following points:
   (1) the definition field of hash function includes all keywords, and the range of value field depends on the size or address range of Hash list.
   (2) the equal probability of the address calculated by the hash table is evenly distributed in the whole address space to minimize hash conflict.
   (3) the hash function should be as simple as possible and calculate the address quickly.
   common hash functions are described below:

1. Direct addressing method

   directly remove a linear function value of the keyword as the hash address.
  advantages: the simplest, no conflict.
   disadvantages: the distribution of keywords is basically continuous. If the distribution is discontinuous, it will lead to too many vacancies and waste of storage space.
   suitable for the situation where the lookup table is small and continuous.

2. Division and remainder method

Assuming that the length of the hash table is m, take a prime number P < = m, P and m should be the closest, and then use the formula H (key) = key% P to obtain the address of each keyword.

3. Digital analysis

   if the keyword is a number with a large number of digits (such as mobile phone number), and these numbers have the same rules, you can extract the remaining different rules as the hash address and assign the keyword to each position of the Hash list.
   suitable for lookup tables with large keyword digits.

4. Square middle method

   take the middle digit of the square of the keyword as the hash address. The hash address obtained by this method is related to each keyword, so the hash address obtained is evenly distributed.
╭ the distribution of keywords is not very large.

3: Methods of dealing with conflicts

Hash tables cannot absolutely avoid conflicts. There are many methods to deal with conflicts. Here are the commonly used open addressing method and zipper method.

1. Open addressing method

   open addressing method is also called closed hash. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table. Then you can store the key in the “next” empty position in the conflict position.
   when a keyword conflicts with another keyword, use some detection technology to form a detection sequence in the hash table, and then find it successively along the detection sequence. When an empty cell is encountered, it is inserted into it. The basic formula is:
  H(i) = (H(key)+di)mod m。
   where H (I) is the hash function, Di is the incremental sequence, and M is the table length. According to the different value methods of Di, we can divide it into various detection methods, including linear detection method, square (quadratic) detection method and hashing detection method.

(1) Linear detection method

   loop through the increment sequence 1, 2,……, (M – 1) to test the next storage address, i.e. Di = I. If the found position is empty, insert it, otherwise test the next increment.
   disadvantages: it will cause element aggregation and reduce the search efficiency.

(2) Square detection method

   sequence in increments 1, – 1, 4, – 4…, K2,-k2 and K ≤ M / 2 cycle to test the next storage address. Also known as secondary detection method
  advantages: avoid the problem of “accumulation”.
   disadvantages: not all cells on the hash table can be detected, but at least half of the cells can be detected.

(3) Re hashing method

   Di is I * H2 (key), and H2 (key) is another hash function. Also known as double hash method. The detection sequence is: H2 (key), 2h2 (key), 3h2 (key). For any key, H2 (key) is not 0. The detection sequence should also ensure that all hash storage units can be detected. Its specific hash function form is as follows:
  H(i) = (H(key)+i*H2(key))mod m
   initial detection position H0 = H (key)% M. I is the number of conflicts, initially 0. After M-1 probe at most, it will traverse all positions in the table and return to H0 position.

(4) Precautions

For the deletion operation of open addressing method, physical deletion cannot be carried out simply, because for synonyms, this address may be on its search path. If it is physically deleted, the search path will be interrupted, so only the deletion flag can be set. However, after multiple deletions, many locations will not be used. The elements marked for deletion can be physically deleted as needed.

2. Zipper method

   the zipper method is also called the chain address method. The zipper method is to put the keyword (synonym) values with the same hash address in the same single linked list, which is called the synonym linked list. If there are m hash addresses, there are m linked lists. At the same time, the pointer array t [0.. M-1] is used to store the head pointers of each linked list. All records with hash address I are inserted into the single linked list with T [i] as the pointer in the form of nodes. The initial value of each component in t shall be a null pointer.
  the average search length of successful and unsuccessful zipper search is as follows: