##### catalogue

- 1、 Theoretical knowledge
- 1.1. Hash algorithm
- 1.2. Hash bucket algorithm
- 1.3 conflict resolution algorithm
- 2、 Dictionary implementation
- 2.1. Entry structure
- 2.2 other key private variables
- 2.3. Dictionary – add operation
- 2.4. Dictionary – find operation
- 2.5. Dictionary – remove operation
- 2.6. Dictionary – resize operation (capacity expansion)
- 2.6.1 trigger conditions for capacity expansion
- 2.6.2 how to expand capacity
- 2.7. Dictionary – add operation
- 2.8. Collection version control

## 1、 Theoretical knowledge

For the implementation principle of dictionary, there are two key algorithms, one is hash algorithm, and the other is used to deal with hash collision and conflict resolution algorithm.

### 1.1. Hash algorithm

Hash algorithm is a digital summarization algorithm, which can map indefinite length binary data set to a shorter binary length data set. The common MD5 algorithm is a hash algorithm, which can generate digital summarization for any data. The function that implements the hash algorithm is called the hash function. The hash function has the following characteristics.

- Hash operation is performed on the same data, and the results must be the same.
`HashFunc(key1) == HashFunc(key1)`

- The results of hash operation on different data may also be the same (hash will produce collision).
`key1 != key2 => HashFunc(key1) == HashFunc(key2)`

. - The hash operation is irreversible, and the key cannot obtain the original data.
`key1 => hashCode`

however`hashCode =\=> key1`

。

The following figure is a simple illustration of the hash function. Data of any length is mapped to a shorter data set through hashfunc.

The following figure clearly explains the hash collision, which can be seen from the figure`Sandra Dee`

and`John Smith`

After hash operation, they all fall to`02`

The location of the produced collisions and conflicts.

Common algorithms for constructing hash functions are as follows:

1. Direct addressing method: take the value of keyword or a linear function of keyword as the hash address. That is, H (key) = key or H (key) = a • key + B, where a and B are constants (such hash functions are called self functions)

2. Numerical analysis method: analyze a group of data, such as the date of birth of a group of employees. At this time, we find that the first few digits of the date of birth are roughly the same. In this case, the probability of conflict will be very large. However, we find that the last few digits of the date of birth represent the month and the detailed date are very different. Suppose that the following digits are used to form the hash address, The probability of conflict will be significantly reduced. Therefore, digital analysis is to find out the law of numbers and use these data as much as possible to construct hash addresses with low collision probability.

3. Square middle method: take the middle digits after the square of the keyword as the hash address.

4. Folding method: cut the keyword into several parts with the same number of digits, and the last part can have different digits, and then take the superposition and (remove the carry) of these parts as the hash address.

5. Random number method: select a random function and take the random value of keyword as the hash address. It is often used in situations with different keyword lengths.

6. Divide and leave remainder method: take the remainder of the keyword divided by a number P not greater than the hash table length m as the hash address. That is, H (key) = key mod p, P < = M. It can not only take the module of keyword directly, but also take the module after folding and square operation. The choice of P is very important. It usually takes prime or M. if P is not selected well, it is easy to collide

### 1.2. Hash bucket algorithm

When it comes to the hash algorithm, you will think of the hash table. A key can quickly get the hashcode through the hash function operation. Through the mapping of the hashcode, you can directly get the value. However, the value of the hashcode is generally very large, often more than 2 ^ 32. It is impossible to specify a mapping for each hashcode.

Because of such a problem, people map the generated hashcode in the form of segments. Each segment is called a bucket. Generally, the common hash bucket is to take the remainder of the result directly.

Suppose that the generated hashcode may have 2 ^ 32 values, and then it is cut into segments and mapped with 8 buckets, then you can`bucketIndex = HashFunc(key1) % 8`

Such an algorithm is used to determine which bucket the hashcode is mapped to.

As you can see, mapping is performed in the form of hash bucket, which will aggravate hash conflict.

### 1.3 conflict resolution algorithm

For a hash algorithm, conflicts will inevitably occur, so how to deal with them after they occur is a key place. At present, common conflict resolution algorithms include zipper method (Dictionary Implementation), open addressing method, re hash method and public overflow partition method. This paper only introduces zipper method and re hash method, Students interested in other algorithms can refer to the references at the end of the article.

1. Zipper method: the idea of this method is to establish a single linked list of conflicting elements and store the head pointer address to the bucket corresponding to the hash table. In this way, after locating the position of the hash table bucket, you can find the elements by traversing the single linked list.

2. Re hash method: as the name suggests, the key is hashed again using other hash functions until a non conflicting position is found.

There is a picture to describe the zipper method. The conflict is solved by establishing a single linked list at the conflict location.

## 2、 Dictionary implementation

The dictionary implementation is mainly analyzed against the source code. At present, the version of the source code is. Net framework 4.7. The address can be stamped. The source code address of the link: link

This chapter mainly introduces several key classes and objects in dictionary, and then follow the code to go through the process of insertion, deletion and capacity expansion. I believe you can understand its design principle.

### 2.1. Entry structure

First, we introduce an entry structure. Its definition is shown in the following code. This is the smallest unit of data in a dictionary. Call`Add(Key,Value)`

Method will be encapsulated in such a structure.

```
private struct Entry {
public int hashCode; // The 31 bit hashcode value other than the sign bit is - 1 if the entry is not used
public int next; // The subscript index of the next element. If there is no next element, it is - 1
public TKey key; // The key that holds the element
public TValue value; // Store the value of the element
}
```

### 2.2 other key private variables

In addition to the entry structure, there are several key private variables. Their definitions and explanations are shown in the following code.

```
private int[] buckets; // Hash bucket
private Entry[] entries; // Entry array to hold elements
private int count; // Index location of current entries
private int version; // The current version prevents the collection from being changed during the iteration
private int freeList; // The subscript index of the deleted entry in the entries, which is free
private int freeCount; // How many entries are deleted and how many free locations are there
private IEqualityComparer<TKey> comparer; // comparator
private KeyCollection keys; // Collection for storing keys
private ValueCollection values; // Collection of values
```

In the above code, it should be noted that`buckets、entries`

These two arrays are the key to the implementation of dictionary.

### 2.3. Dictionary – add operation

After the above analysis, I believe you don’t particularly understand why you need to design and do so. Now let’s go through the add process of dictionary to experience it.

First, we describe the data structure of a dictionary in the form of a graph, in which only the key points are drawn. A data structure with a bucket size of 4 and an entry size of 4.

Then we assume that we need to perform an add operation,`dictionary.Add("a","b")`

, where`key = "a",value = "b"`

。

1. Calculate the hashcode of the key according to its value. We assume that the hash value of “a” is 6（`GetHashCode("a") = 6`

）。

2. Calculate the bucket in which the hashcode falls through the remainder operation of the hashcode. Now the length of the barrel（`buckets.Length`

）For 4, then`6 % 4`

Finally fell on`index`

In a bucket of 2, that is`buckets[2]`

。

3. Avoid one other situation, and then it will`hashCode、key、value`

Such information is stored`entries[count]`

Yes, because`count`

The location is free; continue`count++`

Point to the next free location. In the first position in the figure above, index = 0 is idle, so it is stored in`entries[0]`

The location of the.

4. Will`Entry`

Subscript of`entryIndex`

Assign to`buckets`

Corresponding subscript in`bucket`

。 Step 3 is stored in`entries[0]`

Location, so`buckets[2]=0`

。

5. Finally`version++`

, the set has changed, so the version needs + 1. Only adding, replacing, and deleting elements will update the version

Steps 1 to 5 above are just for your understanding. In fact, there are some deviations, which will be added in the add operation section later.

After completing the above add operation, the data structure is updated to the form shown in the figure below.

This is an ideal operation. There is only one hashcode in a bucket without collision, but in fact, collisions often occur; How to solve the collision in the dictionary class.

We continue to perform an add operation,`dictionary.Add("c","d")`

, assumptions`GetHashCode(“c”)=6`

, finally`6 % 4 = 2`

。 Last barrel`index`

It is also 2. There is no problem according to the previous steps 1 ~ 3. After execution, the data structure is shown in the figure below.

If you continue with step 4, then`buckets[2] = 1`

And then the original`buckets[2]=>entries[0]`

Our relationship will be lost, which we don’t want to see. Now in the entry`next`

It plays a big role.

If the corresponding`buckets[index]`

If other elements already exist, the following two statements will be executed to make the new element`entry.next`

Point to the previous element and let`buckets[index]`

The new elements pointing to the present form a single linked list.

```
entries[index].next = buckets[targetBucket];
...
buckets[targetBucket] = index;
```

In fact, step 4 is to do such an operation without judging whether there are other elements, because`buckets`

The initial value of the middle bucket is – 1, which will not cause problems.

After the above steps, the data structure will be updated to the following figure.

### 2.4. Dictionary – find operation

To facilitate the demonstration of how to find, we continue to add an element`dictionary.Add("e","f")`

，`GetHashCode(“e”) = 7; 7% buckets.Length=3`

, the data structure is as follows.

Suppose we execute such a statement now`dictionary.GetValueOrDefault("a")`

, the following steps are performed

1. Get the hashcode of the key and calculate the bucket position. We mentioned earlier that “a” is`hashCode=6`

So it’s finally calculated`targetBucket=2`

。

2. Adoption`buckets[2]=1`

find`entries[1]`

, compare whether the key values are equal, and return if they are equal`entryIndex`

, if you don’t want to wait, continue`entries[next]`

Search until the key equivalent element or`next == -1`

When I was young. Here we found it`key == "a"`

Element, return`entryIndex=0`

。

3. If`entryIndex >= 0`

Then return the corresponding`entries[entryIndex]`

Element, otherwise return`default(TValue)`

。 Here we go straight back`entries[0].value`

。

The whole search process is shown in the figure below

Extract the code you are looking for, as shown below.

```
//Find the location of the entry element
private int FindEntry(TKey key) {
if( key == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
}
if (buckets != null) {
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF; // Get hashcode, ignoring sign bit
//Int i = buckets [hashcode% buckets. Length] find the corresponding bucket and get the position of entry in entries
// i >= 0; I = entries [i]. Next traverses the single linked list
for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
//Find it and return
if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
}
}
return -1;
}
...
internal TValue GetValueOrDefault(TKey key) {
int i = FindEntry(key);
//Greater than or equal to 0 means that the element position is found and directly returns value
//Otherwise, the default value of this type is returned
if (i >= 0) {
return entries[i].value;
}
return default(TValue);
}
```

### 2.5. Dictionary – remove operation

We have already introduced adding and searching. Next, we will introduce how to delete a dictionary. We use the previous dictionary data structure.

Deleting the previous steps is similar to searching. You need to find the location of the element and then delete it.

We now execute such a statement`dictionary.Remove("a")`

, the hashfunc operation result is consistent with the above. Most of the steps are similar to searching. Let’s look directly at the extracted code, as shown below.

```
public bool Remove(TKey key) {
if(key == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
}
if (buckets != null) {
//1. Get hashcode through key
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
//2. Get the bucket position by taking the remainder
int bucket = hashCode % buckets.Length;
//Last is used to determine whether the current bucket is the last element in the single linked list
int last = -1;
//3. Traverse the single linked list corresponding to the bucket
for (int i = buckets[bucket]; i >= 0; last = i, i = entries[i].next) {
if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) {
//4. After finding the element, if last < 0 means that it is the last element in the bucket, you can directly assign the subscript in the bucket to entries [i]. Next
if (last < 0) {
buckets[bucket] = entries[i].next;
}
else {
//4.1 last is not less than 0, which means that the current element is in the middle of the bucket single linked list. It is necessary to connect the head node and tail node of the element to prevent the interruption of the linked list
entries[last].next = entries[i].next;
}
//5. Initialize the data in the entry structure
entries[i].hashCode = -1;
//5.1 create freelist single linked list
entries[i].next = freeList;
entries[i].key = default(TKey);
entries[i].value = default(TValue);
//* 6. For the key code, freelist is equal to the current entry position, and the next add element will be added to this position first
freeList = i;
freeCount++;
//7. Version number + 1
version++;
return true;
}
}
}
return false;
}
```

After executing the above code, the data structure is updated as shown in the figure below. Need attention`varsion、freeList、freeCount`

The values of have been updated.

### 2.6. Dictionary – resize operation (capacity expansion)

Some careful friends may want to ask after they have seen the add operation,`buckets、entries`

It’s just two arrays. What if the array is full? The next step is the resize operation I want to introduce, which is helpful to us`buckets、entries`

Capacity expansion.

#### 2.6.1 trigger conditions for capacity expansion

First, we need to know under what circumstances capacity expansion will occur; The first case is that the array is full and there is no way to store new elements. As shown in the figure below.

As we all know from the above, hash operation will inevitably produce conflicts. The zipper method is used in the dictionary to solve conflicts, but let’s look at the situation in the figure below.

All the elements just fall on`buckets[3]`

Above, the result is that the time complexity O (n) is caused, and the search performance will be reduced; Therefore, the second is that too many collisions occur in the dictionary, which will seriously affect the performance and trigger the capacity expansion operation.

At present, the collision number threshold set in. Net framework 4.7 is 100

```
public const int HashCollisionThreshold = 100;
```

#### 2.6.2 how to expand capacity

In order to show you clearly, the following data structure is simulated. A dictionary with a size of 2 is assumed, and the collision threshold is 2; Now trigger hash collision expansion.

Start capacity expansion.

1. Apply for buckets and entries twice the current size

2. Copy the existing elements to the new entries

After completing the above two steps, the new data structure is as follows.

3. If it is a hash collision expansion, use the new hashcode function to recalculate the hash value

As mentioned above, this is a hash collision expansion, so you need to use a new hash function to calculate the hash value. The new hash function will not solve the collision problem. It may be worse. The same as in the figure below will still fall in the same place`bucket`

Come on.

4. For each element of entries, bucket = newentries [i]. Hashcode% newsize determines the location of new buckets

**5. Rebuild the hash chain, newentries [i]. Next = buckets [bucket]; buckets[bucket]=i; **

because`buckets`

It has also expanded to twice the size, so it needs to be redefined`hashCode`

Where`bucket`

Medium; Finally, rebuild the hash single linked list

This completes the expansion operation. If the expansion is triggered by reaching the hash collision threshold, the result may be worse after expansion.

In JDK,`HashMap`

If there are too many collisions, the single linked list will be converted into a red black tree to improve the search performance. At present, there is no such optimization in the. Net framework. There are similar optimizations in the. Net core. In the future, there will be time to share some collection implementations of the. Net core.

Each capacity expansion operation needs to traverse all elements, which will affect performance. Therefore, it is best to set an estimated initial size when creating a dictionary instance.

```
private void Resize(int newSize, bool forceNewHashCodes) {
Contract.Assert(newSize >= entries.Length);
//1. Apply for new buckets and entries
int[] newBuckets = new int[newSize];
for (int i = 0; i < newBuckets.Length; i++) newBuckets[i] = -1;
Entry[] newEntries = new Entry[newSize];
//2. Copy the elements in the entries to the new entries
Array.Copy(entries, 0, newEntries, 0, count);
//3. If it is a hash collision expansion, use the new hashcode function to recalculate the hash value
if(forceNewHashCodes) {
for (int i = 0; i < count; i++) {
if(newEntries[i].hashCode != -1) {
newEntries[i].hashCode = (comparer.GetHashCode(newEntries[i].key) & 0x7FFFFFFF);
}
}
}
//4. Determine the new bucket location
//5. Rebuild the Hahs single linked list
for (int i = 0; i < count; i++) {
if (newEntries[i].hashCode >= 0) {
int bucket = newEntries[i].hashCode % newSize;
newEntries[i].next = newBuckets[bucket];
newBuckets[bucket] = i;
}
}
buckets = newBuckets;
entries = newEntries;
}
```

### 2.7. Dictionary – add operation

In our previous add operation steps, we mentioned that there will be another case, that is, the element will be deleted.

Avoid one other situation, and then it will`hashCode、key、value`

Such information is stored`entries[count]`

Yes, because`count`

The location is free; continue`count++`

Point to the next free location. In the first position in the figure above, index = 0 is idle, so it is stored in`entries[0]`

The location of the.

because`count`

Is directed by self increment`entries[]`

Next free`entry`

, if any element is deleted, then`count`

An idle location will appear in the previous location`entry`

； If not, a lot of space will be wasted.

That’s why the remove operation logs`freeList、freeCount`

, is to make use of the deleted space. In fact, the add operation takes precedence`freeList`

Idle`entry`

Location, excerpt code is as follows.

```
private void Insert(TKey key, TValue value, bool add){
if( key == null ) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
}
if (buckets == null) Initialize(0);
//Get hashcode through key
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
//Calculate the target bucket subscript
int targetBucket = hashCode % buckets.Length;
//Number of collisions
int collisionCount = 0;
for (int i = buckets[targetBucket]; i >= 0; i = entries[i].next) {
if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) {
//If it is an add operation and the same element is traversed, an exception is thrown
if (add) {
ThrowHelper.ThrowArgumentException(ExceptionResource.Argument_AddingDuplicate);
}
//If it is not an increase operation, it may be an index assignment operation. DICTIONARY ["foo"] = "foo"
//Then the assigned version + +, exit
entries[i].value = value;
version++;
return;
}
//Each traversal of an element is a collision
collisionCount++;
}
int index;
//If there is a deleted element, put the element in the free position of the deleted element
if (freeCount > 0) {
index = freeList;
freeList = entries[index].next;
freeCount--;
}
else {
//If the current entries are full, the capacity expansion is triggered
if (count == entries.Length)
{
Resize();
targetBucket = hashCode % buckets.Length;
}
index = count;
count++;
}
//Assign value to entry
entries[index].hashCode = hashCode;
entries[index].next = buckets[targetBucket];
entries[index].key = key;
entries[index].value = value;
buckets[targetBucket] = index;
//Version number++
version++;
//If the number of collisions is greater than the set maximum number of collisions, the hash collision expansion is triggered
if(collisionCount > HashHelpers.HashCollisionThreshold && HashHelpers.IsWellKnownEqualityComparer(comparer))
{
comparer = (IEqualityComparer<TKey>) HashHelpers.GetRandomizedEqualityComparer(comparer);
Resize(entries.Length, true);
}
}
```

The above is the complete add code, or is it very simple, right?

### 2.8. Collection version control

It has been mentioned above`version`

This variable will be changed every time you add, modify or delete`version++`

； So this`version`

What is the meaning of existence?

First, let’s look at a piece of code. In this code, we first instantiate a dictionary instance, and then`foreach`

Traverse the instance in`foreach`

Used in code blocks`dic.Remove(kv.Key)`

Delete element.

The result is thrown`System.InvalidOperationException:"Collection was modified..."`

For such exceptions, the set is not allowed to change during the iteration. If you directly delete elements after traversal in Java, there will be a strange problem, so it is used in. Net`version`

To achieve version control.

So how to implement version control during iteration? Let’s take a look at the source code.

When the iterator initializes, it is recorded`dictionary.version`

Version number, and then each iteration process will check whether the version number is consistent. If it is inconsistent, an exception will be thrown.

This avoids many strange problems caused by modifying the set in the iterative process.

The above is the detailed content of analyzing the implementation principle of c# dictionary. For more information about c# dictionary, please pay attention to other relevant articles of developeppaer!