Analyze the implementation principle of c# dictionary

Time:2021-11-4
catalogue
  • 1、 Theoretical knowledge
    • 1.1. Hash algorithm
    • 1.2. Hash bucket algorithm
    • 1.3 conflict resolution algorithm
  • 2、 Dictionary implementation
    • 2.1. Entry structure
    • 2.2 other key private variables
    • 2.3. Dictionary – add operation
    • 2.4. Dictionary – find operation
    • 2.5. Dictionary – remove operation
    • 2.6. Dictionary – resize operation (capacity expansion)
      • 2.6.1 trigger conditions for capacity expansion
      • 2.6.2 how to expand capacity
    • 2.7. Dictionary – add operation
      • 2.8. Collection version control

      1、 Theoretical knowledge

      For the implementation principle of dictionary, there are two key algorithms, one is hash algorithm, and the other is used to deal with hash collision and conflict resolution algorithm.

      1.1. Hash algorithm

      Hash algorithm is a digital summarization algorithm, which can map indefinite length binary data set to a shorter binary length data set. The common MD5 algorithm is a hash algorithm, which can generate digital summarization for any data. The function that implements the hash algorithm is called the hash function. The hash function has the following characteristics.

      • Hash operation is performed on the same data, and the results must be the same.HashFunc(key1) == HashFunc(key1)
      • The results of hash operation on different data may also be the same (hash will produce collision).key1 != key2 => HashFunc(key1) == HashFunc(key2).
      • The hash operation is irreversible, and the key cannot obtain the original data.key1 => hashCodehoweverhashCode =\=> key1

      The following figure is a simple illustration of the hash function. Data of any length is mapped to a shorter data set through hashfunc.

      The following figure clearly explains the hash collision, which can be seen from the figureSandra DeeandJohn SmithAfter hash operation, they all fall to02The location of the produced collisions and conflicts.

      Common algorithms for constructing hash functions are as follows:

      1. Direct addressing method: take the value of keyword or a linear function of keyword as the hash address. That is, H (key) = key or H (key) = a • key + B, where a and B are constants (such hash functions are called self functions)

      2. Numerical analysis method: analyze a group of data, such as the date of birth of a group of employees. At this time, we find that the first few digits of the date of birth are roughly the same. In this case, the probability of conflict will be very large. However, we find that the last few digits of the date of birth represent the month and the detailed date are very different. Suppose that the following digits are used to form the hash address, The probability of conflict will be significantly reduced. Therefore, digital analysis is to find out the law of numbers and use these data as much as possible to construct hash addresses with low collision probability.

      3. Square middle method: take the middle digits after the square of the keyword as the hash address.

      4. Folding method: cut the keyword into several parts with the same number of digits, and the last part can have different digits, and then take the superposition and (remove the carry) of these parts as the hash address.

      5. Random number method: select a random function and take the random value of keyword as the hash address. It is often used in situations with different keyword lengths.

      6. Divide and leave remainder method: take the remainder of the keyword divided by a number P not greater than the hash table length m as the hash address. That is, H (key) = key mod p, P < = M. It can not only take the module of keyword directly, but also take the module after folding and square operation. The choice of P is very important. It usually takes prime or M. if P is not selected well, it is easy to collide

      1.2. Hash bucket algorithm

      When it comes to the hash algorithm, you will think of the hash table. A key can quickly get the hashcode through the hash function operation. Through the mapping of the hashcode, you can directly get the value. However, the value of the hashcode is generally very large, often more than 2 ^ 32. It is impossible to specify a mapping for each hashcode.

      Because of such a problem, people map the generated hashcode in the form of segments. Each segment is called a bucket. Generally, the common hash bucket is to take the remainder of the result directly.

      Suppose that the generated hashcode may have 2 ^ 32 values, and then it is cut into segments and mapped with 8 buckets, then you canbucketIndex = HashFunc(key1) % 8Such an algorithm is used to determine which bucket the hashcode is mapped to.

      As you can see, mapping is performed in the form of hash bucket, which will aggravate hash conflict.

      1.3 conflict resolution algorithm

      For a hash algorithm, conflicts will inevitably occur, so how to deal with them after they occur is a key place. At present, common conflict resolution algorithms include zipper method (Dictionary Implementation), open addressing method, re hash method and public overflow partition method. This paper only introduces zipper method and re hash method, Students interested in other algorithms can refer to the references at the end of the article.

      1. Zipper method: the idea of this method is to establish a single linked list of conflicting elements and store the head pointer address to the bucket corresponding to the hash table. In this way, after locating the position of the hash table bucket, you can find the elements by traversing the single linked list.

      2. Re hash method: as the name suggests, the key is hashed again using other hash functions until a non conflicting position is found.

      There is a picture to describe the zipper method. The conflict is solved by establishing a single linked list at the conflict location.

      2、 Dictionary implementation

      The dictionary implementation is mainly analyzed against the source code. At present, the version of the source code is. Net framework 4.7. The address can be stamped. The source code address of the link: link

      This chapter mainly introduces several key classes and objects in dictionary, and then follow the code to go through the process of insertion, deletion and capacity expansion. I believe you can understand its design principle.

      2.1. Entry structure

      First, we introduce an entry structure. Its definition is shown in the following code. This is the smallest unit of data in a dictionary. CallAdd(Key,Value)Method will be encapsulated in such a structure.

      private struct Entry {
          public int hashCode;    //  The 31 bit hashcode value other than the sign bit is - 1 if the entry is not used
          public int next;        //  The subscript index of the next element. If there is no next element, it is - 1
          public TKey key;        //  The key that holds the element
          public TValue value;    //  Store the value of the element
      }

      2.2 other key private variables

      In addition to the entry structure, there are several key private variables. Their definitions and explanations are shown in the following code.

      private int[] buckets; 		//  Hash bucket
      private Entry[] entries; 	//  Entry array to hold elements
      private int count; 			//  Index location of current entries
      private int version; 		//  The current version prevents the collection from being changed during the iteration
      private int freeList; 		//  The subscript index of the deleted entry in the entries, which is free
      private int freeCount; 		//  How many entries are deleted and how many free locations are there
      private IEqualityComparer<TKey> comparer; 	//  comparator
      private KeyCollection keys; 		//  Collection for storing keys
      private ValueCollection values; 		//  Collection of values

      In the above code, it should be noted thatbuckets、entriesThese two arrays are the key to the implementation of dictionary.

      2.3. Dictionary – add operation

      After the above analysis, I believe you don’t particularly understand why you need to design and do so. Now let’s go through the add process of dictionary to experience it.

      First, we describe the data structure of a dictionary in the form of a graph, in which only the key points are drawn. A data structure with a bucket size of 4 and an entry size of 4.

      Then we assume that we need to perform an add operation,dictionary.Add("a","b"), wherekey = "a",value = "b"

      1. Calculate the hashcode of the key according to its value. We assume that the hash value of “a” is 6(GetHashCode("a") = 6)。

      2. Calculate the bucket in which the hashcode falls through the remainder operation of the hashcode. Now the length of the barrel(buckets.Length)For 4, then6 % 4Finally fell onindexIn a bucket of 2, that isbuckets[2]

      3. Avoid one other situation, and then it willhashCode、key、valueSuch information is storedentries[count]Yes, becausecountThe location is free; continuecount++Point to the next free location. In the first position in the figure above, index = 0 is idle, so it is stored inentries[0]The location of the.

      4. WillEntrySubscript ofentryIndexAssign tobucketsCorresponding subscript inbucket。 Step 3 is stored inentries[0]Location, sobuckets[2]=0

      5. Finallyversion++, the set has changed, so the version needs + 1. Only adding, replacing, and deleting elements will update the version

      Steps 1 to 5 above are just for your understanding. In fact, there are some deviations, which will be added in the add operation section later.

      After completing the above add operation, the data structure is updated to the form shown in the figure below.

      This is an ideal operation. There is only one hashcode in a bucket without collision, but in fact, collisions often occur; How to solve the collision in the dictionary class.

      We continue to perform an add operation,dictionary.Add("c","d"), assumptionsGetHashCode(“c”)=6, finally6 % 4 = 2。 Last barrelindexIt is also 2. There is no problem according to the previous steps 1 ~ 3. After execution, the data structure is shown in the figure below.

      If you continue with step 4, thenbuckets[2] = 1And then the originalbuckets[2]=>entries[0]Our relationship will be lost, which we don’t want to see. Now in the entrynextIt plays a big role.

      If the correspondingbuckets[index]If other elements already exist, the following two statements will be executed to make the new elemententry.nextPoint to the previous element and letbuckets[index]The new elements pointing to the present form a single linked list.

      
      entries[index].next = buckets[targetBucket];
      ...
      buckets[targetBucket] = index;

      In fact, step 4 is to do such an operation without judging whether there are other elements, becausebucketsThe initial value of the middle bucket is – 1, which will not cause problems.

      After the above steps, the data structure will be updated to the following figure.

      2.4. Dictionary – find operation

      To facilitate the demonstration of how to find, we continue to add an elementdictionary.Add("e","f")GetHashCode(“e”) = 7; 7% buckets.Length=3, the data structure is as follows.

      Suppose we execute such a statement nowdictionary.GetValueOrDefault("a"), the following steps are performed

      1. Get the hashcode of the key and calculate the bucket position. We mentioned earlier that “a” ishashCode=6So it’s finally calculatedtargetBucket=2

      2. Adoptionbuckets[2]=1findentries[1], compare whether the key values are equal, and return if they are equalentryIndex, if you don’t want to wait, continueentries[next]Search until the key equivalent element ornext == -1When I was young. Here we found itkey == "a"Element, returnentryIndex=0

      3. IfentryIndex >= 0Then return the correspondingentries[entryIndex]Element, otherwise returndefault(TValue)。 Here we go straight backentries[0].value

      The whole search process is shown in the figure below

      Extract the code you are looking for, as shown below.

      //Find the location of the entry element
      private int FindEntry(TKey key) {
          if( key == null) {
              ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
          }
      
          if (buckets != null) {
              int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF; //  Get hashcode, ignoring sign bit
              //Int i = buckets [hashcode% buckets. Length] find the corresponding bucket and get the position of entry in entries
              // i >= 0;  I = entries [i]. Next traverses the single linked list
              for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next) {
                  //Find it and return
                  if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
              }
          }
          return -1;
      }
      ...
      internal TValue GetValueOrDefault(TKey key) {
          int i = FindEntry(key);
          //Greater than or equal to 0 means that the element position is found and directly returns value
          //Otherwise, the default value of this type is returned
          if (i >= 0) {
              return entries[i].value;
          }
          return default(TValue);
      }

      2.5. Dictionary – remove operation

      We have already introduced adding and searching. Next, we will introduce how to delete a dictionary. We use the previous dictionary data structure.

      Deleting the previous steps is similar to searching. You need to find the location of the element and then delete it.

      We now execute such a statementdictionary.Remove("a"), the hashfunc operation result is consistent with the above. Most of the steps are similar to searching. Let’s look directly at the extracted code, as shown below.

      public bool Remove(TKey key) {
          if(key == null) {
              ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
          }
      
          if (buckets != null) {
              //1. Get hashcode through key
              int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
              //2. Get the bucket position by taking the remainder
              int bucket = hashCode % buckets.Length;
              //Last is used to determine whether the current bucket is the last element in the single linked list
              int last = -1;
              //3. Traverse the single linked list corresponding to the bucket
              for (int i = buckets[bucket]; i >= 0; last = i, i = entries[i].next) {
                  if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) {
                      //4. After finding the element, if last < 0 means that it is the last element in the bucket, you can directly assign the subscript in the bucket to entries [i]. Next
                      if (last < 0) {
                          buckets[bucket] = entries[i].next;
                      }
                      else {
                          //4.1 last is not less than 0, which means that the current element is in the middle of the bucket single linked list. It is necessary to connect the head node and tail node of the element to prevent the interruption of the linked list
                          entries[last].next = entries[i].next;
                      }
                      //5. Initialize the data in the entry structure
                      entries[i].hashCode = -1;
                      //5.1 create freelist single linked list
                      entries[i].next = freeList;
                      entries[i].key = default(TKey);
                      entries[i].value = default(TValue);
                      //* 6. For the key code, freelist is equal to the current entry position, and the next add element will be added to this position first
                      freeList = i;
                      freeCount++;
                      //7. Version number + 1
                      version++;
                      return true;
                  }
              }
          }
          return false;
      }

      After executing the above code, the data structure is updated as shown in the figure below. Need attentionvarsion、freeList、freeCountThe values of have been updated.

      2.6. Dictionary – resize operation (capacity expansion)

      Some careful friends may want to ask after they have seen the add operation,buckets、entriesIt’s just two arrays. What if the array is full? The next step is the resize operation I want to introduce, which is helpful to usbuckets、entriesCapacity expansion.

      2.6.1 trigger conditions for capacity expansion

      First, we need to know under what circumstances capacity expansion will occur; The first case is that the array is full and there is no way to store new elements. As shown in the figure below.

      As we all know from the above, hash operation will inevitably produce conflicts. The zipper method is used in the dictionary to solve conflicts, but let’s look at the situation in the figure below.

      All the elements just fall onbuckets[3]Above, the result is that the time complexity O (n) is caused, and the search performance will be reduced; Therefore, the second is that too many collisions occur in the dictionary, which will seriously affect the performance and trigger the capacity expansion operation.

      At present, the collision number threshold set in. Net framework 4.7 is 100

      
      public const int HashCollisionThreshold = 100;

      2.6.2 how to expand capacity

      In order to show you clearly, the following data structure is simulated. A dictionary with a size of 2 is assumed, and the collision threshold is 2; Now trigger hash collision expansion.

      Start capacity expansion.

      1. Apply for buckets and entries twice the current size

      2. Copy the existing elements to the new entries

      After completing the above two steps, the new data structure is as follows.

      3. If it is a hash collision expansion, use the new hashcode function to recalculate the hash value

      As mentioned above, this is a hash collision expansion, so you need to use a new hash function to calculate the hash value. The new hash function will not solve the collision problem. It may be worse. The same as in the figure below will still fall in the same placebucketCome on.

      4. For each element of entries, bucket = newentries [i]. Hashcode% newsize determines the location of new buckets

      **5. Rebuild the hash chain, newentries [i]. Next = buckets [bucket]; buckets[bucket]=i; **

      becausebucketsIt has also expanded to twice the size, so it needs to be redefinedhashCodeWherebucketMedium; Finally, rebuild the hash single linked list

      This completes the expansion operation. If the expansion is triggered by reaching the hash collision threshold, the result may be worse after expansion.

      In JDK,HashMapIf there are too many collisions, the single linked list will be converted into a red black tree to improve the search performance. At present, there is no such optimization in the. Net framework. There are similar optimizations in the. Net core. In the future, there will be time to share some collection implementations of the. Net core.

      Each capacity expansion operation needs to traverse all elements, which will affect performance. Therefore, it is best to set an estimated initial size when creating a dictionary instance.

      private void Resize(int newSize, bool forceNewHashCodes) {
          Contract.Assert(newSize >= entries.Length);
          //1. Apply for new buckets and entries
          int[] newBuckets = new int[newSize];
          for (int i = 0; i < newBuckets.Length; i++) newBuckets[i] = -1;
          Entry[] newEntries = new Entry[newSize];
          //2. Copy the elements in the entries to the new entries
          Array.Copy(entries, 0, newEntries, 0, count);
          //3. If it is a hash collision expansion, use the new hashcode function to recalculate the hash value
          if(forceNewHashCodes) {
              for (int i = 0; i < count; i++) {
                  if(newEntries[i].hashCode != -1) {
                      newEntries[i].hashCode = (comparer.GetHashCode(newEntries[i].key) & 0x7FFFFFFF);
                  }
              }
          }
          //4. Determine the new bucket location
          //5. Rebuild the Hahs single linked list
          for (int i = 0; i < count; i++) {
              if (newEntries[i].hashCode >= 0) {
                  int bucket = newEntries[i].hashCode % newSize;
                  newEntries[i].next = newBuckets[bucket];
                  newBuckets[bucket] = i;
              }
          }
          buckets = newBuckets;
          entries = newEntries;
      }

      2.7. Dictionary – add operation

      In our previous add operation steps, we mentioned that there will be another case, that is, the element will be deleted.

      Avoid one other situation, and then it willhashCode、key、valueSuch information is storedentries[count]Yes, becausecountThe location is free; continuecount++Point to the next free location. In the first position in the figure above, index = 0 is idle, so it is stored inentries[0]The location of the.

      becausecountIs directed by self incremententries[]Next freeentry, if any element is deleted, thencountAn idle location will appear in the previous locationentry; If not, a lot of space will be wasted.

      That’s why the remove operation logsfreeList、freeCount, is to make use of the deleted space. In fact, the add operation takes precedencefreeListIdleentryLocation, excerpt code is as follows.

      private void Insert(TKey key, TValue value, bool add){
          
          if( key == null ) {
              ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);
          }
      
          if (buckets == null) Initialize(0);
          //Get hashcode through key
          int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
          //Calculate the target bucket subscript
          int targetBucket = hashCode % buckets.Length;
      	//Number of collisions
          int collisionCount = 0;
          for (int i = buckets[targetBucket]; i >= 0; i = entries[i].next) {
              if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) {
                  //If it is an add operation and the same element is traversed, an exception is thrown
                  if (add) {      
      				ThrowHelper.ThrowArgumentException(ExceptionResource.Argument_AddingDuplicate);
                  }
                  //If it is not an increase operation, it may be an index assignment operation. DICTIONARY ["foo"] = "foo"
                  //Then the assigned version + +, exit
                  entries[i].value = value;
                  version++;
                  return;
              }
              //Each traversal of an element is a collision
              collisionCount++;
          }
          int index;
          //If there is a deleted element, put the element in the free position of the deleted element
          if (freeCount > 0) {
              index = freeList;
              freeList = entries[index].next;
              freeCount--;
          }
          else {
              //If the current entries are full, the capacity expansion is triggered
              if (count == entries.Length)
              {
                  Resize();
                  targetBucket = hashCode % buckets.Length;
              }
              index = count;
              count++;
          }
      
          //Assign value to entry
          entries[index].hashCode = hashCode;
          entries[index].next = buckets[targetBucket];
          entries[index].key = key;
          entries[index].value = value;
          buckets[targetBucket] = index;
          //Version number++
          version++;
      
          //If the number of collisions is greater than the set maximum number of collisions, the hash collision expansion is triggered
          if(collisionCount > HashHelpers.HashCollisionThreshold && HashHelpers.IsWellKnownEqualityComparer(comparer)) 
          {
              comparer = (IEqualityComparer<TKey>) HashHelpers.GetRandomizedEqualityComparer(comparer);
              Resize(entries.Length, true);
          }
      }

      The above is the complete add code, or is it very simple, right?

      2.8. Collection version control

      It has been mentioned aboveversionThis variable will be changed every time you add, modify or deleteversion++; So thisversionWhat is the meaning of existence?

      First, let’s look at a piece of code. In this code, we first instantiate a dictionary instance, and thenforeachTraverse the instance inforeachUsed in code blocksdic.Remove(kv.Key)Delete element.

      The result is thrownSystem.InvalidOperationException:"Collection was modified..."For such exceptions, the set is not allowed to change during the iteration. If you directly delete elements after traversal in Java, there will be a strange problem, so it is used in. NetversionTo achieve version control.

      So how to implement version control during iteration? Let’s take a look at the source code.

      When the iterator initializes, it is recordeddictionary.versionVersion number, and then each iteration process will check whether the version number is consistent. If it is inconsistent, an exception will be thrown.

      This avoids many strange problems caused by modifying the set in the iterative process.

      The above is the detailed content of analyzing the implementation principle of c# dictionary. For more information about c# dictionary, please pay attention to other relevant articles of developeppaer!