Java internal skill series – how does HashSet ensure that elements are not repeated

Time:2020-12-6

Interviewer: can you briefly introduce the difference between list and set?

Xiaohan:

  • List is an ordered set, which is continuously stored in memory, and can store duplicate elements. List query is fast and add and delete slowly;
  • Set is an unordered set, which is discontinuous in memory and can’t store duplicate elements. Set can be added or deleted quickly and query is slow;

Interviewer: how does HashSet ensure that the elements are not repeated?

Xiaohan: three minutes…


In order to avoid Xiaohan’s embarrassment of knowing but not knowing why, we still need to analyze the above problems.

My guest, let’s see below

We all know that the elements stored in the HashSet are not allowed to be repeated. Do you know how the HashSet ensures that the elements are not repeatable?

Let’s look at the source code

public class HashSet<E>
    extends AbstractSet<E>
    implements Set<E>, Cloneable, java.io.Serializable
{
    static final long serialVersionUID = -5024744406713321676L;

    private transient HashMap<E,Object> map;

    private static final Object PRESENT = new Object();

    public HashSet() {
        map = new HashMap<>();
    }

    
    public HashSet(Collection<? extends E> c) {
        map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
        addAll(c);
    }

    
    public HashSet(int initialCapacity, float loadFactor) {
        map = new HashMap<>(initialCapacity, loadFactor);
    }
}

At first glance, the code, ouch, I’ll go. The operation of new hashset() does not mean that it maintains a HashMap. If I continue to perform this code, I think I can see a general idea of this skill!

Ladies and gentlemen, let’s move on

public boolean add(E e) {
    return map.put(e, PRESENT)==null;
}

What, this is not the map operation? In a moment, I will have a meal reasoning;

The key in the map is not allowed to be repeated, and your HashSet just uses the non duplicate key feature in my map to check the duplicate elements. Wonderful.

Indeed, HashSet does make use of this feature of map to realize the non repetition feature of elements. But let’s dig deeper. How does map ensure that the key does not repeat?

This article is not so much about how HashSet ensures that elements are not duplicated, but how map ensures that keys are not duplicated.

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
            
        //1. If the position does not exist, insert it directly
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            //2. If it exists, judge whether it is a repeating element
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

In the above part, I focused on two pieces of code, 1 and 2.

First paragraph

if ((p = tab[i = (n - 1) & hash]) == null)

In fact, this code mainly calculates the position of the element through hash, and then determines whether the position has a value. If there is no value, it can be directly inserted, and finally returns null;

The second paragraph

if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;

If there are other elements in the position through calculation, then the hash and equals will be used to judge whether it is a duplicate element. If the element is repeated, the repeating element will be returned.

Through the second code, we can find that the hash and equals methods are used to judge whether the elements are repeated. If the objects are stored in our set, we must rewrite the hash and equals methods.

Now it’s clear why the equals method should be rewritten. There won’t be so weird code. The values of these two objects are the same. Why is the set not duplicated!