Java HashSet worst case lookup time complexity

Java HashSet worst case lookup time complexity - java

If hashtables/maps with closed hashing are worst-case O(n), are HashSets also going to require O(n) time for lookup, or is it constant time?

When looking up an element in a HashMap, it performs an O(1) calculation to find the right bucket, and then iterates over the items there serially until it finds the one the is equal to the requested key, or all the items are checked.
In the worst case scenario, all the items in the map have the same hash code and are therefore stored in the same bucket. In this case, you'll need to iterate over all of them serially, which would be an O(n) operation.
A HashSet is just a HashMap where you don't care about the values, only the keys - under the hood, it's a HashMap where all the values are a dummy Object.

If you look at the implementation of a HashSet (e.g. from OpenJDK 8: https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashSet.java), you can see that it's actually just built on top of a HashMap. Relevant code snippet here:
public class HashSet<E>
extends AbstractSet<E>
implements Set<E>, Cloneable, java.io.Serializable
{
private transient HashMap<E,Object> map;
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
/**
* Constructs a new, empty set; the backing <tt>HashMap</tt> instance has
* default initial capacity (16) and load factor (0.75).
*/
public HashSet() {
map = new HashMap<>();
}
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
The HashSet attempts to slightly optimize the memory usage by creating a single static empty Object value named PRESENT and just using that as the value part of every key/value entry into the HashMap.
So whatever the performance implications are of using a HashMap, a HashSet will have more or less the same ones since it's literally using a HashMap under the covers.
To directly answer your question: in the worst case, yes, just as a the worse case complexity of a HashMap is O(n), so too the worst case complexity of a HashSet is O(n).
It is worth noting that, unless you have a really bad hash function or are using a hashtable of a ridiculously small size, you're very unlikely to see the worst case performance in practice. You'd have to have every element hash to the exact same bucket in the hashtable so the performance would essentially degrade to a linked list traversal (assuming a hashtable using chaining for collision handling, which the Java ones do).

Worst case is O(N) as mentioned, average and amortized run time is constant.
From GeeksForGeeks:
The underlying data structure for HashSet is hashtable. So amortize (average or usual case) time complexity for add, remove and look-up (contains method) operation of HashSet takes O(1) time.

I see a lot of people saying the worst case is O(n). This is because the old HashSet implementation used to use a LinkedList to handle collisions to the same bucket. However, that is not a definitive answer.
In java 8 such LinkedList is replaced by a balanced binary tree when the number of collisions of a bucket grows. This improves the worst-case performance from O(n) to O(log n) for lookups.
You can check additional details here.
http://openjdk.java.net/jeps/180
https://www.nagarro.com/en/blog/post/24/performance-improvement-for-hashmap-in-java-8

Related

What is the time complexity of a HashMap implemented using dynamic array?

They say HashMap's put and get operations have a constant time complexity, is it still going to be O(1) if it is implemented with a dynamic array?
ex:
public class HashMap <key, value>{
private class Entry <k, v>{
private k key;
private v value;
public Entry(k key, v value){
this.key = key;
this.value = value;
}
}
private ArrayList < LinkedList<Entry<key, value>> > = new ArrayList<>();
// the rest of the implementation
// ...
}

HashMap already uses a dynamic array:
/**
* The table, initialized on first use, and resized as
* necessary. When allocated, length is always a power of two.
* (We also tolerate length zero in some operations to allow
* bootstrapping mechanics that are currently not needed.)
*/
transient Node<K,V>[] table;
Using an ArrayList instead of a manually resized array does not increase the time complexity.

The issue for puts is how often you're going to need to extend the ArrayList.
This is probably little different in overhead to extending a plain array: you have to allocate a new one and rehash.
Note you'll need to know the intended ArrayList size in order to compute the hash index (as hash code % array size), so you should allocate the ArrayList with that capacity initially and then populate with nulls - since the array elements don't exist until added to the list.
Similarly for when you rehash.
You can of course do it wrong: you could compute a size for use in computing the hash index, and then not extend the ArrayList accordingly. Then you'd suffer from arbitrary need to extend the ArrayList whenever you had an index higher than one you'd seen before, which may require reallocation internal to the ArrayList.
In short: there's no significant performance penalty for using an ArrayList if you do it in a reasonable way, but no particular benefit either.

What is the time complexity of a HashMap implemented using dynamic array?
The short answer is: It depends on how you actually implement the put and get methods, and the rehashing. However, assuming that you have gotten it right, then the complexity would be the same as with classic arrays.
Note that a typical hash table implementation will not benefit from using a dynamic array (aka a List).
In between resizes, a hash table has an array whose size is fixed. Entries are added and removed from buckets, but the number of buckets and their positions in the array do not change. Since the code is not changing the array size, there is no benefit in using dynamic arrays in between the resizes.
When the hash table resizes, it needs to create a new array (typically about twice the size of the current one). Then it recompute the entry -> bucket mappings and redistributes the entries. Note that a new array is required, since it is not feasible to redistribute the entries "in place". Secondly, the size of the new array will be known (and fixed) when it is allocated. So again there is no benefit here from using a dynamic array.
Add to this that for all primitive operations on an array, the equivalent operations for a dynamic array (i.e. ArrayList1) have a small performance overhead. So there will be a small performance hit from using a dynamic array in a hash table implementation.
Finally, you need to be careful when talking about complexity of hash table implementations. The average complexity of HashMap.put (and similar) is O(1) amortized.
A single put operation may be O(N) if it triggers a resize.
If the hash function is pathological, all operations can be O(N).
If you choose an inappropriate load factor, performance will suffer.
If you implement an incorrect resizing policy then performance will suffer.
(Amortized means averaged over all similar operations on the table. For example, N insertions with different keys into an empty table is O(N) ... or O(1) amortized per insertion.)
In short: the complexity of hash tables is ... complex.
1 - a Vector will be slightly worse than an ArrayList, but LinkedList would actually make the complexity for get and put O(N) instead of O(1). I'll leave you to figure out the details.

What is increased cost of TreeSet vs LinkedHashSet and TreeMap over LinkedHashMap?

LinkedHashSet - This implementation spares its clients from the unspecified, generally chaotic ordering provided by HashSet, without incurring the increased cost associated with TreeSet.
Same is said about LinkedHashMap vs TreeMap
What is this increased cost (LinkedHashMap vs TreeMap) exactly?
Does that mean that TreeSet needs more memory per element? LinkedHashSet needs more memory for two additional links, but TreeSet needs additional memory to store Map.Entry pair of elements (because implicitly based on TreeMap), besides LinkedHashSet is based on HashMap which also has Map.Entry pair of elements overhead...
So the difference is how fast a new element is added (in case of TreeSet it takes longer due to some "sorting").
What are other significant increased costs?

TreeSet/TreeMap have a higher time complexity for operations such ass add(), contains() (for TreeSet), put(), containsKey() (for TreeMap), etc... since they require logarithmic time to locate elements in the tree (or add elements to the tree), while LinkedHashSet/LinkedHashMap require expected constant time for those operations.
In terms of memory requirements, there's a very small difference:
TreeMap entries hold key, value, 3 Entry references (left, right, parent) and a boolean.
LinkedHashMap entries hold key, value, 3 Entry references (next, before, after) and an int.

When iterating a HashSet, the iteration order is generally the order of the hash of the object, which is generally not too useful if you want a predictable order.
If sane ordering is important you would generally need to use a TreeSet which iterates in sorted order but at a cost because maintaining the sorted order adds to the complexity of the process.
A LinkedHashSet can be used as a middle-ground solution to the seemingly insane ordering of a HashSet by ensuring that the iteration order is at least consistent by using the insertion order.

Why does Hashmap Internally use LinkedList instead of Arraylist

Why does Hashmap internally use a LinkedList instead of an Arraylist when two objects are placed in the same bucket in the hash table?

Why does HashMap internally use s LinkedList instead of an Arraylist, when two objects are placed into the same bucket in the hash table?
Actually, it doesn't use either (!).
It actually uses a singly linked list implemented by chaining the hash table entries. (By contrast, a LinkedList is doubly linked, and it requires a separate Node object for each element in the list.)
So why am I nitpicking here? Because it is actually important ... because it means that the normal trade-off between LinkedList and ArrayList does not apply.
The normal trade-off is:
ArrayList uses less space, but insertion and removal of a selected element is O(N) in the worst case.
LinkedList uses more space, but insertion and removal of a selected element1 is O(1).
However, in the case of the private singly linked list formed by chaining together HashMap entry nodes, the space overhead is one reference (same as ArrayList), the cost of inserting a node is O(1) (same as LinkedList), and the cost of removing a selected node is also O(1) (same as LinkedList).
Relying solely on "big O" for this analysis is dubious, but when you look at the actual code, it is clear that what HashMap does beat ArrayList on performance for deletion and insertion, and is comparable for lookup. (This ignores memory locality effects.) And it also uses less memory for the chaining than either ArrayList or LinkedList was used ... considering that there are already internal entry objects to hold the key / value pairs.
But it gets even more complicated. In Java 8, they overhauled the HashMap internal data structures. In the current implementation, once a hash chain exceeds a certain length threshold, the implementation switches to using a binary tree representation if the key type implements Comparable.
1 - That is the insertion / deletion is O(1) if you have found the insertion / removal point. For example, if you are using the insert and remove methods on a LinkedList object's ListIterator.

This basically boils down to complexities of ArrayList and LinkedList.
Insertion in LinkedList (when order is not important) is O(1), just append to start.
Insertion in ArrayList (when order is not important) is O(N) ,traverse to end and there is also resizing overhead.
Removal is O(n) in LinkedList, traverse and adjust pointers.
Removal in arraylist could be O(n^2) , traverse to element and shift elements or resize the Arraylist.
Contains will be O(n) in either cases.
When using a HashMap we will expect O(1) operations for add, remove and contains. Using ArrayList we will incur higher cost for the add, remove operations in buckets

Short Answer : Java uses either LinkedList or ArrayList (whichever it finds appropriate for the data).
Long Answer
Although sorted ArrayList looks like the obvious way to go, there are some practical benefits of using LinkedList instead.
We need to remember that LinkedList chain is used only when there is collision of keys.
But as a definition of Hash function : Collisions should be rare
In rare cases of collisions we have to choose between Sorted ArrayList or LinkedList.
If we compare sorted ArrayList and LinkedList there are some clear trade-offs
Insertion and Deletion : Sorted ArrayList takes O(n), but LinkedList takes constant O(1)
Retrieval : Sorted ArrayList takes O(logn) and LinkedList takes 0(n).
Now, its clear that LinkedList are better than sorted ArrayList during insertion and deletion, but they are bad while retrieval.
In there are fewer collisions, sorted ArrayList brings less value (but more over head).
But when the collisions are more frequent and the collided elements list become large(over certain threshold) Java changes the collision data structure from LinkedList to ArrayList.

contain operation on hashmap dependant on size of hashmap?

As per my understanding on hashmap
Question 1:-
For Hashmap returning the unique hashcode for each key
time to determine whether a object is contained in hashmap is constant
and does not depend on size of hashmap
Question 2:-
For Hashmap returning the same hashcode for each key but retrning false for equals method
time to determine whether a object is contained in hashmap is dependant on size of hashmap
Is that true ?

It is generally considered that hashmap look ups only take O(1) time. This is the average time for look up. But in the worst case scenario it can be O(n) as well. For an example if a linked list is used in the implementation of hashmap this scenario can occur. But it can be avoided if self-balancing trees are used which reduces the worst case scenario to O(log n) time.

If we have an appropriately written hash function, then yes order of retrieval comes out to be of O(1).
Think of it this way, if your hash function is written appropriately, so that elements are distributed across buckets then the time to search an element would be proportional to the size of bucket. Now, if you have a constant size bucket and number of buckets or memory size is not a constraint, then you will be able to retrieve the element in constant time.
Regarding your second question: Yes if you have a hash function returning same hashcode, then the order of retrieval of element will be proportional to the size of hashmap also called O(n)

How is the implementation of LinkedHashMap different from HashMap?

If LinkedHashMap's time complexity is same as HashMap's complexity why do we need HashMap? What are all the extra overhead LinkedHashMap has when compared to HashMap in Java?

LinkedHashMap will take more memory. Each entry in a normal HashMap just has the key and the value. Each LinkedHashMap entry has those references and references to the next and previous entries. There's also a little bit more housekeeping to do, although that's usually irrelevant.

If LinkedHashMap's time complexity is same as HashMap's complexity why do we need HashMap?
You should not confuse complexity with performance. Two algorithms can have the same complexity, yet one can consistently perform better than the other.
Remember that f(N) is O(N) means that:
limit(f(N), N -> infinity) <= C*N
where C is a constant. The complexity says nothing about how small or large the C values are. For two different algorithms, the constant C will most likely be different.
(And remember that big-O complexity is about the behavior / performance as N gets very large. It tells you nothing about the behavior / performance for smaller N values.)
Having said that:
The difference in performance between HashMap and LinkedHashMap operations in equivalent use-cases is relatively small.
A LinkedHashMap uses more memory. For example, the Java 11 implementation has two additional reference fields in each map entry to represent the before/after list. On a 64 bit platform without compressed OOPs the extra overhead is 16 bytes per entry.
Relatively small differences in performance and/or memory usage can actually matter a lot to people with performance or memory critical applications1.
1 - ... and also to people who obsess about these things unnecessarily.

LinkedHashMap additionally maintains a doubly-linked list running through all of its entries, that will provide a reproducable order. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order).
HashMap doesn't have these extra costs (runtime,space) and should prefered over LinkedHashMap when you don't care about insertion order.

LinkedHashMap is a useful data structure when you need to know the insertion order of keys to the Map. One suitable use case is for the implementation of an LRU cache. Due to order maintenance of the LinkedHashMap, the data structure needs additional memory compared to HashMap. In case insertion order is not a requirement, you should always go for the HashMap.

There is another major difference between HashMap and LinkedHashMap :Iteration is more efficient in case of LinkedHashMap.
As Elements in LinkedHashMap are connected with each other so iteration requires time proportional to the size of the map, regardless of its capacity.
But in case of HashMap; as there is no fixed order, so iteration over it requires time proportional to its capacity.
I have put more details on my blog.

HashMap does not maintains insertion order, hence does not maintains any doubly linked list.
Most salient feature of LinkedHashMap is that it maintains insertion order of key-value pairs. LinkedHashMap uses doubly Linked List for doing so.
Entry of LinkedHashMap looks like this:
static class Entry<K, V> {
K key;
V value;
Entry<K,V> next;
Entry<K,V> before, after; //For maintaining insertion order
public Entry(K key, V value, Entry<K,V> next){
this.key = key;
this.value = value;
this.next = next;
}
}
By using before and after - we keep track of newly added entry in LinkedHashMap, which helps us in maintaining insertion order.
Before refer to previous entry and
after refers to next entry in LinkedHashMap.
For diagrams and step by step explanation please refer http://www.javamadesoeasy.com/2015/02/linkedhashmap-custom-implementation.html

LinkedHashMap inherits HashMap, that means it uses existing implementation of HashMap to store key and values in a Node (Entry Object). Other than this it stores a separate doubly linked list implementation to maintain the insertion order in which keys have been entered.
It looks like this :
header node <---> node 1 <---> node 2 <---> node 3 <----> node 4 <---> header node.
So extra overload is maintaining insertion and deletion in this doubly linked list.
Benefit is : Iteration order is guaranteed to be insertion order, which is not in HashMap.

Re-sizing is supposed to be faster as it iterates through its
double-linked list to transfer the contents into a new table array.
containsValue() is Overridden to take advantage of the faster
iterator.
LinkedHashMap can also be used to create a LRU cache. A special
LinkedHashMap(capacity, loadFactor, accessOrderBoolean) constructor
is provided to create a linked hash map whose order of iteration is
the order in which its entries were last accessed, from
least-recently accessed to most-recently. In this case, merely
querying the map with get() is a structural modification.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.