I found out that for quite big Arrays (over 1000 entries), the methods A.removeAll(B) is way faster on a HashSet than on an ArrayList.
Do you have an idea of how these methods are implemented and how this could explain that difference?
A set (and thus HashSet as well) contains at most one element of B and since HashSet uses hashes it is quite efficient to locate and remove that element. The overall complexity should thus be O(1) for removing all (that is one) B.
A list can contain any number of B in any location so removing all B has to check all elements. The overall complexity is O(n) since every element has to be checked if it is a B.
Edit:
If B represents a collection/array, i.e. a set of multiple elements, you'd multiply the above complexities by the size m of B, so you'll get O(m) for HashSet and O(n * m) for lists.
Edit 2:
Note that if you have a sorted list the complexity might be reduced to O(log(n)) or O(log(n) * m). For that to work the code removing the actual elements would have to know the list is sorted though and since ArrayList is not guaranteed to be sorted it can't make that optimization.
Basically the reason for both is the time complexity that these specific implementations are trying to achive for theyr respectiv operations.
The time complexity for the ArrayList remove method is O(n - index) source from When to use LinkedList over ArrayList?
While the remove method of the HashSet offers constant time complexity O(1) source from Hashset vs Treeset
Related
I have two java HashSets, as below:
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
As the number of elements in each HashSet grows larger(and assuming some of the elements are unique, some aren't), does the time complexity of adding elements for the second HashSet(*of vectors) change(relative to the second HashSet(*of integers))? Or does the fact that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in general) O(1), but could someone clarify for H2?
Also, if it were instead, say, a TreeSet of Vectors, how would the time complexity of .add() change from a TreeSet of ints in that case?
Your question makes some false assumptions
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
Vector is a legacy synchronized class from Java 1.0. You're probably better off using ArrayList. Vector is also a mutable class, meaning that it may not work properly in a HashSet if the Objects it contents are changing
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't)
There cannot be non-unique elements in a Set
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't), does the time
complexity of adding elements for the second HashSet(*of vectors)
change(relative to the second HashSet(*of integers))? Or does the fact
that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in
general) O(1), but could someone clarify for H2?
Basically, larger lists will change the time complexity - not of the HashSet, but of hashcode() and equals. Also be aware that adding or removing elements from a List after it is added as a key in a hashmap / hashset will generally break the Map.
Also, if it were instead, say, a TreeSet of Vectors, how would the
time complexity of .add() change from a TreeSet of ints in that case?
You can't do this, as Vector does not implement Comparable
I haven't found much information about the time complexity of Collections.rotate(list, k), am I correct in assuming the following?
For ArrayList: O(n) being n the size of the list
For LinkedList and ArrayDeque: O(k)
Thanks in advance.
IMHO the documentation of Collections.rotate() describes the algorithm nicely:
If the specified list is small or implements the RandomAccess interface, this implementation exchanges the first element into the location it should go, and then repeatedly exchanges the displaced element into the location it should go until a displaced element is swapped into the first element. If necessary, the process is repeated on the second and successive elements, until the rotation is complete.
According to this part of the description the time complexity is O(N) for lists that implement the RandomAccess interface (because for such lists the element access is O(1) and the algorithm needs to access N elements), O(N^2) for lists where the element access is O(N) (for example LinkedList) - but only for small N.
If the specified list is large and doesn't implement the RandomAccess interface, this implementation breaks the list into two sublist views around index -distance mod size. Then the reverse(List) method is invoked on each sublist view, and finally it is invoked on the entire list.
This part of the description is for larger lists that do not implement the RandomAccess interface (for example LinkedList). The description mentions 3 calls to Collections.reverse() and the documentation for Collections.reverse() says:
This method runs in linear time.
That means O(N) for each call to Collections.reverse() - but O(3*N) is a constant factor to O(N) which means that the whole operation for Collections.rotate() also runs in O(N).
Note that ArrayDeque does not implement the List interface and you therefore cannot call Collections.rotate() with an ArrayDeque as argument.
Why does Hashmap internally use a LinkedList instead of an Arraylist when two objects are placed in the same bucket in the hash table?
Why does HashMap internally use s LinkedList instead of an Arraylist, when two objects are placed into the same bucket in the hash table?
Actually, it doesn't use either (!).
It actually uses a singly linked list implemented by chaining the hash table entries. (By contrast, a LinkedList is doubly linked, and it requires a separate Node object for each element in the list.)
So why am I nitpicking here? Because it is actually important ... because it means that the normal trade-off between LinkedList and ArrayList does not apply.
The normal trade-off is:
ArrayList uses less space, but insertion and removal of a selected element is O(N) in the worst case.
LinkedList uses more space, but insertion and removal of a selected element1 is O(1).
However, in the case of the private singly linked list formed by chaining together HashMap entry nodes, the space overhead is one reference (same as ArrayList), the cost of inserting a node is O(1) (same as LinkedList), and the cost of removing a selected node is also O(1) (same as LinkedList).
Relying solely on "big O" for this analysis is dubious, but when you look at the actual code, it is clear that what HashMap does beat ArrayList on performance for deletion and insertion, and is comparable for lookup. (This ignores memory locality effects.) And it also uses less memory for the chaining than either ArrayList or LinkedList was used ... considering that there are already internal entry objects to hold the key / value pairs.
But it gets even more complicated. In Java 8, they overhauled the HashMap internal data structures. In the current implementation, once a hash chain exceeds a certain length threshold, the implementation switches to using a binary tree representation if the key type implements Comparable.
1 - That is the insertion / deletion is O(1) if you have found the insertion / removal point. For example, if you are using the insert and remove methods on a LinkedList object's ListIterator.
This basically boils down to complexities of ArrayList and LinkedList.
Insertion in LinkedList (when order is not important) is O(1), just append to start.
Insertion in ArrayList (when order is not important) is O(N) ,traverse to end and there is also resizing overhead.
Removal is O(n) in LinkedList, traverse and adjust pointers.
Removal in arraylist could be O(n^2) , traverse to element and shift elements or resize the Arraylist.
Contains will be O(n) in either cases.
When using a HashMap we will expect O(1) operations for add, remove and contains. Using ArrayList we will incur higher cost for the add, remove operations in buckets
Short Answer : Java uses either LinkedList or ArrayList (whichever it finds appropriate for the data).
Long Answer
Although sorted ArrayList looks like the obvious way to go, there are some practical benefits of using LinkedList instead.
We need to remember that LinkedList chain is used only when there is collision of keys.
But as a definition of Hash function : Collisions should be rare
In rare cases of collisions we have to choose between Sorted ArrayList or LinkedList.
If we compare sorted ArrayList and LinkedList there are some clear trade-offs
Insertion and Deletion : Sorted ArrayList takes O(n), but LinkedList takes constant O(1)
Retrieval : Sorted ArrayList takes O(logn) and LinkedList takes 0(n).
Now, its clear that LinkedList are better than sorted ArrayList during insertion and deletion, but they are bad while retrieval.
In there are fewer collisions, sorted ArrayList brings less value (but more over head).
But when the collisions are more frequent and the collided elements list become large(over certain threshold) Java changes the collision data structure from LinkedList to ArrayList.
As HashMap uses LinkedList when two different keys produces a same hashCode.But I was wondering what makes LinkedList a better candidate here over other implementation of List.Why not ArrayList because ArrayList uses Array internally and arrays have a faster iteration compared to a LinkedList.
Collisions in hash maps are an exception, rather than a rule. When your hash function is reasonably good, as it should be, there should be very few collisions.
If we used ArrayList for the buckets, with most lists being empty or having exactly one element, this would be a rather big waste of resources. With array lists allocating multiple members upfront, you would end up paying forward for multiple collisions that you may not have in the future.
Moreover, removing from array lists is cheap only when the last element gets deleted. When the first one gets deleted, you end up paying for the move of all elements.
Linked lists are free from these problems. Insertion is O(1), deletion is O(1), and they use exactly as many nodes as you insert. The memory overhead of the next/prior links is not too big a price to pay for this convenience.
The problem with an arrayList is that you can't fast remove an element: you have to move all the elements after the one you remove.
With a linkedList, removing an element is merely changing a reference from one node to the new next one, skipping the removed one.
The difference is huge. When you want to have a list and be able to fast remove elements, don't use an arraylist, the usual choice is the linked list.
Why not ArrayList because ArrayList uses Array internally and arrays have a faster iteration compared to a LinkedList.
And ArrayList is much slower to modify. So they made a judgement call and went with LinkedList.
Is the contains method on TreeSet (Since it is already sorted per default) faster than say HashSet?
The reason I ask is that Collections.binarySearch is quite fast if the List is sorted, so I am thinking that maybe the contains method for TreeSet might be the same.
From the javadoc of TreeSet:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
From the javadoc of HashSet:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
So the answer is no.
Looking at the implementation (JDK 1.7 oracle), treeset.contains (resp. hashtree) relies on treemap.containsKey (resp. hashmap) method. containsKey loops over one hash bucket in hashmap (which possibly contains only one item), whereas it loops over the whole map, moving from node to node in treemap, using the compareTo method. If your item is the largest or the smallest, this can take significantly more time.
Finally, I just ran a quick test (yes I know, not very reliable) with a tree containing 1m integers and looking for one of the 2 largest, which forces the treeset to browse the whole set. HashSet is quicker by a factor of 50.