I have two java HashSets, as below:
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
As the number of elements in each HashSet grows larger(and assuming some of the elements are unique, some aren't), does the time complexity of adding elements for the second HashSet(*of vectors) change(relative to the second HashSet(*of integers))? Or does the fact that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in general) O(1), but could someone clarify for H2?
Also, if it were instead, say, a TreeSet of Vectors, how would the time complexity of .add() change from a TreeSet of ints in that case?
Your question makes some false assumptions
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
Vector is a legacy synchronized class from Java 1.0. You're probably better off using ArrayList. Vector is also a mutable class, meaning that it may not work properly in a HashSet if the Objects it contents are changing
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't)
There cannot be non-unique elements in a Set
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't), does the time
complexity of adding elements for the second HashSet(*of vectors)
change(relative to the second HashSet(*of integers))? Or does the fact
that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in
general) O(1), but could someone clarify for H2?
Basically, larger lists will change the time complexity - not of the HashSet, but of hashcode() and equals. Also be aware that adding or removing elements from a List after it is added as a key in a hashmap / hashset will generally break the Map.
Also, if it were instead, say, a TreeSet of Vectors, how would the
time complexity of .add() change from a TreeSet of ints in that case?
You can't do this, as Vector does not implement Comparable
Related
Ok this might be a superstupid question, but
I'm a bit baffled atm and eager to hear what you can tell me about this.
I had an ArrayList with about 5 million longs added. These longs are calculated hashes for primary keys (concatenated Strings) out of a big csv file.
Now I wanted to check for uniqueness and looping through the list like that:
for(int i=0;i<hashArrayList.size();i++)
{
long refValue = hashArrayList.get(i)
for(int j=i+1;j<hashArrayList.size();j++)
{
if(refValue == hashArrayList.get(j))
--> UNIQUENESS VIOLATION, now EXPLODE!!
}
}
This way it takes HOURS.
Now about the Hashset, which doesn't allow duplicates by itself.
A hashset.addAll(hashArrayList) takes 4 seconds! while eliminating/not adding duplicates for this list with 5 mio elements.
How does it do that?
And: Is my ArrayList-looping so stupid?
You are doing a totally a different comparison.
With an ArrayList, you have a nested for loop which makes it O(n^2).
But with a HashSet, you are not doing any looping, but just adding n elements to it which is O(n). Internally a HashSet uses a HashMap whose key is the individual elements of the list and value is a static Object.
Source code for HashSet (Java 8)
public HashSet(Collection<? extends E> c) {
map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
addAll(c);
}
addAll calls add
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
So, ultimately it all comes to inserting an object (here long) into a HashMap which provides a constant time performance 1
1
From javadoc of HashMap (emphasis mine)
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets
A hash-based collection doesn't need looping to check whether there are elements with the same key.
Imagine you have 1.000 objects X. In your case, you loop through the list every time you add something.
A hash-based collection calculates the hash of the object, looks inside whether there are other elements with the same hash and then just needs to check whether one of them is equal to the new element. If you have a good hash function that returns a unique hash for unique elements, you would just have to calculate a number.
Of course, if you just say "I am lazy and I override my hashCode method with return 1", then you would have the same amount of comparisons additional to the hash collection overhead.
Example: Imagine you have the following HashSet:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
As you see, the basic structure (can be) like this: An array containing other data structures with the actual entries. Now, if you put an obj5 into the HashSet, it will call obj5.hashCode(). Based on this, it will calculate the outer index of this obj. Let's say it's 4:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
^ obj5
Now we have three other objects with the same index. Yes, we need a loop here to check, whether some of them is equal to the new obj5, but if you have a bigger HashSet with millions of entries, a comparison with some elements is much faster than comparing with all the elements. This is the advantage of a hash-based collection.
Hashmap internal working
Moreover you are using loop inside a loop which is making the complexity O(n^2) which is less efficient what hashmap uses.
I found out that for quite big Arrays (over 1000 entries), the methods A.removeAll(B) is way faster on a HashSet than on an ArrayList.
Do you have an idea of how these methods are implemented and how this could explain that difference?
A set (and thus HashSet as well) contains at most one element of B and since HashSet uses hashes it is quite efficient to locate and remove that element. The overall complexity should thus be O(1) for removing all (that is one) B.
A list can contain any number of B in any location so removing all B has to check all elements. The overall complexity is O(n) since every element has to be checked if it is a B.
Edit:
If B represents a collection/array, i.e. a set of multiple elements, you'd multiply the above complexities by the size m of B, so you'll get O(m) for HashSet and O(n * m) for lists.
Edit 2:
Note that if you have a sorted list the complexity might be reduced to O(log(n)) or O(log(n) * m). For that to work the code removing the actual elements would have to know the list is sorted though and since ArrayList is not guaranteed to be sorted it can't make that optimization.
Basically the reason for both is the time complexity that these specific implementations are trying to achive for theyr respectiv operations.
The time complexity for the ArrayList remove method is O(n - index) source from When to use LinkedList over ArrayList?
While the remove method of the HashSet offers constant time complexity O(1) source from Hashset vs Treeset
Why does Hashmap internally use a LinkedList instead of an Arraylist when two objects are placed in the same bucket in the hash table?
Why does HashMap internally use s LinkedList instead of an Arraylist, when two objects are placed into the same bucket in the hash table?
Actually, it doesn't use either (!).
It actually uses a singly linked list implemented by chaining the hash table entries. (By contrast, a LinkedList is doubly linked, and it requires a separate Node object for each element in the list.)
So why am I nitpicking here? Because it is actually important ... because it means that the normal trade-off between LinkedList and ArrayList does not apply.
The normal trade-off is:
ArrayList uses less space, but insertion and removal of a selected element is O(N) in the worst case.
LinkedList uses more space, but insertion and removal of a selected element1 is O(1).
However, in the case of the private singly linked list formed by chaining together HashMap entry nodes, the space overhead is one reference (same as ArrayList), the cost of inserting a node is O(1) (same as LinkedList), and the cost of removing a selected node is also O(1) (same as LinkedList).
Relying solely on "big O" for this analysis is dubious, but when you look at the actual code, it is clear that what HashMap does beat ArrayList on performance for deletion and insertion, and is comparable for lookup. (This ignores memory locality effects.) And it also uses less memory for the chaining than either ArrayList or LinkedList was used ... considering that there are already internal entry objects to hold the key / value pairs.
But it gets even more complicated. In Java 8, they overhauled the HashMap internal data structures. In the current implementation, once a hash chain exceeds a certain length threshold, the implementation switches to using a binary tree representation if the key type implements Comparable.
1 - That is the insertion / deletion is O(1) if you have found the insertion / removal point. For example, if you are using the insert and remove methods on a LinkedList object's ListIterator.
This basically boils down to complexities of ArrayList and LinkedList.
Insertion in LinkedList (when order is not important) is O(1), just append to start.
Insertion in ArrayList (when order is not important) is O(N) ,traverse to end and there is also resizing overhead.
Removal is O(n) in LinkedList, traverse and adjust pointers.
Removal in arraylist could be O(n^2) , traverse to element and shift elements or resize the Arraylist.
Contains will be O(n) in either cases.
When using a HashMap we will expect O(1) operations for add, remove and contains. Using ArrayList we will incur higher cost for the add, remove operations in buckets
Short Answer : Java uses either LinkedList or ArrayList (whichever it finds appropriate for the data).
Long Answer
Although sorted ArrayList looks like the obvious way to go, there are some practical benefits of using LinkedList instead.
We need to remember that LinkedList chain is used only when there is collision of keys.
But as a definition of Hash function : Collisions should be rare
In rare cases of collisions we have to choose between Sorted ArrayList or LinkedList.
If we compare sorted ArrayList and LinkedList there are some clear trade-offs
Insertion and Deletion : Sorted ArrayList takes O(n), but LinkedList takes constant O(1)
Retrieval : Sorted ArrayList takes O(logn) and LinkedList takes 0(n).
Now, its clear that LinkedList are better than sorted ArrayList during insertion and deletion, but they are bad while retrieval.
In there are fewer collisions, sorted ArrayList brings less value (but more over head).
But when the collisions are more frequent and the collided elements list become large(over certain threshold) Java changes the collision data structure from LinkedList to ArrayList.
As HashMap uses LinkedList when two different keys produces a same hashCode.But I was wondering what makes LinkedList a better candidate here over other implementation of List.Why not ArrayList because ArrayList uses Array internally and arrays have a faster iteration compared to a LinkedList.
Collisions in hash maps are an exception, rather than a rule. When your hash function is reasonably good, as it should be, there should be very few collisions.
If we used ArrayList for the buckets, with most lists being empty or having exactly one element, this would be a rather big waste of resources. With array lists allocating multiple members upfront, you would end up paying forward for multiple collisions that you may not have in the future.
Moreover, removing from array lists is cheap only when the last element gets deleted. When the first one gets deleted, you end up paying for the move of all elements.
Linked lists are free from these problems. Insertion is O(1), deletion is O(1), and they use exactly as many nodes as you insert. The memory overhead of the next/prior links is not too big a price to pay for this convenience.
The problem with an arrayList is that you can't fast remove an element: you have to move all the elements after the one you remove.
With a linkedList, removing an element is merely changing a reference from one node to the new next one, skipping the removed one.
The difference is huge. When you want to have a list and be able to fast remove elements, don't use an arraylist, the usual choice is the linked list.
Why not ArrayList because ArrayList uses Array internally and arrays have a faster iteration compared to a LinkedList.
And ArrayList is much slower to modify. So they made a judgement call and went with LinkedList.
From the Java Docs of LinkedHashSet(LHS) class :
Iteration over a LinkedHashSet requires time proportional to the size
of the set, regardless of its capacity. Iteration over a HashSet is
likely to be more expensive, requiring time proportional to its
capacity.
My question is why does iteration time over a LHS has no bearing on the capacity of the set ?
Because the LinkedHashSet comprises internally both a LinkedList and a Set. When iterating, you iterate over the (I believe, double) LinkedList, not the HashSet.
Create a regular HashSet with a capacity of 1MB (new HashSet(1024 * 1024), add 1 element and try to iterate. Though the HashSet has only 1 element the iterator will have to go over all 1MB buckets of the underlying hastable. But if it was a LinkedHashSet the iterator would not go over the hashtable (that one is used only for get() and contains()) but would go thru the LinkedList (parallel structure) and there is only one element in it.
Iterating over a HashSet you need (pretty much) iterate over the buckets that contain the elements, then to eliminate empty values, which requires additional time. Briefly - there is some overhead associated with sorting empty elements out.
The nature of Linked collections is so that every element points to the next one. So, you start with the first and without much problems pull the next, and so on - this way you easily iterate them all.