What's the magic behind a Hashset finding duplicates that incredibly fast? - java

Ok this might be a superstupid question, but
I'm a bit baffled atm and eager to hear what you can tell me about this.
I had an ArrayList with about 5 million longs added. These longs are calculated hashes for primary keys (concatenated Strings) out of a big csv file.
Now I wanted to check for uniqueness and looping through the list like that:
for(int i=0;i<hashArrayList.size();i++)
{
long refValue = hashArrayList.get(i)
for(int j=i+1;j<hashArrayList.size();j++)
{
if(refValue == hashArrayList.get(j))
--> UNIQUENESS VIOLATION, now EXPLODE!!
}
}
This way it takes HOURS.
Now about the Hashset, which doesn't allow duplicates by itself.
A hashset.addAll(hashArrayList) takes 4 seconds! while eliminating/not adding duplicates for this list with 5 mio elements.
How does it do that?
And: Is my ArrayList-looping so stupid?

You are doing a totally a different comparison.
With an ArrayList, you have a nested for loop which makes it O(n^2).
But with a HashSet, you are not doing any looping, but just adding n elements to it which is O(n). Internally a HashSet uses a HashMap whose key is the individual elements of the list and value is a static Object.
Source code for HashSet (Java 8)
public HashSet(Collection<? extends E> c) {
map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
addAll(c);
}
addAll calls add
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
So, ultimately it all comes to inserting an object (here long) into a HashMap which provides a constant time performance 1
1
From javadoc of HashMap (emphasis mine)
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets

A hash-based collection doesn't need looping to check whether there are elements with the same key.
Imagine you have 1.000 objects X. In your case, you loop through the list every time you add something.
A hash-based collection calculates the hash of the object, looks inside whether there are other elements with the same hash and then just needs to check whether one of them is equal to the new element. If you have a good hash function that returns a unique hash for unique elements, you would just have to calculate a number.
Of course, if you just say "I am lazy and I override my hashCode method with return 1", then you would have the same amount of comparisons additional to the hash collection overhead.
Example: Imagine you have the following HashSet:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
As you see, the basic structure (can be) like this: An array containing other data structures with the actual entries. Now, if you put an obj5 into the HashSet, it will call obj5.hashCode(). Based on this, it will calculate the outer index of this obj. Let's say it's 4:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
^ obj5
Now we have three other objects with the same index. Yes, we need a loop here to check, whether some of them is equal to the new obj5, but if you have a bigger HashSet with millions of entries, a comparison with some elements is much faster than comparing with all the elements. This is the advantage of a hash-based collection.

Hashmap internal working
Moreover you are using loop inside a loop which is making the complexity O(n^2) which is less efficient what hashmap uses.

Related

Time Complexity of .add() in HashSet of Vectors in Java

I have two java HashSets, as below:
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
As the number of elements in each HashSet grows larger(and assuming some of the elements are unique, some aren't), does the time complexity of adding elements for the second HashSet(*of vectors) change(relative to the second HashSet(*of integers))? Or does the fact that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in general) O(1), but could someone clarify for H2?
Also, if it were instead, say, a TreeSet of Vectors, how would the time complexity of .add() change from a TreeSet of ints in that case?
Your question makes some false assumptions
HashSet<Integer> H1 = new HashSet<>();
HashSet<Vector> H2 = new HashSet<>();
Vector is a legacy synchronized class from Java 1.0. You're probably better off using ArrayList. Vector is also a mutable class, meaning that it may not work properly in a HashSet if the Objects it contents are changing
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't)
There cannot be non-unique elements in a Set
As the number of elements in each HashSet grows larger(and assuming
some of the elements are unique, some aren't), does the time
complexity of adding elements for the second HashSet(*of vectors)
change(relative to the second HashSet(*of integers))? Or does the fact
that the HashSet contains Vectors not effect the time complexity?
I understand the time complexity of .add() the first HashSet is(in
general) O(1), but could someone clarify for H2?
Basically, larger lists will change the time complexity - not of the HashSet, but of hashcode() and equals. Also be aware that adding or removing elements from a List after it is added as a key in a hashmap / hashset will generally break the Map.
Also, if it were instead, say, a TreeSet of Vectors, how would the
time complexity of .add() change from a TreeSet of ints in that case?
You can't do this, as Vector does not implement Comparable

Does the ArrayList's contains() method work faster if the ArrayList is ordered?

I suspect it doesn't. If I want to use the fact that the list is ordered, should I implement my own contains() method, using binary search, for example? Are there any methods that assume that the list is ordered?
This question is different to the possible duplicate because the other question doesn't ask about the contains() method.
No, because ArrayList is backed by array and internally calls indexOf(Object o) method where it searches sequentially. Thus sorting is not relevant to it. Here's the source code:
/**
* Returns the index of the first occurrence of the specified element
* in this list, or -1 if this list does not contain the element.
* More formally, returns the lowest index <tt>i</tt> such that
* <tt>(o==null ? get(i)==null : o.equals(get(i)))</tt>,
* or -1 if there is no such index.
*/
public int indexOf(Object o) {
if (o == null) {
for (int i = 0; i < size; i++)
if (elementData[i]==null)
return i;
} else {
for (int i = 0; i < size; i++)
if (o.equals(elementData[i]))
return i;
}
return -1;
}
Use binary search of collections to search in an ordered array list
Collections.<T>binarySearch(List<T> list, T key)
Arraylist.contains will consider this as a normal list and it would take the same amount of time as any unordered list that is O(n) whereas complexity of binary search would be O(logn) in worst case
No. contains uses indexOf:
public boolean contains(Object var1) {
return this.indexOf(var1) >= 0;
}
and indexOf just simply iterates over the internal array:
for(var2 = 0; var2 < this.size; ++var2) {
if (var1.equals(this.elementData[var2])) {
return var2;
}
}
Collections.binarySearch is what you're looking for:
Searches the specified list for the specified object using the binary
search algorithm. The list must be sorted into ascending order
according to the natural ordering of its elements (as by the
sort(List) method) prior to making this call. If it is not sorted, the
results are undefined.
Emphasis mine
Also consider using a SortedSet such as a TreeSet which will provide stronger guarantees that the elements are kept in the correct order, unlike a List which must rely on caller contracts (as highlighted above)
Does the ArrayList's contains() method work faster if the ArrayList is ordered?
It doesn't. The implementation of ArrayList does not know if the list is ordered or not. Since it doesn't know, it cannot optimize in the case when it is ordered. (And an examination of the source code bears this out.)
Could a (hypothetical) array-based-list implementation know? I think "No" for the following reasons:
Without either a Comparator or a requirement that elements implement Comparable, the concept of ordering is ill-defined.
The cost of checking that a list is ordered is O(N). The cost of incrementally checking that a list is still ordered is O(1) ... but still one or two calls to compare on each update operation. That is a significant overhead ... for a general purpose data structure to incur in the hope of optimizing (just) one operation in the API.
But that's OK. If you (the programmer) are able to ensure (ideally by efficient algorithmic means) that a list is always ordered, then you can use Collections.binarySearch ... with zero additional checking overhead in update operations.
Just to keep it simple.
If you have an array [5,4,3,2,1] and you order it to [1,2,3,4,5] will forks faster if you look for 1 but it will take longer to find 5. Consequently, from the mathematical point of view if you order an array, searching for an item inside will anyway require to loop from 1 to, in the worst case, n.
May be that for your problem sorting may help, say you receive unordered timestamps but
if your array is not too small
want to avoid the additional cost of sorting per each new entry in the array
you just want to find quickly an object
you know the Object properties you are searching for
you can create a KeyObject containing the properties you are looking for implements equals & hashCode for it then store your items into a Map. Using a Map.containsKey(new KeyObject(prop1, prop2)) would be in any case faster than looping the array. If you do not have the real object you can always create a fake KeyObject, filled with the properties you expect, to check the Map.

Using the Java 8 Streams API, can sorted() be relied upon when calling Collectors.toSet()?

This is the implementation of the java.util.stream.Collectors class's toSet() method:
public static <T>
Collector<T, ?, Set<T>> toSet() {
return new CollectorImpl<>((Supplier<Set<T>>) HashSet::new, Set::add,
(left, right) -> { left.addAll(right); return left; },
CH_UNORDERED_ID);
}
As we can see, it uses a HashSet and calls add. From the HashSet documentation, "It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time."
In the following code, a List of String is streamed, sorted and collected into a Set:
public static void main(String[] args) {
Set<String> strings = Arrays.asList("c", "a", "b")
.stream()
.sorted()
.collect(Collectors.toSet());
System.out.println(strings.getClass());
System.out.println(strings);
}
This provides the output:
class java.util.HashSet
[a, b, c]
The output is sorted. What I think is happening here is that although the contract provided by the HashSet documentation specifies that ordering is not something it provides, the implementation happens to add in order. I suppose this could change in future versions / vary between JVMs and that a wiser approach would be to do something like Collectors.toCollection(TreeSet::new).
Can sorted() be relied upon when calling Collectors.toSet()?
Additionally, what exactly does "it does not guarantee that the order will remain constant over time" mean? (I suppose add, remove, the resizing of the underlying array?)
To answer that question, you have to know a bit about how HashSet is implemented. As the name suggests, a HashSet is implemented using a hash table. Basically, a hash table is an array that is indexed by element hashes. A hash function (in Java, an object's hash is calculated by object.hashCode()) is basically a function that meets a few criteria:
it is (relatively) quick to compute for a given element
two objects that .equals() each other have identical hashes
there is a low probability that different items have the same hash
So, when you meed a HashSet that is "sorted" (which is understood as "the iterator preserves the natural order of elements"), this is due to a couple of coincidences:
the natural order of elements respects the natural order of their hashCodes
the hash table is small enough not to have collisions (two elements with the same hash code)
If you look into the String class hashCode() method, you will see that for one-letter strings, the hash code corresponds to the Unicode index (codepoint) of the letter - so in this specific case, as long as the hash table is small enough, the elements will be sorted. However, this is a huge coincidence and
will not hold for any other sort order
will not hold for classes whose hashCodes do not follow their natural ordering
will not hold hashtables with collisions
and moreover, this has nothing to do with the fact that sorted() was called on the stream - it's simply due to the way hashCode() is implemented and therefore the ordering of the hash table. Therefore, the simple answer to the question is "no".
The answer is no. Once you added the items into a Set you cannot rely on any order. From JDK sourcecode (HashSet.java):
/**
* Returns an iterator over the elements in this set. The elements
* are returned in no particular order.
*
* #return an Iterator over the elements in this set
* #see ConcurrentModificationException
*/
public Iterator<E> iterator() {
return map.keySet().iterator();
}
Now, in previous versions of the JDK even though an order wasn't guaranteed, you'd usually get the items in the same order of insertion (unless the class of the objects implements hashCode() and then you'll get the order that is dictated by hashCode()). either the order of creation of the objects or the order of invocation of hashCode() on the objects. As #Holgar mentions in the comments below, in HotSpot it's the latter. And you can't even count on that since there are exceptions to this as well since the sequential number is not the only ingredient in the hashCode generator.
I recently heard a talk from Stuart Marks (the guy who's responsible for a re-write of a major part of Collections in Java 9) and he said that they've added randomization to the iteration order of Sets (created by new set-factories) in Java 9. If you want to hear the session, the part that he talk about sets start here - good talk, highly recommended by the way!.
So even if you used to count on iteration order of Sets, once you move to Java 9 you should stop doing so.
All that said, if you need order you should consider using a SortedSet,
LinkedHashSet or TreeSet

How to detect duplicate Lists in Map<String,List<String>>

I have a Map of the form Map<String,List<String>>. The key is a document number, the List a list of terms that match some criteria and were found in the document.
In order to detect duplicate documents I would like to know if any two of the List<String> have exactly the same elements (this includes duplicate values).
The List<String> is sorted so I can loop over the map and first check List.size(). For any two lists
that are same size I would then have to compare the two lists with List.equals().
The Map and associated lists will never be very large, so even though this brute force approach will not scale well it
will suffice. But I was wondering if there is a better way. A way that does not involve so much
explicit looping and a way that will not produce an combinatorial explosion if the Map and/or Lists get a lot larger.
In the end all I need is a yes/no answer to the question: are any of the lists identical?
You can add the lists to a set data structure one by one. Happily the add method will tell you if an equal list is already present in the set:
HashSet<List<String>> set = new HashSet<List<String>>();
for (List<String> list : yourMap.values()) {
if (!set.add(list)) {
System.out.println("Found a duplicate!");
break;
}
}
This algorithm will find if there is a duplicate list in O(N) time, where N is the total number of characters in the lists of strings. This is quite a bit better than comparing every pair of lists, as for n lists there are n(n-1)/2 pairs to compare.
Use Map.containsValue(). Won't be more efficient than what you describe, but code will be cleaner. Link -> http://docs.oracle.com/javase/7/docs/api/java/util/Map.html#containsValue%28java.lang.Object%29
Also, depending on WHY exactly you're doing this, might be worth looking into this interface -> http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html
Not sure if it's a better way, but a cleaner way would be to create an object that implements Comparable and which holds one of your List. You could implement hashcode() and equals() as you describe above and change your map to contain instances of this class instead of the Lists directly.
You could then use HashSet to efficiently discover which lists are equal. Or you can add the values collection of the map to the HashSet and compare the size of the hashset to the size of the Map.
From the JavaDoc of 'List.equals(Object o)':
Compares the specified object with this list for equality. Returns
true if and only if the specified object is also a list, both lists
have the same size, and all corresponding pairs of elements in the two
lists are equal. (Two elements e1 and e2 are equal if (e1==null ?
e2==null : e1.equals(e2)).) In other words, two lists are defined to
be equal if they contain the same elements in the same order. This
definition ensures that the equals method works properly across
different implementations of the List interface.
This leads me to believe that it is doing the same thing you are proposing: Check to make sure both sides are a List, then compare the sizes, then check each pair. I wouldn't re-invent the wheel there.
You could use hashCode() instead, but the JavaDoc there seems to indicate it's looping as well:
Returns the hash code value for this list. The hash code of a list is
defined to be the result of the following calculation:
int hashCode = 1;
Iterator<E> i = list.iterator();
while (i.hasNext()) {
E obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
So, I don't think you are saving any time. You could, however, write a custom List that calculates the hash as items are put in. Then you negate the cost of doing looping.

Contains on TreeSet versus another Set

Is the contains method on TreeSet (Since it is already sorted per default) faster than say HashSet?
The reason I ask is that Collections.binarySearch is quite fast if the List is sorted, so I am thinking that maybe the contains method for TreeSet might be the same.
From the javadoc of TreeSet:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
From the javadoc of HashSet:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
So the answer is no.
Looking at the implementation (JDK 1.7 oracle), treeset.contains (resp. hashtree) relies on treemap.containsKey (resp. hashmap) method. containsKey loops over one hash bucket in hashmap (which possibly contains only one item), whereas it loops over the whole map, moving from node to node in treemap, using the compareTo method. If your item is the largest or the smallest, this can take significantly more time.
Finally, I just ran a quick test (yes I know, not very reliable) with a tree containing 1m integers and looking for one of the 2 largest, which forces the treeset to browse the whole set. HashSet is quicker by a factor of 50.

Categories

Resources