How to detect duplicate Lists in Map<String,List<String>> - java

I have a Map of the form Map<String,List<String>>. The key is a document number, the List a list of terms that match some criteria and were found in the document.
In order to detect duplicate documents I would like to know if any two of the List<String> have exactly the same elements (this includes duplicate values).
The List<String> is sorted so I can loop over the map and first check List.size(). For any two lists
that are same size I would then have to compare the two lists with List.equals().
The Map and associated lists will never be very large, so even though this brute force approach will not scale well it
will suffice. But I was wondering if there is a better way. A way that does not involve so much
explicit looping and a way that will not produce an combinatorial explosion if the Map and/or Lists get a lot larger.
In the end all I need is a yes/no answer to the question: are any of the lists identical?

You can add the lists to a set data structure one by one. Happily the add method will tell you if an equal list is already present in the set:
HashSet<List<String>> set = new HashSet<List<String>>();
for (List<String> list : yourMap.values()) {
if (!set.add(list)) {
System.out.println("Found a duplicate!");
break;
}
}
This algorithm will find if there is a duplicate list in O(N) time, where N is the total number of characters in the lists of strings. This is quite a bit better than comparing every pair of lists, as for n lists there are n(n-1)/2 pairs to compare.

Use Map.containsValue(). Won't be more efficient than what you describe, but code will be cleaner. Link -> http://docs.oracle.com/javase/7/docs/api/java/util/Map.html#containsValue%28java.lang.Object%29
Also, depending on WHY exactly you're doing this, might be worth looking into this interface -> http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html

Not sure if it's a better way, but a cleaner way would be to create an object that implements Comparable and which holds one of your List. You could implement hashcode() and equals() as you describe above and change your map to contain instances of this class instead of the Lists directly.
You could then use HashSet to efficiently discover which lists are equal. Or you can add the values collection of the map to the HashSet and compare the size of the hashset to the size of the Map.

From the JavaDoc of 'List.equals(Object o)':
Compares the specified object with this list for equality. Returns
true if and only if the specified object is also a list, both lists
have the same size, and all corresponding pairs of elements in the two
lists are equal. (Two elements e1 and e2 are equal if (e1==null ?
e2==null : e1.equals(e2)).) In other words, two lists are defined to
be equal if they contain the same elements in the same order. This
definition ensures that the equals method works properly across
different implementations of the List interface.
This leads me to believe that it is doing the same thing you are proposing: Check to make sure both sides are a List, then compare the sizes, then check each pair. I wouldn't re-invent the wheel there.
You could use hashCode() instead, but the JavaDoc there seems to indicate it's looping as well:
Returns the hash code value for this list. The hash code of a list is
defined to be the result of the following calculation:
int hashCode = 1;
Iterator<E> i = list.iterator();
while (i.hasNext()) {
E obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
So, I don't think you are saving any time. You could, however, write a custom List that calculates the hash as items are put in. Then you negate the cost of doing looping.

Related

Using a HashMap for search capabilites

I need to use a data structure that can provide a low search time, but don't have a need to store key, value pairs.
All I need is to check if an element exists in a collection or not.
I'm thinking of inserting all the values from an array into a hashmap (with the key and value being the same) and then perform search operations on this.
Any alternatives or is this reasonable?
If you don't want to maintain key-value pairs, consider using java.util.HashSet
I assume your main use case would be adding elements to it and then calling 'contains' which has O(1) complexity
Why do you need a HashMap for this? There are a few ArrayList Examples for this.
ArrayList, List, LinkedList
You can define the object you want to store in the List by using the diamond operator
LinkedList<String> this list now stores String values.
or as the comments suggested you can use a HashSet
HashSet<String> hashSet = new HashSet<>();
hashSet.add("Item");
You can go with HashSet. contains(Object o) method can help you in doing desired operation. It returns true if element is present otherwise returns false.
You can use Bloom Filter. it is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.
Cool article on Bloom Filters.

What's the magic behind a Hashset finding duplicates that incredibly fast?

Ok this might be a superstupid question, but
I'm a bit baffled atm and eager to hear what you can tell me about this.
I had an ArrayList with about 5 million longs added. These longs are calculated hashes for primary keys (concatenated Strings) out of a big csv file.
Now I wanted to check for uniqueness and looping through the list like that:
for(int i=0;i<hashArrayList.size();i++)
{
long refValue = hashArrayList.get(i)
for(int j=i+1;j<hashArrayList.size();j++)
{
if(refValue == hashArrayList.get(j))
--> UNIQUENESS VIOLATION, now EXPLODE!!
}
}
This way it takes HOURS.
Now about the Hashset, which doesn't allow duplicates by itself.
A hashset.addAll(hashArrayList) takes 4 seconds! while eliminating/not adding duplicates for this list with 5 mio elements.
How does it do that?
And: Is my ArrayList-looping so stupid?
You are doing a totally a different comparison.
With an ArrayList, you have a nested for loop which makes it O(n^2).
But with a HashSet, you are not doing any looping, but just adding n elements to it which is O(n). Internally a HashSet uses a HashMap whose key is the individual elements of the list and value is a static Object.
Source code for HashSet (Java 8)
public HashSet(Collection<? extends E> c) {
map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
addAll(c);
}
addAll calls add
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
So, ultimately it all comes to inserting an object (here long) into a HashMap which provides a constant time performance 1
1
From javadoc of HashMap (emphasis mine)
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets
A hash-based collection doesn't need looping to check whether there are elements with the same key.
Imagine you have 1.000 objects X. In your case, you loop through the list every time you add something.
A hash-based collection calculates the hash of the object, looks inside whether there are other elements with the same hash and then just needs to check whether one of them is equal to the new element. If you have a good hash function that returns a unique hash for unique elements, you would just have to calculate a number.
Of course, if you just say "I am lazy and I override my hashCode method with return 1", then you would have the same amount of comparisons additional to the hash collection overhead.
Example: Imagine you have the following HashSet:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
As you see, the basic structure (can be) like this: An array containing other data structures with the actual entries. Now, if you put an obj5 into the HashSet, it will call obj5.hashCode(). Based on this, it will calculate the outer index of this obj. Let's say it's 4:
HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
^ obj5
Now we have three other objects with the same index. Yes, we need a loop here to check, whether some of them is equal to the new obj5, but if you have a bigger HashSet with millions of entries, a comparison with some elements is much faster than comparing with all the elements. This is the advantage of a hash-based collection.
Hashmap internal working
Moreover you are using loop inside a loop which is making the complexity O(n^2) which is less efficient what hashmap uses.

Java data structure to allow boolean flags on objects and sorting?

I wish to have a set of Objects and booleans that mark an object as "visited" or not. Naturally I thought of Map that will tell me if an object is already visited or not. But I want them to be sorted too, so that whenever I ask "Who is the 'smallest' object visited?". The calculation wouldn't be too difficult, max O(n) on that data structure.
In my very specific case I'm asking about Date object, but it's irrelevant.
Objects can be added to that data structure at any moment, and will be entered with 'false' values.
Use a SortedSet. When an object is visited, add it to the set. To find out if an object was visited, just use set.contains(). To find the smallest object:
T smallest = set.isEmpty() ? null : set.iterator().next();
You could use a map of <Boolean, TreeSet<Object>>, where you keep all of the visited objects in the set mapped to true and visa-versa (assuming you're not dealing with duplicate objects). I believe that insertion into a TreeSet runs in O(n) time, and to get the "smallest" object visited, you would use first(), which runs in O(1) time.
What you need is guava's TreeMultiset, please read about the Multiset, a TreeMultiset implementation maintains the ordering of its elements. You can write a custom Comparator - in first place you could have most frequently visited object.
https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained
If you use it you won't have structures like
Collection<Something, Something>
and also sorting would be out of the box after implementing a Comparator.

Does a set has the same elements as a list in Java?

I have an ArrayList<SomeObject> in java which contains some <SomeObject> multiple times.
I also have a Set<SomeObject>, which contains some elements one time only. The elements are only uniquely distinguishable only by their name (String SomeObject.Name).
How am I possible to see if the list has exactly the same elements as the set, but maybe multiple times?
Thanks
There are several collections libraries to do this. For example commons-collection: https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/CollectionUtils.html#isEqualCollection-java.util.Collection-java.util.Collection-
eg. CollectionUtils.isEqualCollection(myList, mySet)
If you have to write it yourself, no libraries, then just check that each contains all the elements of the other:
`mySet.containsAll(myList) && myList.containsAll(mySet)`
It seems like the simplest one-line solution to this is probably
return new HashSet<SomeObject>(list).equals(set);
...which just identifies the unique elements of list and verifies that that matches set exactly.
You could convert the ArrayList to a set by using HashSet:
HashSet listSet = new HashSet(arrayList);
To check whether the ArrayList initially has more elements, just compare the listSet and arrayList size() results:
boolean sameSize = listSet.size() == arrayList.size()
Then you can get the intersection of the two sets (the elements they have in common):
listSet.retainAll(set1)
If listSet.size() == set1.size() now, then they had the same elements, as all elements in the two lists were shared in common. To check whether the arrayList had repeating elements initially, check the value of the boolean from before: if sameSize is true, then they did, false means that they didn't.
You have a couple of options:
add all you List elements to Set and check if the size is the same;
create a new List based on your Set and check is they are equals.
All Java collections have method boolean containsAll(collection<>), so if we want to check if two collections contain same elements, we can write collection1.containsAll(collection2) && collection2.containsAll(collection1) which will return true if collection1 and collection2 contain same elements.
Create a Hash that is a String and an Integer of the count of how many time. Interate thought the list create a Hash entry and setting to one if element exists already add one to the count.
Hash<String, Integer> hash = new HashMap<String, Integer>();
for (String s : list) {
if (hash.containsKey(s))
hash.put(s, hash.get(s)++);
else
hash.put(s,1);
}

How to avoid duplicate strings in Java?

I want to be able to add specific words from a text into a vector. Now the problem is I want to avoid adding duplicate strings. The first thing that comes to my mind is to compare all strings before adding them, as the amount of entries grow, this becomes really inefficient solution. The only "time efficient" solution that I can think of is unordered_multimap container that has included in C++11. I couldn't find a Java equivalent of it. I was thinking to add strings to the map and at the end just copying all entries to the vector, in that way it would be a lot more efficient than the first solution. Now I wonder whether there is any Java library that does what I want? If not is there any C++ unordered_multimap container equivalent in Java that I couldn't find?
You can use a Set<String> Collection. It does not allow duplicates. You can choose then as implementantion:
1) HashSet if you do not care about the order of elements (Strings).
2) LinkedHashSet if you want to keep the elements in the inserting order.
3) TreeSet if you want the elements to be sorted.
For example:
Set<String> mySet = new TreeSet<String>();
mySet.add("a_String");
...
Vector is "old-fashioned" in Java. You had better avoid it.
You can use a set (java.util.Set):
Set<String> i_dont_allow_duplicates = new HashSet<String>();
i_dont_allow_duplicates.add(my_string);
i_dont_allow_duplicates.add(my_string); // wont add 'my_string' this time.
HashSet will do the job most effeciently and if you want to keep insertion order then you can use LinkedHashSet.
Use a Set. A HashSet will do fine if you do not need to preserve order. A LinkedHashSet works if you need that.
You should consider using a Set:
A collection that contains no duplicate elements. More formally, sets
contain no pair of elements e1 and e2 such that e1.equals(e2), and at
most one null element. As implied by its name, this interface models
the mathematical set abstraction.
HashSet should be good for your use:
HashSet class implements the Set interface, backed by a hash table
(actually a HashMap instance). It makes no guarantees as to the
iteration order of the set; in particular, it does not guarantee that
the order will remain constant over time. This class permits the null
element.
So simply define a Set like this and use it appropriately:
Set<String> myStringSet = new HashSet<String>();
Set<String> set = new HashSet<String>();
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.

Categories

Resources