How to avoid duplicate strings in Java?

How to avoid duplicate strings in Java? - java

I want to be able to add specific words from a text into a vector. Now the problem is I want to avoid adding duplicate strings. The first thing that comes to my mind is to compare all strings before adding them, as the amount of entries grow, this becomes really inefficient solution. The only "time efficient" solution that I can think of is unordered_multimap container that has included in C++11. I couldn't find a Java equivalent of it. I was thinking to add strings to the map and at the end just copying all entries to the vector, in that way it would be a lot more efficient than the first solution. Now I wonder whether there is any Java library that does what I want? If not is there any C++ unordered_multimap container equivalent in Java that I couldn't find?

You can use a Set<String> Collection. It does not allow duplicates. You can choose then as implementantion:
1) HashSet if you do not care about the order of elements (Strings).
2) LinkedHashSet if you want to keep the elements in the inserting order.
3) TreeSet if you want the elements to be sorted.
For example:
Set<String> mySet = new TreeSet<String>();
mySet.add("a_String");
...
Vector is "old-fashioned" in Java. You had better avoid it.

You can use a set (java.util.Set):
Set<String> i_dont_allow_duplicates = new HashSet<String>();
i_dont_allow_duplicates.add(my_string);
i_dont_allow_duplicates.add(my_string); // wont add 'my_string' this time.
HashSet will do the job most effeciently and if you want to keep insertion order then you can use LinkedHashSet.

Use a Set. A HashSet will do fine if you do not need to preserve order. A LinkedHashSet works if you need that.

You should consider using a Set:
A collection that contains no duplicate elements. More formally, sets
contain no pair of elements e1 and e2 such that e1.equals(e2), and at
most one null element. As implied by its name, this interface models
the mathematical set abstraction.
HashSet should be good for your use:
HashSet class implements the Set interface, backed by a hash table
(actually a HashMap instance). It makes no guarantees as to the
iteration order of the set; in particular, it does not guarantee that
the order will remain constant over time. This class permits the null
element.
So simply define a Set like this and use it appropriately:
Set<String> myStringSet = new HashSet<String>();

Set<String> set = new HashSet<String>();
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.

Related

Why collections sort Method cannot be used on set interface

Sorry for the bad subject,just wanted to know,
Why cant we implement sort method on set any specific reasons?

Set is meant to have no order, says the math theory. Not saying that anything forbids a sorted set, but the purpose here is to represent math set concept as is.
However, there also exists a sorted set interface: SortedSet

If you just take an arbitrary Java Set, there is no API that allows you to modify the order of the set. The Set interface has no methods for modifying order.

If you wanted to, you could write a collection class that implemented the Set interface, but had an underlying order like a List. That would be something that you could sort.
I'm not sure why you'd ever do such a thing; and as far as I know, there is no such class in the JDK or any of the commonly used libraries. I just don't see what the point of such a class would be. If you want something to remember an order, use a List.

The sort() method can not be applied on the Set instance and even Set Interface itself doesn't have any sort() method on it, unlike List Interface.
It's because when we normally add the value in any Set Instance so it follows the concept of generation of hashcode and assigning the elements based on their hashcode for better performance and for no duplicacy.
Set numberSet = new HashSet<>();
numberSet.add(10);
numberSet.add(100);
numberSet.add(11);
numberSet.add(1111);
numberSet.add(121);
now numberSet will add these values based on their hashcodes and retrieval of values will also be done based on the same prinicple.
Let's say if somehow we have a sort method on Set, then let's say we have sorted the set and values get arranged according to the ordering context which we want.
Then the values will be rearranged in the Set Data Structure (means now the contract which is being shared by Set that all the objects which have the same hashcode will be saved at a particular position that will get broken because of reordering and now the Set no longer can perform the operations in O(1) and can't check for duplicacy too according to it's underlying implementation ).
If you want to hold the Set sorted values there is a way:
We can open up a stream and then call sort() or sort(Comparator(? extends T)) can call any of the methods after opening the stream.
For calling these methods your class should implement Comparable Interface
sorted() -> internally uses compareTo method.
sorted(Comparator) -> uses your comparator lambda
Stream stream = numberSet.stream().sorted((obj1, obj2) -> obj1.getRollNumber() - obj2.getRollNumber());
now, these sort methods will return the sorted values in a Stream which is sorted.
But if you try to do stream..collect(Collectors.toSet());
try to collect the values in the set again and think now the set holds the sorted values. Set will not hold values in a sorted manner as it will hold the values in the set adding contract manner (using hashcode generation and all that again, values will be added in the same order as it was earlier).

How to detect duplicate Lists in Map<String,List<String>>

I have a Map of the form Map<String,List<String>>. The key is a document number, the List a list of terms that match some criteria and were found in the document.
In order to detect duplicate documents I would like to know if any two of the List<String> have exactly the same elements (this includes duplicate values).
The List<String> is sorted so I can loop over the map and first check List.size(). For any two lists
that are same size I would then have to compare the two lists with List.equals().
The Map and associated lists will never be very large, so even though this brute force approach will not scale well it
will suffice. But I was wondering if there is a better way. A way that does not involve so much
explicit looping and a way that will not produce an combinatorial explosion if the Map and/or Lists get a lot larger.
In the end all I need is a yes/no answer to the question: are any of the lists identical?

You can add the lists to a set data structure one by one. Happily the add method will tell you if an equal list is already present in the set:
HashSet<List<String>> set = new HashSet<List<String>>();
for (List<String> list : yourMap.values()) {
if (!set.add(list)) {
System.out.println("Found a duplicate!");
break;
}
}
This algorithm will find if there is a duplicate list in O(N) time, where N is the total number of characters in the lists of strings. This is quite a bit better than comparing every pair of lists, as for n lists there are n(n-1)/2 pairs to compare.

Use Map.containsValue(). Won't be more efficient than what you describe, but code will be cleaner. Link -> http://docs.oracle.com/javase/7/docs/api/java/util/Map.html#containsValue%28java.lang.Object%29
Also, depending on WHY exactly you're doing this, might be worth looking into this interface -> http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html

Not sure if it's a better way, but a cleaner way would be to create an object that implements Comparable and which holds one of your List. You could implement hashcode() and equals() as you describe above and change your map to contain instances of this class instead of the Lists directly.
You could then use HashSet to efficiently discover which lists are equal. Or you can add the values collection of the map to the HashSet and compare the size of the hashset to the size of the Map.

From the JavaDoc of 'List.equals(Object o)':
Compares the specified object with this list for equality. Returns
true if and only if the specified object is also a list, both lists
have the same size, and all corresponding pairs of elements in the two
lists are equal. (Two elements e1 and e2 are equal if (e1==null ?
e2==null : e1.equals(e2)).) In other words, two lists are defined to
be equal if they contain the same elements in the same order. This
definition ensures that the equals method works properly across
different implementations of the List interface.
This leads me to believe that it is doing the same thing you are proposing: Check to make sure both sides are a List, then compare the sizes, then check each pair. I wouldn't re-invent the wheel there.
You could use hashCode() instead, but the JavaDoc there seems to indicate it's looping as well:
Returns the hash code value for this list. The hash code of a list is
defined to be the result of the following calculation:
int hashCode = 1;
Iterator<E> i = list.iterator();
while (i.hasNext()) {
E obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
So, I don't think you are saving any time. You could, however, write a custom List that calculates the hash as items are put in. Then you negate the cost of doing looping.

Setting key type in HashMap, how?

hi
I want to create a HashMap (java) that stores Expression, a little object i've created.
How do I choose what type of key to use? What's the difference for me between integer and String? I guess i just don't fully understand the idea behind HashMap so i'm not sure what keys to use.
Thanks!

Java HashMap relies on two things:
the hashCode() method, which returns an integer that is generated from the key and used inside the map
the equals(..) method, which should be consistent to the hash calculated, this means that if two keys has the same hashcode than it is desiderable that they are the same element.
The specific requirements, taken from Java API doc are the following:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
If you don't provide any kind of specific implementation, then the memory reference of the object is used as the hashcode. This is usually good in most situations but if you have for example:
Expression e1 = new Expression(2,4,PLUS);
Expression e2 = new Expression(2,4,PLUS);
(I don't actually know what you need to place inside your hashmap so I'm just guessing)
Then, since they are two different object although with same parameters, they will have different hashcodes. This could be or not be a problem for your specific situation.
In case it isn't just use the hasmap without caring about these details, if it is you will need to provide a better way to compute the hashcode and equality of your Expression class.
You could do it in a recursive way (by computing the hashcode as a result of the hashcodes of children) or in a naive way (maybe computing the hashcode over a toString() representation).
Finally, if you are planning to use just simple types as keys (like you said integers or strings) just don't worry, there's no difference. In both cases two different items will have the same hashcode. Some examples:
assert(new String("hello").hashCode() == new String("hello").hashCode());
int x = 123;
assert(new Integer(x).hashCode() == new Integer(123).hashCode());
Mind that the example with strings is not true in general, like I explained you before, it is just because the hashcode method of strings computes the value according to the content of the string itself.

The key is what you use to identify objects. You might have a situation where you want to identify numbers by their name.
Map<String,Integer> numbersByName = new HashMap<String,Integer>();
numbersByName.put("one",Integer.valueOf(1));
numbersByName.put("two",Integer.valueOf(2));
numbersByName.put("three",Integer.valueOf(3));
... etc
Then later you can get them out by doing
Integer three = numbersByName.get("three");
Or you might have a need to go the other way. If you know you're going to have integer values, and want the names, you can map integers to strings
Map<String,Integer> numbersByValue = new HashMap<String,Integer>();
numbersByValue.put(Integer.valueOf(1),"one");
numbersByValue.put(Integer.valueOf(2),"two");
numbersByValue.put(Integer.valueOf(3),"three");
... etc
And get it out
String three = numbersByValue.get(Integer.valueOf(3));

Keys and their associated values are both objects. When you get something from a HashMap, you have to cast it to the actual type of object it represents (we can do this because all objects in Java inherit the Object class). So, if your keys are strings and your values are Integers, you would do something like:
Integer myValue = (Integer)myMap.get("myKey");
However, you can use Java generics to tell the compiler that you're only going to be using Strings and Integers:
HashMap<String,Integer> myMap = new HashMap<String,Integer>();
See http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashMap.html for more details on HashMap.

If you do not want to look up the expressions, why do you want them to store in a map?
But if you want to, then the key is that item you use for lookup.

Why HashSet internally implemented as HashMap [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why does HashSet implementation in Sun Java use HashMap as its backing?
I know what a hashset and hashmap is - pretty well versed with them.
There is 1 thing which really puzzled me.
Example:
Set <String> testing= new HashSet <String>();
Now if you debug it using eclipse right after the above statements, under debugger variables tab, you will noticed that the set 'testing' internally is implemented as a hashmap.
Why does it need a hashmap since there is no key,value pair involved in sets collection

It's an implementation detail. The HashMap is actually used as the backing store for the HashSet. From the docs:
This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element.
(emphasis mine)

The answer is right in the API docs
"This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element.
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets. Iterating over this set requires time proportional to the sum of the HashSet instance's size (the number of elements) plus the "capacity" of the backing HashMap instance (the number of buckets). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important."
So you don't even need the debugger to know this.
In answer to your question: it is an implementation detail. It doesn't need to use a HashMap, but it is probably just good code re-use. If you think about it, in this case the only difference is that a Set has different semantics from a Map. Namely, maps have a get(key) method, and Sets do not. Sets do not allow duplicates, Maps allow duplicate values, but they must be under different keys.
It is probably really easy to use a HashMap as the backing of a HashSet, because all you would have to do would be to use hashCode (defined on all objects) on the value you are putting in the Set to determine if a dupe, i.e., it is probably just doing something like
backingHashMap.put(toInsert.hashCode(), toInsert);
to insert items into the Set.

In most cases the Set is implemented as wrapper for the keySet() of a Map. This avoids duplicate implementations. If you look at the source you will see how it does this.
You might find the method Collections.newSetFromMap() which can be used to wrap ConcurrentHashMap for example.

The very first sentence of the class's Javadoc states that it is backed by a HashMap:
This class implements the Set interface, backed by a hash table (actually a HashMap instance).
If you'll look at the source code of HashSet you'll see that what it stores in the map is as the key is the entry you are using, and the value is a mere marker Object (named PRESENT).
Why is it backed by a HashMap? Because this is the simplest way to store a set of items in a (conceptual) hashtable and there is no need for HashSet to re-invent an implementation of a hashtable data structure.

It's just a matter of convenience that the standard Java class library implements HashSet using a HashMap -- they only need to implement one data structure and then HashSet stores its data in a HashMap with the actual set objects as the key and a dummy value (typically Boolean.TRUE) as the value.

HashMap has already all the functionality that HashSet requires. There would be no sense to duplicate the same algorithms.

it allows you to easily and quickly determine whether an object is already in the set or not.

Duplicate values in the Set collection?

Is it possible to allow duplicate values in the Set collection?
Is there any way to make the elements unique and have some copies of them?
Is there any functions for Set collection for having duplicate values in it?

Ever considered using a java.util.List instead?
Otherwise I would recommend a Multiset from Google Guava (the successor to Google Collections, which this answer originally recommended -ed.).

The very definition of a Set disallows duplicates. I think perhaps you want to use another data structure, like a List, which will allow dups.
Is there any way to make the elements unique and have some copies of them?
If for some reason you really do need to store duplicates in a set, you'll either need to wrap them in some kind of holder object, or else override equals() and hashCode() of your model objects so that they do not evaluate as equivalent (and even that will fail if you are trying to store references to the same physical object multiple times).
I think you need to re-evaluate what you are trying to accomplish here, or at least explain it more clearly to us.

From the javadocs:
"sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element"
So if your objects were to override .equals() so that it would return different values for whatever objects you intend on storing, then you could store them separately in a Set (you should also override hashcode() as well).
However, the very definition of a Set in Java is,
"A collection that contains no
duplicate elements. "
So you're really better off using a List or something else here. Perhaps a Map, if you'd like to store duplicate values based on different keys.

Sun's view on "bags" (AKA multisets):
We are extremely sympathetic to the desire for type-safe collections. Rather than adding a "band-aid" to the framework that enforces type-safety in an ad hoc fashion, the framework has been designed to mesh with all of the parameterized-types proposals currently being discussed. In the event that parameterized types are added to the language, the entire collections framework will support compile-time type-safe usage, with no need for explicit casts. Unfortunately, this won't happen in the the 1.2 release. In the meantime, people who desire runtime type safety can implement their own gating functions in "wrapper" collections surrounding JDK collections.
(source; note it is old and possibly obsolete -ed.)
Apart from Google's collections API, you can use Apache Commons Collections.
Apache Commons Collections:
http://commons.apache.org/collections/
Javadoc for Bag

I don't believe that you can have duplicate values within a set. A set is defined as a collection of unique values. You may be better off using an ArrayList.

These sound like interview questions, so I'll answer them like interview questions...
Is it possible to allow duplicate values in the Set collection?
Yes, but it requires that the person implementing the Set violate the design contract upon which Set is built. Basically, I could write a class that extends Set and doesn't enforce Set's promises.
In addition, other violations are possible. I could use a Set implementation that relies upon Java's hashCode() contract. Then if I provided an Object that violates Java's hashcode contract, I might be able to place two objects into the set which are equal, but yeild different hashcodes (because they might not be checked in equality against each other due to being in different hash bucket chains.
Is there any way to make the elements unique and have some copies of them?
It basically depends on how you define uniqueness. If an object's uniqueness is determined by its value, then one can have multiple copies of the same unique object; however, if the object's uniqueness is determined by its instance, then by definition it would not be possible to have multiple copies of the same object. You could however have multiple references to them.
Is there any functions for Set collection for having duplicate values in it?
The Set interface doesn't have any functions for detecting / reporting duplicates; however, it is based on the Collections interface, which has to support the List interface, so it is possible to pass duplicates into a Set; however, a properly implemented Set will just ignore the duplicates, and present one copy of every element determined to be unique.

I don't think so. The only way would be to use a List. You can also trick with function equals(), hashcode() or compareTo() but it is going to be ankward.

NO chance.... you can not have duplicate values in SET interface...
If you want duplicates then you can try Array-List

As mentioned choose the right collection for the task and likely a List will be what you need. Messing with the equals(), hashcode() or compareTo() to break identity is generally a bad idea simply to wedge an instance into the wrong collection to start with. Worse yet it may break code in other areas of the application that depend on these methods producing valid comparison results and be very difficult to debug or track down such errors.

This question was asked to me also in an interview. I think the answer is, ofcourse Set will not allow duplicate elements and instead ArrayList or other collections should be used for the same, however overriding equals() for the type of the object being stored in the set will allow you to manipulate on the comparison logic. And hence you may be able to store duplicate elements in the Set. Its more of a hack, which would allow non-unique elements in the Set and ofcourse is not recommended in production level code.

You can do so by overriding hashcode as given below:
public class Test
{
static int a=0;
#Override
public int hashCode()
{
a++;
return a;
}
public static void main(String[] args)
{
Set<Test> s=new HashSet<Test>();
Test t1=new Test();
Test t2=t1;
s.add(t1);
s.add(t2);
System.out.println(s);
System.out.println("--Done--");
}
}

Well, In this case we are trying to break the purpose of specific collection. If we want to allow duplicate records simply use list or multimap.

Set will store unique values and if you wants to store duplicate values then for list,but still if you want duplicate values in set then create set of ArrayList so that you can put duplicate elements into it.
Set<ArrayList> s = new HashSet<ArrayList>();
ArrayList<String> arr = new ArrayList<String>();
arr.add("First");
arr.add("Second");
arr.add("Third");
arr.add("Fourth");
arr.add("First");
s.add(arr);

You can use Tree Map instead :
Key can be used as element you wish to store
and Value will be the frequency of input element.
The insertion and removal will require custom handling.
Insertion : Check if the map already contains the element , if yes then increment its frequency. O(log N)
Removal : if the element's frequency is 1 then remove it , else decrease frequency by 1. O(log N)
More details can be found in the java docs of tree map
Overall time complexity will remain same as TreeSet O(log N) but worse than a HashSet O(1)
firstEntry() -> provides smallest element entry, Time Complexity : O(Log N)
lastEntry() -> provides greatest element entry, Time Complexity : O(Log N)

public class SET {
public static void main(String[] args) {
Set set=new HashSet();
set.add(new AB(10, "pawan#email"));
set.add(new AB(10, "pawan#email"));
set.add(new AB(10, "pawan#email"));
Iterator it=set.iterator();
while(it.hasNext()){
Object o=it.next();
System.out.println(o);
}
}
}
public class AB{
int id;
String email;
public AB() {
System.out.println("DC");
}
AB(int id,String email){
this.id=id;
this.email=email;
}
#Override public String toString() {
// TODO Auto-generated method stub return ""+id+"\t"+email;}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.