Why HashSet internally implemented as HashMap [duplicate]

Why HashSet internally implemented as HashMap [duplicate] - java

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why does HashSet implementation in Sun Java use HashMap as its backing?
I know what a hashset and hashmap is - pretty well versed with them.
There is 1 thing which really puzzled me.
Example:
Set <String> testing= new HashSet <String>();
Now if you debug it using eclipse right after the above statements, under debugger variables tab, you will noticed that the set 'testing' internally is implemented as a hashmap.
Why does it need a hashmap since there is no key,value pair involved in sets collection

It's an implementation detail. The HashMap is actually used as the backing store for the HashSet. From the docs:
This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element.
(emphasis mine)

The answer is right in the API docs
"This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time. This class permits the null element.
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets. Iterating over this set requires time proportional to the sum of the HashSet instance's size (the number of elements) plus the "capacity" of the backing HashMap instance (the number of buckets). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important."
So you don't even need the debugger to know this.
In answer to your question: it is an implementation detail. It doesn't need to use a HashMap, but it is probably just good code re-use. If you think about it, in this case the only difference is that a Set has different semantics from a Map. Namely, maps have a get(key) method, and Sets do not. Sets do not allow duplicates, Maps allow duplicate values, but they must be under different keys.
It is probably really easy to use a HashMap as the backing of a HashSet, because all you would have to do would be to use hashCode (defined on all objects) on the value you are putting in the Set to determine if a dupe, i.e., it is probably just doing something like
backingHashMap.put(toInsert.hashCode(), toInsert);
to insert items into the Set.

In most cases the Set is implemented as wrapper for the keySet() of a Map. This avoids duplicate implementations. If you look at the source you will see how it does this.
You might find the method Collections.newSetFromMap() which can be used to wrap ConcurrentHashMap for example.

The very first sentence of the class's Javadoc states that it is backed by a HashMap:
This class implements the Set interface, backed by a hash table (actually a HashMap instance).
If you'll look at the source code of HashSet you'll see that what it stores in the map is as the key is the entry you are using, and the value is a mere marker Object (named PRESENT).
Why is it backed by a HashMap? Because this is the simplest way to store a set of items in a (conceptual) hashtable and there is no need for HashSet to re-invent an implementation of a hashtable data structure.

It's just a matter of convenience that the standard Java class library implements HashSet using a HashMap -- they only need to implement one data structure and then HashSet stores its data in a HashMap with the actual set objects as the key and a dummy value (typically Boolean.TRUE) as the value.

HashMap has already all the functionality that HashSet requires. There would be no sense to duplicate the same algorithms.

it allows you to easily and quickly determine whether an object is already in the set or not.

Related

Do keys and values in an an unordered hash map line up when iterated independently?

I have seen answers touching this topic but wanted to find if anybody has any example where keyset and values lists that you derive from hash map would have the keys and values in different order. I understand the entries itself can have undefined order in hash map, but could the lists of keys and values be out of order with respect to each other?
Here is a short snippet for clarification:
public class App {
public static void main(String[] args) {
Map<String,String> stateCapitols = new HashMap<>();
stateCapitols.put("AL", "Montgomery");
stateCapitols.put("AK", "Juneau");
stateCapitols.put("CO", "Denver");
stateCapitols.put("FL", "Tallahassee");
stateCapitols.put("Indiana", "Indianapolis");
stateCapitols.keySet().stream().forEach(System.out::println);
System.out.println();
stateCapitols.values().stream().forEach(System.out::println);
}
}
Would there be any way AL might appear in same place as Denver (or any other value) in the above example?

Let's look again at the Java SE API language regarding iteration order in the Map contract:
Some map implementations, like the TreeMap class, make specific guarantees as to their order; others, like the HashMap class, do not.
And HashMap:
This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.
Since it is explicitly stated that HashMap iterators do not have an order, it cannot be assumed that the iteration will be stable even between calls to the same method, let alone between calls to the different methods keySet() and values().
Helpfully, Map has a method entrySet() that does exactly what you need: it iterates over map contents in a way that pairs up keys and values. It is the method to use any time you need to rely on both parts of the pair.
With changes to Java licensing now in effect, people and organizations that had assumed they would probably always use Oracle's implementation of Java are now looking to alternative implementations. Relying on the unwritten details of a single implementation is extremely dangerous, even more so now than before Oracle's licensing and pricing changes.

HashMap makes no guarantees about its iteration order. It is possible in principle (that is, it is allowed by the specification) for the order in which keys, for example, to change from one iteration to the next, even if the map's contents have not changed, or for the iteration order of keys to differ from the iteration order of the corresponding values.
This is stated in the HashMap specification:
This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.
In practice, the iteration order of HashMap is stable from one iteration to the next, and even from one JVM invocation to the next, if the HashMap is initialized and populated exactly the same way. However, it is unwise for an application to depend on this. Creating a HashMap with a different initial size or load factor can influence the iteration order, even if the map is populated with the same contents. The HashMap implementation does change from time to time, and this also affects iteration order. Such changes occur even in patch or bugfix releases of the JDK. Unfortunately, history has shown that applications break when the iteration order changes. Therefore, robust applications should strive to avoid making any dependencies on HashMap iteration order.
This is pretty hard to do in practice. I'm aware of one (non public) version of the JDK that has a testing mode that randomizes iteration order of HashMaps. That potentially helps flush out such dependencies.
If you need to correlate keys and values of a HashMap while iterating it, get the entrySet() of the HashMap and iterate that. It provides map entries (key-value pairs) so the relationship between keys and values is preserved.
Alternative Map implementations in the JDK provide well-defined iteration ordering. TreeMap and ConcurrentSkipListMap order their entries based on a provided comparison method. LinkedHashMap provides iteration order based on insertion order. (It also provides a mode where iteration is by access order, which is sometimes useful, but whose behavior is often surprising.)
Note that the unmodifiable collections introduced in Java 9 (Set.of, Map.of, etc.) provide randomized iteration order. The order will differ from one run of the JVM to the next. This should help applications avoid making inadvertent dependencies on iteration order.

HashMaps and Lists reconciliation in Java

My issue is that I need a HashMap which returns a reference to an internal LinkedList when hashMap.get(key) is called— not simply return the value that corresponds to the key.
From what I've gathered, a LinkedHashMap enables a doubly-linked list to occupy each map entry for collision handling. However, I want to be able to get a reference to the overarching LinkedList that encapsulates all values mapped into it (each object that shares a LinkedList also share a particular feature I'm very interested in due to my overridden hash code function).
Put differently, I aim to avoid the linked list auto-traversal built into the LinkedHashMap class and just want the reference of the list itself to operationalize.
I want this reference to be returned in addition to having the capacity to add new values to the end of the LinkedLists with linkedHashMap.put(key, value) invocation.
Any pointers would be appreciated.

LinkedHashMap just stores its keys in a defined order (A LinkedList backs the KeySet). It isn't anything about how it handles collisions.
For what you've described, I think you'll have to implement things yourself. You're basically making a Map<KeyType, List<EntryType>>, with a "put" function that appends to the associated list. It isn't too much code.
I probably wouldn't make it actually extend Map, though, because what you've described doesn't really match the interface for that.

Is it possible to implement a MyHashMap backed by a given HashSet implementation?

As we all known, in Sun(Oracle) JDK, HashSet is implemented backed by a HashMap, to reuse the complicated algorithm and data structure.
But, is it possible to implement a MyHashMap using java.util.HashSet as its back?
If possible, how? If not, why?
Please note that this question is only a discussion of coding skill, not applicable for production scenarios.

Trove bases it's Map on it's Set implementation. However, it has one critical method which is missing from Set which is a get() method.
Without a get(Element) method, HashSet you cannot perform a lookup which is a key function of a Map. (pardon the pun) The only option Set has is a contains which could be hacked to perform a get() but it would not be ideal.
You can have;
a Set where the Entry is a key and a value.
you define entries as being equal when the keys are the same.
you hack the equals() method so when there is a match, that on a "put" the value portion of an entry is updated, and on a "get" the value portion is copied.
Set could have been designed to be extended as Map, but it wasn't and it wouldn't be a good idea to use HashSet or the existing Set implementations to create a Map.

How does the search algorithm work with objects in a java collection such as HashSet?

The question really is regarding objects that change dynamically in a collection. Does the "contains" method go and compare each of the object individually every time or does it do something clever?
If you have 10000 entries in a collection, I would have expected it to work a bit more cleverly but not sure. Or if not is there a way to optimise it by adding a hook that would tell the collection object to update hashcodes for the objects that have changed??
Additional Question:
Thanks for answers below... Can I also ask what happens in case of ArrayList? I could not find anything in the documentation that says not to put mutable objects in ArrayList. Does that mean the search algorithm simply goes and compares against hashcode of each object??

They hash the object and look it up by its hash code. If it is there, it will compare the objects themselves. This is because two or more objects that have the same hash might not be the same object.
Since Java's hash collections use buckets (chaining), they have to look at all the objects in the bucket. These objects are kept in a linked list (not java.util.LinkedList, but a custom list)
This is generally very efficient, and the HashSet.contains() method is amortized O(1) (constant time).
Java's docs have an answer to the second part of your question:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

A HashSet computes the hash code of an element when it's added to the set. It stores this in a way which makes it very efficient to find all elements with the same hash code.
Then when you call contains(), it simply has to compute the hash code of the value you're looking for, and find all elements in the set with the same hash code. There may be multiple elements as hash codes aren't unique, but there are likely to be far fewer elements with matching hash codes than there are elements within the set itself. Each matching element is then checked with equals until either a match is found or we've run out of candidates.
EDIT: To answer the second part, which somehow I'd missed on first reading, you won't be able to find the element again. You mustn't change an element used as a key in a hash table or an element in a hash set in any equality-affecting manner, or you will basically break things.

The simple answer is — no, nothing clever happens. If you expect an object's state to change in a way that affects its hashCode() and equals(...) behavior, then you must not store it in a HashSet, nor any other Set. To quote from http://download.oracle.com/javase/6/docs/api/java/util/Set.html:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

A HashSet uses a HashMap under the hood. Therefore, the contains operation uses the hashCode() method in the object to check if it's present in the hash table implemented by HashMap.

java constantly sorted list with quick retrieval

I'm looking for a constantly sorted list in java, which can also be used to retrieve an object very quickly. PriorityQueue works great for the "constantly sorted" requirement, and HashMap works great for the fast retrieval by key, but I need both in the same list. At one point I had wrote my own, but it does not implement the collections interfaces (so can't be used as a drop-in replacement for a java.util.List etc), and I'd rather stick to standard java classes if possible.
Is there such a list out there? Right now I'm using 2 lists, a priority queue and a hashmap, both contain the same objects. I use the priority queue to traverse the first part of the list in sorted order, the hashmap for fast retrieval by key (I need to do both operations interchangeably), but I'm hoping for a more elegant solution...
Edit: I should add that I need to have the list sorted by a different comparator then what is used for retrieval by key; the list is sorted by a long value, the key retrieval is a String.

Since you're already using HashMap, that implies that you have unique keys. Assuming that you want to order by those keys, TreeMap is your answer.

It sounds like what you're talking about is a collection with an automatically-maintained index.
Try looking at GlazedLists which use "list pipelines" to efficiently propagate changes -- their SortedList class should do the job.
edit: missed your retrieval-by-key requirement. That can be accomplished with GlazedLists.syncEventListToMap and GlazedLists.syncEventListToMultimap -- syncEventListToMap works if there are no duplicate keys, and syncEventListToMultimap works if there are duplicate keys. The nice part about this approach is that you can create multiple maps based on different indices.
If you want to use TreeMaps for indices -- which may give you better performance -- you need to keep your TreeMaps privately encapsulated within a custom class of your choosing, that exposes the interfaces/methods you want, and create accessors/mutators for that class to keep the indices in sync with the collection. Be sure to deal with concurrency issues (via synchronized methods or locks or whatever) if you access the collection from multiple threads.
edit: finally, if fast traversal of the items in sorted order is important, consider using ConcurrentSkipListMap instead of TreeMap -- not for its concurrency, but for its fast traversal. Skip lists are linked lists with multiple levels of linkage, one that traverses all items, the next that traverses every K items on average (for a given constant K), the next that traverses every K2 items on average, etc.

TreeMap
http://download.oracle.com/javase/6/docs/api/java/util/TreeMap.html

Go with a TreeSet.
A NavigableSet implementation based on a TreeMap. The elements are ordered using their natural ordering, or by a Comparator provided at set creation time, depending on which constructor is used.
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).

I haven't tested this so I might be wrong, so consider this just an attempt.
Use TreeMap, wrap the key of this map as an object which has two attributes (the string which you use as the key in hashmap and the long which you use to maintain the sort order in PriorityQueue). Now for this object, override the equals and hashcode method using the string. Implement the comparable interface using the long.

Why don't you encapsulate your solution to a class that implements Collection or Map?
This way you could simply delegate the retrieval methods to the faster/better suiting collection. Just make sure that calls to write-methods (add/remove/put) will be forwarded to both collections. Remember indirect accesses, like iterator.remove(). Most of these methods are optional to implement, but you have to deactivate them (Collections.unmodifiableXXX will help here in most cases).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.