Why is it like Hash set internally only used Hash map ? Is it something related with performance?
A HashSet can be thought of as a special case of a HashMap in which you don't actually care about the type of values, only whether a value is associated with a particular key.
So, it made sense to just implement one on top of the other.
A HashMap is a good choice if your key type has a good hash function.
Similarly, TreeSet is implemented using TreeMap, which can be effective if your keys are ordered/comparable.
You can implement the Set interface in many other ways, but these are the typical ones.
Nah, it's just convenient, and not actually any less efficient on most VMs. So Java -- at least in some implementations -- doesn't bother doing anything fancier.
Since HashMap and HashSet are basicly using the same algorithm, it is simpler not to implement it twice, and therefore it isn't surprising that a few, if not all, JVM implementation do it.
It is also doable with LinkedHashMap/Set, TreeMap/Set and others.
More generally, it is possible to create any Set implementation from any Map implementation by choosing the value as being the same as the key, or be a constant.
The loss in memory storage is negligible.
By the way, the JDK provides a Collections.newSetFromMap method which does exactly this: it converts a Map<E,Boolean> into a Set<E> by mapping all keys to Boolean.TRUE. When there is no corresponding Set implementation to a Map one, this utility method is very useful, for example for ConcurrentHashMap.
The opposite, creating a Map implementation from a Set one, is also doable, though it's slightly more difficult.
Related
I need a Map sized 1 and began to wonder what would be best, a TreeMap or a HashMap?
My thought is that TreeMap would be better, as initialising and adding a value to a HashMap will cause it to create a table of 15 entries, whereas I believe TreeMapis a red-black tree implementation, which at size one will simply have a root node.
With that said, I suppose it depends on the hashCode / compareTo for the key of the HashMap / TreeMap respectively.
Ultimately, I suppose it really doesn't matter in terms of performance, I'm thinking in terms of best practice. I guess the best performance would come from a custom one entry Map implementation but that is just a bit ridiculous.
The canonical way of doing this is to use Collections.singletonMap()
Nice and simple, provided that you also require (or at least, don't mind) immutability.
And yes, internally it is implemented as a custom single-node Map.
As a complete aside, you can create a HashMap with a single bucket if in the constructor you specify a capacity of 1 and a loadFactor that is greater than the number of elements you want to put in. But memory-wise that would still be a bit of a waste as you'd have the overhead of the Entry array, the Entry object and all the other fields HashMap has (like load factor, size, resize treshold).
If you are only looking for a key to value structure you can just use SimpleEntry<K,V> class, this is basically an implementation of Map.Entry<K,V>
We know that EnumSet and EnumMap are faster than HashSet/HashMap due to the power of bit manipulation. But are we actually harnessing the true power of EnumSet/EnumMap when it really matters? If we have a set of millions of record and we want to find out if some object is present in that set or not, can we take advantage of EnumSet's speed?
I checked around but haven't found anything discussing this. Everywhere the usual stuff is found i.e. because EnumSet and EnumMap uses a predefined set of keys lookups on small collections are very fast. I know enums are compile-time constants but can we have best of both worlds - an EnumSet-like data structure without needing enums as keys?
Interesting insight; the short answer is no, but your question is exploring some good data-structure design concepts which I'll try to discuss.
First, lets talk about HashMap (HashSet uses a HashMap internally, so they share most behavior); a hash-based data structure is powerful because it is fast and general. It's fast (i.e. approximately O(1)) because we can find the key we're looking for with a very small number of computations. Roughly, we have an array of lists of keys, convert the key to an integer index into that array, then look through the associated list for the key. As the mapping gets bigger, the backing array is repeatedly resized to hold more lists. Assuming the lists are evenly distributed, this lookup is very fast. And because this works for any generic object (that has a proper .hashcode() and .equals()) it's useful for just about any application.
Enums have several interesting properties, but for the purpose of efficient lookup we only care about two of them - they're generally small, and they have a fixed number of values. Because of this, we can do better than HashMap; specifically, we can map every possible value to a unique integer, meaning we don't need to compute a hash, and we don't need to worry about hashes colliding. So EnumMap simply stores an array of the same size as the enum and looks up directly into it:
// From Java 7's EnumMap
public V get(Object key) {
return (isValidKey(key) ?
unmaskNull(vals[((Enum)key).ordinal()]) : null);
}
Stripping away some of the required Map sanity checks, it's simply:
return vals[key.ordinal()];
Note that this is conceptually no different than a standard HashMap, it's simply avoiding a few computations. EnumSet is a little more clever, using the bits in one or more longs to represent array indices, but functionally it's no different than the EnumMap case - we allocate enough slots to cover all possible enum values and can use their integer .ordinal() rather than compute a hash.
So how much faster than HashMap is EnumMap? It's clearly faster, but in truth it's not that much faster. HashMap is already a very efficient data structure, so any optimization on it will only yield marginally better results. In particular, both HashMap and EnumMap are asymptotically the same speed (O(1)), meaning as they get bigger, they behave equally well. This is the primary reason why there isn't a more general data-structure like EnumMap - because it wouldn't be worth the effort relative to HashMap.
The second reason we don't want a more general "FiniteKeysMap" is it would make our lives as users more complicated, which would be worthwhile if it were a marked speed increase, but since it's not would just be a hassle. We would have to define an interface (and probably also a factory pattern) for any type that could be a key in this map. The interface would need to provide a guarantee that every unique instance returns a unique hashcode in the range [0-n), and also provide a way for the map to get n and potentially all n elements. Those last two operations would be better as static methods, but since we can't define static methods in an interface, they'd have to either be passed in directly to every map we create, or a separate factory object with this information would have to exist and be passed to the map/set at construction. Because enums are part of the language, they get all of those benefits for free, meaning there's no such cost for end-user programmers to need to take advantage of.
Furthermore, it would be very easy to make mistakes with this interface; say you have a type that has exactly 100,000 unique values. Should it implement our interface? It could. But you'd likely actually be shooting yourself in the foot. This would eat up a lot of unnecessary memory, since our FiniteKeysMap would allocate a new 100,000 length array to represent even an empty map. Generally speaking, that sort of wasted space is not worth the marginal improvement such a data structure would provide.
In short, while your idea is possible, it is not practical. HashMap is so efficient that attempting to create a separate data structure for a very limited number of cases would add far more complexity than value.
For the specific case of faster .contains() checks, you might like Bloom Filters. It's a set-like data structure that very efficiently stores very large sets, with the condition that it may sometimes incorrectly say an element is in the set when it is not (but not the other way around - if it says an element isn't in the set, it's definitely not). Guava provides a nice BloomFilter implementation.
Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading
A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).
From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.
Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.
As we all known, in Sun(Oracle) JDK, HashSet is implemented backed by a HashMap, to reuse the complicated algorithm and data structure.
But, is it possible to implement a MyHashMap using java.util.HashSet as its back?
If possible, how? If not, why?
Please note that this question is only a discussion of coding skill, not applicable for production scenarios.
Trove bases it's Map on it's Set implementation. However, it has one critical method which is missing from Set which is a get() method.
Without a get(Element) method, HashSet you cannot perform a lookup which is a key function of a Map. (pardon the pun) The only option Set has is a contains which could be hacked to perform a get() but it would not be ideal.
You can have;
a Set where the Entry is a key and a value.
you define entries as being equal when the keys are the same.
you hack the equals() method so when there is a match, that on a "put" the value portion of an entry is updated, and on a "get" the value portion is copied.
Set could have been designed to be extended as Map, but it wasn't and it wouldn't be a good idea to use HashSet or the existing Set implementations to create a Map.
There are some cases that the key objects used in map do not override hashCode() and equals() from Object, for examples, use a socket Connection or java.lang.Class as keys.
Is there any potential defect to use these objects as keys in a HashMap?
Should I use IdentityHashMap in these cases?
If equals() and hashCode() are not overridden on key objects, HashMap and IdentityHashMap should have the same semantics. The default equals() implementation uses reference semantics, and the default hashCode() is the system identity hash code of the object.
This is only harmful in cases where different instances of an object can be considered logically equal. For example, you would not want to use IdentityHashMap if your keys were:
new Integer(1)
and
new Integer(1)
Since these are technically different instances of the Integer class. (You should really be using Integer.valueOf(1), but that's getting off-topic.)
Class as keys should be okay, except in very special circumstances (for example, the hibernate ORM library generates subclasses of your classes at runtime in order to implement a proxy.) As a developer I would be skeptical of code which stores Connection objects in a Map as keys (maybe you should be using a connection pool, if you are managing database connections?). Whether or not they will work depends on the implementation (since Connection is just an interface).
Also, it's important to note that HashMap expects the equals() and hashCode() determination to remain constant. In particular, if you implement some custom hashCode() which uses mutable fields on the key object, changing a key field may make the key get 'lost' in the wrong hashtable bucket of the HashMap. In these cases, you may be able to use IdentityHashMap (depending on the object and your particular use case), or you might just need a different equals()/hashCode() implementation.
From a mobile code security point of view, there are situations where using IdentityHashMap or similar becomes necessary. Malicious implementations of non-final key classes can override hashCode and equals to be malicious. They can, for instance, claim equality to different instances, acquire a reference to other instances they are compared to, etc. I suggest breaking with standard practice by staying safe and using IdentityHashMap where you want identity semantics. There rarely is a good reason to changing the meaning of equality in a subclass where the superclass is already being compared. I guess the most likely scenario is a broken, non-symmetric proxy.
The implementation of IdentityHashMap is quite different than HashMap. It uses linear probing rather than Entry objects as links in a chain. This leads to a slight reduction in the number of objects, although a pretty small difference in total memory use. I don't have any good performance statistics I can cite. There used to be a performance difference between using (non-overridden) Object.hashCode and System.identityHashCode, but that got cleared up a few years ago.
In situation you describes, the behaviors of HashMap and IdentityHashMap is identical.
On the contrast to this, if keys overrides equals() and hashCode(), behaviors of two maps are different.
see java.util.IdentityHashMap's javadoc below.
This class implements the Map interface with a hash table, using reference-equality in place of object-equality when comparing keys (and values). In other words, in an IdentityHashMap, two keys k1 and k2 are considered equal if and only if (k1==k2). (In normal Map implementations (like HashMap) two keys k1 and k2 are considered equal if and only if (k1==null ? k2==null : k1.equals(k2)).)
In summary, my answer is that:
Is there any potential defect to use these objects as keys in a HashMap?
--> No
Should I use IdentityHashMap in these cases? --> No
While there's no theoretical problem, you should avoid IdentityHashMap unless you have an explicit reason to use it. It provides no appreciable performance or other benefit in the general case, and when you inevitably start introducing objects into the map that do override equals() and hashCode(), you'll end up with subtle, hard-to-diagnose bugs.
If you think you need IdentityHashMap for performance reasons, use a profiler to confirm that suspicion before you make the switch. My guess is you'll find plenty of other opportunities for optimization that are both safer and make a bigger difference.
As far as I know, the only problem with a hashmap with bad keys, would be with very big hashmaps- your keys could be very bad and you get o(n) retrieval time, instead of o(1). If it does break anything else, I would be interested to hear about it though :)