Could anyone please tell what are the important use cases of IdentityHashMap?
Whenever you want your keys not to be compared by equals but by == you would use an IdentityHashMap. This can be very useful if you're doing a lot of reference-handling but it's limited to very special cases only.
The documentations says:
A typical use of this class is
topology-preserving object graph
transformations, such as serialization
or deep-copying. To perform such a
transformation, a program must
maintain a "node table" that keeps
track of all the object references
that have already been processed. The
node table must not equate distinct
objects even if they happen to be
equal. Another typical use of this
class is to maintain proxy objects.
For example, a debugging facility
might wish to maintain a proxy object
for each object in the program being
debugged.
One case where you can use IdentityHashMap is if your keys are Class objects. This is about 33% faster than HashMap for gets! It probably uses less memory too.
You can also use the IdentityHashMap as a general purpose map if you can make sure the objects you use as keys will be equal if and only if their references are equal.
To what gain? Obviously it will be faster and will use less memory than using implementations like HashMap or TreeMap.
Actually, there are quite a lot of cases when this stands. For example:
Enums. Although for enums there is even a better alternative: EnumMap
Class objects. They are also comparable by reference.
Interned Strings. Either by specifying them as literals or calling String.intern() on them.
Cached instances. Some classes provide caching of their instances. For example quoting from the javadoc of Integer.valueOf(int):
This method will always cache values in the range -128 to 127, inclusive...
Certain libraries/frameworks will manage exactly one instance of ceratin types, for example Spring beans.
Singleton types. If you use istances of types that are built with the Singleton pattern, you can also be sure that (at the most) one instance exists from them and therefore reference equality test will qualify for equality test.
Any other type where you explicitly take care of using only the same references for accessing values that were used to putting values into the map.
To demonstrate the last point:
Map<Object, String> m = new IdentityHashMap<>();
// Any keys, we keep their references
Object[] keys = { "strkey", new Object(), new Integer(1234567) };
for (int i = 0; i < keys.length; i++)
m.put(keys[i], "Key #" + i);
// We query values from map by the same references:
for (Object key : keys)
System.out.println(key + ": " + m.get(key));
Output will be, as expected (because we used the same Object references to query values from the map):
strkey: Key #0
java.lang.Object#1c29bfd: Key #1
1234567: Key #2
HashMap creates Entry objects every time you add an object, which can put a lot of stress on the GC when you've got lots of objects. In a HashMap with 1,000 objects or more, you'll end up using a good portion of your CPU just having the GC clean up entries (in situations like pathfinding or other one-shot collections that are created and then cleaned up). IdentityHashMap doesn't have this problem, so will end up being significantly faster.
See a benchmark here: http://www.javagaming.org/index.php/topic,21395.0/topicseen.html
This is a practical experience from me:
IdentityHashMap leaves a much smaller memory footprint compared to HashMap for large cardinalities.
One important case is where you are dealing with reference types (as opposed to values) and you really want the correct result. Malicious objects can have overridden hashCode and equals methods getting up to all sorts of mischief. Unfortunately, it's not used as often as it should be. If the interface types you are dealing with don't override hashCode and equals, you should typically go for IdentityHashMap.
Related
I was working on some algorithmic problems when I got to this and it seemed interesting to me. If I have two lists (so two different objects), with the same values, the hashcode is the same. After some reading, I understand that this is how it should behave. For example:
List<String> lst1 = new LinkedList<>(Arrays.asList("str1", "str2"));
List<String> lst2 = new LinkedList<>(Arrays.asList("str1", "str2"));
System.out.println(lst1.hashCode() + " " + lst2.hashCode());
...........
Result: 2640541 2640541
My purpose would be to differentiate between lst1 and lst2 in a list for example.
Is there a structure (like a HashSet for example) that takes into consideration the actual object and not only the values inside the object when calculating the hashcode for something?
Yes, you can use java's java.util.IdentityHashMap, or guava's identity hash set.
The hashes of the two lists must be equal, because the objects are equal. But the identity map and set above are based on the identity of the list objects, not their hash.
If I have two lists (so two different objects), with the same values, the hashcode is the same. After some reading, I understand that this is how it should behave.
Yes, this is part of the specification of java.util.List.
Is there a structure (like a HashSet for example) that takes into consideration the actual object and not only the values inside the object when calculating the hashcode for something?
My purpose would be to differentiate between lst1 and lst2 in a list for example
It is unclear what "in a list" means here. For example, Collection.contains() and List.equals() are defined in terms or members' equals() methods, and likewise the behavior of List.remove(Object). Although distinct objects, your two Lists will compare equal to each other, so those methods will not distinguish between them, neither directly nor as members of another list. You can always compare them for reference equality (==), however, to determine that they are not the same object despite being equals() each other.
As far as a collection that takes members' object identity into account, you could consider java.util.IdentityHashMap. Two such maps having keys and associated values that are pairwise equals() each other but not identical will not compare equals() to each other. Such sets will typically have different hash codes than each other, though that cannot be guaranteed. Note well, however, the warnings throughout the documentation of IdentityHashMap that although it implements the Map API, many of the behavioral details are inconsistent with the requirements of that interface.
Note also that
most of the above is relevant only for collections whose members are of a type that overrides equals() and hashCode(). The implementations of or inherited from Object differentiate between objects on a reference-equality basis, so the ordinary collections classes have no surprises for you there.
identical string literals are not required to represent distinct objects, so the lst1 and lst2 in your example code may in fact contain identical elements, in the reference equality sense.
Not generally in collections, because you generally want two collections with all the same items to be equal (which is why they implement it like this- equals will return true and the hash codes are the same).
You can subclass a list and have it not do that, it would just not be widely useful and would cause a lot of confusion if other programmers read your code. In that case, you'd just want equals to return the result of == and hashCode to return the integer value of the reference (the same thing that Object.equals does).
So I am working on some data-structure, whose operations tend to generate lots of different instances of a certain type. The datasets can be large enough to potentially yield millions of such objects. It is crucial that I "memoize" them, since there are recurring patterns in these calculations.
Typically the way memoization is done is that you simply have a set (say, a HashMap with its keys as its values) of all instances ever created. Any time an operation would return a result, instead the set is searched for an existing, identical object. If the object is found, it is returned instead, and the result we looked up with instantly becomes garbage.
Now, this is straight-forward enough. However, HashMap is not a perfect fit for this use case. There are two points I will address here:
Lack of support for custom equality and hashing function
Lack of support for "equivalent key" lookup.
The first has already been discussed (question 5453226), though not sufficiently for my purposes; solutions in the form of "just wrap your key types with another object" are a non-starter. If the key is a relatively small object (e.g. an array or a string of small size) the overhead cost of these wrappers can be nearly 2X.
To illustrate the second point, let's say the data type I'd like to memoize is a (typically small) int[]. Suppose I have made an operation and it has yielded a result, but it is not exactly in an int[], rather, it consists of a int[] x and separately a length int length. What I would like to do now is to look up the set for an array that equals (as a sequence of ints) to the data in x[0..length-1]. In principle, there should be no problem to accomplish this, provided that the user supplies equality and hashing predicates that match the ones used for the key type itself. In principle, the "compatible key" (x and length here) doesn't even have to be materialized as an object.
The lack of "compatible key" lookups may cause to program to create temporary key objects that are likely to be thrown away afterwards.
Problem 1 prevents me from using something like int[] as a key type in a Map, since it doesn't have the hashing and equality functions that I want. Further, I actually want to use shallow / reference-only comparisons and hashing on these objects once I'm past the memoization step, since then x is an identical object to y iff x == y. Here we have the same key type, but different desired equality/hashing behavior.
In C++, the functionality in (1) is offered out of the box by most hash-table based containers. (2) is given by some container types, for example the associative container templates in the Boost library.
A third function I'd be interested in (less important) is the insert_check/insert_commit idea, where we first check if a matching key exist, and we also get some implementation-defined marker back (e.g. bucket index). Then if we do create a new key and want to insert it, we give back the marker and it's inserted to the right place in the data structure. There is no reason why the hash of the same key should be computed twice.
If the compiler (or JIT) is clever enough to avoid double lookup in this scenario, that is a preferable solution - it's easier not to have to worry about it. I just never actually tested if it is.
Are there any libraries known to offer any of these capabilities in Java? Do I have to roll my own? Am I missing something?
So I read about HashMap. At one point it was noted:
"Immutability also allows caching the hashcode of different keys which makes the overall retrieval process very fast and suggest that String and various wrapper classes (e.g., Integer) provided by Java Collection API are very good HashMap keys."
I don't quite understand... why?
String#hashCode:
private int hash;
...
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
Since the contents of a String never change, the makers of the class chose to cache the hash after it had been calculated once. This way, time is not wasted recalculating the same value.
Quoting the linked blog entry:
final object with proper equals () and hashcode () implementation would act as perfect Java HashMap keys and improve performance of Java hashMap by reducing collision.
I fail to see how both final and equals() have anything to do with hash collisions. This sentence raises my suspicion about the credibility of the article. It seems to be a collection of dogmatic Java "wisdoms".
Immutability also allows caching there hashcode of different keys which makes overall retrieval process very fast and suggest that String and various wrapper classes e.g Integer provided by Java Collection API are very good HashMap keys.
I see two possible interpretations of this sentence, both of which are wrong:
HashMap caches hash codes of immutable objects. This is not correct. The map doesn't have the possibility to find out if an object is "immutable".
Immutability is required for an object to cache its own hash code. Ideally, an object's hash value should always just rely on non-mutating state of the object, otherwise the object couldn't be sensibly used as a key. So in this case, too, the author fails to make a point: If we assume that our object is not changing its state, we also don't have to recompute the hash value every time, even if our object is mutable!
Example
So if we are really crazy and actually decide to use a List as a key for a HashMap and make the hash value dependent on the contents, rather than the identity of the list, we could just decide to invalidate the cached hash value on every modification, thus limiting the number of hash computations to the number of modifications to the list.
It's very simple. Since an immutable object doesn't change over time, it only needs to perform the calculation of the hash code once. Calculating it again will yield the same value. Therefore it is common to calculate the hash code in the constructor (or lazily) and store it in a field. The hashcode function then returns just the value of the field, which is indeed very fast.
Basically immutability is achieved in Java by making the class not extendable and all the operations in the object will ideally not change the state of the object. If you see the operations of String like replace(), it does not change the state of the current object with which you are manipulating rather it gives you a new String object with the replaced string. So ideally if you maintain such objects as keys the state doesn't change and hence the hash code also remains unchanged. So caching the hash code will be performance effective during retrievals.
Think of the hashmap as a big array of numbered boxes. The number is the hashcode, and the boxes are ordered by number.
Now if the object can't change, the hash function will always reproduce the same value. Therefore the object will always stay in it's box.
Now suppose a changeable object. It is changed after adding it to the hash, so now it is sitting in the wrong box, like a Mrs. Jones which happened to marry Mister Doe, and which is now named Doe too, but in many registers still named Jones.
Immutable classes are unmodifiable, that's why those are used as keys in a Map.
For an example -
StringBuilder key1=new StringBuilder("K1");
StringBuilder key2=new StringBuilder("K2");
Map<StringBuilder, String> map = new HashMap<>();
map.put(key1, "Hello");
map.put(key2, "World");
key1.append("00");
System.out.println(map); // This line prints - {K100=Hello, K2=World}
You see the key K1 (which is an object of mutable class StringBuilder) inserted in the map is lost due to an inadvertent change to it. This won't happen if you use immutable classes as keys for the Map family members.
Hash tables will only work if the hash code of an object can never change while it is stored in the table. This implies that the hash code cannot take into account any aspect of the object which could change while it's in the table. If the most interesting aspects of an object are mutable, that implies that either:
The hash code will have to ignore most of the interesting aspects of the object, thus causing many hash collisions, or...
The code which owns the hash table will have to ensure that the objects therein are not exposed to anything that might change them while they are stored in the hash table.
If Java hash tables allowed clients to supply an EqualityComparer (the way .NET dictionaries do), code which knows that certain aspects of the objects in a hash table won't unexpectedly change could use a hash code which took those aspects into account, but the only way to accomplish that in Java would be to wrap each item stored in the hashcode in a wrapper. Such wrapping may not be the most evil thing in the world, however, since the wrapper would be able to cache hash values in a way which an EqualityComparer could not, and could also cache further equality-related information [e.g. if the things being stored were nested collections, it might be worthwhile to compute multiple hash codes, and confirm that all hash codes match before doing any detailed inspection of the elements].
In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.
My current implementation looks like this:
Set<Object> keyset = new HashSet<Object>();
for (Map<Object, Object> map : input) {
for (Object key : map.keySet()) {
if (keyset.add(key)) {
...
}
}
}
The HashSet instance ensures that no key will be acted upon more than once.
Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.
I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.
Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.
Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?
EDIT:
A few clarifications and comments:
I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
for (Map<Object, Object> m : input) {
Object v = m.get(key);
// Do something with v
}
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
Key Map0 Map1 Map2
F 1 null 2
B 2 3 null
C null null 5
...
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().
EDIT 2:
Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:
I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...
Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.
Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
How you avoid using a HashSet depends on what you are doing.
I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.
// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input)
union.putAll(map);
// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
// do something.
}
Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.
If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.
This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().
Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.
You could take a look at Guava's Sets.union() http://guava-libraries.googlecode.com/svn/tags/release04/javadoc/com/google/common/collect/Sets.html#union(java.util.Set,%20java.util.Set)
There are some cases that the key objects used in map do not override hashCode() and equals() from Object, for examples, use a socket Connection or java.lang.Class as keys.
Is there any potential defect to use these objects as keys in a HashMap?
Should I use IdentityHashMap in these cases?
If equals() and hashCode() are not overridden on key objects, HashMap and IdentityHashMap should have the same semantics. The default equals() implementation uses reference semantics, and the default hashCode() is the system identity hash code of the object.
This is only harmful in cases where different instances of an object can be considered logically equal. For example, you would not want to use IdentityHashMap if your keys were:
new Integer(1)
and
new Integer(1)
Since these are technically different instances of the Integer class. (You should really be using Integer.valueOf(1), but that's getting off-topic.)
Class as keys should be okay, except in very special circumstances (for example, the hibernate ORM library generates subclasses of your classes at runtime in order to implement a proxy.) As a developer I would be skeptical of code which stores Connection objects in a Map as keys (maybe you should be using a connection pool, if you are managing database connections?). Whether or not they will work depends on the implementation (since Connection is just an interface).
Also, it's important to note that HashMap expects the equals() and hashCode() determination to remain constant. In particular, if you implement some custom hashCode() which uses mutable fields on the key object, changing a key field may make the key get 'lost' in the wrong hashtable bucket of the HashMap. In these cases, you may be able to use IdentityHashMap (depending on the object and your particular use case), or you might just need a different equals()/hashCode() implementation.
From a mobile code security point of view, there are situations where using IdentityHashMap or similar becomes necessary. Malicious implementations of non-final key classes can override hashCode and equals to be malicious. They can, for instance, claim equality to different instances, acquire a reference to other instances they are compared to, etc. I suggest breaking with standard practice by staying safe and using IdentityHashMap where you want identity semantics. There rarely is a good reason to changing the meaning of equality in a subclass where the superclass is already being compared. I guess the most likely scenario is a broken, non-symmetric proxy.
The implementation of IdentityHashMap is quite different than HashMap. It uses linear probing rather than Entry objects as links in a chain. This leads to a slight reduction in the number of objects, although a pretty small difference in total memory use. I don't have any good performance statistics I can cite. There used to be a performance difference between using (non-overridden) Object.hashCode and System.identityHashCode, but that got cleared up a few years ago.
In situation you describes, the behaviors of HashMap and IdentityHashMap is identical.
On the contrast to this, if keys overrides equals() and hashCode(), behaviors of two maps are different.
see java.util.IdentityHashMap's javadoc below.
This class implements the Map interface with a hash table, using reference-equality in place of object-equality when comparing keys (and values). In other words, in an IdentityHashMap, two keys k1 and k2 are considered equal if and only if (k1==k2). (In normal Map implementations (like HashMap) two keys k1 and k2 are considered equal if and only if (k1==null ? k2==null : k1.equals(k2)).)
In summary, my answer is that:
Is there any potential defect to use these objects as keys in a HashMap?
--> No
Should I use IdentityHashMap in these cases? --> No
While there's no theoretical problem, you should avoid IdentityHashMap unless you have an explicit reason to use it. It provides no appreciable performance or other benefit in the general case, and when you inevitably start introducing objects into the map that do override equals() and hashCode(), you'll end up with subtle, hard-to-diagnose bugs.
If you think you need IdentityHashMap for performance reasons, use a profiler to confirm that suspicion before you make the switch. My guess is you'll find plenty of other opportunities for optimization that are both safer and make a bigger difference.
As far as I know, the only problem with a hashmap with bad keys, would be with very big hashmaps- your keys could be very bad and you get o(n) retrieval time, instead of o(1). If it does break anything else, I would be interested to hear about it though :)