Why are immutable objects in hashmaps so effective?

Why are immutable objects in hashmaps so effective? - java

So I read about HashMap. At one point it was noted:
"Immutability also allows caching the hashcode of different keys which makes the overall retrieval process very fast and suggest that String and various wrapper classes (e.g., Integer) provided by Java Collection API are very good HashMap keys."
I don't quite understand... why?

String#hashCode:
private int hash;
...
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
Since the contents of a String never change, the makers of the class chose to cache the hash after it had been calculated once. This way, time is not wasted recalculating the same value.

Quoting the linked blog entry:
final object with proper equals () and hashcode () implementation would act as perfect Java HashMap keys and improve performance of Java hashMap by reducing collision.
I fail to see how both final and equals() have anything to do with hash collisions. This sentence raises my suspicion about the credibility of the article. It seems to be a collection of dogmatic Java "wisdoms".
Immutability also allows caching there hashcode of different keys which makes overall retrieval process very fast and suggest that String and various wrapper classes e.g Integer provided by Java Collection API are very good HashMap keys.
I see two possible interpretations of this sentence, both of which are wrong:
HashMap caches hash codes of immutable objects. This is not correct. The map doesn't have the possibility to find out if an object is "immutable".
Immutability is required for an object to cache its own hash code. Ideally, an object's hash value should always just rely on non-mutating state of the object, otherwise the object couldn't be sensibly used as a key. So in this case, too, the author fails to make a point: If we assume that our object is not changing its state, we also don't have to recompute the hash value every time, even if our object is mutable!
Example
So if we are really crazy and actually decide to use a List as a key for a HashMap and make the hash value dependent on the contents, rather than the identity of the list, we could just decide to invalidate the cached hash value on every modification, thus limiting the number of hash computations to the number of modifications to the list.

It's very simple. Since an immutable object doesn't change over time, it only needs to perform the calculation of the hash code once. Calculating it again will yield the same value. Therefore it is common to calculate the hash code in the constructor (or lazily) and store it in a field. The hashcode function then returns just the value of the field, which is indeed very fast.

Basically immutability is achieved in Java by making the class not extendable and all the operations in the object will ideally not change the state of the object. If you see the operations of String like replace(), it does not change the state of the current object with which you are manipulating rather it gives you a new String object with the replaced string. So ideally if you maintain such objects as keys the state doesn't change and hence the hash code also remains unchanged. So caching the hash code will be performance effective during retrievals.

Think of the hashmap as a big array of numbered boxes. The number is the hashcode, and the boxes are ordered by number.
Now if the object can't change, the hash function will always reproduce the same value. Therefore the object will always stay in it's box.
Now suppose a changeable object. It is changed after adding it to the hash, so now it is sitting in the wrong box, like a Mrs. Jones which happened to marry Mister Doe, and which is now named Doe too, but in many registers still named Jones.

Immutable classes are unmodifiable, that's why those are used as keys in a Map.
For an example -
StringBuilder key1=new StringBuilder("K1");
StringBuilder key2=new StringBuilder("K2");
Map<StringBuilder, String> map = new HashMap<>();
map.put(key1, "Hello");
map.put(key2, "World");
key1.append("00");
System.out.println(map); // This line prints - {K100=Hello, K2=World}
You see the key K1 (which is an object of mutable class StringBuilder) inserted in the map is lost due to an inadvertent change to it. This won't happen if you use immutable classes as keys for the Map family members.

Hash tables will only work if the hash code of an object can never change while it is stored in the table. This implies that the hash code cannot take into account any aspect of the object which could change while it's in the table. If the most interesting aspects of an object are mutable, that implies that either:
The hash code will have to ignore most of the interesting aspects of the object, thus causing many hash collisions, or...
The code which owns the hash table will have to ensure that the objects therein are not exposed to anything that might change them while they are stored in the hash table.
If Java hash tables allowed clients to supply an EqualityComparer (the way .NET dictionaries do), code which knows that certain aspects of the objects in a hash table won't unexpectedly change could use a hash code which took those aspects into account, but the only way to accomplish that in Java would be to wrap each item stored in the hashcode in a wrapper. Such wrapping may not be the most evil thing in the world, however, since the wrapper would be able to cache hash values in a way which an EqualityComparer could not, and could also cache further equality-related information [e.g. if the things being stored were nested collections, it might be worthwhile to compute multiple hash codes, and confirm that all hash codes match before doing any detailed inspection of the elements].

Related

Hashtable implementation in C and Java

In Java, HashMap and Hashtable, both implement map interface and store key/value pairs using hash function and Array/LinkedList implementation.  In C also, Hash table can be implemented using Array/LinkedList functionality but there is no concept of key/value pair like map. 
So my question is, whether Hash table implementation in C, similar to Hashtable in Java? or its more closer to HashSet in java (except unique elements only condition)?

Both semantics (Hashtable and HashSet) can be implemented in C, but neither comes in the Standard C library. You can find many different has table implementation on the Internet, each with its own advantages and drawbacks. Implementing this yourself may prove difficult as there are many traps and pitfalls.

I previously used BSD's Red-Black trees implementation. It's relatively easy to use when you start to understand how it works.
The really great thing about is that you only have to copy one header file and then just include that where it's needed, no need to link to libraries.
It has similar functionallity to HashSets, you can find by keys with the RB_FIND() macro, enumerate elements with RB_FOREACH(), insert new ones with RB_INSERT() and so on.
You can find more info in it's MAN page or the source code itself.

The difference (in Java) between a HashTable and a HashSet is in how the key is selected to calculate its hash value. In the HashSet the key is the instance stored itself, and the hashCode() method is applied to the complete instance, (Object provides both, hashCode() and equals(Object) methods. In the case of an external key, the equals(Object) and hashCode() are selected now from the separate key instance, instead of from the stored data value. For that reason, HashTable is normally a subclass of HashSet (and every Java table is actually derived from its corresponding *Set counterpart), by publishing an internal implementation of the Map.Entry<K,V> interface)
Implementing a hash table in C is not too difficult, but you need to understand first what's the key (if external) and the differences between the Key and the Value, the differences between calculating a hashCode() and comparing for equality, how are you going to distinguish the key from the value, and how do you manage internally keys and hashes, in order to manage collisions.
I recently started an implementation of a hash table in C (not yet finished) and my hash_table constructor need to store in the instance record a pointer to an equals comparison routine (to check for equality, the same as Java requires an Object's compareTo() method, this allows you to detect collisions (when you have two entries with the same hash but they compare as different) and the hash function used on keys to get the hash. In my implementation probably I will store the full hash value returned by the hash function (before fitting it on the table's size), so I can grow the table in order to simplify the placement of the elements in the new hash table once growed, without having to recalculate all hashes again.
I don't know if this hints can be of help to you, but it's my two cents. :)
My implementation uses a prime numbers table to select the capacity (aprox, doubling the size on each entry) to redimension the table when the number of collisions begin to be unacceptable (whatever this means to you, I have no clear idea yet, this is a time consuming operation, but happens scarcely, so it must be carefully specified, and it is something that Java's HashTable does) But if you don't plan to grow your hash table, or to do it manually, the implementation is easier (just add the number of entries in the constructor parameter list, and create some grow_hash_table(new_cap) method.)

Java hashCode, artificial fields?

Imagine the following problem:
// Class PhoneNumber implements hashCode() and equals()
PhoneNumber obj = new PhoneNumber("mgm", "089/358680");
System.out.println("Hashcode: " +
obj.hashCode()); //prints "1476725853"
// Add PhoneNumber object to HashSet
Set<PhoneNumber> set = new HashSet();
set.add(obj);
// Modify object after it has been inserted
obj.setNumber("089/358680-0");
// Modification causes a different hash value
System.out.println("New hashcode: " +
obj.hashCode()); //prints "7130851"
// ... Later or in another class, code such as the following
// is operating on the Set:
// Unexpected Result!
// Output: obj is set member: FALSE
System.out.println("obj is set member: " +
set.contains(obj));
If I've got a class and I want all my fields to be editable and still be able to use a set / hashCode. Would it be a good idea to create an artificial uneditable field in the class that is set at creation of the object? For example the current time in ms. When I've got that field, I can base the hashcode upon it and I would still be able to edit all the "real" fields. Would this be a good idea?

I strongly believe you are presenting a bad use case: if you need to modify object in a Set, you should definitely remove the old one and re-add the new one (or use another java.util.Collection). Taking from your example:
Set<PhoneNumber> set = new HashSet();
set.add(obj);
// Modify object after it has been inserted
set.remove(obj);
obj.setNumber("089/358680-0");
set.add(obj);
The whole purpose of hashCode is to create a bucket of similar objects to reduce the search space, therefore it should be immutable but useful for you (if you use an artificial field, how do you find the object in your set later on? How do you retrieve this artificial field, given you are not with persistence storage of any type - the id in a database is an exception in the usage of artificial field IMHO).
To explain the meaning of
The whole purpose of hashCode is to create a bucket of similar
objects to reduce the search space
have a look at this sample code: http://ideone.com/MJ2MQT. I (wrongly) created to objects with the same hash code, then added both to a set; as expected, the set contains both of them, because the hash code is used to retrieve the elements which collide and then the equals method is called to solve this collision. Collisions (read different objects which return same hash code) are unavoidable, and the goal of a proper designed hash code function is to reduce them as much as possible.

Storing mutable objects in a hash set, or using them as keys in a hash map, is definitely not a good idea, precisely for the reason that you illustrate in your code.
On the other hand, defining an artificial number that serves as an ID of an object defeats the purpose of having a hash code in the first place, because it does not help you find an object that is equal to a given object by limiting the search to objects with identical hash codes.
In fact, your solution is not different from constructing a Map<Integer,PhoneNumber> from an "artificial hash code" to your mutable PhoneNumber object. If finding objects by association is what you need, HashMap from an artificial ID to the mutable object is the way to go.

It usually makes sense to have a unique identifier for your data objects, especially if you are persisting them in some database. It will allow you to have an easy implementation of equals and hashCode, which will only depend on this single identifier.
I'm not sure the current time in ms. will be the best choice, but you should definitely generate some unique ID.

Do HashMap values need to be immutable?

I understand that keys in a HashMap need to be immutable, or at least ensure that their hash code (hashCode()) will not change or conflict with another object with different state.
However, do the values stored in HashMap need to be the same as above*? Why or why not?
* The idea is to be able to mutate the values (such as calling setters on it) without affecting the ability to retrieve them later from the HashMap using immutable keys. If values are mutated, can that break their association to the keys?
My question is mainly concerning Java, however, answers for other languages are welcome.

No. In general the characteristics of a hash map data structure do not depend upon the value. Having keys be immutable is important because the underlying data structures are built using the hash of the key at the time of insertion. The underlying data structures are designed to provide certain properties (relatively fast look up, fast removal, fast removal, etc...) all based upon this hash. If this hash were to change then the data structure with these nice properties based upon a hash which has changed will be invalidated. If you need to "modify" a key one general approach is to remove the old key and re-insert the new key.

The values of a HashMap do not need to be immutable. The map generally does not care what happens to the value (unless it is a ConcurrentHashMap), only where in memory that value is located.
In the case of ConcurrentHashMap, the mutability of the value is not affected, but it would be overly broad to say that the map does not "care" what happens to the value. Even though concurrency is allowed on updates to the map, the values that the map points to can be manipulated with no effect to the the immutable keys.

RE: If values are mutated, can that break their association to the keys?.
No.
A Map returns an object reference given a key. That key will alway point to the same object reference. Changing the object in some way (i.e. changing its instance variables) will not affect the ability to retrieve that object.

not really. having a mutable key will bring significant issues but what values you put it does not really matter. If the value is an mutable object and one modifies it, then of course the value is updated as well (but that's nothing to do with Hashmap).

No values do not need to be immutable, but can be very good practice. This of course depends on your use-case.
Here is a use-case where immutability was important: I recently ran into a bug because of this. An entry was put in a cache (backed by a HashMap). Later, this entry was retrieved and altered. Because the value was mutable (i.e. allowed changes), the next retrieve of the entry still had the edits made by the previous retriever. This was a problem because in my use case, the cache data was not supposed to change.
Consider this example:
Class Foo {
int a;
public Foo(int a) { this.a = a; }
public void setA(int x) { this.a = a; }
}
Map<String, Foo> data = getFooMap();
Foo foo = new Foo(17);
data.put("entry1", foo);
Foo entry1 = map.get("entry1);
System.out.println(entry1.a); // prints "17"
entry1.setA(18);
...
Foo entry1 = map.get("entry1);
System.out.println(entry1.a); // prints "18"

Map and HashCode

What is the reason to make unique hashCode for hash-based collection to work faster?And also what is with not making hashCode mutable?
I read it here but didn't understand, so I read on some other resources and ended up with this question.
Thanks.

Hashcodes don't have to be unique, but they work better if distinct objects have distinct hashcodes.
A common use for hashcodes is for storing and looking objects in data structures like HashMap. These collections store objects in "buckets" and the hashcode of the object being stored is used to determine which bucket it's stored in. This speeds up retrieval. When looking for an object, instead of having to look through all of the objects, the HashMap uses the hashcode to determine which bucket to look in, and it looks only in that bucket.
You asked about mutability. I think what you're asking about is the requirement that an object stored in a HashMap not be mutated while it's in the map, or preferably that the object be immutable. The reason is that, in general, mutating an object will change its hashcode. If an object were stored in a HashMap, its hashcode would be used to determine which bucket it gets stored in. If that object is mutated, its hashcode would change. If the object were looked up at this point, a different hashcode would result. This might point HashMap to the wrong bucket, and as a result the object might not be found even though it was previously stored in that HashMap.

Hash codes are not required to be unique, they just have a very low likelihood of collisions.
As to hash codes being immutable, that is required only if an object is going to be used as a key in a HashMap. The hash code tells the HashMap where to do its initial probe into the bucket array. If a key's hash code were to change, then the map would no longer look in the correct bucket and would be unable to find the entry.

hashcode() is basically a function that converts an object into a number. In the case of hash based collections, this number is used to help lookup the object. If this number changes, it means the hash based collection may be storing the object incorrectly, and can no longer retrieve it.
Uniqueness of hash values allows a more even distribution of objects within the internals of the collection, which improves the performance. If everything hashed to the same value (worst case), performance may degrade.
The wikipedia article on hash tables provides a good read that may help explain some of this as well.

It has to do with the way items are stored in a hash table. A hash table will use the element's hash code to store and retrieve it. It's somewhat complicated to fully explain here but you can learn about it by reading this section: http://www.brpreiss.com/books/opus5/html/page206.html#SECTION009100000000000000000

Why searching by hashing is faster?
lets say you have some unique objects as values and you have a String as their keys. Each keys should be unique so that when the key is searched, you find the relevant object it holds as its value.
now lets say you have 1000 such key value pairs, you want to search for a particular key and retrieve its value. if you don't have hashing, you would then need to compare your key with all the entries in your table and look for the key.
But with hashing, you hash your key and put the corresponding object in a certain bucket on insertion. now when you want to search for a particular key, the key you want to search will be hashed and its hash value will be determined. And you can go to that hash bucket straight, and pick your object without having to search through the entire key entries.

hashCode is a tricky method. It is supposed to provide a shorthand to equality (which is what maps and sets care about). If many objects in your map share the same hashcode, the map will have to check equals frequently - which is generally much more expensive.
Check the javadoc for equals - that method is very tricky to get right even for immutable objects, and using a mutable object as a map key is just asking for trouble (since the object is stored for its "old" hashcode)

As long, as you are working with collections that you are retrieving elements from by index (0,1,2... collection.size()-1) than you don't need hashcode. However, if we are talking about associative collections like maps, or simply asking collection does it contain some elements than we are talkig about expensive operations.
Hashcode is like digest of provided object. It is robust and unique. Hashcode is generally used for binary comparisions. It is not that expensive to compare on binary level hashcode of every collection's member, as comparing every object by it's properties (more than 1 operation for sure). Hashcode needs to be like a fingerprint - one entity - one, and unmutable hashcode.

The basic idea of hashing is that if one is looking in a collection for an object whose hash code differs from that of 99% of the objects in that collection, one only need examine the 1% of objects whose hash code matches. If the hashcode differs from that of 99.9% of the objects in the collection, one only need examine 0.1% of the objects. In many cases, even if a collection has a million objects, a typical object's hash code will only match a very tiny fraction of them (in many cases, less than a dozen). Thus, a single hash computation may eliminate the need for nearly a million comparisons.
Note that it's not necessary for hash values to be absolutely unique, but performance may be very bad if too many instances share the same hash code. Note that what's important for performance is not the total number of distinct hash values, but rather the extent to which they're "clumped". Searching for an object which is in a collection of a million things in which half the items all have one hash value and each remaining items each have a different value will require examining on average about 250,000 items. By contrast, if there were 100,000 different hash values, each returned by ten items, searching for an object would require examining about five.

You can define a customized class extending from HashMap. Then you override the methods (get, put, remove, containsKey, containsValue) by comparing keys and values only with equals method. Then you add some constructors. Overriding correctly the hashcode method is very hard.
I hope I have helped everybody who wants to use easily a hashmap.

Algorithm to get unique and same hashcode for the object when we run the application multiple times

I m using Java.I want to know,is any algorithm is available that will give me an unique and the same hash code when I will run the application multiple times sop that collisions of hash code will be avoided.
I know the thing that for similar objects, jvm returns same hash code and for different objects it may return same or different hash code.Bt I want some logic that will help to generate generate unique hash code for every object.
unique means that hash code of one object should not collide with any other object's hash code.and same means when i will run the application multiple times ,it should return me the same hash code whatever it returned me previously

The default hash code function in Java might return different hash codes for each JVM invokation, because it is able to use the memory address of the object, mangle it, and return it.
This is however not good coding practice, since objects which are equal should always return the same hashcode! Please read about the hash code contract to learn more. And most Classes in Java already have a hashcode function implemented that returns the same value on each JVM invocation.
To make it simple: All your data holding objects which might be stored in some collection should have an equals and hashcode implemention. If you code with Eclipse or any other reasonable IDE, you can use a wizard that creates the functions automatically.
And while we are at it: It is IMHO good practice to also implement the Comparable<T> interface, so you can use the objects within SortedSets and TreeMaps, too.
While we are at it: If others should your objects, don't forget Serializable and Cloneable.

Unique means that hashcode of one object should not collide with any other object's hashcode. Same means when I run the application multiple times, it should return me the same hash code whatever it returned me previously.
It is impossible to meet these requirements for a number of reasons:
It is not possible to guarantee that hashcodes are unique. Whatever you do in your classes hashcode method, some other classes hashcode method may give a value for some instance that is the same as the hashcode of one of your instances.
It is impossible to guarantee that hashcodes are unique across application runs even just for instances of your class.
The second requires justification. The way to create a unique hashcode is to do something like this:
static HashSet<Integer> usedCodes = ...
static IdentityHashMap<YourClass, Integer> codeMap = ...
public int hashcode() {
Integer code = codeMap.get(this);
if (code == null) {
code = // generate value-based hashcode for 'this'
while (usedCode.contains(code)) {
code = rehash(code);
}
usedCodes.add(code);
codeMap.put(this, code);
}
return code;
}
This gives the hashcodes with the desired uniqueness property, but the sameness property is not guaranteed ... unless the application always generates / accesses the hashcodes for all objects in the same order.
The only way to get this to work would be to persist the usedCode and codeMap data structures in a suitable form. Even (just) storing the unique hashcodes as part of the persisted objects is not sufficient, because there is a risk that the application may reissue a hashcode to a newly created object before reading the existing object that has the hashcode.
Finally, it should be noted that you have to be careful with using identity hashcodes anywhere in the solution. Identity hashcodes are not unique across different runs of an application. Indeed, if there are differences in any inputs, or if there is any non-determinism, it is highly likely that a given object will have a different identity hashcode value each time you run the application.
FOLLOW UP
Suppose you are storing millions of urls in database. While retrieving these urls, I want to generate unique hashcode that will make searching faster.
You need to store the hashcodes in a separate column of the table. But given the constraints discussed above, I don't see how this is going to make search faster. Basically you have to search the database for the URL in order to work out its unique hashcode.
I think you are better off using hashcodes that are not unique with a small probability. If you use a good enough "cryptographic" hashing function and a large enough hash size you can (in theory) make the probability of collision arbitrarily small ... but not zero.

Based on my understanding of your question...
If it is your custom object, then you can override the hashcode method(along with equals) to get a consistent hashcode based on the instance variables of your class. You can even return a constant hashcode, it will still satisfy the hascode contract.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.