Related
How do I search a certain bucket in a hashing solution to find a key? I am having trouble figuring out how to see if my key is already in a given bucket number. I don't understand how to read buckets in an array.
I am writing my own Hash data structure using buckets not Java's.
Once you have found the bucket where the item should be, based on hashcode, you will then have to look for the item in question among all the objects in the same bucket. Now all of these objects have the same hashCode, because they are all in the same bucket. So you will have to actually compare these objects with the .equals method to see if the item you are actually looking for is there.
How you manage this group of items that all share the same bucket is up to you. You might have a list, or a sub array, or any data structure that holds a collection of objects.
In fact, you don't necessarily need to hold them all in the same bucket at all. There are schemes called open hashing where items with the same hash 'spill' out of the target bucket, and occupy successive buckets in the top array.
Without knowing your exact data structure I can't be more specific. But basically you use hashCode to get you to the top bucket, then you use equals to find the object within the group of objects that have the same hashcode.
According to this question
how-does-a-hashmap-work-in-java and this
Many Key-Values pairs could be stocked in the same bucket (after calculating the index of the bucket using the hash), and when we call get(key) it looks over the linked list and tests using equals method.
It doesn't sound really optimized to me, doesn't it compare hashCodes of the linked List before the use of equals?
If the answer is NO:
it means most of the time the bucket contains only 1 node, could you explain why ? because according to this logical explanation many differents keys could have the same bucket index.
how the implementation ensure the good distribution of keys ? this probably mean that the bucket table size is relative to the number of keys
And even if the Table Bucket size is equals to the number of keys, how the HashMap hashCode function ensure the good distribution of keys ? isn't a random distribution ?,
could we have more details?
The implementation is open source, so I would encourage you to just read the code for any specific questions. But here's the general idea:
The primary responsibility for good hashCode distribution lies with the keys' class, not with the HashMap. If the key have a hashCode() method with bad distribution (for instance, return 0;) then the HashMap will perform badly.
HashMap does do a bit of "re-hashing" to ensure slightly better distribution, but not much (see HashMap::hash)
On the get side of things, a couple checks are made on each element in the bucket (which, yes, is implemented as a linked list)
First, the HashMap checks the element's hashCode with the incoming key's hashCode. This is because this operation is quick, and the element's hashCode was cached at put time. This guards against elements that have different hashCodes (and are thus unequal, by the contract of hashCode and equals established by Object) but happen to fall into the same bucket (remember, bucket indexes are basically hashCode % buckets.length)
If that succeeds, then, the HashMap checks equals explicitly to ensure they're really equal. Remember that equality implies same hashCode, but same hashCode does not require equality (and can't, since some classes have potentially infinite number of different values -- like String -- but there are only a finite number of possible hashCode values)
The reason for the double-checking of both hashCode and equals is to be both fast and correct. Consider two keys that have a different hashCode, but end up in the same HashMap bucket. For instance, if key A has hashCode=7 and B has hashCode=14, and there are 7 buckets, then they'll both end up in bucket 0 (7 % 7 == 0, and 14 % 7 == 0). Checking the hashCodes there is a quick way of seeing that A and B are unequal. If you find that the hashCodes are equal, then you make sure it's not just a hashCode collision by calling equals. This is just an optimization, really; it's not required by the general hash map algorithm.
To avoid having to make multiple comparisons in linked lists, the number of buckets in a HashMap is generally kept large enough that most buckets contain only one item. By default the java.util.HashMap tries to maintain enough buckets that the number of items is only 75% of the number of buckets.
Some of the buckets may still contain more than one item - what's called a "hash collision" - and other buckets will be empty. But on average, most buckets with items in them will contain only one item.
The equals() method will always be used at least once to determine if the key is an exact match. Note that the equals() method is usually at least as fast as the hashCode() method.
A good distribution of keys is maintained by a good hashCode() implementation; the HashMap can do little to affect this. A good hashCode() method is one where the returned hash has as random a relationship to the value of the object as possible.
For an example of a bad hashing function, once upon a time, the String.hashCode() method only depended on the start of the string. The problem was that sometimes you want to store a bunch of strings in a HashMap that all start the same - for example, the URLs to all the pages on a single web site - resulting in an inordinately high proportion of hash collisions. I believe String.hashCode() was later modified to fix this.
dosn't it compares hachCodes of the linked List instead of use the
equals
Its not required. hashcode is used to determine the bucket number be it put or get operation. Once you know the bucket number with hashcode and find its a linked list there, then you know that you need to iterate over it and need to check for equality to find exact key . so there is no need of hashcode comparison here
Thats why hashcode should be as unique as as it can be so that its best for lookup.
it means most of the time the bucket contains only 1 node
No . It depend on the uniqueness of hascode. If two key objects have same hashcode but are not equal, then bucket with contain two nodes
When we pass Key and Value object to put() method on Java HashMap, HashMap implementation calls hashCode method on Key object and applies returned hashcode into its own hashing function to find a bucket location for storing Entry object, important point to mention is that HashMap in Java stores both key and value object as Map.Entry in bucket which is essential to understand the retrieving logic.
While retrieving the Values for a Key, if hashcode is same to some other keys, bucket location would be same and collision will occur in HashMap, Since HashMap use LinkedList to store object, this entry (object of Map.Entry comprise key and value ) will be stored in LinkedList.
The good distribution of the Keys will depend on the implementation of hashcode method. This implementation must obey the general contract for hashcode:
If two objects are equal by equals() method then there hashcode returned by hashCode() method must be same.
Whenever hashCode() mehtod is invoked on the same object more than once within single execution of application, hashCode() must return same integer provided no information or fields used in equals and hashcode is modified. This integer is not required to be same during multiple execution of application though.
If two objects are not equals by equals() method it is not require that there hashcode must be different. Though it’s always good practice to return different hashCode for unequal object. Different hashCode for distinct object can improve performance of hashmap or hashtable by reducing collision.
You can visit this git-hub repository "https://github.com/devashish234073/alternate-hash-map-implementation-Java/blob/master/README.md".
You can understand the working the working of HashMap with a basic implementation and examples. The ReadMe.md explains all.
Including some portion of the example here:
Suppose I have to store the following key-value pairs.
(key1,val1)
(key2,val2)
(key3,val3)
(....,....)
(key99999,val99999)
Let our hash algo produces values only in between 0 and 5.
So first we create a rack with 6 buckets numbered 0 to 5.
Storing:
To store (keyN,valN):
1.get the hash of 'keyN'
2.suppose we got 2
3.store the (keyN,valN) in rack 2
Searching:
For searching keyN:
1.get hash of keyN
2.lets say we get 2
3.we traverse rack 2 and get the key and return the value
Thus for N keys , if we were to store them linearly it will take N comparison to search the last element , but with hashmap whose hash algo generates 25 values , we have to do only N/25 comparison. [with hash values equally dispersed]
What is the reason to make unique hashCode for hash-based collection to work faster?And also what is with not making hashCode mutable?
I read it here but didn't understand, so I read on some other resources and ended up with this question.
Thanks.
Hashcodes don't have to be unique, but they work better if distinct objects have distinct hashcodes.
A common use for hashcodes is for storing and looking objects in data structures like HashMap. These collections store objects in "buckets" and the hashcode of the object being stored is used to determine which bucket it's stored in. This speeds up retrieval. When looking for an object, instead of having to look through all of the objects, the HashMap uses the hashcode to determine which bucket to look in, and it looks only in that bucket.
You asked about mutability. I think what you're asking about is the requirement that an object stored in a HashMap not be mutated while it's in the map, or preferably that the object be immutable. The reason is that, in general, mutating an object will change its hashcode. If an object were stored in a HashMap, its hashcode would be used to determine which bucket it gets stored in. If that object is mutated, its hashcode would change. If the object were looked up at this point, a different hashcode would result. This might point HashMap to the wrong bucket, and as a result the object might not be found even though it was previously stored in that HashMap.
Hash codes are not required to be unique, they just have a very low likelihood of collisions.
As to hash codes being immutable, that is required only if an object is going to be used as a key in a HashMap. The hash code tells the HashMap where to do its initial probe into the bucket array. If a key's hash code were to change, then the map would no longer look in the correct bucket and would be unable to find the entry.
hashcode() is basically a function that converts an object into a number. In the case of hash based collections, this number is used to help lookup the object. If this number changes, it means the hash based collection may be storing the object incorrectly, and can no longer retrieve it.
Uniqueness of hash values allows a more even distribution of objects within the internals of the collection, which improves the performance. If everything hashed to the same value (worst case), performance may degrade.
The wikipedia article on hash tables provides a good read that may help explain some of this as well.
It has to do with the way items are stored in a hash table. A hash table will use the element's hash code to store and retrieve it. It's somewhat complicated to fully explain here but you can learn about it by reading this section: http://www.brpreiss.com/books/opus5/html/page206.html#SECTION009100000000000000000
Why searching by hashing is faster?
lets say you have some unique objects as values and you have a String as their keys. Each keys should be unique so that when the key is searched, you find the relevant object it holds as its value.
now lets say you have 1000 such key value pairs, you want to search for a particular key and retrieve its value. if you don't have hashing, you would then need to compare your key with all the entries in your table and look for the key.
But with hashing, you hash your key and put the corresponding object in a certain bucket on insertion. now when you want to search for a particular key, the key you want to search will be hashed and its hash value will be determined. And you can go to that hash bucket straight, and pick your object without having to search through the entire key entries.
hashCode is a tricky method. It is supposed to provide a shorthand to equality (which is what maps and sets care about). If many objects in your map share the same hashcode, the map will have to check equals frequently - which is generally much more expensive.
Check the javadoc for equals - that method is very tricky to get right even for immutable objects, and using a mutable object as a map key is just asking for trouble (since the object is stored for its "old" hashcode)
As long, as you are working with collections that you are retrieving elements from by index (0,1,2... collection.size()-1) than you don't need hashcode. However, if we are talking about associative collections like maps, or simply asking collection does it contain some elements than we are talkig about expensive operations.
Hashcode is like digest of provided object. It is robust and unique. Hashcode is generally used for binary comparisions. It is not that expensive to compare on binary level hashcode of every collection's member, as comparing every object by it's properties (more than 1 operation for sure). Hashcode needs to be like a fingerprint - one entity - one, and unmutable hashcode.
The basic idea of hashing is that if one is looking in a collection for an object whose hash code differs from that of 99% of the objects in that collection, one only need examine the 1% of objects whose hash code matches. If the hashcode differs from that of 99.9% of the objects in the collection, one only need examine 0.1% of the objects. In many cases, even if a collection has a million objects, a typical object's hash code will only match a very tiny fraction of them (in many cases, less than a dozen). Thus, a single hash computation may eliminate the need for nearly a million comparisons.
Note that it's not necessary for hash values to be absolutely unique, but performance may be very bad if too many instances share the same hash code. Note that what's important for performance is not the total number of distinct hash values, but rather the extent to which they're "clumped". Searching for an object which is in a collection of a million things in which half the items all have one hash value and each remaining items each have a different value will require examining on average about 250,000 items. By contrast, if there were 100,000 different hash values, each returned by ten items, searching for an object would require examining about five.
You can define a customized class extending from HashMap. Then you override the methods (get, put, remove, containsKey, containsValue) by comparing keys and values only with equals method. Then you add some constructors. Overriding correctly the hashcode method is very hard.
I hope I have helped everybody who wants to use easily a hashmap.
I understand that returning the same value for each object is inefficient, but is it the most efficient approach to return distinct values for distinct instances?
If each object gets a different hashCode value then isn't this just like storing them in an ArrayList?
hashCode must be consistent with equals, that's number one priority. If no two objects are equal, then this would be desirable. Bear in mind that if your object has any more than 32 bits of state, it is theoretically impossible to provide a perfectly spread hashcode.
No, it's not actually.
Assuming your objects are going to be stored into a HashMap (or Set... doesn't matter, we'll use HashMap here for simplicity), you want your hashCode method to return a result in a way that distributes the objects as evenly as possible.
Hashcode should be unique for Objects that are not equal, although you can't guarantee this will always be true.
On the other hand, if a.equals(b) is true, then a.hashCode() == b.hashCode(). This is known as the Object Contract.
Besides this, there are performance issues also. Each time two different objects have the same hashCode, they're mapped to the same position in the HashMap (aka, they collide). This means that the HashMap implementation has to handle this collision, which is much more complex than simply storing and retrieving an entry.
There are also plenty of algorithms that rely on the fact that values are distributed evenly across a Map, and the performance deteriorates rapidly when the number of collisions increase (some algorithms assume a perfect hash function, meaning that no collisions ever occur, no two different values get mapped to the same position on the Map).
Good examples of this are probabilistic algorithms and data-structures such as Bloom Filters (to use an example that appears to be in fashion these days).
You want hashCode() as varied as possible to avoid collisions. If there are no collisions, each key or element will be stored in the underlying array on its own. (A bit like an ArrayList)
The problem is that even if the hashCode() are different you can still get collisions. This happens because you don't have a bucket for every possible hashCode, and this value has to be reduced to a smaller range. e.g. you have 16 buckets, the range is 0 to 15. How it does this is complicated, but I am sure you can see that even if all the hashCodes are different, they can still result in a collision (though its unlikely)
It is a concern for denial of service attacks. Normally strings have a low collision rate, however you can deliberately construct strings which have the same hashcode. This question gives a list of Strings with a hashCode of 0 Why doesn't String's hashCode() cache 0?
The hashCode() method isn't suited for placing objects in an ArrayList.
Although it does return the same value for the same object every time, two objects could quite possibly have the same hashcode.
Therefore the hashCode method is used on the key Object when storing items in for example a HashMap.
The HashMap class's major data structure is this:
Entry[] table;
It's important to note that the Entry class (which is a static package protected class that implements Map.Entry) is actually a linked list style structure.
When you try to put an element, first the key's hashcode is computed and then transformed into a bucket number. The "bucket" is the index into the above array.
Once you find the bucket, a linear search is done inside of that bucket for the exact key (if you don't believe me, look at the HashMap code). If it is found, the value is replaced. If not, the key/value pair is appended to the end of that bucket.
For this reason, hashcode() values need not be unique, however, the more unique and evenly distributed they are, the better your odds are to have the value evenly distributed among the buckets. If your hashcode() method return the same value for all instances, they'd all end up in the same bucket, hence rendering your get() method to be one long linear search, yielding O(N)
The more distributed the values are, the smaller the buckets, and thus the smaller the linear search component would be. Unique values would yield constant time lookup O(1).
Apart from the fact that HashSet does not allow duplicate values, what is the difference between HashMap and HashSet?
I mean implementation wise? It's a little bit vague because both use hash tables to store values.
HashSet is a set, e.g. {1,2,3,4,5}
HashMap is a key -> value (key to value) map, e.g. {a -> 1, b -> 2, c -> 2, d -> 1}
Notice in my example above that in the HashMap there must not be duplicate keys, but it may have duplicate values.
In the HashSet, there must be no duplicate elements.
They are entirely different constructs. A HashMap is an implementation of Map. A Map maps keys to values. The key look up occurs using the hash.
On the other hand, a HashSet is an implementation of Set. A Set is designed to match the mathematical model of a set. A HashSet does use a HashMap to back its implementation, as you noted. However, it implements an entirely different interface.
When you are looking for what will be the best Collection for your purposes, this Tutorial is a good starting place. If you truly want to know what's going on, there's a book for that, too.
HashSet
HashSet class implements the Set interface
In HashSet, we store objects(elements or values)
e.g. If we have a HashSet of string elements then it could depict a
set of HashSet elements: {“Hello”, “Hi”, “Bye”, “Run”}
HashSet does not allow duplicate elements that mean you
can not store duplicate values in HashSet.
HashSet permits to have a single null value.
HashSet is not synchronized which means they are not suitable for thread-safe operations until unless synchronized explicitly.[similarity]
add contains next notes
HashSet O(1) O(1) O(h/n) h is the table
HashMap
HashMap class implements the Map interface
HashMap is
used for storing key & value pairs. In short, it maintains the
mapping of key & value (The HashMap class is roughly equivalent to
Hashtable, except that it is unsynchronized and permits nulls.) This
is how you could represent HashMap elements if it has integer key
and value of String type: e.g. {1->”Hello”, 2->”Hi”, 3->”Bye”,
4->”Run”}
HashMap does not allow duplicate keys however it allows having duplicate values.
HashMap permits single null key and any number of null values.
HashMap is not synchronized which means they are not suitable for thread-safe operations until unless synchronized explicitly.[similarity]
get containsKey next Notes
HashMap O(1) O(1) O(h/n) h is the table
Please refer this article to find more information.
It's really a shame that both their names start with Hash. That's the least important part of them. The important parts come after the Hash - the Set and Map, as others have pointed out. What they are, respectively, are a Set - an unordered collection - and a Map - a collection with keyed access. They happen to be implemented with hashes - that's where the names come from - but their essence is hidden behind that part of their names.
Don't be confused by their names; they are deeply different things.
The Hashset Internally implements HashMap. If you see the internal implementation the values inserted in HashSet are stored as keys in the HashMap and the value is a Dummy object of Object class.
Difference between HashMap vs HashSet is:-
HashMap contains key value pairs and each value can be accessed by key where as HashSet needs to be iterated everytime as there is no get method.
HashMap implements Map interface and allows one null value as a key and multiple null values as values, whereas HashSet implements Set interface, allows only one null value and no duplicated values.(Remeber one null key is allowed in HashMap key hence one null value in HashSet as HashSet implemements HashMap internally).
HashSet and HashMap do not maintain the order of insertion while iterating.
HashSet allows us to store objects in the set where as HashMap allows us to store objects on the basis of key and value. Every object or stored object will be having key.
As the names imply, a HashMap is an associative Map (mapping from a key to a value), a HashSet is just a Set.
Differences between HashSet and HashMap in Java
1) First and most significant difference between HashMap and HashSet is that HashMap is an implementation of Map interface while HashSet is an implementation of Set interface, which means HashMap is a key value based data-structure and HashSet guarantees uniqueness by not allowing duplicates.In reality HashSet is a wrapper around HashMap in Java, if you look at the code of add(E e) method of HashSet.java you will see following code :
public boolean add(E e)
{
return map.put(e, PRESENT)==null;
}
where its putting Object into map as key and value is an final object PRESENT which is dummy.
2) Second difference between HashMap and HashSet is that , we use add() method to put elements into Set but we use put() method to insert key and value into HashMap in Java.
3) HashSet allows only one null key, but HashMap can allow one null key + multiple null values.
That's all on difference between HashSet and HashMap in Java. In summary HashSet and HashMap are two different type of Collection one being Set and other being Map.
Differences between HashSet and HashMap in Java
HashSet internally uses HashMap to store objects.when add(String) method called it calls HahsMap put(key,value) method where key=String object & value=new Object(Dummy).so it maintain no duplicates because keys are nothing but Value Object.
the Objects which are stored as key in Hashset/HashMap should override hashcode & equals contract.
Keys which are used to access/store value objects in HashMap should declared as Final because when it is modified Value object can't be located & returns null.
A HashMap is to add, get, remove, ... objects indexed by a custom key of any type.
A HashSet is to add elements, remove elements and check if elements are present by comparing their hashes.
So a HashMap contains the elements and a HashSet remembers their hashes.
A HashSet uses a HashMap internally to store its entries. Each entry in the internal HashMap is keyed by a single Object, so all entries hash into the same bucket. I don't recall what the internal HashMap uses to store its values, but it doesn't really matter since that internal container will never contain duplicate values.
EDIT: To address Matthew's comment, he's right; I had it backwards. The internal HashMap is keyed with the Objects that make up the Set elements. The values of the HashMap are an Object that's just simply stored in the HashMap buckets.
Differences:
with respect to heirarchy:
HashSet implements Set.
HashMap implements Map and stores a mapping of keys and values.
A use of HashSet and HashMap with respect to database would help you understand the significance of each.
HashSet: is generally used for storing unique collection objects.
E.g: It might be used as implementation class for storing many-to-one relation ship between
class Item and Class Bid where (Item has many Bids)
HashMap: is used to map a key to value.the value may be null or any Object /list of Object (which is object in itself).
A HashSet is implemented in terms of a HashMap. It's a mapping between the key and a PRESENT object.
HashMap is a Map implementation, allowing duplicate values but not duplicate keys.. For adding an object a Key/Value pair is required. Null Keys and Null values are allowed. eg:
{The->3,world->5,is->2,nice->4}
HashSet is a Set implementation,which does not allow duplicates.If you tried to add a duplicate object, a call to public boolean add(Object o) method, then the set remains unchanged and returns false. eg:
[The,world,is,nice]
Basically in HashMap, user has to provide both Key and Value, whereas in HashSet you provide only Value, the Key is derived automatically from Value by using hash function. So after having both Key and Value, HashSet can be stored as HashMap internally.
HashSet and HashMap both store pairs , the difference lies that in HashMap you can specify a key while in HashSet the key comes from object's hash code
HashMaps allow one null key and null values. They are not synchronized, which increases efficiency. If it is required, you can make them synchronized using Collections.SynchronizedMap()
Hashtables don't allow null keys and are synchronized.
The main difference between them you can find as follows:
HashSet
It does not allow duplicate keys.
Even it is not synchronized, so this will have better performance.
It allows a null key.
HashSet can be used when you want to maintain a unique list.
HashSet implements Set interface and it is backed by the hash table(actually HashMap instance).
HashSet stores objects.
HashSet doesn’t allow duplicate elements but null values are allowed.
This interface doesn’t guarantee that order will remain constant over time.
HashMap
It allows duplicate keys.
It is not synchronized, so this will have better performance.
HashMap does not maintain insertion order.
The order is defined by the Hash function.
It is not Thread Safe
It allows null for both key and value.
It allows one null key and as many null values as you like.
HashMap is a Hash table-based implementation of the Map interface.
HashMap store object as key and value pair.
HashMap does not allow duplicate keys but null keys and values are allowed.
Ordering of the element is not guaranteed overtime.
EDIT - this answer isn't correct. I'm leaving it here in case other people have a similar idea. b.roth and justkt have the correct answers above.
--- original ---
you pretty much answered your own question - hashset doesn't allow duplicate values. it would be trivial to build a hashset using a backing hashmap (and just a check to see if the value already exists). i guess the various java implementations either do that, or implement some custom code to do it more efficiently.
HashMap is a implementation of Map interface
HashSet is an implementation of Set Interface
HashMap Stores data in form of key value pair
HashSet Store only objects
Put method is used to add element in map
Add method is used to add element is Set
In hash map hashcode value is calculated using key object
Here member object is used for calculating hashcode value which can be same for two objects so equal () method is used to check for equality if it returns false that means two objects are different.
HashMap is faster than hashset because unique key is used to access object
HashSet is slower than Hashmap