This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
The fundamentals of Hash tables?
I am trying to implement a Simple Hash Table probably with simple java arrays. But 1st I will need to somehow have an associative array or sorts? How might a simple Hash Table implementation look like? It should still be able to do add/delete/get in O(1)
A hash table basically takes an input key, hashes it with a function to find a bucket ID, and then uses that bucket ID to either store or retrieve the data associated with that key.
In other words, for your case, you just have to provide a hashing function on your data that will give you a bucket ID of your array index.
Perhaps the simplest (and most naive) would be exclusive ORing together all the characters of your key then doing a modulus operation to bring it to the desired range. For example, say you have a structure containing:
Name
Address
Phone
All sorts of other details
You could generate a hash as follows:
set hashval to zero
for each character in Name:
hashval = hashval xor character
hashval = hashval mod 256
This would give you a bucket ID of between 0 and 255 inclusive.
Just keep in mind that the bucket may contain more than one item so you can't just use the bucket ID as an array index. Each bucket will need to be a structure containing possibly multiple items (such as a linked list or even another hashtable).
Read any text book on data structures and algorithms, or just Wikipedia "Hash Table" entry
The implementation which comes packaged with your JDK is pretty good for self study (though I admit in no way minimalistic). Have a look at it here.
Related
So I'm a bit confused about this one.
If Hashtables use separate chaining (or linear probing), why won't the following print out both values?
Hashtable<Character, Integer> map = new Hashtable<>();
map.put('h', 0);
map.put('h', 1);
System.out.println(map.remove('h')); // outputs 1
System.out.println(map.get('h')); // outputs null
I'm trying to understand why, given 2 identical keys, the hashtable won't use separate chaining in order to store both values. Did I understand this somewhat incorrectly or has Java just not implemented collision handling in their hashtable class?
Another question that might tie together would be, how does a hashtable using linear probing, given a key, know which value is the one we are looking for?
Thanks in advance!
I'm trying to understand why, given 2 identical keys, the hashtable won't use separate chaining in order to store both values.
The specification for Map (i.e. the javadoc) says that only one value is stored for each key. So that's what HashTable and HashMap implementations do.
Certainly the separate chaining doesn't stop someone from implementing a hash table with that property. The pseudo-code for put(key, value) on a hash table with separate chaining is broadly as follows:
Compute the hash value for the key.
Compute an index in the array of hash chains from the hash value. (The computation is index = hash % array.length or something similar.)
Search the hash chain at the computed index for an entry that matches the key.
If you found the entry for the key on the chain, update the value in the entry.
If you didn't find the entry, create an entry and add it to the chain.
If you repeat that for the same key, you will compute the same hash value, search the same chain, and find the same entry. You then update it, and there is still only one entry for that key ... as required by the specification.
In short, the above algorithm has no problem meeting the Map.put API requirements.
I think you are mis-understanding how hash tables work. Imagine I am looking for someone with an id of 227828. Say I have 1000 such people. I can search all 1000 and eventually find that ID and the person to whom it belongs.
But if their ids are used as keys in a hash table it is easier. Using the id as the key, say the hash function returns 0 for an even id and 1 for an odd id. Then all I have to do is find the box that contains even ids. Ideally I would then only have to search thru 500 entries to find the key - i.e. the id, and return the value associated with it.
But hash functions are more sophisticated and there are many such boxes or buckets. And the appropriate box or bucket can be identified and then be searched for the proper key. And then return its associated value.
I want to implement a HashMap data structure, but I can't quite figure out what to do with underlying array structure.
If I'm not mistaken, in HashMap each key is hashed and converted into an integer which is used to refer to the array index. Search time is O(1) because of direct referring.
Let's say K is key and V is value. We can create an array of size n of V type and it will reside in the index produced by hash(K) function. However, hash(K) doesn't produce consecutive indices and Java's arrays aren't sparse and to solve this we can create a very large array to hold elements, but it won't be efficient, it will hold a lot of NULL elements.
One solution is to store elements in consecutive order, and for finding an element we have to search the entire array and check elements' keys but it will be linear time. I want to achieve direct access.
Thanks, beforehand.
Borrowed from the Wikipedia article on hash tables, you can use a smaller array for underlying storage by taking the hash modulo the array size like so:
hash = hashfunc(key)
index = hash % array_size
This is a good solution because you can keep the underlying array relatively dense, you don't have to modify the hash funciton, and it does not affect the time complexity.
You can look at the source code for all your problems.
For the sake of not going through a pretty large code. The solution to your problem is to use modulo. You can choose n=64, then store an element x with h(x) mod 64 = 2 in the second position of the array, ...
If two elements have the same hash modulo n, then you store them next to each other (usually done in a tree map). Another solution would be to increase n.
This question already has answers here:
How does a hash table work?
(17 answers)
Closed 8 years ago.
If hash table is an array of linked-list elements and hash code is index of the element in the array then why hash table does not maintain the order of insertion ??
To be brief, a hash table (dictionary) does not maintain a total order of insertions because it doesn't need to. The abstract data type supports ammortized O(1) insertions, deletions, and searches, but does not support enumeration, and does not impose any order on the elements in the key set.
HashTable implements a dictionary, and total order of insertions is not retained because insertions with different hash values map to different chains. In the case of a dictionary implementation using chaining, keys with colliding hashes are stored in a linked list (as stated in the question), and are indeed maintained in order of insertion. There exist many other (faster) implementations of dictionaries that do not have this property. Please see the thees lecture notes for a discussion of open addressing (a dictionary implementation pattern that does not retain insertion order of colliding elements). Open addressing with double hashing, in particular, does not impose any order on the elements stored in the key set.
Because it's designed so, try LinkedHashSet instead
Consider an int array variable x[]. The variable X will have starting address reference. When array is accessed with index 2 that is x[2].then its memory location is calculated as
address of x[2] is starting addr + index * size of int.
ie. x[2]=x + 2*4.
But in case of hashmap how the memory address will be mapped internally.
By reading many previous posts I observed that HashMap uses a linked list to store the key value list. But if that is the case, then to find a key, it generates a hashcode then it will checks for equal hash code in list and retrieves the value..
This takes O(n) complexity.
If i am wrong in above observation Please correct me... I am a beginner.
Thank you
The traditional implementation of a HashMap is to use a function to generate a key, then use that key to access a value directly. Think of it as generating something that will translate into an array index. It does not look through the hashmap comparing element hashes to the generated hash; it generates the hash, and uses the hash to access an element directly.
What I think you're talking about is the case where two values in the HashMap generate the SAME key. THEN it uses a list of those, and has to look through them to determine which one it wants. But that's not O(n) where n is the number of elements in the HashMap, just O(m) where m is the number of elements with the same hash. Clearly the name of the game is to find a hash function where the generated hash is unique for all the elements, as much as is feasible.
--- edit to expand on the explanation ---
In your post, you state:
By reading many previous posts I observed that HashMap uses a linked
list to store the key value list.
This is wrong for the overall HashMap. For a HashMap to work reasonably, there must be a way to use the key to calculate a way to access the corresponding element directly, not by searching through all the values in the HashMap.
A "perfect" hash calculation would translate each possible key into hash value that was not calculated for any other key. This is usually not feasible, and so it is usually possible that two different keys will result in the same result from the hash calculation. In this case, the HashMap implementation could use a linked list of values, and would need to look through all such values to find the one that it was looking for. But this number is FAR less than the number of values in the overall HashMap.
You can make a hash where strings are the keys, and in which the first character of the string is converted to a number which is then used as an array index. As long as your strings all have different first characters, then accessing the value is a simple calc plus an array access -- O(1). Or you could add all the character values of the string indices together and take the last two (or three) digits, and THAT would be your hash calc. As long as the addition produced unique values for each index string, you don't ever have to look through a list; again, O(1).
And, in fact, as long as the hash calculation is approximately perfect, the lookup is still O(1) overall, because the limited number of times you have to look through a short list does not alter the overall efficiency.
To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?
Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.
This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.