I am building a simple versioned key-value store, where I need the ability
to access records by specifying a (key,version) pair. More recent versions store
a pointer to their previous version (ak, they store the index into the hashmap).
To keep the size of the records small, I hash the key,version pair to a Long,
which I then store, and use as an index.
The current implementation I was using was to append the key and the version (keys
are restricted to alphabetical letters), and using the native hashcode() function.
Before each put, I test for collisions (ak, does an entry already exist, and if yes,
does it have the same key,id pair), and I observe those frequently. Is this realistic?
My space is of the order of a million entries. My initial assumption is that this lead to collisions being very rare. But I was wrong.
Do you have an alternative solution (one which can keep the hash to 64 bit, 2^64 is a lot greater than 1m). I would like as much as possible to avoid the size overhead of SHA/MD5
Keys are randomly generated strings of length 16 chars
Versions are longs and will span the range 0 to 100000
Related
I'm looking at this website that lists Big O complexities for various operations. For Dynamic Arrays, the removal complexity is O(n), while for Hash Tables it's O(1).
For Dynamic Arrays like ArrayLists to be O(n), that must mean the operation of removing some value from the center and then shifting each index over one to keep the block of data contiguous. Because if we're just deleting the value stored at index k and not shifting, it's O(1).
But in Hash Tables with linear probing, deletion is the same thing, you just run your value through the Hash function, go to the Dynamic Array holding your data, and delete the value stored in it.
So why do Hash Tables get O(1) credit while Dynamic Arrays get O(n)?
This is explained here. The key is that the number of values per Dynamic Array is kept under a constant value.
Edit: As Dukeling pointed out, my answer explains why a Hash Table with separate chaining has O(1) removal complexity. I should add that, on the website you were looking at, Hash Tables are credited with O(1) removal complexity because they analyse a Hash Table with separate chaining and not linear probing.
The point of hash tables is that they keep close to the best case, where the best case means a single entry per bucket. Clearly, you have no trouble accepting that to remove the sole entry from a bucket takes O(1) time.
When there are many hash conflicts, you certainly need to do a lot of shifting when using linear probing.
But the complexity for hash tables are under the assumption of Simply Uniform Hashing, meaning that it assumes that there will be a minimal number of hash conflicts.
When this happens, we only need to delete some value and shift either no values or a small (essentially constant) amount of values.
When you talk about the complexity of an algorithm, you actually need to discuss a concrete implementation.
There is no Java class called a "Hash Table" (obviously!) or "HashTable".
There are Java classes called HashMap and Hashtable, and these do indeed have O(1) deletion.
But they don't work the way that you seem to think (all?) hash tables work. Specifically, HashMap and Hashtable are organized as an array of pointers to "chains".
This means that deletion consists of finding the appropriate chain, and then traversing the chain to find the entry to remove. The first step is constant time (including the time to calculate the hash code. The second step is proportional to the length of the hash chains. But assuming that the hash function is good, the average length of the hash chain is a small constant. Hence the total time for deletion is O(1) on average.
The reason that the hash chains are short on average is that the HashMap and Hashtable classes automatically resize the main hash array when the "load factor" (the ratio of the array size to the number of entries) exceeds a predetermined value. Assuming that the hash function distributes the (actual) keys pretty evenly, you will find that the chains are roughly the same length. Assuming that the array size is proportional to the total number of entries, the actual load factor will the average hash chain length.
This reasoning breaks down if the hash function does not distribute the keys evenly. This leads to a situation where you get a lot of hash collisions. Indeed, the worst-case behaviour is when all keys have the same hash value, and they all end up on a single hash chain with all N entries. In that case, deletion involves searching a chain with N entries ... and that makes it O(N).
It turns out that the same reasoning can be applied to other forms of hash table, including those where the entries are stored in the hash array itself, and collisions are handled by rehashing scanning. (Once again, the "trick" is to expand the hash table when the load factor gets too high.)
To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?
Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.
This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.
I have written an algorithm that implements a hash map to solve a problem. I am wondering if anybody can give me some kind of general formula for calculating the average number of hops to find an entry? Just part of my report :)
I have created my own hash code function, and I am trying to measure the quality of it.
By "hops" I mean:
For collision handling: If two or more element's hashCodes map to the same index in the hash table, I built a "linked list" at that index. So if there are 4 elements that are mapped to an index 'i' in the hash table, then the index 'i' contains a linked list of 4 elements. "Hops" in this sense is "walking" or "hopping" through that linked list.
Essentially, there is another data structure at each index of the map.
To be completely explicit, the number of 'hops' along the list in a hashtable which uses lists to handle collisions is identical to the number of hash collisions in the table, which will be the number of times hash(item) % size of table evaluates to the same value for the data provided. For hash tables which use the spare slots in the table, colliding items which have been removed from the table also contribute.
For example, if your table size were to increase in whole powers of two but your hash function only had differences in the higher bits, then you would have many collisions in the table even though your external hash has no collisions in its outputs. One technique (IIRC the one used in Sun's implementation) is to use prime numbers as the table size, another is to use a bit-mixing function to process the provided hash function's output before taking the lowest n-bits as the index.
So the number of collisions depends on the spread of values of the provided hash function found in your data ( if they all collide, then the table implementation can't do anything ), on the choice of table size for a given load factor, and how the output of the provided hash is converted to a table index.
The performance will depend on the quality of the hash function as well as the distribution of the data. Pick a large representative data set and measure the performance.
Take a sample input set S and calculate the hash values for every element in S and insert the calculated value into a set H. |S| / |H| is the average collisions you should expect. This depends on your own hash function, the quality of it.
I am calculating my own hashCode, and I am trying to measure the quality of it.
What you need to do is forget about the hash table, and simply analyze the distribution of hash values across the range of the int type. Ideally you want hash values to be distributed uniformly. Any significant peaks represent potential problems.
The other thing you need to take into account is the distribution of the keys used in your actual application. For instance, the hash function may hash "similar" keys in a way that doesn't give much dispersion. If your application then uses lots of similar keys you will end up with lots of collisions.
If you try to calculate / estimate / measure the number of "hops", you run into the effect of things like the initial HashMap size, the order of key insertion, the effect of resizing and so on.
See the documentation of the Java HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
In other words, it depends on the quality of the hash function implemented for the items you are storing in it.
I have read about hashtable and open adrdessing. If you want to insert the keys: 18,32,44 in an hashtable with size 13:
18 gets index 5 (18 modulus 13 = 5)
32 gets index 6 (32 modulus 13 = 6)
44 gets index 5 (44 modulus 13 = 5)
You'll get a collision because there are already something on index 5.
If you use linear probing you'll do hashfunction = (key+i) modulus N where i = 0,1,2.. until you find an empty place in the hashtable. Then 44 will get be inserted at index 7.
What if you delete 32, and then you want to delete 44. You start by looking at hashfunction(44)=5 - that was not 44, then hashfunction(44 + 1) = 6 - that is empty. Then you might think that 44 is gone. How do you mark a place in the hashtable, that the place is not really empty, but does not contain a key, and that you should keep looking for 44 at the next index?
If you then need to insert another key at index 6 then the key just overwrites the "mark" in the hashtable.
What could you use to mark an index - saying here has been an key, but has been deleted - so you continue to look at next index? You can't just write null or 0 because then either you think the key has been deleted (null) or that an key with value 0 has overwritten 44.
One way to handle hash tables using open addressing is to use state marks: EMPTY, OCCUPIED and DELETED. Note that there's an important distinction between EMPTY, which means the position has never been used and DELETED, which means it was used but got deleted.
When a value gets removed, the slot is marked as DELETED, not EMPTY. When you try to retrieve a value, you'll probe until you find a slot that's mark EMPTY; eg: you consider DELETED slots to be the same as OCCUPIED. Note that insertion can ignore this distinction - you can insert into a DELETED or EMPTY slot.
The question is tagged Java, which is a bit misleading because Java (or at least Oracle's implementation of it) does not use open addressing. Open addressing gets specially problematic when the load factor gets high, which causes hash collisions to occur much more often:
As you can see, there's a dramatic performance drop near the 0.7 mark. Most hashtables get resized once their load factor gets past a certain constant factor. Java for example doubles the size of its HashMap when the load factor gets past 0.75.
It seems like you are trying to implement your own hash table (in contrast to using the Hashtable or HashMap included in java), so it's more a data structure question than a java question.
That being said, implementing a hash table with open addressing (such as linear probing) is not very efficient when it comes to removing elements. The normal solution is to "pull up" all elements that are in the wrong slot so there won't be any spaces in the probing.
There is some pseudo code describing this quite well at wikipedia:
http://en.wikipedia.org/wiki/Open_addressing
The hash table buckets aren't limited to storing a single value. So if two objects hash to the same location in the table they will both be stored. The collision only means that lookup will be slightly slower because when looking for the value with a key that hashes to a particular location it will need to check each entry to see if it matches
It sounds like you are describing a hash table where you only store a single entry and each index. The only way I can think to do that is to add a field to the structure storing the value that indicates if that position had a collision. Then when doing a lookup you'd check the key, if it was a match you have the value. If not, then you would check to see if there was a collision and then check the next position. On a removal you'd have to leave the collision marker but delete the value and key.
If you use a hash table which uses this approach (which none of the built in hash collections do) you need traverse all the latter keys to see if they need to be moved up (to avoid holes). Some might be for the same hash value and some might be collisions for unrelated hash codes. If you do this you are not left with any holes. For a hash map which is not too full, this shouldn't create much overhead.
Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.
I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.
In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that
Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.
Use a profiler to optimize piece by piece. It's hard to without that.
If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.
But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.
Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package, but haven't tried it.
If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.
The optimization must be done int the hashCode method of the key class. The thing to have in mind is to implement this method to avoid collisions.
Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.
So how big a table will you need to have no collisions in 100K?
If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.
We start with 26 keys. Follow the first character, say f
f might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.
You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.
I think google search engines use a technique similar to this.
The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, i.e. your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.
Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.
I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.
So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".
My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?