I have written an algorithm that implements a hash map to solve a problem. I am wondering if anybody can give me some kind of general formula for calculating the average number of hops to find an entry? Just part of my report :)
I have created my own hash code function, and I am trying to measure the quality of it.
By "hops" I mean:
For collision handling: If two or more element's hashCodes map to the same index in the hash table, I built a "linked list" at that index. So if there are 4 elements that are mapped to an index 'i' in the hash table, then the index 'i' contains a linked list of 4 elements. "Hops" in this sense is "walking" or "hopping" through that linked list.
Essentially, there is another data structure at each index of the map.
To be completely explicit, the number of 'hops' along the list in a hashtable which uses lists to handle collisions is identical to the number of hash collisions in the table, which will be the number of times hash(item) % size of table evaluates to the same value for the data provided. For hash tables which use the spare slots in the table, colliding items which have been removed from the table also contribute.
For example, if your table size were to increase in whole powers of two but your hash function only had differences in the higher bits, then you would have many collisions in the table even though your external hash has no collisions in its outputs. One technique (IIRC the one used in Sun's implementation) is to use prime numbers as the table size, another is to use a bit-mixing function to process the provided hash function's output before taking the lowest n-bits as the index.
So the number of collisions depends on the spread of values of the provided hash function found in your data ( if they all collide, then the table implementation can't do anything ), on the choice of table size for a given load factor, and how the output of the provided hash is converted to a table index.
The performance will depend on the quality of the hash function as well as the distribution of the data. Pick a large representative data set and measure the performance.
Take a sample input set S and calculate the hash values for every element in S and insert the calculated value into a set H. |S| / |H| is the average collisions you should expect. This depends on your own hash function, the quality of it.
I am calculating my own hashCode, and I am trying to measure the quality of it.
What you need to do is forget about the hash table, and simply analyze the distribution of hash values across the range of the int type. Ideally you want hash values to be distributed uniformly. Any significant peaks represent potential problems.
The other thing you need to take into account is the distribution of the keys used in your actual application. For instance, the hash function may hash "similar" keys in a way that doesn't give much dispersion. If your application then uses lots of similar keys you will end up with lots of collisions.
If you try to calculate / estimate / measure the number of "hops", you run into the effect of things like the initial HashMap size, the order of key insertion, the effect of resizing and so on.
See the documentation of the Java HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
In other words, it depends on the quality of the hash function implemented for the items you are storing in it.
Related
I am making a custom implementation of a HashMap without using the HashMap data structure as part of an assignment, currently I have the choice of working with two 1D arrays or using a 2D array to store my keys and values. I want to be able to check if a key exists and return the corresponding value in O(1) time complexity (assignment requirement) but i am assuming it is without the use of containsKey().
Also, when inserting key and value pairs to my arrays, i am confused because it should not be O(1) logically, since there would occasionally be cases where there is collision and i have to recalculate the index, so why is the assignment requirement for insertion O(1)?
A lot of questions in there, let me give it a try.
I want to be able to check if a key exists and return the
corresponding value in O(1) time complexity (assignment requirement)
but i am assuming it is without the use of containsKey().
That actually doesn't make a difference. O(1) means the execution time is independent of the input, it does not mean a single operation is used. If your containsKey() and put() implementations are both O(1), then so is your solution that uses both of them exactly once.
Also, when inserting key and value pairs to my arrays, i am confused
because it should not be O(1) logically, since there would
occasionally be cases where there is collision and i have to
recalculate the index, so why is the assignment requirement for
insertion O(1)?
O(1) is the best case, which assumes that there are no hash collisions. The worst case is O(n) if every key generates the same hash code. So when a hash map's lookup or insertion performance is calculated as O(1), that assumes a perfect hashCode implementation.
Finally, when it comes to data structures, the usual approach is to use a single array, where the array items are link list nodes. The array offsets correspond to hashcode() % array size (there are much more advances formulas than this, but this is a good starting point). In case of a hash collision, you will have to navigate the linked list nodes until you find the correct entry.
You're correct in that hash table insert is not guaranteed to be O(1) due to collisions. If you use the open addressing strategy to deal with collisions, the process to insert an item is going to take time proportional to 1/(1-a) where a is the proportion of how much of the table capacity has been used. As the table fills up, a goes to 1, and the time to insert grows without bound.
The secret to keeping time complexity as O(1) is making sure that there's always room in the table. That way a never grows too big. That's why you have to resize the table when it starts to run out capacity.
Problem: resizing a table with N item takes O(N) time.
Solution: increase the capacity exponentially, for example double it every time you need to resize. This way the table has to be resized very rarely. The cost of the occasional resize operations is "amortized" over a large number of insertions, and that's why people say that hash table insertion has "amortized O(1)" time complexity.
TLDR: Make sure you increase the table capacity when it's getting full, maybe 70-80% utilization. When you increase the capacity, make sure that it's by a constant factor, for example doubling it.
I don't figure out how to implement a special hash table.
The idea would be that the hash table gives an approximate
match. So a perfect hash table (such as found in java.util)
just gives a map, such that:
Hashtable h = new Hashtable();
...
x = h.get(y);
If x is the result of applying the map h to the argument y,
i.e. basically in mathematics it would be a function
namely x = h(y). Now for the approximate match, what about a
data structure that gives me quickly:
x = h(k) where k=max { z<=y | h(z)!=null }
The problem is k can be very far away from the given y. For example
y could be 2000, and the next occupied slot k could be 1000. Some
linear search would be costly, the data structure should do the job
more quickly.
I know how to do it with a tree(*), but something with a hash, can this
also work? Or maybe combine some tree and hash properties in the sought
of data structure? Some data structure that tends toward O(1) access?
Bye
(*) You can use a tree ordered by y, and find something next below or equal y.
This is known as Spatial hashing. Keep in mind it has to be tailored for your specific domain.
It can be used when the hash tells you something about logical arrangement of objects. So when |hash(a)-hash(b)| < |hash(a)-hash(c)| means b is closer/more similar to a than c is.
Then the basic idea is that you divide the space into buckets (e.g. drop the least significant digits of the hash -- the naive approach) and your spatial hash is this bucket ID. You have to take care of the edge cases, when the objects are very near to each other, while being on the boundary of the buckets (e.g. h(1999) = 1 but h(2000)=2). You can solve this by two overlapping hashes and having two separate hash maps for them and querying both of them, or looking to the neighboring buckets etc...
As I sais in the beginning, this has to be thought through very well.
The tree (e.g. KD-tree for higher dimensions) isn't so demanding in the design phase and is generally a more convenient approach to nearest neighbor(s) querying.
The specific formula you give suggests you want a set that can retrieve the greatest item less than a given input.
One simple approach to achieving that would be to keep a sorted list of the items, and perform a binary search to locate the position in the list at which the given element would be inserted, then return the element equal to or less than that element.
As always, any set can be converted into a map by using a pair object to wrap the key-value pair, or by maintaining a parallel data structure for the values.
For an array-based approach, the runtime will be O(log n) for retrieval and O(n) for insertion of a single element. If 'add all' sorts the added elements and then merges them, it can be O(n log n).
It's not possible1 to have a constant-time algorithm that can answer what the first element less than a given element is using a hashing approach; a good hashing algorithm spreads out similar (but non-equal) items, to avoid having many similar items fall into the same bucket and destroy the desired constant-time retrieval behavior, this means the elements of a hash set (or map) are very deliberately not even remotely close to sorted order, they are as close to randomly distributed as we could achieve while using an efficient repeatable hashing algorithm.
1. Of course, proving that it's not possible is difficult, since one can't easily prove that there isn't a simple repeatable constant-time request that will reliably convince an oracle (or God, if God were that easy to manipulate) to give you the answer to the question you want, but it seems unlikely.
I'm looking at this website that lists Big O complexities for various operations. For Dynamic Arrays, the removal complexity is O(n), while for Hash Tables it's O(1).
For Dynamic Arrays like ArrayLists to be O(n), that must mean the operation of removing some value from the center and then shifting each index over one to keep the block of data contiguous. Because if we're just deleting the value stored at index k and not shifting, it's O(1).
But in Hash Tables with linear probing, deletion is the same thing, you just run your value through the Hash function, go to the Dynamic Array holding your data, and delete the value stored in it.
So why do Hash Tables get O(1) credit while Dynamic Arrays get O(n)?
This is explained here. The key is that the number of values per Dynamic Array is kept under a constant value.
Edit: As Dukeling pointed out, my answer explains why a Hash Table with separate chaining has O(1) removal complexity. I should add that, on the website you were looking at, Hash Tables are credited with O(1) removal complexity because they analyse a Hash Table with separate chaining and not linear probing.
The point of hash tables is that they keep close to the best case, where the best case means a single entry per bucket. Clearly, you have no trouble accepting that to remove the sole entry from a bucket takes O(1) time.
When there are many hash conflicts, you certainly need to do a lot of shifting when using linear probing.
But the complexity for hash tables are under the assumption of Simply Uniform Hashing, meaning that it assumes that there will be a minimal number of hash conflicts.
When this happens, we only need to delete some value and shift either no values or a small (essentially constant) amount of values.
When you talk about the complexity of an algorithm, you actually need to discuss a concrete implementation.
There is no Java class called a "Hash Table" (obviously!) or "HashTable".
There are Java classes called HashMap and Hashtable, and these do indeed have O(1) deletion.
But they don't work the way that you seem to think (all?) hash tables work. Specifically, HashMap and Hashtable are organized as an array of pointers to "chains".
This means that deletion consists of finding the appropriate chain, and then traversing the chain to find the entry to remove. The first step is constant time (including the time to calculate the hash code. The second step is proportional to the length of the hash chains. But assuming that the hash function is good, the average length of the hash chain is a small constant. Hence the total time for deletion is O(1) on average.
The reason that the hash chains are short on average is that the HashMap and Hashtable classes automatically resize the main hash array when the "load factor" (the ratio of the array size to the number of entries) exceeds a predetermined value. Assuming that the hash function distributes the (actual) keys pretty evenly, you will find that the chains are roughly the same length. Assuming that the array size is proportional to the total number of entries, the actual load factor will the average hash chain length.
This reasoning breaks down if the hash function does not distribute the keys evenly. This leads to a situation where you get a lot of hash collisions. Indeed, the worst-case behaviour is when all keys have the same hash value, and they all end up on a single hash chain with all N entries. In that case, deletion involves searching a chain with N entries ... and that makes it O(N).
It turns out that the same reasoning can be applied to other forms of hash table, including those where the entries are stored in the hash array itself, and collisions are handled by rehashing scanning. (Once again, the "trick" is to expand the hash table when the load factor gets too high.)
Lets say we had a stream of data comming into us (of unknown range and distribution) and we wanted to store the last X number of values in a hash table which provides O(1) access, how would we do this?
For simplicity lets say the data is a stream of numbers of unknown range and distribution.
In order to map these numbers to an element of an array we would need a hash function that takes account of the data range and distribution.
I guess we'd either estimate this range upfront or maintain some statistics on the data coming in and adjust the hash function accordingly.
Also we'd need a way of rejigging the array one the X threshold is met.
Any thoughts or ideas for doing this as fast as possible?
Dynamic Perfect Hashing seems to be pretty close, combine this with an array growing strategy and your good to go: http://en.wikipedia.org/wiki/Dynamic_perfect_hashing
Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.
I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.
In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that
Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.
Use a profiler to optimize piece by piece. It's hard to without that.
If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.
But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.
Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package, but haven't tried it.
If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.
The optimization must be done int the hashCode method of the key class. The thing to have in mind is to implement this method to avoid collisions.
Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.
So how big a table will you need to have no collisions in 100K?
If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.
We start with 26 keys. Follow the first character, say f
f might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.
You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.
I think google search engines use a technique similar to this.
The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, i.e. your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.
Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.
I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.
So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".
My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?