Chosing a suitable table size for a Hash

Chosing a suitable table size for a Hash - java

If I have a key set of 1000, what is a suitable size for my Hash table, and how is that determined?

It depends on the load factor (the "percent full" point where the table will increase its size and re-distribute its elements). If you know you have exactly 1000 entries, and that number will never change, you can just set the load factor to 1.0 and the initial size to 1000 for maximum efficiency. If you weren't sure of the exact size, you could leave the load factor at its default of 0.75 and set your initial size to 1334 (expected size/LF) for really good performance, at a cost of extra memory.
You can use the following constructor to set the load factor:
Hashtable(int initialCapacity, float loadFactor)

You need to factor in the hash function as well.
one rule of thumb suggests make the table size about double, so that there is room to expand, and hopefully keep the number of collisions small.
Another rule of thumb is to assume that you are doing some sort of modulo related hashing, then round your table size up to the next largest prime number, and use that prime number as the modulo value.
What kind of things are you hashing? More detail should generate better advice.

There's some discussion of these factors in the documentation for Hashtable

Let it grow. With this size, the automatic handling is fine. Other than that, 2 x size + 1 is a simple formula. Prime numbers are also kind of good, but as soon as your data set reaches a certain size, the hash implementation might decide to rehash and grow the table.
Your keys are driving the effectiveness and are hopefully distinct enough.
Bottom line: Ask the size question when you have problems such as size or slow performance, other than that: Do not worry!

Twice is good.
You don't have a big keyset.
Don't bother about difficult discussions about your HashTable implementation, and go for 2000.

I'd like to reiterate what https://stackoverflow.com/users/33229/wwwflickrcomphotosrene-germany said above. 1000 doesn't seem like a very big hash to me. I've been using a lot of hashtables about that size in java without seeing much in the way of performance problems. And I hardly ever muck about with the size or load factor.
If you've run a profiler on your code and determined that the hashtable is your problem, then by all means start tweaking. Otherwise, I wouldn't assume you've got a problem until you're sure.
After all, in most code, the performance problem isn't where you think it is. I try not to anticipate.

Related

Implementing a fixed-size hash map

I need to implement a fixed-size hash map optimized for memory and speed, but am unclear as to what this means: Does it mean that the number of buckets I can have in my hash map is fixed? Or that I cannot dynamically allocate memory and expand the size of my hash map to resolve collisions via linked lists? If the answer to the latter question is yes, the first collision resolution approach that comes to mind is linear probing--can anyone comment on other more memory and speed efficient methods, or point me towards any resources to get started? Thanks!

Without seeing specific requirements it's hard to interpret the meaning of "fixed-size hash map optimized for memory and speed," so I'll focus on the "optimized for memory and speed" aspect.
Memory
Memory efficiency is hard to give advice for particularly if the hash map actually exists as a "fixed" size. In general open-addressing can be more memory efficient because the key and value alone can be stored without needing pointers to next and/or previous linked list nodes. If your hash map is allowed to resize, you'll want to pick a collision resolution strategy that allows for a larger load (elements/capacity) before resizing. 1/2 is a common load used by many hash map implementations, but it means at least 2x the necessary memory is always used. The collision resolution strategy generally needs to be balanced between speed and memory efficiency, in particular tuned for your actual requirements/use case.
Speed
From a real-world perspective, particularly for smaller hash-maps or those with trivially sized key, the most important aspect to optimizing speed would likely be reducing cache-misses. That means put as much of the information needed to perform operations in contiguous memory spaces as possible.
My advice would be to use open-addressing instead of chaining for collision resolution. This would allow for more of your memory to be contiguous and should be at a minimum 1 less cache miss per key comparison. Open-addressing will require some-kind of probing, but compared to the cost of fetching each link of a linked list from memory looping over several array elements checking key comparison should be faster. See here for a benchmark of c++ std::vector vs std::list, the takeaway being for most operations a normal contiguous array is faster due to spatial locality despite algorithm complexity.
In terms of types of probing, linear probing has an issue of clustering. As collisions occur the adjacent elements are consumed which causes more and more collisions in the same section of the array, this becomes exceptionally important when the table is nearly full. This could be solved with re-hashing, Robin-Hood hashing (As you are probing to insert, if you reach an element closer to it's ideal slot than the element being inserted, swap the two and try to insert that element instead, A much better description can be seen here), etc. Quadratic Probing doesn't have the same problem of clustering that linear probing has, but it has it's own limitation, not every array location can be reached from every other array location, so dependent on size, typically only half the array can be filled before it would need to be resized.
The size of the array is also something that effects performance. The most common sizings are power of 2 sizes, and prime number sizes. Java: A "prime" number or a "power of two" as HashMap size?
Arguments exist for both, but mostly performance will depend on usage, specifically power of two sizes are usually very bad with hashes that are sequential, but the conversion of hash to array index can be done with a single and operation vs a comparatively expensive mod.
Notably, the google_dense_hash is a very good hash map, easily outperforming the c++ standard library variation in almost every use case, and uses open-addressing, a power of 2 resizing convention, and quadratic probing. Malte Skarupke wrote an excellent hash table beating the google_dense_hash in many cases, including lookup. His implementation uses robin hood hashing and linear probing, with a probe length limit. It's very well described in a blog post along with excellent benchmarking against other hash tables and a description of the performance gains.

Why is the hash table resized by doubling it?

Checking in java and googling online for hashtable code examples it seems that the resizing of the table is done by doubling it.
But most textbooks say that the best size for the table is a prime number.
So my question is:
Is the approach of doubling because:
It is easy to implement, or
Is finding a prime number too inefficient (but I think that finding
the next prime going over n+=2 and testing for primality using
modulo is O(loglogN) which is cheap)
Or this is my misunderstanding and only certain hashtable variants
only require prime table size?
Update:
The way presented in textbooks using a prime number is required for certain properties to work (e.g. quadratic probing needs a prime size table to prove that e.g. if a table is not full item X will be inserted).
The link posted as duplicate asks generally about increasing by any number e.g. 25% or next prime and the answer accepted states that we double in order to keep the resizing operation "rare" so we can guarantee amortized time.
This does not answer the question of having a table size that is prime and using a prime for resizing such that is even greater than double. So the idea is to keep the properties of the prime size taking into account the resizing overhead

Q: But most textbooks say that the best size for the table is a prime number.
Regarding size primality:
What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:
If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.
The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.
Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?
Quality of hash function doesn't matter, you always can "improve" hash function by MurMur3 avalancing, that is cheaper than switching to prime table size from power-of-2 table size, see above.
I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads. With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5. See this answer.
Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.
Q: Is the approach of doubling because:
It is easy to implement, or
Basically, in many cases, yes.
See this large answer regarding load factors:
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.
What is doubling? It's just the simpliest resizing strategy. The strategy could be arbitrarily complex, performing optimally in your use cases. It could consider present hash table size, growth intensity (how much get operations were done since previous resize), etc. Nobody forbids you to implement such custom resizing logic.
Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)
There is a good practice to precompute some subset of prime hash table sizes, to choose between them using binary search in runtime. See the list double hash capacities and explaination, QHash capacities. Or, even using direct lookup, that is very fast.
Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?
Yes, only certain types requre, see above.

Java HashMap (java.util.HashMap) chains bucket collisions in a linked list (or [as of JDK8] tree depending on the size and overfilling of bins).
Consequently theories about secondary probing functions don't apply.
It seems the message 'use primes sizes for hash tables' has become detached from the circumstances it applies over the years...
Using powers of two has the advantage (as noted by other answers) of reducing the hash-value to a table entry can be achieved by a bit-mask. Integer division is relatively expensive and in high performance situations this can help.
I'm going to observe that "redistributing the collision chains when rehashing is a cinch for tables that are a power of two going to a power of two".
Notice that when using powers of two rehashing to twice the size 'splits' each bucket between two buckets based on the 'next' bit of the hash-code.
That is, if the hash-table had 256 buckets and so using the lowest 8 bits of the hash-value rehashing splits each collision chain based on the 9th bit and either remains in the same bucket B (9th bit is 0) or goes to bucket B+256 (9th bit is 1). Such splitting can preserve/take advantage of the bucket handling approach. For example, java.util.HashMap keeps small buckets sorted in reverse order of insertion and then splits them into two sub-structures obeying that order. It keeps big buckets in a binary tree sorted by hash-code and similarly splits the tree to preserve that order.
NB: These tricks weren't implemented until JDK8.
(I am pretty sure) Java.util.HashMap only sizes up (never down). But there are similar efficiencies to halving a hash-table as doubling it.
One 'downside' of this strategy is that Object implementers aren't explicitly required to make sure the low order bits of hash-codes are well distributed.
A perfectly valid hash-code could be well distributed overall but poorly distributed in its low order bits. So an object obeying the general contract for hashCode() might still tank when actually used in a HashMap!
Java.util.HashMap mitigates this by applying an additional hash 'spread' onto the provided hashCode() implementation. That 'spread' is really quick crude (xors the 16 high bits with the low).
Object implmenters should be aware (if not already) that bias in their hash-code (or lack thereof) can have a significant effect on the performance of data structures using hashes.
For the record I've based this analysis on this copy of the source:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

Are Bit Set really faster than Sorted Set operations?

I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.
Eg: Similar Question on Stack-Overflow
One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.
However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?

Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].
Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.
Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.

For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.
When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.
Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).

Hash Table Size Setting

Do you always have to know the size of the array for a Hashtable prior to creating the array?

No, you don't. A quality implementation (Hashtable/HashMap) will resize itself automatically as the number of elements increases.
If you are talking about your own implementation, the answer depends on whether the hash table is capable of increasing the number of buckets as its size grows.
If you are worried about the performance implications of the resizing, the correct approach is to profile this in the context of your overall application.

No, in fact is bad to have it fixed to a certain value.
For more info you can start here with Wikipedia.

How big should my hashmap be?

I do not know in advance how many elements are going to be stored in my Hashmap . So how big should the capacity of my HashMap be ? What factors should I take into consideration here ? I want to minimize the rehashing process as much as possible since it is really expensive.

You want to have a good tradeoff between space requirement and speed (which is reduced if many collisions happen, which becomes more likely if you reduce the space allocation).
You can define a load factor, the default is probably fine.
But what you also want to avoid is having to rebuild and extend the hash table as it grows. So you want to size it with the maximum capacity up front. Unfortunately, for that, you need to know roughly how much you are going to put into it.
If you can afford to waste a little memory, and at least have a reasonable upper bound for how large it can get, you can use that as the initial capacity. It will never rehash if you stay below that capacity. The memory requirement is linear to the capacity (maybe someone has numbers). Keep in mind that with a default load factor of 0.75, you need to set your capacity a bit higher than the number of elements, as it will extend the table when it is already 75% full.
If you really have no idea, just use the defaults. Not because they are perfect in your case, but because you don't have any basis for alternative settings.
The good news is that even if you set suboptimal values, it will still work fine, just waste a bit of memory and/or CPU cycles.

The documentation gives the minimum necessary information you need to be able to make a reasonable decision. Read the introduction. I don't know factors you should take into consideration because you have not given details about the nature of your application, the expected load,... My best advice at this stage, let it stay at the default of 16, then do a load testing ( think about the app on the user point of view ) and you'll be able to figure out just roughly how much capacity you need initially.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.