I have a general question about when we should use hash table instead of say AVL trees. I remembered my lecturer saying if the data size is something like 210 or 220, a AVL tree is acceptable, because using hash table will generate hashing operations etc.
So I wonder in practice, is there a general rule regarding data size to tell us when we should choose hash table over AVL trees? Is hash table always the first choice when dealing with data size larger than 220?
Hash tables are 'wasteful' memory wise as the backing table is normally larger than the number of entries. Trees don't have this problem but do have the problem that lookups (and most other operations) are a log(n) operation. So yes you are correct that for small data sets a tree may be better - depending on how much you care about memory efficiency.
There's no general rule data regarding data size - it depends on the specifics of the implementations you are comparing and what you want to optimize for (memory or CPU). The Javadocs provide some insight into the performance of the implementations provided by Java:
http://docs.oracle.com/javase/7/docs/api/java/util/TreeMap.html
http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html
Beyond that writing some benchmarks and comparing different implementations will give you more insight.
Related
As I have studied HashSet class, it uses the concept of filled ratio, which says if the HashSet if filled up to this limit create a larger HashSet and copy values in to it. Why we dont let HashSet to get full with object and then create a new HashSet? Why a new concept is derived for HashSet?
Both ArrayList and Vector are accessed by positional index, so that there are no conflicts and access is always O(1).
A hash-based data structure is accessed by a hashed value, which can collide and degrade into access to a second-level "overflow" data structure (list or tree). If you have no such collisions, access is O(1), but if you have many collisions, it can be significantly worse. You can control this a bit by allocating more memory (so that there are more buckets and hopefully fewer collisions).
As a result, there is no need to grow an ArrayList to a capacity more than you need to store all elements, but it does make sense to "waste" a bit (or a lot) in the case of a HashSet. The parameter is exposed to allow the programmer to choose what works best for her application.
As Jonny Henly has described. It is because of the way data is stored.
ArrayList is linear data structure, while HashSet is not. In HashSet data is stored in underlying array based on hashcodes. In a way performance of HashSet is linked to how many buckets are filled and how well data is distributed among these buckets. Once this distribution of data is beyond a certain level (called load factor) re-hashing is done.
HashSet is primarily used to ensure that the basic operations (such as adding, fetching, modifying and deleting) are performed in constant time regardless of the number of entries being stored in the HashSet.
Though a well designed hash function can achieve this, designing one might take time. So if performance is a critical requirement for the application, we could use the load factor to ensure the operations are performed in constant time as well. I think we could call both of these as redundant's for each other (the load factor and the hash function).
I agree that this may not be a perfect explanation, but I hope it does bring some clarity on the subject.
I need to implement a fixed-size hash map optimized for memory and speed, but am unclear as to what this means: Does it mean that the number of buckets I can have in my hash map is fixed? Or that I cannot dynamically allocate memory and expand the size of my hash map to resolve collisions via linked lists? If the answer to the latter question is yes, the first collision resolution approach that comes to mind is linear probing--can anyone comment on other more memory and speed efficient methods, or point me towards any resources to get started? Thanks!
Without seeing specific requirements it's hard to interpret the meaning of "fixed-size hash map optimized for memory and speed," so I'll focus on the "optimized for memory and speed" aspect.
Memory
Memory efficiency is hard to give advice for particularly if the hash map actually exists as a "fixed" size. In general open-addressing can be more memory efficient because the key and value alone can be stored without needing pointers to next and/or previous linked list nodes. If your hash map is allowed to resize, you'll want to pick a collision resolution strategy that allows for a larger load (elements/capacity) before resizing. 1/2 is a common load used by many hash map implementations, but it means at least 2x the necessary memory is always used. The collision resolution strategy generally needs to be balanced between speed and memory efficiency, in particular tuned for your actual requirements/use case.
Speed
From a real-world perspective, particularly for smaller hash-maps or those with trivially sized key, the most important aspect to optimizing speed would likely be reducing cache-misses. That means put as much of the information needed to perform operations in contiguous memory spaces as possible.
My advice would be to use open-addressing instead of chaining for collision resolution. This would allow for more of your memory to be contiguous and should be at a minimum 1 less cache miss per key comparison. Open-addressing will require some-kind of probing, but compared to the cost of fetching each link of a linked list from memory looping over several array elements checking key comparison should be faster. See here for a benchmark of c++ std::vector vs std::list, the takeaway being for most operations a normal contiguous array is faster due to spatial locality despite algorithm complexity.
In terms of types of probing, linear probing has an issue of clustering. As collisions occur the adjacent elements are consumed which causes more and more collisions in the same section of the array, this becomes exceptionally important when the table is nearly full. This could be solved with re-hashing, Robin-Hood hashing (As you are probing to insert, if you reach an element closer to it's ideal slot than the element being inserted, swap the two and try to insert that element instead, A much better description can be seen here), etc. Quadratic Probing doesn't have the same problem of clustering that linear probing has, but it has it's own limitation, not every array location can be reached from every other array location, so dependent on size, typically only half the array can be filled before it would need to be resized.
The size of the array is also something that effects performance. The most common sizings are power of 2 sizes, and prime number sizes. Java: A "prime" number or a "power of two" as HashMap size?
Arguments exist for both, but mostly performance will depend on usage, specifically power of two sizes are usually very bad with hashes that are sequential, but the conversion of hash to array index can be done with a single and operation vs a comparatively expensive mod.
Notably, the google_dense_hash is a very good hash map, easily outperforming the c++ standard library variation in almost every use case, and uses open-addressing, a power of 2 resizing convention, and quadratic probing. Malte Skarupke wrote an excellent hash table beating the google_dense_hash in many cases, including lookup. His implementation uses robin hood hashing and linear probing, with a probe length limit. It's very well described in a blog post along with excellent benchmarking against other hash tables and a description of the performance gains.
Checking in java and googling online for hashtable code examples it seems that the resizing of the table is done by doubling it.
But most textbooks say that the best size for the table is a prime number.
So my question is:
Is the approach of doubling because:
It is easy to implement, or
Is finding a prime number too inefficient (but I think that finding
the next prime going over n+=2 and testing for primality using
modulo is O(loglogN) which is cheap)
Or this is my misunderstanding and only certain hashtable variants
only require prime table size?
Update:
The way presented in textbooks using a prime number is required for certain properties to work (e.g. quadratic probing needs a prime size table to prove that e.g. if a table is not full item X will be inserted).
The link posted as duplicate asks generally about increasing by any number e.g. 25% or next prime and the answer accepted states that we double in order to keep the resizing operation "rare" so we can guarantee amortized time.
This does not answer the question of having a table size that is prime and using a prime for resizing such that is even greater than double. So the idea is to keep the properties of the prime size taking into account the resizing overhead
Q: But most textbooks say that the best size for the table is a prime number.
Regarding size primality:
What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:
If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.
The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.
Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?
Quality of hash function doesn't matter, you always can "improve" hash function by MurMur3 avalancing, that is cheaper than switching to prime table size from power-of-2 table size, see above.
I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads. With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5. See this answer.
Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.
Q: Is the approach of doubling because:
It is easy to implement, or
Basically, in many cases, yes.
See this large answer regarding load factors:
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.
What is doubling? It's just the simpliest resizing strategy. The strategy could be arbitrarily complex, performing optimally in your use cases. It could consider present hash table size, growth intensity (how much get operations were done since previous resize), etc. Nobody forbids you to implement such custom resizing logic.
Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)
There is a good practice to precompute some subset of prime hash table sizes, to choose between them using binary search in runtime. See the list double hash capacities and explaination, QHash capacities. Or, even using direct lookup, that is very fast.
Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?
Yes, only certain types requre, see above.
Java HashMap (java.util.HashMap) chains bucket collisions in a linked list (or [as of JDK8] tree depending on the size and overfilling of bins).
Consequently theories about secondary probing functions don't apply.
It seems the message 'use primes sizes for hash tables' has become detached from the circumstances it applies over the years...
Using powers of two has the advantage (as noted by other answers) of reducing the hash-value to a table entry can be achieved by a bit-mask. Integer division is relatively expensive and in high performance situations this can help.
I'm going to observe that "redistributing the collision chains when rehashing is a cinch for tables that are a power of two going to a power of two".
Notice that when using powers of two rehashing to twice the size 'splits' each bucket between two buckets based on the 'next' bit of the hash-code.
That is, if the hash-table had 256 buckets and so using the lowest 8 bits of the hash-value rehashing splits each collision chain based on the 9th bit and either remains in the same bucket B (9th bit is 0) or goes to bucket B+256 (9th bit is 1). Such splitting can preserve/take advantage of the bucket handling approach. For example, java.util.HashMap keeps small buckets sorted in reverse order of insertion and then splits them into two sub-structures obeying that order. It keeps big buckets in a binary tree sorted by hash-code and similarly splits the tree to preserve that order.
NB: These tricks weren't implemented until JDK8.
(I am pretty sure) Java.util.HashMap only sizes up (never down). But there are similar efficiencies to halving a hash-table as doubling it.
One 'downside' of this strategy is that Object implementers aren't explicitly required to make sure the low order bits of hash-codes are well distributed.
A perfectly valid hash-code could be well distributed overall but poorly distributed in its low order bits. So an object obeying the general contract for hashCode() might still tank when actually used in a HashMap!
Java.util.HashMap mitigates this by applying an additional hash 'spread' onto the provided hashCode() implementation. That 'spread' is really quick crude (xors the 16 high bits with the low).
Object implmenters should be aware (if not already) that bias in their hash-code (or lack thereof) can have a significant effect on the performance of data structures using hashes.
For the record I've based this analysis on this copy of the source:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java
Problem
I'm trying to normalize columns in very large raw, de-normalized CSV tables. Column values are short strings (10-100 bytes). I'm trying to find a faster solution than my current approach(es).
Example
input.csv
john,london
jean,paris
bill,london
Is converted to the following files:
input.normalized.csv
1,1
2,2
3,1
input.col1.csv
1,john
2,jean
3,bill
input.col2.csv
1,london
2,paris
I've currently got two approaches to normalizing these datasets.
Current Approaches
Single pass in-memory
A single pass approach, storing column values -> normalized_id values in an associative array (a Java HashMap in my case). This will run out of memory at some point, but it's fast when it can store everything in memory. A simple way of lowering memory usage, would be to do a single pass per column.
Multipass sorting
A multipass approach based on sorting. Column values gets their line number attached, and are then sorted (in a memory-efficient merge-sort manner). For examples, column values london,paris,london have line numbers attached and are then sorted: london;1,london;3,paris;2 .
I can now have a single "unique value counter", and simply compare each value with the previous value (e.g. London == London, so do not increment unique value counter). At the end, I have pairs of unique_id,linenum pairs that I can sort by line number to reconstruct the normalized column. Columns can then be merged in a single pass.
This approach can be done in very limited memory, depending on the memory usage of the sorting algorithm applied. The good news is that this approach is easy to implement in something like hadoop, utilising its distributed sorting step.
MY QUESTION
The multipass approach is painfully slow compared to a single-pass approach (or a single-pass-per-column approach). So I'm wondering what the best way to optimize that approach would be, or if someone could suggest alternative approaches?
I reckon I'm looking for a (distributed) key-value store of some kind, that has as low memory usage as possible.
It seems to me that using Trove would be a good, simple alternative to using Java HashMaps, but I'd like something that can handle the distribution of keys for me.
Redis would probably be a good bet, but I'm not impressed by it's memory usage per key-value pair.
Do you know the rough order of magnitude of the input columns? If so, and you don't need to preserve the original input file order? Then you can just use a sufficiently large hash function to avoid collisions for the input keys.
If you insist on having a dense consecutive key space, then you've already covered the two primary choices. You could certainly try redis, I've seen it used for 10s of millions of key-value pairs, but it is probably not going to scale beyond that. You could also try memcached. It might have a slightly lower memory overhead than redis, but I would definitely experiment with both, since they are fairly similar for this particular usage. You don't actually need Redis's advanced data structures.
If you need more key-values than you can store in memory on a single machine, you could fall back to something like BDB or Kyoto cabinet, but eventually this step is going to become the bottleneck to your processing. The other red flag is if you can fit an entire column in memory on a single machine, then why are you using Hadoop?
Honestly, relying on a dense ordered primary key is one of the first things that gets thrown out in a NoSQL DB as it assumes a single coordinated master. If you can allow for even some gaps, then you can do something similar to a vector clock.
One final alternative, would be to use a map-reduce job to collect all the duplicate values up by key and then assign a unique value using some external transactional DB counter. However, the map-reduce job is essentially a multi-pass approach, so it may be worse. The main advantage is that you will be getting some IO parallelism. (Although the id assignment is still a serial transaction.)
In an interview recently, I was asked about how HashMap works in Java and I was able to explain it well and explain that in the worst case the HashMap may degenerate into a list due to chaining. I was asked to figure out a way to improve this performance but I was unable to do that during the interview. The interviewer asked me to look up "Trove".
I believe he was pointing to this page. I have read the description provided on that page but still can't figure out how it overcomes the limitations of the java.util.HashMap.
Even a hint would be appreciated. Thanks!!
The key phrase there is open addressing. Instead of hashing to an array of buckets, all the entries are in one big array. When you add an element, if the space for it is already in use you just move down the array to find a free space.
As long as the array is kept sufficiently bigger than the number of entries and the hash function is well distributed it's possible to keep average lookup times small. And by having one array you can get better performance - it's more cache friendly.
However it still has worst-case linear behaviour if (say) every key hashes to the same value, so it doesn't avoid that issue.
It seems to me from the Trove page that there are two main differences that improve performance.
The first is the use of open addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing). This doesn't avoid the collision issue, but it does mean that there's no need to create "Entry" objects for every item that goes in the map.
The second important difference is being able to provide your own hash function, which differs from the one provided by the class of the keys. So you could provide a much faster hash function if it made sense to do so.
One advantage of Trove is that it avoids object creation, especially for primitives.
For big hashtables in an embedded java device this can be advantageous due fewer memory consumption.
The other advantage, I saw is the use of custom hash codes / functions without the need to override hashcode(). For a specific data set, and an expert in writing hash functions this can be an advantage.