Do you always have to know the size of the array for a Hashtable prior to creating the array?
No, you don't. A quality implementation (Hashtable/HashMap) will resize itself automatically as the number of elements increases.
If you are talking about your own implementation, the answer depends on whether the hash table is capable of increasing the number of buckets as its size grows.
If you are worried about the performance implications of the resizing, the correct approach is to profile this in the context of your overall application.
No, in fact is bad to have it fixed to a certain value.
For more info you can start here with Wikipedia.
Related
I need to implement a fixed-size hash map optimized for memory and speed, but am unclear as to what this means: Does it mean that the number of buckets I can have in my hash map is fixed? Or that I cannot dynamically allocate memory and expand the size of my hash map to resolve collisions via linked lists? If the answer to the latter question is yes, the first collision resolution approach that comes to mind is linear probing--can anyone comment on other more memory and speed efficient methods, or point me towards any resources to get started? Thanks!
Without seeing specific requirements it's hard to interpret the meaning of "fixed-size hash map optimized for memory and speed," so I'll focus on the "optimized for memory and speed" aspect.
Memory
Memory efficiency is hard to give advice for particularly if the hash map actually exists as a "fixed" size. In general open-addressing can be more memory efficient because the key and value alone can be stored without needing pointers to next and/or previous linked list nodes. If your hash map is allowed to resize, you'll want to pick a collision resolution strategy that allows for a larger load (elements/capacity) before resizing. 1/2 is a common load used by many hash map implementations, but it means at least 2x the necessary memory is always used. The collision resolution strategy generally needs to be balanced between speed and memory efficiency, in particular tuned for your actual requirements/use case.
Speed
From a real-world perspective, particularly for smaller hash-maps or those with trivially sized key, the most important aspect to optimizing speed would likely be reducing cache-misses. That means put as much of the information needed to perform operations in contiguous memory spaces as possible.
My advice would be to use open-addressing instead of chaining for collision resolution. This would allow for more of your memory to be contiguous and should be at a minimum 1 less cache miss per key comparison. Open-addressing will require some-kind of probing, but compared to the cost of fetching each link of a linked list from memory looping over several array elements checking key comparison should be faster. See here for a benchmark of c++ std::vector vs std::list, the takeaway being for most operations a normal contiguous array is faster due to spatial locality despite algorithm complexity.
In terms of types of probing, linear probing has an issue of clustering. As collisions occur the adjacent elements are consumed which causes more and more collisions in the same section of the array, this becomes exceptionally important when the table is nearly full. This could be solved with re-hashing, Robin-Hood hashing (As you are probing to insert, if you reach an element closer to it's ideal slot than the element being inserted, swap the two and try to insert that element instead, A much better description can be seen here), etc. Quadratic Probing doesn't have the same problem of clustering that linear probing has, but it has it's own limitation, not every array location can be reached from every other array location, so dependent on size, typically only half the array can be filled before it would need to be resized.
The size of the array is also something that effects performance. The most common sizings are power of 2 sizes, and prime number sizes. Java: A "prime" number or a "power of two" as HashMap size?
Arguments exist for both, but mostly performance will depend on usage, specifically power of two sizes are usually very bad with hashes that are sequential, but the conversion of hash to array index can be done with a single and operation vs a comparatively expensive mod.
Notably, the google_dense_hash is a very good hash map, easily outperforming the c++ standard library variation in almost every use case, and uses open-addressing, a power of 2 resizing convention, and quadratic probing. Malte Skarupke wrote an excellent hash table beating the google_dense_hash in many cases, including lookup. His implementation uses robin hood hashing and linear probing, with a probe length limit. It's very well described in a blog post along with excellent benchmarking against other hash tables and a description of the performance gains.
I do not know in advance how many elements are going to be stored in my Hashmap . So how big should the capacity of my HashMap be ? What factors should I take into consideration here ? I want to minimize the rehashing process as much as possible since it is really expensive.
You want to have a good tradeoff between space requirement and speed (which is reduced if many collisions happen, which becomes more likely if you reduce the space allocation).
You can define a load factor, the default is probably fine.
But what you also want to avoid is having to rebuild and extend the hash table as it grows. So you want to size it with the maximum capacity up front. Unfortunately, for that, you need to know roughly how much you are going to put into it.
If you can afford to waste a little memory, and at least have a reasonable upper bound for how large it can get, you can use that as the initial capacity. It will never rehash if you stay below that capacity. The memory requirement is linear to the capacity (maybe someone has numbers). Keep in mind that with a default load factor of 0.75, you need to set your capacity a bit higher than the number of elements, as it will extend the table when it is already 75% full.
If you really have no idea, just use the defaults. Not because they are perfect in your case, but because you don't have any basis for alternative settings.
The good news is that even if you set suboptimal values, it will still work fine, just waste a bit of memory and/or CPU cycles.
The documentation gives the minimum necessary information you need to be able to make a reasonable decision. Read the introduction. I don't know factors you should take into consideration because you have not given details about the nature of your application, the expected load,... My best advice at this stage, let it stay at the default of 16, then do a load testing ( think about the app on the user point of view ) and you'll be able to figure out just roughly how much capacity you need initially.
I usually do e.g.
HashMap<String,String> dictionary = new HashMap<String,String>();
I started to think about it, and as far as I know a HashMap is implemented under the hood via a hash table.
The objects are stored in the table using a hash to find where they should be stored in the table.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
I.e. what would be the size of the hash table during construction? Would it need to allocate new memory for the table as elements increase?
Or I am confused on the concept here?
Are the default capacity and load adequate or should I be spending time for the actual numbers?
The nice thing about Java is that it is open-source, so you can pull up the source code, which answers a number of questions:
No, there is no relationship between HashMap and HashTable. HashMap derives from AbstractMap, and does not internally use a HashTable for managing data.
Whether or not omitting an explicit size will decrease performance will depend upon your usage model (or more specifically, how many things you put into the map). The map will automatically double in size every time a certain threshold is hit (0.75 * <current map capacity>), and the doubling operation is expensive. So if you know approximately how many elements will be going into the map, you can specify a size and prevent it from ever needing to allocate additional space.
The default capacity of the map, if none is specified using the constructor, is 16. So it will double its capacity to 32 when the 12th element is added to the map. And then again on the 24th, and so on.
Yes, it needs to allocate new memory when the capacity increases. And it's a fairly costly operation (see the resize() and transfer() functions).
Unrelated to your question but still worth noting, I would recommend declaring/instantiating your map like:
Map<String,String> dictionary = new HashMap<String,String>();
...and of course, if you happen to know how many elements will be placed in the map, you should specify that as well.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
Depends on how much you're going to store in the HashMap and how your code will use it afterward. If you can give it a ballpark figure up front, it might be faster, but: "it's very important not to set the initial capacity too high [...] if iteration performance is important" 1 because iteration time is proportional to the capacity.
Doing this in non-performance-critical pieces of code would be considered premature optimization. If you're going to outsmart the JDK authors, make sure you have measurements that show that your optimization matters.
what would be the size of the hash table during construction?
According to the API docs, 16.
Would it need to allocate new memory for the table as elements increase?
Yes. Every time it's fuller than the load factor (default = .75), it reallocates.
Are the default capacity and load adequate
Only you can tell. Profile your program to see whether it's spending too much time in HashMap.put. If it's not, don't bother.
Hashmap would automatically increase the size if it needs to. The best way to initialize is if you have some sort of anticipating how much elements you might needs and if the figure is large just set it to a number which would not require constant resizing. Furthermore if you read the JavaDoc for Hashmap you would see that the default size is 16 and load factor is 0.75 which means that once the hashmap is 75% full it will automatically resize. So if you expect to hold 1million elements it is natural you want a larger size than the default one
I would declare it as interface Map first of all.
Map<String,String> dictionary = new HashMap<String,String>();
Does the fact that I do not set a size on the construction of the
dictionary makes the performace decrease?
Yes, initial capacity should be set for better performance.
Would it need to allocate new memory for the table as elements
increase
Yes, load factor also effects performance.
More detail in docs
As stated here, the default initial capacity is 16 and the default load factor is 0.75. You can change either one with different c'tors, and this depends on your usage (though these are generally good for general purposes).
So I am implementing my own hashtable in java, since the built in hashtable has ridiculous memory overhead per entry. I'm making an open-addressed table with a variant of quadratic hashing, which is backed internally by two arrays, one for keys and one for values. I don't have the ability to resize though. The obvious way to do it is to create larger arrays and then hash all of the (key, value) pairs into the new arrays from the old ones. This falls apart though when my old arrays take up over 50% of my current memory, since I can't fit both the old and new arrays in memory at the same time. Is there any way to resize my hashtable in this situation
Edit: the info I got for current hashtable memory overheads is from here How much memory does a Hashtable use?
Also, for my current application, my values are ints, so rather than store references to Integers, I have an array of ints as my values.
The simple answer is "no, there is no way to extend the length of an existing array". That said, you could add extra complexity to your hashtable and use arrays-of-arrays (or specialize and hard-code support for just two arrays).
You could partition your hash table. e.g. you could have 2-16 partitions based on 1-4 bits in the hashCode. This would allow you to resize a portion of the hash table at a time.
If you have a single hash table which is a large percentage of your memory size, you have a serious design issue IMHO. Are you using a mobile device? What is your memory restriction? Have you looked using Trove4j which doesn't use entry objects either.
Maybe a solution for the problem is:
->Creating a list to store the content of the matrix (setting a list for each row and then free the memory of the array in question if possible, one by one);
-> Create the new matrix;
->Fill the matrix with the values stored on the list (removing 1 element from the list right after copying the info from it.
This can be easier if the matrix elements are pointers to the elements themselves.
This is a very theorical approach of the problem, but I hope it helps
If I have a key set of 1000, what is a suitable size for my Hash table, and how is that determined?
It depends on the load factor (the "percent full" point where the table will increase its size and re-distribute its elements). If you know you have exactly 1000 entries, and that number will never change, you can just set the load factor to 1.0 and the initial size to 1000 for maximum efficiency. If you weren't sure of the exact size, you could leave the load factor at its default of 0.75 and set your initial size to 1334 (expected size/LF) for really good performance, at a cost of extra memory.
You can use the following constructor to set the load factor:
Hashtable(int initialCapacity, float loadFactor)
You need to factor in the hash function as well.
one rule of thumb suggests make the table size about double, so that there is room to expand, and hopefully keep the number of collisions small.
Another rule of thumb is to assume that you are doing some sort of modulo related hashing, then round your table size up to the next largest prime number, and use that prime number as the modulo value.
What kind of things are you hashing? More detail should generate better advice.
There's some discussion of these factors in the documentation for Hashtable
Let it grow. With this size, the automatic handling is fine. Other than that, 2 x size + 1 is a simple formula. Prime numbers are also kind of good, but as soon as your data set reaches a certain size, the hash implementation might decide to rehash and grow the table.
Your keys are driving the effectiveness and are hopefully distinct enough.
Bottom line: Ask the size question when you have problems such as size or slow performance, other than that: Do not worry!
Twice is good.
You don't have a big keyset.
Don't bother about difficult discussions about your HashTable implementation, and go for 2000.
I'd like to reiterate what https://stackoverflow.com/users/33229/wwwflickrcomphotosrene-germany said above. 1000 doesn't seem like a very big hash to me. I've been using a lot of hashtables about that size in java without seeing much in the way of performance problems. And I hardly ever muck about with the size or load factor.
If you've run a profiler on your code and determined that the hashtable is your problem, then by all means start tweaking. Otherwise, I wouldn't assume you've got a problem until you're sure.
After all, in most code, the performance problem isn't where you think it is. I try not to anticipate.