I do not know in advance how many elements are going to be stored in my Hashmap . So how big should the capacity of my HashMap be ? What factors should I take into consideration here ? I want to minimize the rehashing process as much as possible since it is really expensive.
You want to have a good tradeoff between space requirement and speed (which is reduced if many collisions happen, which becomes more likely if you reduce the space allocation).
You can define a load factor, the default is probably fine.
But what you also want to avoid is having to rebuild and extend the hash table as it grows. So you want to size it with the maximum capacity up front. Unfortunately, for that, you need to know roughly how much you are going to put into it.
If you can afford to waste a little memory, and at least have a reasonable upper bound for how large it can get, you can use that as the initial capacity. It will never rehash if you stay below that capacity. The memory requirement is linear to the capacity (maybe someone has numbers). Keep in mind that with a default load factor of 0.75, you need to set your capacity a bit higher than the number of elements, as it will extend the table when it is already 75% full.
If you really have no idea, just use the defaults. Not because they are perfect in your case, but because you don't have any basis for alternative settings.
The good news is that even if you set suboptimal values, it will still work fine, just waste a bit of memory and/or CPU cycles.
The documentation gives the minimum necessary information you need to be able to make a reasonable decision. Read the introduction. I don't know factors you should take into consideration because you have not given details about the nature of your application, the expected load,... My best advice at this stage, let it stay at the default of 16, then do a load testing ( think about the app on the user point of view ) and you'll be able to figure out just roughly how much capacity you need initially.
Related
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
Do you always have to know the size of the array for a Hashtable prior to creating the array?
No, you don't. A quality implementation (Hashtable/HashMap) will resize itself automatically as the number of elements increases.
If you are talking about your own implementation, the answer depends on whether the hash table is capable of increasing the number of buckets as its size grows.
If you are worried about the performance implications of the resizing, the correct approach is to profile this in the context of your overall application.
No, in fact is bad to have it fixed to a certain value.
For more info you can start here with Wikipedia.
I usually do e.g.
HashMap<String,String> dictionary = new HashMap<String,String>();
I started to think about it, and as far as I know a HashMap is implemented under the hood via a hash table.
The objects are stored in the table using a hash to find where they should be stored in the table.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
I.e. what would be the size of the hash table during construction? Would it need to allocate new memory for the table as elements increase?
Or I am confused on the concept here?
Are the default capacity and load adequate or should I be spending time for the actual numbers?
The nice thing about Java is that it is open-source, so you can pull up the source code, which answers a number of questions:
No, there is no relationship between HashMap and HashTable. HashMap derives from AbstractMap, and does not internally use a HashTable for managing data.
Whether or not omitting an explicit size will decrease performance will depend upon your usage model (or more specifically, how many things you put into the map). The map will automatically double in size every time a certain threshold is hit (0.75 * <current map capacity>), and the doubling operation is expensive. So if you know approximately how many elements will be going into the map, you can specify a size and prevent it from ever needing to allocate additional space.
The default capacity of the map, if none is specified using the constructor, is 16. So it will double its capacity to 32 when the 12th element is added to the map. And then again on the 24th, and so on.
Yes, it needs to allocate new memory when the capacity increases. And it's a fairly costly operation (see the resize() and transfer() functions).
Unrelated to your question but still worth noting, I would recommend declaring/instantiating your map like:
Map<String,String> dictionary = new HashMap<String,String>();
...and of course, if you happen to know how many elements will be placed in the map, you should specify that as well.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
Depends on how much you're going to store in the HashMap and how your code will use it afterward. If you can give it a ballpark figure up front, it might be faster, but: "it's very important not to set the initial capacity too high [...] if iteration performance is important" 1 because iteration time is proportional to the capacity.
Doing this in non-performance-critical pieces of code would be considered premature optimization. If you're going to outsmart the JDK authors, make sure you have measurements that show that your optimization matters.
what would be the size of the hash table during construction?
According to the API docs, 16.
Would it need to allocate new memory for the table as elements increase?
Yes. Every time it's fuller than the load factor (default = .75), it reallocates.
Are the default capacity and load adequate
Only you can tell. Profile your program to see whether it's spending too much time in HashMap.put. If it's not, don't bother.
Hashmap would automatically increase the size if it needs to. The best way to initialize is if you have some sort of anticipating how much elements you might needs and if the figure is large just set it to a number which would not require constant resizing. Furthermore if you read the JavaDoc for Hashmap you would see that the default size is 16 and load factor is 0.75 which means that once the hashmap is 75% full it will automatically resize. So if you expect to hold 1million elements it is natural you want a larger size than the default one
I would declare it as interface Map first of all.
Map<String,String> dictionary = new HashMap<String,String>();
Does the fact that I do not set a size on the construction of the
dictionary makes the performace decrease?
Yes, initial capacity should be set for better performance.
Would it need to allocate new memory for the table as elements
increase
Yes, load factor also effects performance.
More detail in docs
As stated here, the default initial capacity is 16 and the default load factor is 0.75. You can change either one with different c'tors, and this depends on your usage (though these are generally good for general purposes).
I have a HashMap that contains 12 million entries. It maps String values to Long values. Each string is about ten characters long. Is it possible to calculate how much memory this map will need in RAM?
You can guess, and you can make an educated guess by looking at the size of each item that goes into the map, but your best educated guess will be wrong.
JVMs have other structure used to track references and hold type information for the classes. That will add a fixed, yet unknown amount of memory to an accurate (if you can come up with an accurate input estimate) estimate.
As only some of the memory is memory directly holding the data, and some of the memory is memory used as overhead to hold the data, you need to profile your memory consumption and based your estimates on projections of the memory "growth" when using smaller maps.
Note that profiling a JVM is a tricky task, as it is optimizing memory usage in a manner that will present varying results depending on how long the JVM is running, the activity of the Map, etc. You need to do statistical sampling of the input in a variety of conditions; but, odds are good you will eventually be able to put your finger on a reasonable number. More importantly, you will also be able to say "Well it might peak up at around this number temporarily, but should settle down to this on average". Temporal changes to memory are often overlooked in static analysis.
The JVM would know because it has to allocate and manage memory but it doesn’t tell you. So, short of going native, no, there is no way to know how much memory is actually used by your objects.
A profiler will tell you how much memory is being used by the program, and which objects are using what. You might be able to find your objects' memory usage.
VisualVM is included in Java6, and will give you this information.
It's worth mentioning, though, this is not necessarily going to give you a memory 'requirement', just a view of how much memory it is using at that point in time.
If I have a key set of 1000, what is a suitable size for my Hash table, and how is that determined?
It depends on the load factor (the "percent full" point where the table will increase its size and re-distribute its elements). If you know you have exactly 1000 entries, and that number will never change, you can just set the load factor to 1.0 and the initial size to 1000 for maximum efficiency. If you weren't sure of the exact size, you could leave the load factor at its default of 0.75 and set your initial size to 1334 (expected size/LF) for really good performance, at a cost of extra memory.
You can use the following constructor to set the load factor:
Hashtable(int initialCapacity, float loadFactor)
You need to factor in the hash function as well.
one rule of thumb suggests make the table size about double, so that there is room to expand, and hopefully keep the number of collisions small.
Another rule of thumb is to assume that you are doing some sort of modulo related hashing, then round your table size up to the next largest prime number, and use that prime number as the modulo value.
What kind of things are you hashing? More detail should generate better advice.
There's some discussion of these factors in the documentation for Hashtable
Let it grow. With this size, the automatic handling is fine. Other than that, 2 x size + 1 is a simple formula. Prime numbers are also kind of good, but as soon as your data set reaches a certain size, the hash implementation might decide to rehash and grow the table.
Your keys are driving the effectiveness and are hopefully distinct enough.
Bottom line: Ask the size question when you have problems such as size or slow performance, other than that: Do not worry!
Twice is good.
You don't have a big keyset.
Don't bother about difficult discussions about your HashTable implementation, and go for 2000.
I'd like to reiterate what https://stackoverflow.com/users/33229/wwwflickrcomphotosrene-germany said above. 1000 doesn't seem like a very big hash to me. I've been using a lot of hashtables about that size in java without seeing much in the way of performance problems. And I hardly ever muck about with the size or load factor.
If you've run a profiler on your code and determined that the hashtable is your problem, then by all means start tweaking. Otherwise, I wouldn't assume you've got a problem until you're sure.
After all, in most code, the performance problem isn't where you think it is. I try not to anticipate.