I usually do e.g.
HashMap<String,String> dictionary = new HashMap<String,String>();
I started to think about it, and as far as I know a HashMap is implemented under the hood via a hash table.
The objects are stored in the table using a hash to find where they should be stored in the table.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
I.e. what would be the size of the hash table during construction? Would it need to allocate new memory for the table as elements increase?
Or I am confused on the concept here?
Are the default capacity and load adequate or should I be spending time for the actual numbers?
The nice thing about Java is that it is open-source, so you can pull up the source code, which answers a number of questions:
No, there is no relationship between HashMap and HashTable. HashMap derives from AbstractMap, and does not internally use a HashTable for managing data.
Whether or not omitting an explicit size will decrease performance will depend upon your usage model (or more specifically, how many things you put into the map). The map will automatically double in size every time a certain threshold is hit (0.75 * <current map capacity>), and the doubling operation is expensive. So if you know approximately how many elements will be going into the map, you can specify a size and prevent it from ever needing to allocate additional space.
The default capacity of the map, if none is specified using the constructor, is 16. So it will double its capacity to 32 when the 12th element is added to the map. And then again on the 24th, and so on.
Yes, it needs to allocate new memory when the capacity increases. And it's a fairly costly operation (see the resize() and transfer() functions).
Unrelated to your question but still worth noting, I would recommend declaring/instantiating your map like:
Map<String,String> dictionary = new HashMap<String,String>();
...and of course, if you happen to know how many elements will be placed in the map, you should specify that as well.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
Depends on how much you're going to store in the HashMap and how your code will use it afterward. If you can give it a ballpark figure up front, it might be faster, but: "it's very important not to set the initial capacity too high [...] if iteration performance is important" 1 because iteration time is proportional to the capacity.
Doing this in non-performance-critical pieces of code would be considered premature optimization. If you're going to outsmart the JDK authors, make sure you have measurements that show that your optimization matters.
what would be the size of the hash table during construction?
According to the API docs, 16.
Would it need to allocate new memory for the table as elements increase?
Yes. Every time it's fuller than the load factor (default = .75), it reallocates.
Are the default capacity and load adequate
Only you can tell. Profile your program to see whether it's spending too much time in HashMap.put. If it's not, don't bother.
Hashmap would automatically increase the size if it needs to. The best way to initialize is if you have some sort of anticipating how much elements you might needs and if the figure is large just set it to a number which would not require constant resizing. Furthermore if you read the JavaDoc for Hashmap you would see that the default size is 16 and load factor is 0.75 which means that once the hashmap is 75% full it will automatically resize. So if you expect to hold 1million elements it is natural you want a larger size than the default one
I would declare it as interface Map first of all.
Map<String,String> dictionary = new HashMap<String,String>();
Does the fact that I do not set a size on the construction of the
dictionary makes the performace decrease?
Yes, initial capacity should be set for better performance.
Would it need to allocate new memory for the table as elements
increase
Yes, load factor also effects performance.
More detail in docs
As stated here, the default initial capacity is 16 and the default load factor is 0.75. You can change either one with different c'tors, and this depends on your usage (though these are generally good for general purposes).
Related
I have project that is handling a large amount of data that is being written to an excel file. I store this data in a static HashMap in the form Map<List<String>, Integer>, where the size of the list is only ever 3. The number of entries in the Map however can range anywhere from 0 to 11,300.
The flow of this project is:
Load Map up with entries
Iterate Map and do stuff
Clear map for next set of entries
What I recently found out about HashMap though is how it re-sizes when the set size is breached. So not only is my Map re-sizing constantly at dramatic lengths, but it could very well have about 20,000 empty entries by the time I clear the largest set of entries.
So I'm trying to micro-optimize this thing and I'm stuck with a dilemma of how to do this. My two thoughts are to:
Set the default value of the initial HashMap to a value that would allow it to at most ever re-size only once
Reinitialize the HashMap with the average size that is expected for each new entry set to limit re-sizing and allow the garbage collector to do some clean up
My intuition tells me option two might be the most reasonable one, but that could still prove for lots of re-sizing depending the next entry set. But then option one greatly limits re-sizing to a one time operation but then leaves me with literally thousands of null entries.
Are one of my two proposed solutions better than the other, is there not much difference in memory improvement between the two, or could there be some other solution I have overseen (that does not involve changing the data structure)?
EDIT: Just for some context, I'm wanting to do this because occasionally the project runs out of heap memory and I'm trying to determine how much of an impact this gigantic map is or could be.
EDIT2: Just to clarify, the size of the Map itself is the larger value. The key size (i.e. the list) is ONLY ever at 3.
The question and accepted response here were so wrong, that I had to reply.
I have project that is handling a large amount of data that is being
written to an excel file. I store this data in a static HashMap in the
form Map, Integer>, where the size of the list is only
ever 3. The number of entries in the Map however can range anywhere
from 0 to 11,300.
Please don't take me wrong, but this is tiny!!! Don't even bother to optimize something like this! I quickly made a test, filling "11300" elements in a hashmap is less than a dozen of milliseconds.
What I recently found out about HashMap though is how it re-sizes when the set size is > breached. So not only is my Map re-sizing constantly at dramatic lengths, but it could > very well have about 20,000 empty entries by the time I clear the largest set of
entries.
...just to be clear. Empty entries consume almost no space, these are just empty pointers. 8 bytes per slot on 64bit machines, or 4 bytes per slot on 32bit. We're talking about a few kilobytes at most here.
Reinitialize the HashMap with the average size that is expected for each new entry set > to limit re-sizing and allow the garbage collector to do some clean up.
It's not the average "size" of the entries, it's the average amount of entries to be expected.
EDIT: Just for some context, I'm wanting to do this because
occasionally the project runs out of heap memory and I'm trying to
determine how much of an impact this gigantic map is or could be.
It's unlikely to be the map. Use a profiler! You can store millions of elements without a fuss.
The accepted answer is bad
You could change these values on initialisation, so the size of 11300
and a factorLoad of 1, Meaning the map will not increase in size until
your maximum has been met, which in your case, as I understand it,
will be never.
This is not good advice. Using the same capacity as the expected number of items inserted and a load factor of "one", you are bound to have really large amounts of hash collisions. This will be a performance disaster.
Conclusion
If you don't know how stuff works, don't try to micro-optimize.
I did some research, by ending up on this page: How does a HashMap work in Java
The second last heading has to do with resizing overhead, stating the defaults for a HashMap is a size of 16, and a factorLoad of 0.75.
You could change these values on initialisation, so the size of 11300 and a factorLoad of 1, Meaning the map will not increase in size until your maximum has been met, which in your case, as I understand it, will be never.
I did a quick experiment, using this code:
public static void main(String[] args) throws Exception {
Map<String, Integer> map = new HashMap<>(11000000, 1);
// Map<String, Integer> map = new HashMap<>();
for (int i = 0; i < 11000000; i++) {
map.put(i + "", i);
}
System.out.println(map.size());
Thread.sleep(9000);
}
Swapping the two Map initialisations, and then checking the memory it consumes in Task Manager.
With the initial size and and factorLoad set, it uses ~1.45GB of memory.
Without the values set, it uses ~1.87GB of memory.
Re-initialising the Map every time instead of clearing it for a potentially smaller Map to take its place will be slower, but you would possibly end up with more memory temporarily.
You could also do both. Re-initialise to set the initial size and the factorLoad properites, should you know the amount of List objects for each cycle.
The article also suggests that the Java 8 HashMap, though potentially faster, could also potentially have more memory overhead than in Java 7. It might be worth trying to compile the program in both versions and see which provides an improved memory solution. Would be interesting if nothing else.
As I have studied HashSet class, it uses the concept of filled ratio, which says if the HashSet if filled up to this limit create a larger HashSet and copy values in to it. Why we dont let HashSet to get full with object and then create a new HashSet? Why a new concept is derived for HashSet?
Both ArrayList and Vector are accessed by positional index, so that there are no conflicts and access is always O(1).
A hash-based data structure is accessed by a hashed value, which can collide and degrade into access to a second-level "overflow" data structure (list or tree). If you have no such collisions, access is O(1), but if you have many collisions, it can be significantly worse. You can control this a bit by allocating more memory (so that there are more buckets and hopefully fewer collisions).
As a result, there is no need to grow an ArrayList to a capacity more than you need to store all elements, but it does make sense to "waste" a bit (or a lot) in the case of a HashSet. The parameter is exposed to allow the programmer to choose what works best for her application.
As Jonny Henly has described. It is because of the way data is stored.
ArrayList is linear data structure, while HashSet is not. In HashSet data is stored in underlying array based on hashcodes. In a way performance of HashSet is linked to how many buckets are filled and how well data is distributed among these buckets. Once this distribution of data is beyond a certain level (called load factor) re-hashing is done.
HashSet is primarily used to ensure that the basic operations (such as adding, fetching, modifying and deleting) are performed in constant time regardless of the number of entries being stored in the HashSet.
Though a well designed hash function can achieve this, designing one might take time. So if performance is a critical requirement for the application, we could use the load factor to ensure the operations are performed in constant time as well. I think we could call both of these as redundant's for each other (the load factor and the hash function).
I agree that this may not be a perfect explanation, but I hope it does bring some clarity on the subject.
Do you always have to know the size of the array for a Hashtable prior to creating the array?
No, you don't. A quality implementation (Hashtable/HashMap) will resize itself automatically as the number of elements increases.
If you are talking about your own implementation, the answer depends on whether the hash table is capable of increasing the number of buckets as its size grows.
If you are worried about the performance implications of the resizing, the correct approach is to profile this in the context of your overall application.
No, in fact is bad to have it fixed to a certain value.
For more info you can start here with Wikipedia.
I do not know in advance how many elements are going to be stored in my Hashmap . So how big should the capacity of my HashMap be ? What factors should I take into consideration here ? I want to minimize the rehashing process as much as possible since it is really expensive.
You want to have a good tradeoff between space requirement and speed (which is reduced if many collisions happen, which becomes more likely if you reduce the space allocation).
You can define a load factor, the default is probably fine.
But what you also want to avoid is having to rebuild and extend the hash table as it grows. So you want to size it with the maximum capacity up front. Unfortunately, for that, you need to know roughly how much you are going to put into it.
If you can afford to waste a little memory, and at least have a reasonable upper bound for how large it can get, you can use that as the initial capacity. It will never rehash if you stay below that capacity. The memory requirement is linear to the capacity (maybe someone has numbers). Keep in mind that with a default load factor of 0.75, you need to set your capacity a bit higher than the number of elements, as it will extend the table when it is already 75% full.
If you really have no idea, just use the defaults. Not because they are perfect in your case, but because you don't have any basis for alternative settings.
The good news is that even if you set suboptimal values, it will still work fine, just waste a bit of memory and/or CPU cycles.
The documentation gives the minimum necessary information you need to be able to make a reasonable decision. Read the introduction. I don't know factors you should take into consideration because you have not given details about the nature of your application, the expected load,... My best advice at this stage, let it stay at the default of 16, then do a load testing ( think about the app on the user point of view ) and you'll be able to figure out just roughly how much capacity you need initially.
So I am implementing my own hashtable in java, since the built in hashtable has ridiculous memory overhead per entry. I'm making an open-addressed table with a variant of quadratic hashing, which is backed internally by two arrays, one for keys and one for values. I don't have the ability to resize though. The obvious way to do it is to create larger arrays and then hash all of the (key, value) pairs into the new arrays from the old ones. This falls apart though when my old arrays take up over 50% of my current memory, since I can't fit both the old and new arrays in memory at the same time. Is there any way to resize my hashtable in this situation
Edit: the info I got for current hashtable memory overheads is from here How much memory does a Hashtable use?
Also, for my current application, my values are ints, so rather than store references to Integers, I have an array of ints as my values.
The simple answer is "no, there is no way to extend the length of an existing array". That said, you could add extra complexity to your hashtable and use arrays-of-arrays (or specialize and hard-code support for just two arrays).
You could partition your hash table. e.g. you could have 2-16 partitions based on 1-4 bits in the hashCode. This would allow you to resize a portion of the hash table at a time.
If you have a single hash table which is a large percentage of your memory size, you have a serious design issue IMHO. Are you using a mobile device? What is your memory restriction? Have you looked using Trove4j which doesn't use entry objects either.
Maybe a solution for the problem is:
->Creating a list to store the content of the matrix (setting a list for each row and then free the memory of the array in question if possible, one by one);
-> Create the new matrix;
->Fill the matrix with the values stored on the list (removing 1 element from the list right after copying the info from it.
This can be easier if the matrix elements are pointers to the elements themselves.
This is a very theorical approach of the problem, but I hope it helps