I have project that is handling a large amount of data that is being written to an excel file. I store this data in a static HashMap in the form Map<List<String>, Integer>, where the size of the list is only ever 3. The number of entries in the Map however can range anywhere from 0 to 11,300.
The flow of this project is:
Load Map up with entries
Iterate Map and do stuff
Clear map for next set of entries
What I recently found out about HashMap though is how it re-sizes when the set size is breached. So not only is my Map re-sizing constantly at dramatic lengths, but it could very well have about 20,000 empty entries by the time I clear the largest set of entries.
So I'm trying to micro-optimize this thing and I'm stuck with a dilemma of how to do this. My two thoughts are to:
Set the default value of the initial HashMap to a value that would allow it to at most ever re-size only once
Reinitialize the HashMap with the average size that is expected for each new entry set to limit re-sizing and allow the garbage collector to do some clean up
My intuition tells me option two might be the most reasonable one, but that could still prove for lots of re-sizing depending the next entry set. But then option one greatly limits re-sizing to a one time operation but then leaves me with literally thousands of null entries.
Are one of my two proposed solutions better than the other, is there not much difference in memory improvement between the two, or could there be some other solution I have overseen (that does not involve changing the data structure)?
EDIT: Just for some context, I'm wanting to do this because occasionally the project runs out of heap memory and I'm trying to determine how much of an impact this gigantic map is or could be.
EDIT2: Just to clarify, the size of the Map itself is the larger value. The key size (i.e. the list) is ONLY ever at 3.
The question and accepted response here were so wrong, that I had to reply.
I have project that is handling a large amount of data that is being
written to an excel file. I store this data in a static HashMap in the
form Map, Integer>, where the size of the list is only
ever 3. The number of entries in the Map however can range anywhere
from 0 to 11,300.
Please don't take me wrong, but this is tiny!!! Don't even bother to optimize something like this! I quickly made a test, filling "11300" elements in a hashmap is less than a dozen of milliseconds.
What I recently found out about HashMap though is how it re-sizes when the set size is > breached. So not only is my Map re-sizing constantly at dramatic lengths, but it could > very well have about 20,000 empty entries by the time I clear the largest set of
entries.
...just to be clear. Empty entries consume almost no space, these are just empty pointers. 8 bytes per slot on 64bit machines, or 4 bytes per slot on 32bit. We're talking about a few kilobytes at most here.
Reinitialize the HashMap with the average size that is expected for each new entry set > to limit re-sizing and allow the garbage collector to do some clean up.
It's not the average "size" of the entries, it's the average amount of entries to be expected.
EDIT: Just for some context, I'm wanting to do this because
occasionally the project runs out of heap memory and I'm trying to
determine how much of an impact this gigantic map is or could be.
It's unlikely to be the map. Use a profiler! You can store millions of elements without a fuss.
The accepted answer is bad
You could change these values on initialisation, so the size of 11300
and a factorLoad of 1, Meaning the map will not increase in size until
your maximum has been met, which in your case, as I understand it,
will be never.
This is not good advice. Using the same capacity as the expected number of items inserted and a load factor of "one", you are bound to have really large amounts of hash collisions. This will be a performance disaster.
Conclusion
If you don't know how stuff works, don't try to micro-optimize.
I did some research, by ending up on this page: How does a HashMap work in Java
The second last heading has to do with resizing overhead, stating the defaults for a HashMap is a size of 16, and a factorLoad of 0.75.
You could change these values on initialisation, so the size of 11300 and a factorLoad of 1, Meaning the map will not increase in size until your maximum has been met, which in your case, as I understand it, will be never.
I did a quick experiment, using this code:
public static void main(String[] args) throws Exception {
Map<String, Integer> map = new HashMap<>(11000000, 1);
// Map<String, Integer> map = new HashMap<>();
for (int i = 0; i < 11000000; i++) {
map.put(i + "", i);
}
System.out.println(map.size());
Thread.sleep(9000);
}
Swapping the two Map initialisations, and then checking the memory it consumes in Task Manager.
With the initial size and and factorLoad set, it uses ~1.45GB of memory.
Without the values set, it uses ~1.87GB of memory.
Re-initialising the Map every time instead of clearing it for a potentially smaller Map to take its place will be slower, but you would possibly end up with more memory temporarily.
You could also do both. Re-initialise to set the initial size and the factorLoad properites, should you know the amount of List objects for each cycle.
The article also suggests that the Java 8 HashMap, though potentially faster, could also potentially have more memory overhead than in Java 7. It might be worth trying to compile the program in both versions and see which provides an improved memory solution. Would be interesting if nothing else.
Related
I keep getting java.lang.OutOfMemoryError: GC overhead limit exceeded while loading rows from a Hibernate query.
I've tried increasing the memory a few times and it just continues to happen. I noticed in my logs that it appears to be pointing to a method where I'm using a TreeMap. I'm wondering if I'm using this incorrectly causing the out of memory issues.
public List<Item> getProducts() {
List<ProductListing> productListings = session.createCriteria(ProductListing.class)
.createAlias("productConfiguration", "productConfiguration")
.add(Restrictions.eq("productConfiguration.category", category))
.add(Restrictions.eq("active", true))
.add(Restrictions.eq("purchased", true)).list();
Map<String, Item> items = new TreeMap<>();
productListings.stream().forEach((productListing) -> {
Item item = productListing.getItem();
items.put(item.getName(), item);
});
return new ArrayList<>(items.values());
}
Is it safe to pass the values into the arraylist?
Do I need to set the arraylist size?
I'm just wondering if I'm doing something terribly wrong. It looks correct, but the memory exceptions say otherwise.
I call this "loading the world" - the name getProducts() (with no args) is a bit of a code smell - as you're not restricting the result set size, and for all we know your Item objects could be large with lots of eagerly-loaded dependencies (not to mention all the backing Hibernate objects on the heap).
Another big problem is that you're expensively adding your dehydrated entities to a TreeMap, calling hashCode() and potentially equals(), only to throw away the keys and copy the values into a newly-allocated ArrayList.
Leaving aside the lack of pre-sizing of the ArrayList (correct, this is not ideal, though it should merely be slow), why the TreeMap stage? If there is aggregation you need to do, why not get the database to do that much more efficiently (e.g. via a GROUP BY name) and using an index, rather than pulling it all into a map for much slower re-processing? At the very least, by getting only unique Items back, you could skip the map stage and copy straight into your list (there are even lighter-weight possibilities too, depending on your needs).
I strongly suggest testing with a proper Profiler. Specifically, it may help you determine the effect of eager-loading on heap size. Potentially just turning that off may make the problem much more manageable.
However, you also need to consider your code's clients: who actually needs all of those Items? Most likely nobody.
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
I have an Android application that iterates through an array of thousands of integers and it uses them as key values to access pairs of integers (let us call them id's) in order to make calculations with them. It needs to do it as fast as possible and in the end, it returns a result which is crucial to the application.
I tried loading a HashMap into the memory for fast access to those numbers but it resulted in OOM Exception. I also tried writing those id's to a RandomAccessFile and storing their offsets on the file to another HashMap but it was way too slow. Also, the new HashMap that only stores the offsets is still occupying a large memory.
Now I am considering SQLite but I am not sure if it will be any faster. Are there any structures or libraries that could help me with that?
EDIT: Number of keys are more than 20 million whereas I only need to access thousands of them. I do not know which ones I will access beforehand because it changes with user input.
You could use Trove's TIntLongHashMap to map primitive ints to primitive longs (which store the ints of your value pair). This saves you the object overhead of a plain vanilla Map, which forces you to use wrapper types.
EDIT
Since your update states you have more than 20 million mappings, there will likely be more space-efficient structures than a hash map. An approach to partition your keys into buckets, combined with some sub-key compression will likely save you half the memory over even the most efficient hash map implementation.
SQLite is an embedded relational database, which uses indexes. I would bet it is much faster than using RandomAccessFile. You can give it a try.
My suggestion is to rearrange the keys in Buckets - what i mean is identify (more or less) the distribution of your keys, then make files that corresponds to each range of keys (the point is that every file must contain just as much integers that can get in memory and no more then that) then when you search for a key, you just read the whole file to the memory and look for it.
exemple, assuming the distribution of the key is uniform, store 500k values corresponding to the 0-500k key values, 500k values corresponding to 500k-1mil keys and so on...
EDIT : if you did try this approach, and it still went slow, i still have some tricks in my sleaves:
First make sure that your division is actually close to equal between all the buckets.
Try to make the buckets smaller, by making more buckets.
The idea about correct division to buckets by ranges is that when you search for a key, you go to the corresponding range bucket and The key either in it or that it is not in the whole collection. so there is no point on Concurnetly reading another bucket.
I never done that, cause im not sure concurrency works on I\O's, but it may be helpfull to Read the whole file with 2 threads one from top to bottom and the other from bottom to top until they meet. (or something like that)
While you read the whole bucket into memory, split it to 3-4 arraylists, Run 3-4 working threads to search your key on each of the arrays, the search must end way faster then.
I've got an existing ArrayList which will be filtered according to certain criteria. I'm using Apache's CollectionUtils.select(Collection, Predicate, Collection) for filtering.
The second collection which is passed to this method will be populated with the relevant objects. Is it now more wise to create this new collection with
List newList = new ArrayList();
or with
List newList = new ArrayList(listToBeFiltered.size());
?
In the first case, the List will be upsized if the initial capacity is reached, while in the second case sometimes a way too big List will be created.
Which way is the better one? And please correct me if I've made any mistakes in my explanation.
That normally depends on the final size and the size of the collection to be filtered.
Resizing an ArrayList or normally done by doubling the current size and copying the content around. Thus with a huge final size there might be couple of resize operations needed.
On the other hand a really large initial size might eat up quite a lot of memory and might trigger garbage collection earler, but the list would have to be really big.
You might try and profile both options but for standard sizes I'd prefer specifying a sensible initial size.
If you have any intuition on the selectivity of your filtering then you could size the list slightly larger than its expected size. If the selectivity is typically 20% then you could set the final result to (say) 25%.
List newList = new ArrayList((int) (0.25 * listToBeFiltered.size()));
In most cases, the 'upsizing' will be more expensive as it incurs a penalty each time it needs to make more room. For adding lots of entries, this will result in tiny delays during execution. Allocating a large size to begin with only incurs that penalty once, assuming you don't later add more items. Note, however, that dynamic arrays / lists / containers generally have a granularity that gives a reasonable capacity before they have to re-allocate memory, so for a small number of items you might not spot any difference.
Depends, you have a chance of wasting some space on both.
But by intuition if you think resulting array will be much smaller than input, I suggest you use List newList = new ArrayList();
Initial capacity for Arraylist is 10 and size doubles if it is full.
I usually do e.g.
HashMap<String,String> dictionary = new HashMap<String,String>();
I started to think about it, and as far as I know a HashMap is implemented under the hood via a hash table.
The objects are stored in the table using a hash to find where they should be stored in the table.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
I.e. what would be the size of the hash table during construction? Would it need to allocate new memory for the table as elements increase?
Or I am confused on the concept here?
Are the default capacity and load adequate or should I be spending time for the actual numbers?
The nice thing about Java is that it is open-source, so you can pull up the source code, which answers a number of questions:
No, there is no relationship between HashMap and HashTable. HashMap derives from AbstractMap, and does not internally use a HashTable for managing data.
Whether or not omitting an explicit size will decrease performance will depend upon your usage model (or more specifically, how many things you put into the map). The map will automatically double in size every time a certain threshold is hit (0.75 * <current map capacity>), and the doubling operation is expensive. So if you know approximately how many elements will be going into the map, you can specify a size and prevent it from ever needing to allocate additional space.
The default capacity of the map, if none is specified using the constructor, is 16. So it will double its capacity to 32 when the 12th element is added to the map. And then again on the 24th, and so on.
Yes, it needs to allocate new memory when the capacity increases. And it's a fairly costly operation (see the resize() and transfer() functions).
Unrelated to your question but still worth noting, I would recommend declaring/instantiating your map like:
Map<String,String> dictionary = new HashMap<String,String>();
...and of course, if you happen to know how many elements will be placed in the map, you should specify that as well.
Does the fact that I do not set a size on the construction of the dictionary makes the performace decrease?
Depends on how much you're going to store in the HashMap and how your code will use it afterward. If you can give it a ballpark figure up front, it might be faster, but: "it's very important not to set the initial capacity too high [...] if iteration performance is important" 1 because iteration time is proportional to the capacity.
Doing this in non-performance-critical pieces of code would be considered premature optimization. If you're going to outsmart the JDK authors, make sure you have measurements that show that your optimization matters.
what would be the size of the hash table during construction?
According to the API docs, 16.
Would it need to allocate new memory for the table as elements increase?
Yes. Every time it's fuller than the load factor (default = .75), it reallocates.
Are the default capacity and load adequate
Only you can tell. Profile your program to see whether it's spending too much time in HashMap.put. If it's not, don't bother.
Hashmap would automatically increase the size if it needs to. The best way to initialize is if you have some sort of anticipating how much elements you might needs and if the figure is large just set it to a number which would not require constant resizing. Furthermore if you read the JavaDoc for Hashmap you would see that the default size is 16 and load factor is 0.75 which means that once the hashmap is 75% full it will automatically resize. So if you expect to hold 1million elements it is natural you want a larger size than the default one
I would declare it as interface Map first of all.
Map<String,String> dictionary = new HashMap<String,String>();
Does the fact that I do not set a size on the construction of the
dictionary makes the performace decrease?
Yes, initial capacity should be set for better performance.
Would it need to allocate new memory for the table as elements
increase
Yes, load factor also effects performance.
More detail in docs
As stated here, the default initial capacity is 16 and the default load factor is 0.75. You can change either one with different c'tors, and this depends on your usage (though these are generally good for general purposes).