I'm using this statement
//some code
int a[][]=new int[5000000][5000000];
//some code
and running it with command
java -mx512m Test
It is giving OutOFMemoryError: Java Heap space indicating the line number of the mentioned statement in the stacktrace
How do i solve this problem
Edit:
I'm trying to solve a practice problem on codechef
You may need to consider an approach to your problem which requires less memory.
From Google Calculator (assuming a 64bit integer size):
(5 000 000^2) * 64 bits = 186 264.515 gigabytes
I've faced same problem with Eclips concern Java heap
and the solution was to revise the mx512m into mx4096m or mx2048m (expand the maximum allowed limit of memory)
so in your case try the command
java -mx4096m Test which will allow java to use 4 GB of your ram
I think the data structure you are looking for is a sparse matrix. Store your elements along with their coordinates in a map data structure (eg. Map<Integer,Map<Integer,Integer>> for a 2d sparse array) and just assume anything not in the map is zero.
Well, that's 25 trillion ints, each of which takes 4 bytes, so 100 trillion bytes overall. Easiest solution is to buy ~90 terabytes of RAM and a 64 bit OS.
Seriously though, the correct solution is probably to allocate a more reasonable data structure that can store the data more efficiently, assuming that you don't actually need to load 90 terabytes of data into RAM at once. Perhaps if you post more about the problem, we can give a better answer?
Amongst the other answers, you have a typo in your command.
It should be: java -Xmx512M Test
You are using too much memory. Use less or have more. Each int element is 32-bits. And your memory limit of 512MB is much less than you think it is.
for int[5000000][5000000] you need 100000000000000B or 100000GB. So you only can wait about 100 years when something like that is going to be possible.
You need a more enlightened data structure. Unless you actually NEED 25 trillion integers, which is doubtful.
That's a matrix with 5 million columns and 5 million rows. Is that really what you wanted?
Related
I'm working on a project that requires that I store (potentially) millions of key-value mapping, and make (potentially) the 100s of queries a second. There are some checks I can do around the data I'm working with, but it will only reduce the load by a bit. In addition, I will be making (potentially) 100s of put/removes a second, so my question is: Is there a map sufficient for this task? Is there any way I might optimize the map? Is there something faster that would work for storing key-value mappings?
Some additional information;
- The key will be a point in 3d spaces, I feel like this means I could use arrays, but the arrays would have to be massive
- The value must be an object
Any help would be greatly appreciated!
Back of envelope estimates help in getting to terms with this sort of thing. If you have millions of entries in a map, lets say 32M, and a key is a 3d point (so 3 ints->3*4B->12 bytes) ->12B * 32M = 324MB. You didn't mention the size of the value but assuming you have a similarly sized value lets double that figure. This is Java, so assuming a 64bit platform with Compressed OOPs which is default and what most people are on, you pay an extra 12B of object header per Object. So: 32M * 2 * 24B = 1536MB.
Now if you use a HashMap each entry requires an extra HashMap.Node, in Java8 on the platform above you are looking at 32B per Node (use OpenJDK JOL to find out object sizes). Which brings us to 2560MB. Also throw in the cost of the HashMap array, with 32M entries you are looking at a table with 64M entries (because the array size is a power of 2 and you need some slack beyond your entries), so that's an extra 256MB. All together lets round it up to 3GB?
Most servers these days have quite large amounts of memory (10s to 100s of GB) and adding an extra 3GB to the JVM live set should not scare you. You might consider it disappointing that the overhead exceeds the data in your case, but this is not your emotional well being, it's a question of will it work ;-)
Now that you've loaded up the data, you are mutating it at a rate of 100s of inserts/deletes per second, lets say 1024, reusing above quantities we can sum it up with: 1024 * (24*2 + 32) = 70KB. Churning 70KB of garbage per second is small change for many applications, and not something you necessarily need to sweat about. To put it in context, a JVM will contend with collecting many 100s of MB of Young Generation in a matter of 10s of milliseconds these days.
So, in summary, if all you need is to load the data and query/mutate it along the lines you describe you might just find that a modern server can easily contend with a vanilla solution. I'd recommend you give that a go, maybe prototype with some representative data set, and see how it works out. If you have an issue you can always find more exotic/efficient solutions.
Sorry if this has been asked before (though I can't really find a solution).
I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.
Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).
What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?
EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)
The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:
java -Xmx6144m ...
That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...
Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.
You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...
Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.
Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.
Don't know much about SVD but maybe this page can be helpful:
http://www.netlib.org/lapack/lug/node32.html
-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.
so
java -Xms512 -Xmx1024
would give you 512 to start with
As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?
For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.
Just for right now, pretend there are a billion entries with 3 words associated with each.
If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.
Given 16 GB cost a few hundred dollars, I would suggest buying more memory.
I'm using SOLR-3.4, spatial filtering with the schema having LatLonType (subType=tdouble). I have an index of about 20M places. My basic problem is that if I do bbox filter with cache=true, the performance is reasonably good (~40-50 QPS, about 100-150ms latency), but a big downside is crazy fast old gen heap growth ultimately leading to major collections every 30-40 minutes (on a very large heap, 25GB). And at that point performance is beyond unacceptable. On the other hand I can turn off caching for bbox filters, but then my latency and QPS drops (the latency goes down from 100ms => 500ms). The NumericRangeQuery javadoc talks about the great performance you can get (sub 100 ms) but now I wonder if that was with filterCache enabled, and nobody bothered to look at the heap growth that results. I feel like this is sort of a catch-22 since neither configuration is really acceptable.
I'm open to any ideas. My last idea (untried) is to use geo hash (and pray that it either performs better with cache=false, or has more manageable heap growth if cache=true).
EDIT:
Precision step: default (8 for double I think)
System memory: 32GB (EC2 M2 2XL)
JVM: 24GB
Index size: 11 GB
EDIT2:
A tdouble with precisionStep of 8 means that your doubles will be splitted in sequences of 8 bits. If all your latitudes and longitudes only differ by the last sequence of 8 bits, then tdouble would have the same performance has a normal double on a range query. This is why I suggested to test a precisionStep of 4.
Question: what does this actually mean for a double value?
Having a profile of Solr while responding to your spatial queries would be of great help to understand what is slow, see hprof for example.
Still, here are a few ideas on how you could (perhaps) improve latency.
First you could try to test what happens when decreasing the precisionStep (try 4 for example). If the latitudes and longitudes are too close of each other and the precisionStep is too high, Lucene cannot take advantage of having several indexed values.
You could also try to give a little bit less memory to the JVM in order to give the OS cache more chances to cache frequently accessed index files.
Then, if it is still not fast enough, you could try to extend replace TrieDoubleField as a sub field by a field type that would use a frange query for the getRangeQuery method. This would reduce the number of disk access while computing the range at the cost of a higher memory usage. (I have never tested it, it might provide horrible performance as well.)
I come to know that the maximum size of a method in java is 64k. And if it exceeds, we'll get a compiler warning like "Code too large to compile". So can we call this a drawback of java with this small amount of memory.
Can we increase this size limit or is it really possible to increase ?
Any more idea regarding this method size ?
In my experience the 64KB limit is only a problem for generated code. esp. when intiialising large arrays (which is done in code)
In well structured code, each method is a manageable length and is much smaller than this limit. Large pieces of data, to be loaded into arrays, can be read from a non Java files like a text or binary file.
EDIT:
It is worth nothing that the JIT won't compile methods larger than 8 K. This means the code runs slower and can impact the GC times (as it is less efficient to search the call stack of a thread with methods which are not compiled esp big ones)
If possible you want to limit your methods to 8 K rather than 64 K.
64k is quite a lot, if you exceed it you may think about reorganizing you code.
In my project I met this constraint once in generated sources. Solved by just splitting one method to several.
If your method is longer than 50 lines including inside comments - split it. In this case you will never reach any limitation (even if one exists).
I personally saw 1000 lines long methods (written by criminals that call themselves programmers :) ) but did not see such kind of limitation.
I need simple opinion from all Guru!
I developed a program to do some matrix calculations. It work all fine with
small matrix. However when I start calculating BIG thousands column row matrix. It
kills the speed.
I was thinking to do processing on each row and write the result in a file then free the
memory and start processing 2nd row and write in a file, so and so forth.
Will it help in improving speed? I have to make big changes to implement this change. Thats
why I need your opinion. What do you think?
Thanks
P.S: I know about colt and Jama matrix. I can not use these packages due to company
rules.
Edited
In my program I am storing all the matrix in 2 dimensional array and if matrix is small it is fine. However, when it has thousands column and rows. Then storing all this matrix in memory for calculation cause performance issues. Matrix contains floating values. For processing I read all the matrix store in memory then start calculation. After calculating I write the result in a file.
Is memory really your bottleneck? Because if it isn't, then stopping to write things out to a file is always going to be much, much slower than the alternative. It sounds like you are probably experiencing some limitation of your algorithm.
Perhaps you should consider optimising the algorithm first.
And as I always say for all performance issue - asking people is one thing, but there is no substitute for trying it! Opinions don't matter if the real-world performance is measurable.
I suggest using profiling tools and timing statements in your code to work out exactly where your performance problem is before your start making changes.
You could spend a long time 'fixing' something that isn't the problem. I suspect that the file IO you suggest would actually slow your code down.
If your code effectively has a loop nested within another loop to process each element then you will see your speed drop away quickly as you increase the size of the matrix. If so, an area to look at would be processing your data in parallel, allowing your code to take advantage of multiple CPUs/cores.
Consider a more efficient implementation of a sparse matrix data structure and not a multidimensional array (if you are using one now)
You need to remember that perfoming an NxN multipled by an NxN takes 2xN^3 calculations. Even so it shouldn't take hours. You should get an inprovement by transposing the second matrix (about 30%) but it really shouldn't be taking hours.
So as you 2x N you increase the time by 8x. Worse than that a matrix which fit into your cache is very fast but mroe than a few MB and they have to come from main memory which slows down your operations by another 2-5x.
Putting the data on disk will really slow down your calaculation, I only suggest you do this if you martix doesn't fit in memory, but it will make it 10x - 100x slower so buying a little more memory is a good idea. (In your case your matrixies should be small enough to fit into memory)
I tried Jama, which is a very basic library which use two dimensional arrays instead of one and on 4 year old labtop took 7 minutes. You should be able to get half this time by just using the latest hardware and with multiple threads cut this below one minute.
EDIT: Using a Xeon X5570, Jama multiplied two 5000x5000 matrices in 156 seconds. Using a parallel implementation I wrote, cut this time to 27 seconds.
Use the profiler in jvisualvm in the JDK to identify where the time is spent.
I would do some simple experiments to identify how your algoritm scales, because it sounds like you might use one that has a higher runtime complexity than you think. If it runs in N^3 (which is common if you want to multiply a list with an array) then doubling the input size will eight-double the run time.