Java overall arrays limit? - java

I am implementing a program that has about 2,000,000 (2 million) arrays each of size 16,512 (128 x 129) of integers. I only need to call 200 arrays at a time (that is 3.3 MB), but I wonder if I can expand the program to have more than 2 million (say 200 million) but I still only need to call 200 arrays at a time. So what is the limit of making more and more arrays while I don't use more than 200 arrays at a time?

I highly doubt that, unless you're running on a 64 bit machine with a lot of RAM and a very generous heap.
Let's calculate the memory you'll need for your data:
2,000,000*128*129*8/1024/1024/1024 = 30.8GB.
You'll need additional RAM for the JVM, the rest of your program, and the operating system.
Sounds like a poorly conceived solution to me.
If you mean "I only have 200 arrays in memory at a time" you can certainly do that, but you'll have to move the rest out to secondary storage or a relational database. Query for them, use them, GC them. It might not be the best solution, but it's hard to tell based on the little you've posted.
Update:
Does "trigger" mean "database trigger"?
Yes, you can store them on the disk. I can't guarantee that it'll perform. Your hard drive can certainly handle 30GB of data; it's feasible that it'll accomodate 300GB if it's large enough.
Just remember that you have to think about how you'll manage RAM. GC thrashing might be a problem. A good caching solution might be your friend here. Don't write one yourself.
What happens if that hard drive fails and you lose all that data? Do you back it up? Can your app afford to be down if the disk fails? Think about those scenarios, too. Good luck.

As long as you increase max heap size to make sure your application doesn't run out of memory, you shuold be fine.

As long as you don't keep references to arrays you no longer need, there is no hard limit. Old arrays will automatically get garbage collected, so you can keep allocating and abandoning arrays pretty much ad infinitum.
There is, of course, a limit on how many arrays you can keep around at any given time. This is limited by the amount of memory available to the JVM.

Related

Java slower with big heap

I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.
It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.
So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).
I am using Oracle JDK 1.7, if that's of any importance.
While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).
A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".
There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.
I don't think there's anything in the JVM which could cause such problems.
The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).
Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.
The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.
The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.
You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.

Are there performance issues from using large numbers of objects in Java

I am currently working on a system where performance is an important consideration. It is going to be used for processing large quantities of data (some of the object types are in millions) with non-trivial algorithms (think about Integer Programming problems etc.). At the moment I have a working solution which creates all these data points as Objects.
Is there any performance increase to be gained, by treating them as arrays for example? Are there any best practices for working with large numbers of objects in Java (should it be avoided?).
I suggest you start by using a commercial CPU and memory profiler. This will give you a good idea of what are your bottleneck.
Reducing garbage and making your memory more compact helps more when your have optimised the code to the point that your profilers cannot suggest anything.
You might like to consider what structures which fit in your CPU caches better as this can improve performance by up to 2-5x. e.g. Your L3 cache might be 8 MB, and more than 5x faster than main memory. The more you can condense your working set to fit into it the better.
BTW Your L1 cache is 32 KB and ~10x faster again.
This all assumes that the time to perform a GC doesn't bother you. If you create enough objects you can see multi-second, even multi-minute GC stop-the-world pauses.
Arrays or ArrayLists have similar performance although arrays are faster (up to 25% depending on what you do with them). Where you can find a significant performance gain is by avoiding boxed primitives for calculations, in which case the only solution is to use an array.
Apart from that, creating many short lived objects incurs little performance cost, apart from the fact that GC will run more often (but the cost of running minor GC depends on the number of reachable objects, not on unreachable ones).
Premature optimization is evil. As Richard says in comments, write your code, see if its slow, then improve it. If you have suspicions write an example to simulate high load. The time spent up front to determine this is worth it.
But as for your question...
Yes, creating objects is more expensive compared to creating primitives. It also occupies more heap space (memory.) Also if you are using objects for only a short time the garbage collector will have to run more often which will eat some CPU.
Again, only worry about this if you really need speed improvement.
Prototype key parts of your algorithms, test them in separation, find the slowest, improve, repeat. Stay single threads for as long as possible, but always make a note of what can be done in parallel.
At the end your bottleneck may be either of below:
CPU because if algorithm computational complexity => try finding better algorithm (or run on multiple CPUs in parallel if you are just slightly below the target, if you are far below then parallel processing won't help)
CPU because of excessive GC => profile memory, use low/zero-GC collections (trove4j etc.) or even arrays of primitive types, or even direct memory buffers from NIO, experiment
Memory - optimize data proximity (use chunked arrays matching cache sizes, etc).
Contentions on concurrent objects => revert to single threaded design, try lock-free synchronization primitives, etc.

Java Heap Hard Drive

I have been working on a Java program that generates fractal orbits for quite some time now. Much like photographs, the larger the image, the better it will be when scaled down. The program uses a 2D object (Point) array, which is written to when a point's value is calculated. That is to say the Point is stored in it's corresponding value, I.e.:
Point p = new Point(25,30);
histogram[25][30] = p;
Of course, this is edited for simplicity. I could just write the point values to a CSV, and apply them to the raster later, but using similar methods has yielded undesirable results. I tried for quite some time because I enjoyed being able to make larger images with the space freed by not having this array. It just won't work. For clarity I'd like to add that the Point object also stores color data.
The next problem is the WriteableRaster, which will have the same dimensions as the array. Combined the two take up a great deal of memory. I have come to accept this, after trying to change the way it is done several times, each with lower quality results.
After trying to optimize for memory and time, I've come to the conclusion that I'm really limited by RAM. This is what I would like to change. I am aware of the -Xmx switch (set to 10GB). Is there any way to use Windows' virtual memory to store the raster and/or the array? I am well aware of the significant performance hit this will cause, but in lieu of lowering quality, there really doesn't seem to be much choice.
The OS is already making hard drive space into RAM for you and every process of course -- no magic needed. This will be more of a performance disaster than you think; it will be so slow as to effectively not work.
Are you looking for memory-mapped files?
http://docs.oracle.com/javase/6/docs/api/java/nio/MappedByteBuffer.html
If this is really to be done in memory, I would bet that you could dramatically lower your memory usage with some optimization. For example, your Point object is mostly overhead and not data. Count up the bytes needed for the reference, then for the Object overhead, compared to two ints.
You could reduce the overhead to nothing with two big parallel int arrays for your x and y coordinates. Of course you'd have to encapsulate this for access in your code. But it could halve your memory usage for this data structure. Millions fewer objects also speeds up GC runs.
Instead of putting a WritableRaster in memory, consider writing out the image file in some simple image format directly, yourself. BMP can be very simple. Then perhaps using an external tool to efficiently convert it.
Try -XX:+UseCompressedOops to reduce object overhead too. Also try -XX:NewRatio=20 or higher to make the JVM reserve almost all its heap for long-lived objects. This can actually let you use more heap.
It is not recommended to configure your JVM memory parameters (Xmx) in order to make the operating system to allocate from it's swap memory. apparently the garbage collection mechanism needs to have random access to heap memory and if doesn't, the program will thrash for a long time and possibly lock up. please check the answer given already to my question (last paragraph):
does large value for -Xmx postpone Garbage Collection

What is the best practice to grow a very large binary file rapidly?

My Java application deals with large binary data files using memory mapped file (MappedByteBuffer, FileChannel and RandomAccessFile). It often needs to grow the binary file - my current approach is to re-map the file with a larger region.
It works, however there are two problems
Grow takes more and more time as the file becomes larger.
If grow is conducted very rapidly (E.G. in a while(true) loop), JVM will hang forever after the re-map operation is done for about 30,000+ times.
What are the alternative approaches, and what is the best way to do this?
Also I cannot figure out why the second problem occurs. Please also suggest your opinion on that problem.
Thank you!
Current code for growing a file, if it helps:
(set! data (.map ^FileChannel data-fc FileChannel$MapMode/READ_WRITE
0 (+ (.limit ^MappedByteBuffer data) (+ DOC-HDR room))))
You probably want to grow your file in larger chunks. Use a doubling each time you remap, like a dynamic array, so that the cost for growing is an amortized constant.
I don't know why the remap hangs after 30,000 times, that seems odd. But you should be able to get away with a lot less than 30,000 remaps if you use the scheme I suggest.
The JVM doesn't clean up memory mappings even if you call the cleaner explicitly. Thank you #EJP for the correction.
If you create 32,000 of these they could be all in existence at once. BTW: I suspect you might be hitting some 15-bit limit.
The only solution for this is; don't create so many mapping. You can map an entire disk 4 TB disk with less than 4K mapping.
I wouldn't create a mapping less than 16 to 128 MB if you know the usage will grow and I would consider up to 1 GB per mapping. The reason you can do this with little cost is that the main memory and disk space will not be allocated until you actually use the pages. i.e. the main memory usage grows 4 KB at a time.
The only reason I wouldn't create a 2 GB mapping is Java doesn't support this due to an Integer.MAX_VALUE size limit :( If you have 2 GB or more you have to create multiple mappings.
Unless you can afford an exponential growth on the file like doubling, or any other constant multiplier, you need to consider whether you really need a MappedByteBuffer at all, considering their limitations (unable to grow the file, no GC, etc). I personally would either be reviewing the problem or else using a RandomAccessFile in "rw" mode, probably with a virtual-array layer over the top of it.

How to estimate whether a given task would have enough memory to run in Java

I am developing an application that allows users to set the maximum data set size they want me to run their algorithm against
It has become apparent that array sizes around 20,000,000 in size causes an 'out of memory' error. Because I am invoking this via reflection, there is not really a great deal I can do about this.
I was just wondering, is there any way I can check / calculate what the maximum array size could be based on the users heap space settings and therefore validate user entry before running the application?
If not, are there any better solutions?
Use Case:
The user provides a data size they want to run their algorithm against, we generate a scale of numbers to test it against up to the limit they provided.
We record the time it takes to run and measure the values (in order to work out the o-notation).
We need to somehow limit the users input so as to not exceed or get this error. Ideally we want to measure n^2 algorithms on as bigger array sizes as we can (which could last in terms of runtime for days) therefore we really don't want it running for 2 days and then failing as it would have been a waste of time.
You can use the result of Runtime.freeMemory() to estimate the amount of available memory. However, it might be that actually a lot of memory is occupied by unreachable objects, which will be reclaimed by GC soon. So you might actually be able to use more memory than this. You can try invoking the GC before, but this is not guaranteed to do anything.
The second difficulty is to estimate the amount of memory needed for a number given by the user. While it is easy to calculate the size of an ArrayList with so many entries, this might not be all. For example, which objects are stored in this list? I would expect that there is at least one object per entry, so you need to add this memory too. Calculating the size of an arbitrary Java object is much more difficult (and in practice only possible if you know the data structures and algorithms behind the objects). And then there might be a lot of temporary objects creating during the run of the algorithm (for example boxed primitives, iterators, StringBuilders etc.).
Third, even if the available memory is theoretically sufficient for running a given task, it might be practically insufficient. Java programs can get very slow if the heap is repeatedly filled with objects, then some are freed, some new ones are created and so on, due to a large amount of Garbage Collection.
So in practice, what you want to achieve is very difficult and probably next to impossible. I suggest just try running the algorithm and catch the OutOfMemoryError.
Usually, catching errors is something you should not do, but this seems like an occasion where its ok (I do this in some similar cases). You should make sure that as soon as the OutOfMemoryError is thrown, some memory becomes reclaimable for GC. This is usually not a problem, as the algorithm aborts, the call stack is unwound and some (hopefully a lot of) objects are not reachable anymore. In your case, you should probably ensure that the large list is part of these objects which immediately become unreachable in the case of an OOM. Then you have a good chance of being able to continue your application after the error.
However, note that this is not a guarantee. For example, if you have multiple threads working and consuming memory in parallel, the other threads might as well receive an OutOfMemoryError and not be able to cope with this. Also the algorithm needs to support the fact that it might get interrupted at any arbitrary point. So it should make sure that the necessary cleanup actions are executed nevertheless (and of course you are in trouble if those need a lot of memory!).

Categories

Resources