Abusing Java Arrays? - java

I'm developing a software package which makes heavy use of arrays (ArrayLists). Instructions to be process are put into an array queue to be processed, then when used, deleted from the array. Same with drawing on a plot, data is placed into an array queue, which is read to plot data, and the oldest data is eventually deleted as new data comes in. We are talking about thousands of instructions over an hour and at any time maybe 200,000 points plotted, continually growing/shrinking the array.
After sometime, the software beings to slow where the instructions are processed slower. Nothing really changes as to what is going on for processing, that is, the system is stable as to what how much data is plotted and what instructions are being process, just working off similar incoming data time after time.
Is there some memory issue going on with the "abuse" of the variable-sized (not a defined size, add/delete as needed) arrays/queues that could be causing eventual slowing?
Is there a better way than the String ArrayList to act as a queue?
Thanks!

Yes, you are most likely using the wrong data structure for the job. An ArrayList is a list with a backing array so get() is fast.
The Java runtime library has a very rich set of data structures so you can get a well-written and debugged with the characteristics you need out of the box. You most likely should be using one or more Queues instead.
My guess is that you forget to null out values in your arraylist so the JVM has to keep all of them around. This is a memory leak.
To confirm, use a profiler to see where your memory and cpu go. Visualvm is a nice standalone. Netbeans include one.

The use of VisualVM helped. It showed a heavy use of a "message" form that I was dumping incoming data to and forgot existed, so it was dealing with a million characters when the sluggishness became apparent, because I never limited its size.

Related

Map structure in Android and dynamic memory allocation

Not so recently I've published a game that is written entirely in Java on Android platform. Currently I'm trying to get as much of the performance as possible. It seems that the problem in my game's case is that I'm using more too often ArrayList container in places where Map could be better suited. To explain myself I did it because I was afraid of dynamic memory allocations that would be triggered behind the scene (Map/Tree structures on Android). Maybe there is some sort of structure on Android/Java platform I don't know about, which will provide me with fast searching results and additionally will not allocate dynamically extra memory when adding new elements?
UPDATE:
For example I'm using an ArrayList structure for holding most of my game's Particles. Of course removing them independently (not sequentially) is a pain in the b**t as the system needs to iterate through the whole container just to remove one entity object (of course in the worst case scenario).
I wouldn't worry about slowdown because of memory allocation unless you specifically find it to be an issue. Memory allocation isn't really the cause of slowdowns in Android games, it's when the GC runs that's usually the problem. Unless you are inserting and deleting from the Map very often, you might not have to worry about the allocations.
Update:
Instead of using a Map, you might want to consider just marking particles as "dead" when you no longer need them and using that flag to skip over them in your update iteration. Store the references to the dead particles in a new deadParticles ArrayList, and just take one out from that list when you need a new one. That way the you have instant access to particles when you need them.
Are you preallocating your element objects, and reusing the empties rather than allocating new ones?

Caching large objects and de/serializing them if needed (Java)

i just came to the point where whether google nor my knowledge bring me forwards.
Think about the following situation: I read in a lot (up to millions) of large objects (up to 500mb each) and sometimes i read in millions of objects with only 500kb, this completely depends on the user of my software. Each object is gonna be processed in a pipeline so they don't need to be all in the memory for all the time, only a reference would be needed to find the objects again on my harddisk after serializing it so that i can deserialize it again. So it's something like a persistent cache for large objects.
so here come my questions:
Is there a solution (any framework) which does exactly what i need? this includes: arbitrary serialization of large objects after determining somehow if the cache is full?
if there isn't: is there a way to somehow intelligent check weather an object should be serialized or not? e.g. checking somehow the memory size? Or something like a listener on a softreference (when it get's released?).
Thanks alot,
Christian
Storing millions of objects sounds like a database application.
Check out Oracle Coherence. There are a few open source alternatives as well but, Coherence is the most feature rich.
EDIT:
Coherence Alternatives

Scaling application that reads large XML files

I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.
Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... nearing the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.
I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?
Yea, to parrot #aaray and #MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.
As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.
Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.
IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.
Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.
By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).
The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.
We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:
http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc
You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.

How to handle huge data in java

right now, i need to load huge data from database into a vector, but when i loaded 38000 rows of data, the program throw out OutOfMemoryError exception.
What can i do to handle this ?
I think there may be some memory leak in my program, good methods to detect it ?thanks
Provide more memory to your JVM (usually using -Xmx/-Xms) or don't load all the data into memory.
For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.
If you must have all the data in memory, try caching commonly appearing objects. For example, if you are looking at employee records and they all have a job title, use a HashMap when loading the data and reuse the job titles already found. This can dramatically lower the amount of memory you're using.
Also, before you do anything, use a profiler to see where memory is being wasted, and to check if things that can be garbage collected have no references floating around. Again, String is a common example, since if for example you're using the first 10 chars of a 2000 char string, and you have used substring instead of allocating a new String, what you actually have is a reference to a char[2000] array, with two indices pointing at 0 and 10. Again, a huge memory waster.
You can try increasing the heap size:
java -Xms<initial heap size> -Xmx<maximum heap size>
Default is
java -Xms32m -Xmx128m
Do you really need to have such a large object stored in memory?
Depending of what you have to do with that data you might want to split it in lesser chunks.
Load the data section by section. This will not let you work on all data at the same time, but you won't have to change the memory provided to the JVM.
You could run your code using a profiler to understand how and why the memory is being eaten up. Debug your way through the loop and watch what is being instantiated. There are any number of them; JProfiler, Java Memory Profiler, see the list of profilers here, and so forth.
Maybe optimize your data classes? I've seen a case someone has been using Strings in place of native datatypes such as int or double for every class member that gave an OutOfMemoryError when storing a relatively small amount of data objects in memory. Take a look that you aren't duplicating your objects. And, of course, increase the heap size:
java -Xmx512M (or whatever you deem necessary)
Let your program use more memory or much better rethink the strategy. Do you really need so much data in the memory?
I know you are trying to read the data into vector - otherwise, if you where trying to display them, I would have suggested you use NatTable. It is designed for reading huge amount of data into a table.
I believe it might come in handy for another reader here.
Use a memory mapped file. Memory mapped files can basically grow as big as you want, without hitting the heap. It does require that you encode your data in a decoding-friendly way. (Like, it would make sense to reserve a fixed size for every row in your data, in order to quickly skip a number of rows.)
Preon allows you deal with that easily. It's a framework that aims to do to binary encoded data what Hibernate has done for relational databases, and JAXB/XStream/XmlBeans to XML.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Categories

Resources