How to handle huge data in java - java

right now, i need to load huge data from database into a vector, but when i loaded 38000 rows of data, the program throw out OutOfMemoryError exception.
What can i do to handle this ?
I think there may be some memory leak in my program, good methods to detect it ?thanks

Provide more memory to your JVM (usually using -Xmx/-Xms) or don't load all the data into memory.
For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.

If you must have all the data in memory, try caching commonly appearing objects. For example, if you are looking at employee records and they all have a job title, use a HashMap when loading the data and reuse the job titles already found. This can dramatically lower the amount of memory you're using.
Also, before you do anything, use a profiler to see where memory is being wasted, and to check if things that can be garbage collected have no references floating around. Again, String is a common example, since if for example you're using the first 10 chars of a 2000 char string, and you have used substring instead of allocating a new String, what you actually have is a reference to a char[2000] array, with two indices pointing at 0 and 10. Again, a huge memory waster.

You can try increasing the heap size:
java -Xms<initial heap size> -Xmx<maximum heap size>
Default is
java -Xms32m -Xmx128m

Do you really need to have such a large object stored in memory?
Depending of what you have to do with that data you might want to split it in lesser chunks.

Load the data section by section. This will not let you work on all data at the same time, but you won't have to change the memory provided to the JVM.

You could run your code using a profiler to understand how and why the memory is being eaten up. Debug your way through the loop and watch what is being instantiated. There are any number of them; JProfiler, Java Memory Profiler, see the list of profilers here, and so forth.

Maybe optimize your data classes? I've seen a case someone has been using Strings in place of native datatypes such as int or double for every class member that gave an OutOfMemoryError when storing a relatively small amount of data objects in memory. Take a look that you aren't duplicating your objects. And, of course, increase the heap size:
java -Xmx512M (or whatever you deem necessary)

Let your program use more memory or much better rethink the strategy. Do you really need so much data in the memory?

I know you are trying to read the data into vector - otherwise, if you where trying to display them, I would have suggested you use NatTable. It is designed for reading huge amount of data into a table.
I believe it might come in handy for another reader here.

Use a memory mapped file. Memory mapped files can basically grow as big as you want, without hitting the heap. It does require that you encode your data in a decoding-friendly way. (Like, it would make sense to reserve a fixed size for every row in your data, in order to quickly skip a number of rows.)
Preon allows you deal with that easily. It's a framework that aims to do to binary encoded data what Hibernate has done for relational databases, and JAXB/XStream/XmlBeans to XML.

Related

Abusing Java Arrays?

I'm developing a software package which makes heavy use of arrays (ArrayLists). Instructions to be process are put into an array queue to be processed, then when used, deleted from the array. Same with drawing on a plot, data is placed into an array queue, which is read to plot data, and the oldest data is eventually deleted as new data comes in. We are talking about thousands of instructions over an hour and at any time maybe 200,000 points plotted, continually growing/shrinking the array.
After sometime, the software beings to slow where the instructions are processed slower. Nothing really changes as to what is going on for processing, that is, the system is stable as to what how much data is plotted and what instructions are being process, just working off similar incoming data time after time.
Is there some memory issue going on with the "abuse" of the variable-sized (not a defined size, add/delete as needed) arrays/queues that could be causing eventual slowing?
Is there a better way than the String ArrayList to act as a queue?
Thanks!
Yes, you are most likely using the wrong data structure for the job. An ArrayList is a list with a backing array so get() is fast.
The Java runtime library has a very rich set of data structures so you can get a well-written and debugged with the characteristics you need out of the box. You most likely should be using one or more Queues instead.
My guess is that you forget to null out values in your arraylist so the JVM has to keep all of them around. This is a memory leak.
To confirm, use a profiler to see where your memory and cpu go. Visualvm is a nice standalone. Netbeans include one.
The use of VisualVM helped. It showed a heavy use of a "message" form that I was dumping incoming data to and forgot existed, so it was dealing with a million characters when the sluggishness became apparent, because I never limited its size.

How to test how many bytes an object reference use in Java?

I would like to test how many bytes an object reference use in the Java VM that I'm using. Do you guys know how to test this?
Thanks!
Taking the question literally, on most JVMs, all references on 32-bit JVMs take 4 bytes, one 64-bit JVMs, a reference takes 8 bytes unless -XX:+UseCompressedOops has been used, in which case it takes 4-bytes.
I assume you are asking how to tell how much space an Object occupies. You can use Instrumentation (not a simple matter) but this will only give you a shallow depth. Java tends you break into many objects something which is C++ might be a single structure so it is not as useful.
However, ifyou have a memory issue, I suggest you a memory profiler. This will give you the shallow and deep space objects use and give you a picture across the whole system. This is often more useful as you can start with the biggest consumers and optimise those as even if you have been developing Java for ten years+ you will only be guessing where is the best place to optimise unless you have hard data.
Another way to get the object size if you don't want to use a profiler is to allocate a large array and see how much memory is consumed, You have to do this many times to get a good idea what the average size is. I would set the young space very high to avoid GCs confusing your results e.g. -XX:NewSize=1g
It can differ from JVM to JVM but "Sizeof for Java" says
You might recollect "Java Tip 130: Do You Know Your Data Size?" that described a technique based on creating a large number of identical class instances and carefully measuring the resulting increase in the JVM used heap size. When applicable, this idea works very well, and I will in fact use it to bootstrap the alternate approach in this article.
If you need to be fairly accurate, check out the Instrumentation framework.
This one is the one I use. Got to love those 16-byte references !
alphaworks.ibm.heapanalyzer

How can I calculate how much RAM an object uses?

I have a HashMap that contains 12 million entries. It maps String values to Long values. Each string is about ten characters long. Is it possible to calculate how much memory this map will need in RAM?
You can guess, and you can make an educated guess by looking at the size of each item that goes into the map, but your best educated guess will be wrong.
JVMs have other structure used to track references and hold type information for the classes. That will add a fixed, yet unknown amount of memory to an accurate (if you can come up with an accurate input estimate) estimate.
As only some of the memory is memory directly holding the data, and some of the memory is memory used as overhead to hold the data, you need to profile your memory consumption and based your estimates on projections of the memory "growth" when using smaller maps.
Note that profiling a JVM is a tricky task, as it is optimizing memory usage in a manner that will present varying results depending on how long the JVM is running, the activity of the Map, etc. You need to do statistical sampling of the input in a variety of conditions; but, odds are good you will eventually be able to put your finger on a reasonable number. More importantly, you will also be able to say "Well it might peak up at around this number temporarily, but should settle down to this on average". Temporal changes to memory are often overlooked in static analysis.
The JVM would know because it has to allocate and manage memory but it doesn’t tell you. So, short of going native, no, there is no way to know how much memory is actually used by your objects.
A profiler will tell you how much memory is being used by the program, and which objects are using what. You might be able to find your objects' memory usage.
VisualVM is included in Java6, and will give you this information.
It's worth mentioning, though, this is not necessarily going to give you a memory 'requirement', just a view of how much memory it is using at that point in time.

Keep serialized and compressed Objects in-memory

I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...
You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc
The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.
"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"

How do you make your Java application memory efficient?

How do you optimize the heap size usage of an application that has a lot (millions) of long-lived objects? (big cache, loading lots of records from a db)
Use the right data type
Avoid java.lang.String to represent other data types
Avoid duplicated objects
Use enums if the values are known in advance
Use object pools
String.intern() (good idea?)
Load/keep only the objects you need
I am looking for general programming or Java specific answers. No funky compiler switch.
Edit:
Optimize the memory representation of a POJO that can appear millions of times in the heap.
Use cases
Load a huge csv file in memory (converted into POJOs)
Use hibernate to retrieve million of records from a database
Resume of answers:
Use flyweight pattern
Copy on write
Instead of loading 10M objects with 3 properties, is it more efficient to have 3 arrays (or other data structure) of size 10M? (Could be a pain to manipulate data but if you are really short on memory...)
I suggest you use a memory profiler, see where the memory is being consumed and optimise that. Without quantitative information you could end up changing thing which either have no effect or actually make things worse.
You could look at changing the representation of your data, esp if your objects are small.
For example, you could represent a table of data as a series of columns with object arrays for each column, rather than one object per row. This can save a significant amount of overhead for each object if you don't need to represent an individual row. e.g. a table with 12 columns and 10,000,000 rows could use 12 objects (one per column) rather than 10 million (one per row)
You don't say what sort of objects you're looking to store, so it's a little difficult to offer detailed advice. However some (not exclusive) approaches, in no particular order, are:
Use a flyweight pattern wherever
possible.
Caching to disc. There are
numerous cache solutions for
Java.
There is some debate as to whether
String.intern is a good idea. See
here for a question re.
String.intern(), and the amount of
debate around its suitability.
Make use of soft or weak
references to store data that you can
recreate/reload on demand. See
here for how to use soft
references with caching techniques.
Knowing more about the internals and lifetime of the objects you're storing would result in a more detailed answer.
Ensure good normalization of your object model, don't duplicate values.
Ahem, and, if it's only millions of objects I think I'd just go for a decent 64 bit VM and lots of ram ;)
Normal "profilers" won't help you much, because you need an overview of all your "live" objects. You need heap dump analyzer. I recommend the Eclipse Memory analyzer.
Check for duplicated objects, starting with Strings.
Check whether you can apply patterns like flightweight, copyonwrite, lazy initialization (google will be your friend).
Take a look at this presentation linked from here. It lays out the memory use of common java object and primitives and helps you understand where all the extra memory goes.
Building Memory-efficient Java Applications: Practices and Challenges
You could just store fewer objects in memory. :) Use a cache that spills to disk or use Terracotta to cluster your heap (which is virtual) allowing unused parts to be flushed out of memory and transparently faulted back in.
I want to add something to the point Peter alredy made(can't comment on his answer :() it's always better to use a memory profiler(check java memory profiler) than to go by intution.80% of time it's routine that we ignore has some problem in it.also collection classes are more prone to memory leaks.
If you have millions of Integers and Floats etc. then see if your algorithms allow for representing the data in arrays of primitives. That means fewer references and lower CPU cost of each garbage collection.
A fancy one: keep most data compressed in ram. Only expand the current working set. If your data has good locality that can work nicely.
Use better data structures. The standard collections in java are rather memory intensive.
[what is a better data structure]
If you take a look at the source for the collections, you'll see that if you restrict yourself in how you access the collection, you can save space per element.
The way the collection handle growing is no good for large collections. Too much copying. For large collections, you need some block-based algorithm, like btree.
Spend some time getting acquainted with and tuning the VM command line options, especially those concerning garbage collection. While this won't change the memory used by your objects, it can have a big impact on performance with memory-intensive apps on machines with a lot of RAM.
Assign null value to all the variables which are no longer used. Thus make it available for Garbage collection.
De-reference the collections once usage is over, otherwise GC won't sweep those.
1) Use right dataTypes wherever possible
Class Person {
int age;
int status;
}
Here we can use below variables to save memory while sending Person object
class Person{
short age;
byte status;
}
2) Instead of returning new ArrayList<>(); from method , you can use Collection.emptyList() which will only contain only one element instead of default 10;
For e.g
public ArrayList getResults(){
.....
if(failedOperation)
return new ArrayList<>();
}
//Use this
public ArrayList getResults(){
if(failedOperation)
return Collections.emptyList();
}
3 ) Move creation of objects in methods instead of static declaration wherever possible as fields of objects will be stored on stack instead of heap
4) Using binary formats like protobuf,thrift,avro,messagepack for reducing intercommunication instead of json or XML

Categories

Resources