Caching large objects and de/serializing them if needed (Java)

Caching large objects and de/serializing them if needed (Java) - java

i just came to the point where whether google nor my knowledge bring me forwards.
Think about the following situation: I read in a lot (up to millions) of large objects (up to 500mb each) and sometimes i read in millions of objects with only 500kb, this completely depends on the user of my software. Each object is gonna be processed in a pipeline so they don't need to be all in the memory for all the time, only a reference would be needed to find the objects again on my harddisk after serializing it so that i can deserialize it again. So it's something like a persistent cache for large objects.
so here come my questions:
Is there a solution (any framework) which does exactly what i need? this includes: arbitrary serialization of large objects after determining somehow if the cache is full?
if there isn't: is there a way to somehow intelligent check weather an object should be serialized or not? e.g. checking somehow the memory size? Or something like a listener on a softreference (when it get's released?).
Thanks alot,
Christian

Storing millions of objects sounds like a database application.

Check out Oracle Coherence. There are a few open source alternatives as well but, Coherence is the most feature rich.
EDIT:
Coherence Alternatives

Related

Abusing Java Arrays?

I'm developing a software package which makes heavy use of arrays (ArrayLists). Instructions to be process are put into an array queue to be processed, then when used, deleted from the array. Same with drawing on a plot, data is placed into an array queue, which is read to plot data, and the oldest data is eventually deleted as new data comes in. We are talking about thousands of instructions over an hour and at any time maybe 200,000 points plotted, continually growing/shrinking the array.
After sometime, the software beings to slow where the instructions are processed slower. Nothing really changes as to what is going on for processing, that is, the system is stable as to what how much data is plotted and what instructions are being process, just working off similar incoming data time after time.
Is there some memory issue going on with the "abuse" of the variable-sized (not a defined size, add/delete as needed) arrays/queues that could be causing eventual slowing?
Is there a better way than the String ArrayList to act as a queue?
Thanks!

Yes, you are most likely using the wrong data structure for the job. An ArrayList is a list with a backing array so get() is fast.
The Java runtime library has a very rich set of data structures so you can get a well-written and debugged with the characteristics you need out of the box. You most likely should be using one or more Queues instead.
My guess is that you forget to null out values in your arraylist so the JVM has to keep all of them around. This is a memory leak.
To confirm, use a profiler to see where your memory and cpu go. Visualvm is a nice standalone. Netbeans include one.

The use of VisualVM helped. It showed a heavy use of a "message" form that I was dumping incoming data to and forgot existed, so it was dealing with a million characters when the sluggishness became apparent, because I never limited its size.

Java Serialization Memory Efficiency

I am coding a game using LibGdx. I created a level editor where you can place objects for the level. All the objects are put into a list and then the whole list is serialized by writing out the list to an object output stream. I then read in the list of objects in my game and copy the list over to my list of objects in the current level being played. On the actual user's device who is playing the game, there will be 20+ serialized level files. They will only be deserialized at this point. Is this an efficient way to do this in regards to memory and performance? Could those files take up a good chunk of memory? I noticed people use xml or Json for what I am doing. Should I be worried about having any issues the way I have done my level loading? Thanks. Let me know if my question isn't clear.

When we were looking for slowdowns in our code we profiled it running and found a huge amount of time being spent on one thing. When we looked it turned out to be a section where an object was being duplicated by serializing it to a string and back. For some reason, the default java implementation of serialization is SLOW.
The other problem with serialization is that it's black box--if your file gets corrupted or you mess up your object enough during an upgrade you can lose everything.
Did you consider ORM and some kind of easy database? There are databases that are compiled into your code (invisible to your user) and it's pretty much just db.save(anyObject)... very easy to use.

for a similiar question I did a short research (because I couldn't believe object serialization is slow) and i would recommend you to use JSON, as it's faster. Memory usage in terms of RAM will be the same (as soon as your object is deserialized). On disk you might want to zip it.
according to those benchmarks, jackson is faster than java serialization:
benchmark
deeplink
another advantage is the human readability of your json files.

How B-Tree works in term of serialisation?

In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.

you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.

You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.

For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)

Reducing memory footprint while using large XML DOM's in Java

Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.
First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.
For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.
Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.
What I've tried so far
Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.
I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.
Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.
So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.
Other information that may be important
I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.
I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.

The approach I'd take would be make two passes on the files, using SAX in both cases.
The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).
The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.

Keep serialized and compressed Objects in-memory

I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...

You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc

The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.

"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.