Keep serialized and compressed Objects in-memory - java

I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...

You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc

The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.

"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"

Related

How B-Tree works in term of serialisation?

In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.
you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.
You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.
For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)

Caching large objects and de/serializing them if needed (Java)

i just came to the point where whether google nor my knowledge bring me forwards.
Think about the following situation: I read in a lot (up to millions) of large objects (up to 500mb each) and sometimes i read in millions of objects with only 500kb, this completely depends on the user of my software. Each object is gonna be processed in a pipeline so they don't need to be all in the memory for all the time, only a reference would be needed to find the objects again on my harddisk after serializing it so that i can deserialize it again. So it's something like a persistent cache for large objects.
so here come my questions:
Is there a solution (any framework) which does exactly what i need? this includes: arbitrary serialization of large objects after determining somehow if the cache is full?
if there isn't: is there a way to somehow intelligent check weather an object should be serialized or not? e.g. checking somehow the memory size? Or something like a listener on a softreference (when it get's released?).
Thanks alot,
Christian
Storing millions of objects sounds like a database application.
Check out Oracle Coherence. There are a few open source alternatives as well but, Coherence is the most feature rich.
EDIT:
Coherence Alternatives

Lightweight way to persist objects in Java

I'm trying to design a lightweight way to store persistent data in Java. I've already got a very efficient way to serialize POJOs to DataOutputStreams (and back), but I'm trying to think of a good way to ensure that changes to the data in the POJOs gets serialized when necessary.
This is for a client-side app where I'm trying to keep the size of the eventual distributable as low as possible, so I'm reluctant to use anything that would pull-in heavy-weight dependencies. Right now my distributable is almost 10MB, and I don't want it to get much bigger.
I've considered DB4O but its too heavy - I need something light. Really its probably more a design pattern I need, rather than a library.
Any ideas?
The 'lightest weight' persistence option will almost surely be simply marking some classes Serializable and reading/writing from some fixed location. Are you trying to accomplish something more complex than this? If so, it's time to bundle hsqldb and use an ORM.
If your users are tech savvy, or you're just worried about initial payload, there are libraries which can pull dependencies at runtime, such as Grape.
If you already have a compact data output format in bytes (which I assume you have if you can persist efficiently to a DataOutputStream) then an efficient and general technique is to use run-length-encoding on the difference between the previous byte array output and the new byte array output.
Points to note:
If the object has not changed, the difference in byte arrays will be an array of zeros and hence will compress very small....
For the first time you serialize the object, consider the previous output to be all zeros so that you communicate a complete set of data
You probably want to be a bit clever when the object has variable-sized substructures....
You can also try zipping the difference rather than RLE - might be more efficient in some cases where you have a large object graph with a lot of changes

How to handle huge data in java

right now, i need to load huge data from database into a vector, but when i loaded 38000 rows of data, the program throw out OutOfMemoryError exception.
What can i do to handle this ?
I think there may be some memory leak in my program, good methods to detect it ?thanks
Provide more memory to your JVM (usually using -Xmx/-Xms) or don't load all the data into memory.
For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.
If you must have all the data in memory, try caching commonly appearing objects. For example, if you are looking at employee records and they all have a job title, use a HashMap when loading the data and reuse the job titles already found. This can dramatically lower the amount of memory you're using.
Also, before you do anything, use a profiler to see where memory is being wasted, and to check if things that can be garbage collected have no references floating around. Again, String is a common example, since if for example you're using the first 10 chars of a 2000 char string, and you have used substring instead of allocating a new String, what you actually have is a reference to a char[2000] array, with two indices pointing at 0 and 10. Again, a huge memory waster.
You can try increasing the heap size:
java -Xms<initial heap size> -Xmx<maximum heap size>
Default is
java -Xms32m -Xmx128m
Do you really need to have such a large object stored in memory?
Depending of what you have to do with that data you might want to split it in lesser chunks.
Load the data section by section. This will not let you work on all data at the same time, but you won't have to change the memory provided to the JVM.
You could run your code using a profiler to understand how and why the memory is being eaten up. Debug your way through the loop and watch what is being instantiated. There are any number of them; JProfiler, Java Memory Profiler, see the list of profilers here, and so forth.
Maybe optimize your data classes? I've seen a case someone has been using Strings in place of native datatypes such as int or double for every class member that gave an OutOfMemoryError when storing a relatively small amount of data objects in memory. Take a look that you aren't duplicating your objects. And, of course, increase the heap size:
java -Xmx512M (or whatever you deem necessary)
Let your program use more memory or much better rethink the strategy. Do you really need so much data in the memory?
I know you are trying to read the data into vector - otherwise, if you where trying to display them, I would have suggested you use NatTable. It is designed for reading huge amount of data into a table.
I believe it might come in handy for another reader here.
Use a memory mapped file. Memory mapped files can basically grow as big as you want, without hitting the heap. It does require that you encode your data in a decoding-friendly way. (Like, it would make sense to reserve a fixed size for every row in your data, in order to quickly skip a number of rows.)
Preon allows you deal with that easily. It's a framework that aims to do to binary encoded data what Hibernate has done for relational databases, and JAXB/XStream/XmlBeans to XML.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Categories

Resources