I'm trying to design a lightweight way to store persistent data in Java. I've already got a very efficient way to serialize POJOs to DataOutputStreams (and back), but I'm trying to think of a good way to ensure that changes to the data in the POJOs gets serialized when necessary.
This is for a client-side app where I'm trying to keep the size of the eventual distributable as low as possible, so I'm reluctant to use anything that would pull-in heavy-weight dependencies. Right now my distributable is almost 10MB, and I don't want it to get much bigger.
I've considered DB4O but its too heavy - I need something light. Really its probably more a design pattern I need, rather than a library.
Any ideas?
The 'lightest weight' persistence option will almost surely be simply marking some classes Serializable and reading/writing from some fixed location. Are you trying to accomplish something more complex than this? If so, it's time to bundle hsqldb and use an ORM.
If your users are tech savvy, or you're just worried about initial payload, there are libraries which can pull dependencies at runtime, such as Grape.
If you already have a compact data output format in bytes (which I assume you have if you can persist efficiently to a DataOutputStream) then an efficient and general technique is to use run-length-encoding on the difference between the previous byte array output and the new byte array output.
Points to note:
If the object has not changed, the difference in byte arrays will be an array of zeros and hence will compress very small....
For the first time you serialize the object, consider the previous output to be all zeros so that you communicate a complete set of data
You probably want to be a bit clever when the object has variable-sized substructures....
You can also try zipping the difference rather than RLE - might be more efficient in some cases where you have a large object graph with a lot of changes
Related
I am coding a game using LibGdx. I created a level editor where you can place objects for the level. All the objects are put into a list and then the whole list is serialized by writing out the list to an object output stream. I then read in the list of objects in my game and copy the list over to my list of objects in the current level being played. On the actual user's device who is playing the game, there will be 20+ serialized level files. They will only be deserialized at this point. Is this an efficient way to do this in regards to memory and performance? Could those files take up a good chunk of memory? I noticed people use xml or Json for what I am doing. Should I be worried about having any issues the way I have done my level loading? Thanks. Let me know if my question isn't clear.
When we were looking for slowdowns in our code we profiled it running and found a huge amount of time being spent on one thing. When we looked it turned out to be a section where an object was being duplicated by serializing it to a string and back. For some reason, the default java implementation of serialization is SLOW.
The other problem with serialization is that it's black box--if your file gets corrupted or you mess up your object enough during an upgrade you can lose everything.
Did you consider ORM and some kind of easy database? There are databases that are compiled into your code (invisible to your user) and it's pretty much just db.save(anyObject)... very easy to use.
for a similiar question I did a short research (because I couldn't believe object serialization is slow) and i would recommend you to use JSON, as it's faster. Memory usage in terms of RAM will be the same (as soon as your object is deserialized). On disk you might want to zip it.
according to those benchmarks, jackson is faster than java serialization:
benchmark
deeplink
another advantage is the human readability of your json files.
In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.
you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.
You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.
For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)
I have uploaded a Java Game Server to github. I would like to provide the following functionality to users. When the game state changes, only transmit the delta to the connected game clients, thereby reducing network load.
I have the below idea to do it.... which is pretty dump as far as I can see.
1) Serialize object before modification
2) Serialize object after modification
3) Convert both to String and find diff (not sure how, but sure some libraries will be there to do that)
4) Transmit diff to interested clients.
How are these kind of requirements normally handled in enterprise?
It would be simpler to produce the delta first and serialize that. You don't need serialization at all to produce the delta. You could get a long way with it just using the Bean Introspector on object properties, if your objects are bean-ish enough.
I would use a library like kryo (https://code.google.com/p/kryo/wiki/V1Documentation) or Sqisher java-object-diff from Daniel Bechler. The latter is suitable for Beans, that is, you require get and set methods for each variable. kryo is more flexible and very fast (https://github.com/eishay/jvm-serializers/wiki). Search for kryo and Delta to find more information. To make use of the delta functionality you have to use kryo version 1 and kryonet.
Well as for diff a couple options exists; a few are pointed out here:
How to perform string Diffs in Java?
Another way might be to serialize the object to XML and use a XML diff tool to produce the delta. XML as the advantage of offering a structure where your binary serialized instances won't. However you should make sure to compress your messages to minizime traffic if you use this strategy.
You might want to investigate badiff for this. badiff is a pure-java binary differ with an emphasis on small diffs and parallelization. See the website at http://badiff.org/ .
Since the serialized form of an object is deterministic, your approach should work. Just serialize the object to a ByteArrayInputStream, do your modifications, serialize it to another ByteArrayInputStream, and use badiff to compute the diff.
This is not an appropriate solution for gaming applications where diffs might arrive out of order. In those cases, if you still want to send diffs, consider sending "key" serialized objects every now and then, which you can compute diffs from, such that keys will always be in order.
I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...
You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc
The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.
"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"
So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.