Efficient recall of a delta-based data log in Java

Efficient recall of a delta-based data log in Java - java

My application has a number of objects in an internal list, and I need to be able to log them (e.g. once a second) and later recreate the state of the list at any time by querying the log file.
The current implementation logs the entire list every second, which is great for retrieval because I can simply load the log file, scan through it until I reach the desired time, and load the stored list.
However, the majority of my objects (~90%) rarely change, so it is wasteful in terms of disk space to continually log them at a set interval.
I am considering switching to a "delta" based log where only the changed objects are logged every second. Unfortunately this means it becomes hard to find the true state of the list at any one recorded time, without "playing back" the entire file to catch those objects that had not changed for a while before the desired recall time.
An alternative could be to store (every second) both the changed objects and the last-changed time for each unchanged object, so that a log reader would know where to look for them. I'm worried I'm reinventing the wheel here though — this must be a problem that has been encountered before.
Existing comparable techniques, I suppose, are those used in version control systems, but I'd like a native object-aware Java solution if possible — running git commit on a binary file once a second seems like it's abusing the intention of a VCS!
So, is there a standard way of solving this problem that I should be aware of? If not, any pitfalls that I might encounter when developing my own solution?

Related

Abusing Java Arrays?

I'm developing a software package which makes heavy use of arrays (ArrayLists). Instructions to be process are put into an array queue to be processed, then when used, deleted from the array. Same with drawing on a plot, data is placed into an array queue, which is read to plot data, and the oldest data is eventually deleted as new data comes in. We are talking about thousands of instructions over an hour and at any time maybe 200,000 points plotted, continually growing/shrinking the array.
After sometime, the software beings to slow where the instructions are processed slower. Nothing really changes as to what is going on for processing, that is, the system is stable as to what how much data is plotted and what instructions are being process, just working off similar incoming data time after time.
Is there some memory issue going on with the "abuse" of the variable-sized (not a defined size, add/delete as needed) arrays/queues that could be causing eventual slowing?
Is there a better way than the String ArrayList to act as a queue?
Thanks!

Yes, you are most likely using the wrong data structure for the job. An ArrayList is a list with a backing array so get() is fast.
The Java runtime library has a very rich set of data structures so you can get a well-written and debugged with the characteristics you need out of the box. You most likely should be using one or more Queues instead.
My guess is that you forget to null out values in your arraylist so the JVM has to keep all of them around. This is a memory leak.
To confirm, use a profiler to see where your memory and cpu go. Visualvm is a nice standalone. Netbeans include one.

The use of VisualVM helped. It showed a heavy use of a "message" form that I was dumping incoming data to and forgot existed, so it was dealing with a million characters when the sluggishness became apparent, because I never limited its size.

I want to preserve my data during service restart, but my data is not in simple variable name-value or table format. How should I go about this?

I want to preserve data during service restart, which uses a arraylist of {arraylist of integers} and some other variables.
Since it is about 40-60 MB, I don't want it be generated each time the service restarts(it takes a lot of time); I want to generate data once, and maybe copy it for next service restart.
How can it be done?
Please consider how will I go about putting a data structure similar to multidimensional array(3d or above) into file, before suggesting writing the data in a file; which when done, will likely take significant time to read too.

You can try writing your data after generation to a file. Then on next service restart, you can simply read that from the file.

If you need persistent data, then put it into database
https://developer.android.com/guide/topics/data/data-storage
or try some object database like http://objectbox.io/

So you're afraid reading from the file would take along time due to its size, the number and size of the rows (the inner arrays).
I think it might be worthy to stop for a minute and ask yourself whether you need all this data at once. Maybe you only need a portion of it at any given time and there are scenarios in which you don't use some (or maybe most) of the data? If this is likely, I would suggest that you'll compute the data on demand, when required, and only keep a memory based cache for future demand in the current session.
Otherwise, if you do need all the data at a given time, you have a trade-off here. Trade-off between size on disk and processing time. You can shrink the data using some algorithm, but it would be at the expense of the processing time. On the hand, you can just serialize your object of data and save it to disk as is. Less time, more disk space.
Another solution for your scenario, could be, to just use a DB and a cursor (room on top sqlite). I don't exactly know what it is that you're trying to do, but your arrays can easily be modeled into a DB. Model a single row as you'd like and add to that model the outer index of the array. Then save the models into the DB, potentially making the outer index field the primary key if the DB.
Regardless of the things I wrote, try to think if you really need this data persistent on your client, maybe you can store it at the server side? If so, there are other storage and access solutions which are not included at the Android client side.

Thank you all for answering this question.
This is what I have finally settled for:
Instead of using the structure as part of the app, I made this into a
tool, which will prepare data to be used with the main app. In doing
so, it also stopped the concern regarding service restart.
This tool will first read all the strings from input file(s).
Then put all of them into the structure one at a time.(This will be
the part which I was having doubts, and asked the question about.
Since all the data is into the structure here, as soon as program
terminates, this structured data is unusable.)
Now, I prepared another structure for putting this data into file,
and put all this data into file so that I do not need to read to all
input file again and again, but only few lines.
Then I thought, why spend time "read"ing files while I can hard code
it into my app. So, as final step of this preprocessing tool, I made
it into a class which has switch(input){case X: return Y}.
Now I will just have to put this class into the app I wanted to make.
I know this all sounds very abstract, even stretching the concept of abstract, if you want to know details, please let me know. I am also including link of my "tool". Please visit and let me know if there would have been some better way.
P.S. There could be errors in this tool yet, which if you find, let me know to fix them.
P.P.S.
link: Kompressor Tool

Searching strategy to efficiently load data from service or server?

This question is not very a language-specific question, it's some kind of pattern-related question, but I would like to tag it with some popular languages that I can understand here.
I've not been very experienced with the requirement of efficiently loading data in combination with searching data (especially for mobile environment).
My strategy used before is load everything into local memory and search from there (such as using LINQ in C#).
One more strategy is reload the data every time a new search is executed. Doing something like this is of course not efficient, also we may need to do some more complicated things to sync the newly loaded data with the existing data (already loaded into local memory).
The last strategy I can think of is the hardest one to implement, that is lazily load the data together with the searching execution. That is when the search is executed, the return result should be cached locally. The search should look in the local memory first before fetching new result from the service/server. So the result of each search is a combination of the local search and the server search. The purpose here is to reduce the amount of data being reloaded from server every time a search is run.
Here is what I can think of to implement this kind of strategy:
When a search is run, look in the local memory first. Finishing this step gives out the local result.
Now before sending request to search on the server side, we need to somehow pass what are already put in the result (locally) to exclude them from the result when searching on the server side. So the searching method may include a list of arguments containing all the item IDs found by the fisrt step.
With that searching request, we can exclude the found result and return only new items to the client.
The last step is merge the 2 results: from local and server to have the final search result before showing on the UI to the user.
I'm not sure if this is the right approach but what I feel not really good here is at the step 2. Because we need to send a list of item IDs found on the step 1 to the server, so what if we have hundreds or thousands of such IDs, sending them in that case to the server may not be very efficient. Also the query to exclude such a large amount of items may not be also efficient (even using direct SQL or LINQ). I'm still confused at this point.
Finally if you have any better idea and importantly implemented in some production project, please share with me. I don't need any concrete example code, I just need some idea or steps to implement.

Too long for a comment....
Concerning step 2, you know you can run into many problems:
Amount of data
Over time, you may accumulate a huge amount of data so that even the set their id's gets bigger than the normal server answer. In the end, you could need to cache not only previous server's answers on the client, but also client's state on the server. What you're doing is sort of synchronization, so look at rsync for inspiration; it's an old but smart Unix tool. Also git push might be inspiring.
Basically, by organizing your IDs into a tree, you can easily synchronize the information (about what the client already knows) between the server and the client. The price may be increasing latency as multiple steps may be needed.
Using the knowledge
It's quite possible that excluding the already known objects from the SQL result could be more expensive than not, especially when you can't easily determine if a to-be-excluded object would be a part of the full answer. Still, you can save bandwidth by post-filtering the data.
Being up to date
If your data change or get deleted, your may find your client keeping obsolete data. The client subscribing for relevant changes is one possibility; associating a (logical) timestamp to your IDs is another one.
Summary
It can get pretty complicated and you should measure before you even try. You may find out that the problem itself is hard enough and that achieving these savings is even harder and the gain limited. You know the root of all evil, right?

I would approach the problem by thinking local and remote are two different data sources,
When a search is triggered, the search is initiated against both data sources (local - in memory and server)
Most likely local search will result in results first, so display them to the user.
When results returned from the server, you can append non duplicate results.
Optional - in case server data has changed and some results remove/ or changed, update/remove local results and update the view.

Concurrent Access within a Big InMemory B+Index

I am currently designing around a big memory index structure (several giga bytes). The index is actually a RTree which leafes are BTrees (dont ask). It supports a special query and pushes it to the logical limit.
Since those nodes are soley search nodes I ask my self how to best make it parallel.
I know of six solutions so far:
Block reads when a write is scheduled. The tree is completely blocked until the last read is finished and then the write is performed and after the write the tree can yet again used for multiple reads. (reads need no locking).
Clone Nodes to change and reuse existing nodes (including leafs) and switch between both by simply yet again stop reads switch and done. Since leaf pointers must be altered also the leaf pointers might become their own collection making it possible to switch modifications atomar and changes can be redo to a second version to avoid copy of the pointer on each insert.
Use independent copies of the index like double buffering. Update one copy of the index, switch it. Once noone reads the old index, alter this index in the same way. This way the change can be done without blocking existing reads. If another insert hits the tree in a reasonable amount of time these changes can also be done.
Use a serial share nothing architecture so each search thread has its own copy. Since a thread can only alter its tree after a single read is performed, this would be also lock free and simple. Due reads are spread evenly for each worker thread (being bound to a certain core), the throughput would not be harmed.
Use write / read locks for each node being about to be written and do only block a subtree during write. This would involve additional operations against the tree since splitting and merging would propagate upwards and therefore require a repass of the insert (since expanding locks upwards (parentwise) would introduce the chance of a deadlock). Since Split and Merge are not that frequent if you have a higher page size, this would also be a good way. Actually currently my BTree implementation currently uses a similar mechanism by spliting a node and reinsert the value unless no split is needed (which is not optimal but more simple).
Use double buffer for each node like the shadow cache for databases where each page is switched between two versions. So everytime a node is modified a copy is modified and once a read is issued the old versions are used or the new one. Each node gets a version number and the version that is more close to the active version (latest change) is choosen. To switch between to version, one needs only an atomar change on the root information. This way the tree can be altered and used. This swith can be done every time but it must be ensured that no read is using the old version when overriding the new one. This method has the possibility to not interfer with cache locality in order to link leafs and alike. But it also requires twice the amount of memory since a back buffer must be present but saves allocation time and might be good for a high frequency of changes.
With all that thoughts what is best? I know it depends but what is done in the wild? If there are 10 read threads (or even more) and being blocked by a single write operation I guess this is nothing I really want.
Also how about L3, L2 and L1 cache and in scenarios with multiple CPUs? Any issues on that? The beauty of the double buffering is the chance that those reads hitting the old version are still working with the correct cache version.
The version of creating a fresh copy of a node is quite not appealing. So what is meet in the wild of todays database landscapes?
[update]
By rereading the post, I wonder if using the write locks for split and merge would be better suited by creating replacement nodes since for a split and a merge I need to copy somewhat the half of elements around, those operations are very rare and so actually copy a node completely would do the trick by replacing this node in the parent node which is a simple and fast operation. This way the actual blocks for reads would be very limited and since we create copies anyway, the blocking only happens when the new nodes are replaced. Since during those access leafs may not be altered it is unimportant since the information density has not changed. But again this needs for every access of a node a increment and decrement of a read lock and checking for intended write locks. This all is overhead and this all is blocking further reads.
[Update2]
Solution 7. (currently favored)
Currently we favor a double buffer for the internal (non-leaf) nodes and use something similar to row locking.
Our logical tables that we try to decompose using those index structure (which is all a index does) results in using algebra of sets on those information. I noticed that this algebra of sets is linear (O(m+n) for intersection and union) and gives us the chance to lock each entry being part of such operation.
By double buffering the internal nodes (which is not hard to implement nor does it cost much (about <1% memory overhead)) we can live problem free on that issue not blocking too much read operations.
Since we batch modifications in a certain way it is very rarely seen that a given column is updated but once it is, it takes more time since those modifications might go in the thousands for this single entry.
So the goal is to alter the algebra of sets used to simply intersect those columns being currently modified later on. Since only one column is modified at a time such operation would only block once. And for everyone currently reading it, the write operation has to wait. And guess what, once a write operation waits, it usually lets another write operation of another column taking place that is not bussy. We calculate the propability of such a block to be very very low. So we dont need to care.
The locking mechanism is done using check for write, check for write intention, add read, check for write again and procced with the read. So there is no explicit object locking. We access fixed areas of bytes and if the structure is clear everything critical is planed to move into a c++ version to make it somewhat faster (2x we guess and this only takes one person one or two weeks to do especially if you use a Java to C++ translator).
The only effect that is now also important might be the caching issue since it invalidates L1 caches and maybe L2 too. So we plan to collect all modifications on such a table / index to be scheduled to run within 1 or more minutes timeshare but be evenly distributed to not make a system that has performance hickhups.
If you know of anything that helps us please go ahead.

As noone replied I would like to summarize what we (I) finally did. The structure is now separated. We have a RTree which leaf are actually Tables. Those tables can be even remote so we have a distribution way that is mostly transparent thanks to RMI and proxies.
The rest was simply easy. The RTree has the way to advise a table to split and this split is again a table. This split is been done on a single maschine and transfered to another if it has to be remote. Merge is almost similar.
This remote also is true for threads bound to different CPUs to avoid cache issues.
About the modification in memory it is as I already suggested. we duplicate internal nodes and turned the table 90° and adapted the algebraic set algorithms to handle locked columns efficiently. The test for a single table is simple and compared to the 1000ends of entries per column not a performance issue after all. Deadlocks are also impossible since one column is used at a time so there is only one lock per thread. We experiment with doing columns in parallel which would increase the response time. We also think about binding columns to a given virtual core so there is no locking again since the column is in isolation and yet again the modification can be serialized.
This way one can utilize 20 cores and more per CPU and also avoid cache misses.

Writing hundreds of data objects to a Mongo database

I am working on a Minecraft network which has several servers manipulating 'user-objects', which is just a Mongo document. After a user object is modified it need to be written to the database immediately, otherwise it may be overwritten in other servers (which have an older version of the user object), but sometimes hundreds of objects need to be written away in a short amount of time.. (in a few seconds). My question is: How can I easily write objects to a MongoDB database without really overload the database..
I have been thinking up an idea but I have no idea if it is relevant:
- Create some sort of queue in another thread, everytime an data object gets need to be saved into the database it gets in the queue and then in the 'queue thread' the objects will be saved one by one with some sort of interval..
Thanks in advance
btw Im using Morphia as framework in Java

"hundreds of objects [...] in a few seconds" doesn't sound that much. How much can you do at the moment?
The setting most important for the speed of write operations is the WriteConcern. What are you using at the moment and is this the right setting for your project (data safety vs speed)?
If you need to do many write operations at once, you can probably speed up things with bulk operations. They have been added in MongoDB 2.6 and Morphia supports them as well — see this unit test.
I would be very cautious with a queue:
Do you really need it? Depending on your hardware and configuration you should be able to do hundreds or even thousands of write operations per second.
Is async really the best approach for you? The producer of the write operation / message can only assume his change has been applied, but it probably has not and is still waiting in the queue to be written. Is this the intended behaviour?
Does it make your life easier? You need to know another piece of software, which adds many new and most likely unforeseen problems.
If you need to scale your writes, why not use sharding? No additional technology and your code will behave the same with and without it.
You might want to read the following blogpost on why you probably want to avoid queues for this kind of operation in general: http://widgetsandshit.com/teddziuba/2011/02/the-case-against-queues.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.