Java: versioned data structures?

Java: versioned data structures? - java

I have a data structure that is pretty simple (basically a structure containing some arrays and single values), but I need to record the history of the data structure so that I can efficiently get the contents of the data structure at any point in time.
Is there a relatively straightforward way to do this?
The best way I can think of would be to encapsulate the whole data structure with something that handles all the mutating operations by storing data in functional data structures, and then for each mutation operation caching a copy of the data structure in a Map indexed by time-ordering (e.g. a TreeMap with real time as keys, or a HashMap with a counter of mutation operations combined with one or more indexes stored in TreeMaps mapping real time / tick count / etc. to mutation operations)
any suggestions?
edit: In one case I already have the history as a series of transactions (this is reading items from a data file) so I can replay them, but this takes O(n) steps (n = # of transactions) every time I need to access the data. I'm looking for alternatives.

You are correct. Storing the data in a purely function data structure is the way to go. Supporting anything moderately complicated using do/undo actions is reliant on the programmer being aware of all side effects of every operation, which does not scale and breaks encapsulation.

You should use some form of persistent data structure that is immutable and is based on structural sharing (i.e. so that the parts of the data structure which do not change between versions only get stored once).
I created an open source Java library of such data structures here:
http://code.google.com/p/mikeralib/source/browse/#svn/trunk/Mikera/src/mikera/persistent
These were somewhat inspired by Clojure's persistent data structures, which might also be suitable for your purposes (they are also written in Java).

If you are only storing a little bit of data and don't have a lot of changes then storing each version is fine.
If you don't need to access the old version of the data too often, I wouldn't cache each one, I'd just make it so you could rebuild to it.
You could do this by saving mutations as transactions and replaying the transactions (with the ability to stop at any point.
So you start with an empty data structure and you might get an "Add" instruction followed by a "Change" and another "add" and then maybe a "Delete". Each of these objects would contain a COPY (not a pointer to the same object) of the thing being added or changed.
You would concatenate each operation into a list while at the same time mutating your collection.
If you find that you need a version at an older timestamp, start with a new empty collection, replay until you hit that timestamp then stop and you have the collection as it would be at that time.
If this was a very long-running application and you often needed to access items near the end, you could write an "Undo" for each add/change/delete operation object and actually mutate the data back and forth.
So imagine you have your data object and this array of mutations, you could easily run up and down the mutation list changing the data object back and forth to any version you want.
You could even contain multiple data objects, just create a new empty one and run it up the mutation array (think of it as a timeline--where each stored mutation would contain a timestamp or some version number) until you get it to the timestamp you want--this way you could have "Milestones" that you could reach instantly--for instance, if you allocated one milestone for each thread you could make the addMutation method synchronized and this data collection would become 100% threadsafe.
Note that if you actually return the data object you should only return a copy of the data--otherwise the next time you mutated that milestone it would mutate the data object you returned.
Hmm, you could also include "Rollup" functionality--if you ever decide that you will not need access to the tail (the first few transactions) you could apply them to a "Start" structure and then delete them--from then on you copy the start structure to begin from the start rather than always starting with an empty data structure.
Man, this is an awesome pattern--now I want to implement it.

Either do as you already suggested, or have a base class of some sort with subclasses that represent the different changes. Then get the proper class at run-time by passing the version/timestamp/whatever to a factory that hands you back the right one.

Multi-level undo can be based on a model (ie data structure) and a sequence of actions. Each action supports two operations: "do" and "undo". To perform a change on the model you register a new action and "do" it. This allows you to "walk" back and forth in the history, but the state of the model at a specific index cannot be accessed in constant time.
Maybe something like this would be applicable to your situation?

How long will the application be running for?
It seems like you could do what you suggested -- playing the transactions back -- but cache the data structure and list of transactions at particular points in time (every hour or every day?) to ease the pain of having to go through O(n) operations every time you need to rebuild the collection from scratch.
Granted, there is definitely a trade-off between space (that the cache takes up) and the number of operations needed to re-build it, but hopefully you will be able to find a happy medium for it.

Related

database or ObjectOutputStream, Object specific member or actual object for reference

I'm working on an application for a pharmacy , basically this application has a class "item" and another class "selling invoices" which logs selling processes .
So my question here if the pharmacy is expected to have about ten thousand products in stock, and I'm storing these products in a linked list of type Item, and storing the invoices in linked list also , then on closing the app i save them using object output stream and reload them upon the start, Is it a bad practice ? Have I to use database instead?
My second question is, if i continue on using linkedlist and object output stream , what is better for performance and memory, storing the actual item as a field member in the invoice class or just its ID and then getting the item upon recalling using this ID reference, so what's better ?
Thanks in advance .

It is a bad idea to use ObjectOutputStream like that.
Here are some of the reasons:
If your application crashes (or the power fails) before you "save", then all changes are lost.
Saving "all objects" is expensive.
Serialized objects are opaque. It is only practical to look at them from Java code.
Serialized objects are fragile. If your application classes change, you may find that old serialized objects can no longer be read. That's bad enough, but now consider what happens if your client wants to look at pharmacy records from 5 years ago ... from a backup tape.
Serialized objects provide no way of searching ... apart from reading all of the objects one at a time.
Designs which involve reading all objects into memory do not scale. You are liable to run out of memory. Or compromise on your requirements to avoid running out of memory.
By contrast:
A database won't lose any changes have been committed. They are much more resilient to things like application errors and system level failures.
Committing database changes is not as expensive, because you only write data that has changed.
Typical databases can be viewed, queried, and if necessary repaired using an off-the-shelf database tool.
Changing Java code doesn't break the database. And for some schema changes, there are ways to migrate the database schema and records to match an updated database.
Databases have indexes and query languages for implementing efficient search.
Databases scale because the primary copy of the data is on disk, not in memory.

How to store database data with lots of attributes into cache?

Let's say that I have a table with columns TABLE_ID, CUSTOMER_ID, ACCOUNT_NUMBER, PURCHASE_DATE, PRODUCT_CATEGORY, PRODUCT_PRICE.
This table contains all purchases made in some store.
Please don't concentrate on changing the database model (there are obvious improvement possibilities) because this is a made-up example and I can't change the actual database model, which is far from perfect.
The only thing I can change is the code which uses the already existing database model.
Now, I don't want to access the database all the time, so I have to store the data into cache and then read it from there. The problem is, my program has to support all sorts of things:
What is the total value of purchases made by customer X on date Y?
What is the total value of purchases made for products from category X?
Give me a list of total amounts spent grouped by customer_id.
etc.
I have to be able to preserve this hierarchy in my cache.
One possible solution is to have a map inside a map inside a map... etc.
However, that gets messy very quickly, because I need an extra nesting level for every attribute in the table.
Is there a smarter way to do this?

Have you already established that you need a cache? Are you sure the performance of your application requires it? The database itself can optimize queries, have things in memory, etc.
If you're sure you need a cache, you also need to think about cache invalidation: is the data changing from beneath your feet, i.e. is another process changing the data in the database, or is the database data immutable, or is your application the only process modifying your data.
What do you want your cache to do? Just keep track of queries and results that have been requested so the second time a query is run, you can return the result from the cache? Or do you want to aggressively pre calculate some aggregates? Can the cache data fit into your app memory or do you want to use ReferenceMaps for example that shrink when memory gets tight?
For your actual question, why do you need maps inside maps? You probably should design something that's closer to your business model, and store objects that represent the data in a meaningful way. You could have each query (PurchasesByCustomer, PurchasesByCategory) represented as an object and store them in different maps so you get some type safety. Similarly don't use maps for the result but the actual objects you want.
Sorry, your question is quite vague, but hopefully I've given you some food for thoughts.

What is the best way to iterate and process an entire table from database?

I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.

If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!

This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.

This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.

IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).

This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

How to reduce the total memory hogging by compacting my Objects in Java?

I have a table with around 20 columns with mostly consisting of varchars and decimals. This table has almost 1.5M rows. But few things are common in them like column1 consists of only 100 distinct strings , column2 has almost 1000 and column3 has almost 500.
Right now, I am storing all these column values in a map with Key as first 5 columns and Data as rest of columns. My task is such, I need to initialize all these at the start of the task.
What pattern(like Flyweight, etc) or data structure should I use to minimize my Object storage?
Why I need pre-load of all data?
Assume the whole data of the table as a tree and the victims can be at any leaf, trunk or at root. So for each entry[this is coming from different place], I need to see if there is any match in the tree.

Internalizing is not the best option. Garbage collecting from the PermSpace is possible but nothing the VM is optimized for.
You can implement your own CharSequence implementation that is backed by shared char[] arrays.
With a CharSequence implementation you'll be able to implement basic sharing semantics like internalized strings or more complicated ones taking substrings and other projections into account.
A custom CharSequence implementation can also be optimized to perform fewer memory allocations than the String class which is copying char[] around (for safety reasons that are not necessary if you have the backing char[] under your full control). Even new String("..").intern() will intantiate a new String instance (char[] array) that is rapidly garbage collected.

My first question would be, what does you task plan with doing with the data in the table? Preloading a complete table into memory is not always the best approach, for instance keeping your current setup but loading on demand might be a better solution. And you might want to investigate flushing data that isn't used for a while, i.e. a kind of recently used map.
Could you elaborate what your task tries to achieve with all that data cached in a map?
Is the "victim" identification part of the key or part of the object? If part of the object, how do you select the keys that select the objects that you need? In other words; it sounds like you try to reproduce functionality that the database is very good at.
If your problem is that your table contents does not map easily on a tree-like structure, you could add that information in a way that is useable through the DB interface.

If your data loading process can support it then it isn't too difficult to implement something like String.intern() without the GC permgen side effects.
For any hashable data element, you can simply have a Map<T,T> to look-up preexisting instances. So for String:
Map<String,String> stringCache = new HashMap<String,String>();
...
String sharedValue = stringCache.get(loadedValue);
The process that loads the data from wherever will still be creating temporary strings but these will be rapidly GC'ed. Without knowing more about the specifics of where the data is coming from, it's difficult to comment on whether those temporary objects are necessary... though I have trouble seeing a way around it. They would be reclaimed rapidly during the load process anyway.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.