XPages: Navigating arround a document collection

XPages: Navigating arround a document collection - java

I create a document collection and am able to put the docid of the second doc in the first doc, third in second and so till the last Document which enable me to navigate from first to second document when the user approved a job and so on, but i want to be also able to go from second back to first when the user reject the task but i have not be able to store the docid of the first in the second documnet. Below is the code i am currently using
Document nextJob= null;
Document thisJob =null;
DocumentCollection col = lookup.getAllDocumentsByKey(ID, true);
if (col != null){
Job= col.getFirstDocument();
while (job!= null) {
thisJob.createDocument()
thisJob =Job;
thisJob.replaceItemValue("DocID",thisJob.getUniversalID());
thisJob.save(true);
if(nextJob!= null){
nextJob.replaceItemValue("TaskSuccessor",thisJob.getUniversalID());
nextJob.save(true);
}
nextJob= thisJob
tmpDoc = Job;
Job = col.getNextDocument(Job);
}
}

To echo Frantisek and others, updating the documents is not best practice. The key to how to achieve it is to consider a number of questions:
What you mean first next and previous job?
What is the numbers of jobs involved?
How are save conflicts going to be minimised / resolved by you / users?
How are deletions being handled, to ensure referential integrity?#
What happens when you need to archive data?
If it's for all users and next on date created, create a view based on date created. It will be quicker to create, completely negate the issue of save conflicts or deletes and not have a significant performance hit unless you're dealing with very large numbers of jobs (in which case you should be considering archiving).
If it's a small number of jobs, store them in a Java Map. But you need to handle deletions. Because you'll be loadingn the map when the app loads, archiving is not a problem.
If it's next / previous per user, a better method would be storing the order in a document per person in the database. If replicas are not involved, Note IDs can be used and will be shorter. It will negate save conflicts. But it may cause problems with large numbers of jobs - you will probably need to create new fields programmatically and also handle deletions.
DonMaro's suggestion fits with a graph database approach of edges (the third documents) between the vertices (the jobs).
In most cases, views will be the easiest and most recommended approach. IBM have included view index enhancements in 9.0.1 FP3 and will allow view indexes to be stored outside the NSF in the next point release.
Even if you're confident that you can build a better indexing system than what is already included in Domino, there are other aspects like save conflicts that need to be handled and you're decision may not allow future functional requirements like security, deletion, archiving etc.

Well, despite pointing out to really consider Frantisek Kossuth's comment (as UNIDs get changed in case you have might have to copy/paste a document back into the database, e.g. for backup; try considering generating unique values by using #Unique):
just create a third document object "prevJob" and store the previous document there when/before changing to the next one.
Then you can access the UNID just as you already do by "prevJob.getUniversalID()" and store it in the document you're currently processing.

Related

Checking if a Set of items exist in database quickly

I have an external service which I'm grabbing a list of items from, and persisting locally a relationship between those items and a user. I feed that external service a name, and get back the associated items with that name. I am choosing to persist them locally because I'd like to keep my own attributes about those external items once they've been discovered by my application. The items themselves are pretty static objects, but the total number of them are unknown to me, and the only time I learn about new ones is if a new user has an association with them on the external service.
When I get a list of them back from the external service, I want to check if they exist in my database first, and use that object instead but if it doesn't I need to add them so I can set my own attributes and keep the association to my user.
Right now I have the following (pseudocode, since it's broken into service layers etc):
Set<ExternalItem> items = externalService.getItemsForUser(user.name);
for (ExternalItem externalItem : items){
Item dbItem = sessionFactory.getCurrentSession().get(Item.class,item.id);
if (dbitem == null){
//Not in database, create it.
dbItem = mapToItem(externalItem);
}
user.addItem(dbItem);
}
sessionFactory.getCurrentSession().save(user);//Saves the associated Items also.
The time this operation is taking is around 16 seconds for approximately 500 external items. The remote operation is around 1 second of that, and the save is negligible also. The drain that I'm noticing comes from the numerous session.get(Item.class,item.id) calls I'm doing.
Is there a better way to check for an existing Item in my database than this, given that I get a Set back from my external service?
Note: The external item's id is reliable to be the same as mine, and a single id will always represent the same External Item

I would definitely recommend a native query, as recommended in the comments.
I would not bother to chunk them, though, given the numbers you are talking about. Postgres should be able to handle an IN clause with 500 elements with no problems. I have had programmatically generated queries with many more items than that which performed fine.
This way you also have only one round trip, which, assuming the proper indexes are in place, really should complete in sub-second time.

CouchDB/Couchbase/MongoDB transaction emulation?

I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?

Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.

Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic

Caching in an enterprise web application

I'm relatively new to using caching in larger programs intended to be used by a large number of people. I know what caching is and why its beneficial in general and I've started to integrate EHCache into my application which uses JSP and Spring MVC. In my application the user selects an ID from a drop down list and this uses a java class to grab data from DB according to the ID picked. First the query is executed and it returns a ResultSet object. At this point I am confused at what to do and feel like I'm missing something.
I know I want the object to go into cache if it's not already in there and if it's already in cache then just continue with the loop. But doing things this way requires me to iterate over the whole returned result set from the DB query, which is obviously not the way things are supposed to be done?
So, would you recommend that I just try to cache the whole result set returned? If I did this I guess I could update the list in the cache if the DB table is updated with a new record? Any suggestions on how to proceed and correctly put into ecache what is returned from the DB?
I know I'm throwing out a lot of questions and I certainly appreciate it if someone could offer some help! Here is a snippet of my code so you see what I mean.
rs = sta.executeQuery(QUERYBRANCHES + specifier);
while (rs.next())
{
//for each set of fields retrieved, use those to create a branch object.
//String brName = rs.getString("NAME");
String compareID = rs.getString("ID");
String fixedRegID = rs.getString("REGIONID").replace("0", "").trim();
//CHECKING IF BRANCH IS ALREADY IN THE CACHE. IF IT IS NOT CREATE
//THE NEW OBJECT AND ADD IT TO CACHE. IF THE BRANCH IS IN CACHE THEN CONTINUE
if(!cacheManager.isInMemory(compareID))
{
Branch branch =
new Branch(fixedRegID, rs.getString("ID"), rs.getString("NAME"), rs.getString("ADDR1"), rs.getString("CITY"), rs.getString("ST"), rs.getString("ZIP"));
cacheManager.addBranch(rs.getString("ID"), branch);
}
else
{
continue;
}
}
retData = cacheManager.getAllBranches();

But doing things this way requires me to iterate over the whole
returned result set from the DB query, which is obviously not the way
things are supposed to be done?
You need to iterate in order to fetch the results.
To avoid iteration on all elements you need to exclude the already cached values that are returned in the select.
What I mean is, add to your select exclusion clause the values you dont want, in this case the values already cached. (not like, <>, etc). This will reduce the iteration time.
Otherwise yes, Im afraid you will have to iterate over all returns if your SQL filter is not complete.
So, would you recommend that I just try to cache the whole result set
returned? If I did this I guess I could update the list in the cache
if the DB table is updated with a new record? Any suggestions on how
to proceed and correctly put into ecache what is returned from the DB?
You should not cache highly dynamic business information.
What I recommend is that you use database indexes so that would dramatically increase your performance, and get your values from there. Use pure native SQL if needed.
If you are going to work with a lot of users you will need a lot of memory to keep all those objects in memory.
As you start scaling horizontally caching management is going to be a challenge this way.
If you can, only cache values that wont change or that change very few times, like in the application start up or application parameters.
If you really need to cache business information, please let us know the specifics, like what is the hardware, platform, database, landscape, peak of access, etc.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)

All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.