We are developing an adapter component that sits between two document-processing systems.
Brief Document Lifecycle Description
The system A sends documents to the adapter one by one.
The adapter collects those documents, preforms validation and transforms the documents to a format that is understandable by the system B. Transformed documents are stored in a database.
Every 5 minutes transformed documents are taken from a database, packed to a package and sent to the system B.
Document Processing Details
It is possible that for a document an update could come to the adapter. This update looks like an ordinary document with the same identifier as an original one, but several fields differ. When such an update comes there are the following requirements:
If an original document is sent to the system B, skip the update;
If an original document is not sent to the system B, send only the update.
As it was described above there are two phases of the document processing: validation-transformation-persisting documents one by one and document aggregation-sending.
Each document has a dedicated database record. When an update comes, a corresponding record is updated (no separate record creation). When a document arrives it has the 'New' status, after being validated and transformed it gains the 'Ready for Sending' status. After being sent as a part of a package - 'Sent' status.
Problem Statement
There is a possibility of the following scenario:
a document arrives to the adapter - grabs the 'New' status;
the document is validated and transformed successfully - 'Ready for Sending'
a collection of documents, containing the described document, is taken from a database and the package transformation process started - all documents are still in the 'Ready for Sending' status;
An update arrives for one of the documents in a package, and a corresponding transaction commits - this document record is at the 'Ready for Sending' status, but several fields changed (so do a Hibernate-#Version-marked column).
A package is formed and at the sending phase the adapter tries to set the 'Sent' status for all the documents in the package, but fails with the OptimisticLock exception, which means that the whole package must be reprocessed.
Our adapter has a rate of 250 messages per second and usually a package contains at average 20000 transformed documents and the likelihood of facing the described issue is rather high. It is forbidden to forms and send several packages of lower size, also it is forbidden to loose documents or send duplicates.
Optimistic locking is crucial at the adapter because it grabs and processes documents in several parallel threads, thus an original document and an update could be processed at parallel...
Solution Ideas
We are thinking about introducing a new status like 'On Package Forming' just right before forming a package and setting up this status for each document in a separate transaction: if some documents will raise an OptimisticLock exception, we will just reprocess this document. This solutions looks rather slow.
Could you please suggest a solution for this case? Maybe we have to change the approach completely?
Related
Let's consider the following situation - there are two fields in "article" document - content(string) and views(int). The views field is not indexed. The views field contains information how many times this article was read.
From official doc:
We also said that documents are immutable: they cannot be changed,
only replaced. The update API must obey the same rules. Externally, it
appears as though we are partially updating a document in place.
Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described.
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document? For example - I want to update views every time someone reads some article. If entire document is reindexed I can't do real time update (as it's too heavy operation). So I will have to work with delay, for example updates all articles the visitors have read every 3-5-10 minutes. Or I understand something wrong?
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document?
Yes, whilst the views field is not indexed individually it is part of the _source field. The _source field contains the original JSON you sent to Elasticsearch when you indexed the document and is returned in the results if there is a match on the document during a search. The _source field is indexed with the document in Lucene. In your update script you are changing the _source field so the whole document will be re-indexed.
Could you then evaluate the following strategy. Every time someone reads the article I send update to elastic. However refresh_interval I set to 30 seconds. Will this strategy be normal if during 30 second interval about 1000 users have read one article?
You are still indexing the 1000 documents, 1 document will be indexed as the current document, 999 documents will be indexed marked as deleted and removed from the index during the next Lucene merge.
I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)
You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.
In my project I am generating a report. This involves huge data transmission from the DB.
The logic is like user will give certain criteria,based on which first we will fetch parent items from db.There may be 100000 parent items.Not only this after getting this items we are gathering child items of this parent items and there detailed details. All to gather this parent and child information we are putting in one response xml.
It is fine for small records. But for huge records it is taking more time. We are using a tool as a back end system.Which stores the records.It has its own query set so query optimization did not work.All we have to do it with java.
Can any one from the team give some idea how to optimize this.
Not really a true answer, but too long for a comment
You must benchmark the different steps:
database - time a select extracting all the records (parents + childs) directly on the database (assuming it is a simple database)
network - time a transfert of the approximate size of the whole records.
processing - store the result on a local file and time the processing reading from local file (you must also time a copy of the file to know the time used to read from disk
Multithreading will only help if the bottleneck is processing.
I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?
Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.
Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic
I have a requirement in which I need to capture data changes (not auditing) and life cycle states on inventory.
Technology:
Jave, Oracle, Hibernate + JPA
For the data changes, we have been given a list of data elements that are to be monitored. If the element changes we are to notify a given 3rd party vendor. What I want to do is make this a generic service that we can provide to any of our current and future 3rd party vendors.
We don't care who made the change or what the new value is just that it changed.
The thought is that the data layer of our application would use annotation on each of the data elements. If that data element changed, then it would place a message into a queue. The message bean would then read the queue and make an entry in a table.
Table to look something like the following:
Table Name: ATL_CHANGE_TRACKER
Key columns
INVENTORY_ID Inventory Id of the vehicle
SALEEVENT_ITEM_ID SaleEvent item of the vehicle
FIELD_CHANGED_ID Id of the field that got changed or action. Link to subscription
UPDATE_DTM Indicates the date time when change occured.
For a given inventory, we could have up to 200 entries in this table (monitoring 200 fields across many tables).
Then a daemon for the given 3rd party would then read from this table based on the fields that it has subscribed to (could be all the fields). It would then read what every table it is required to to create the message to be sent to the 3rd party. Decouple the provider of the data and the user of the data.
Identify the list of fields/actions that are available
Table Name: ATL_FIELD_ACTION
Key columns
ID
NAME Name of the field/action - Example Color,Make
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
Subscription table, if 3rd Party company xyz is interested in 60 fields, the 60 fields will be mapped to this table.
ATL_FIELD_ACTION_SUBSCRIPTION
Key columns
ATL_FIELD_ACTION_ ID ID of the atl_field_action table
CONSUMER 3rd Party Name
FUNCTION Name of the 3rd Party Transmission that it is used for
STATUS
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
The second part is that there will be actions on the life cycle of the inventory which will need to be recored also. In this case, when the state of the inventory changes a message will be placed on the same queue and that entry will be entered in the same table.
Again, the daemon will have subscribed to these states and will collect the ones it is interested in.
The goal here is to not have the business tier/data tier care who wants the data - just that it needs to provide it so those interested can get it.
Wonder if anyone has done something like this - any gotchas - off the shelf - open source solutions to do this.
For a high-level discussion on the topic, I would suggest reading this article by Martin Fowler.
Its sounds like you have write-once, read-many type of data, it might produce large volumes of data, and the data is different for different clients. If you ask me, it sounds like this may be a good place to make use of either a NOSQL database or hack your Oracle database to act as a NOSQL database. See here for a discussion on how someone did this with MySQL.
Otherwise, you may look at creating an "immutable" database table and have Hibernate write new records every time it does an update as described here.
Couple things.
First, you get to do all of this work yourself. The JPA/Hibernate lifecycle listeners, while they have an event for when an update occurs, you aren't passed the "old" object and the "new" object. So, you're going to have to keep track of what fields change using some other method.
Second, again with lifecycle listeners, be careful inside of them, as the transaction state is a bit murky. At least on Glassfish/EclipseLink, I've had "strange" problems using either the JPA or JMS from a lifecycle listener. Just weird behavior. We went to a non-transactional queue to capture all of our information that we track from the lifecycle events.
If having the change data committed on its own transaction is acceptable, then there is value is pushing the data on to a faster, internal queue (which can feed a listener that posts it to an MDB). This just gets the auditing "out of band" with your transaction, give you better transaction throughput. But if you need to have the change information committed with the same transaction, this won't work. For example, you could put something on the queue and then the transaction may be rolled back (for whatever) reason, leaving the change on the queue showing it happened, when it in fact failed. That's a potential issue with this.
But if you're posting a lot of audit information, then this can be a concern.
If the auditing information has a short life span (with respect to the rest of the data), then you should probably make an effort to cull the audit tables, they can get pretty large.
Also, if practical, don't disregard the use of DB triggers for this. They can be quite efficient and effective at this process.