ElasticSearch: creating new inverted-index after every update

ElasticSearch: creating new inverted-index after every update - java

I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)

You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.

Related

How to perform Upsert in solr [duplicate]

I've been attempting to do the equivalent of an UPSERT (insert or update if already exists) in solr. I only know what does not work and the solr/lucene documentation I have read has not been helpful. Here's what I have tried:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","name":{"set":"steve"}}]'
{"responseHeader":{"status":409,"QTime":2},"error":{"msg":"Document not found for update. id=1","code":409}}
I do up to 50 updates in one request and request may contain the same id with exclusive fields (title_en and title_es for example). If there was a way of querying whether or not a list of id's exist, I could split the data and perform separate insert and update commands... This would be an acceptable alternative but is there already a handler that does this? I would like to avoid doing any in house routines at this point.
Thanks.

With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.

Solr does not support UPSERT mechanics out of the box. You can create a record or you can update a record and syntax is different.
And if you update the record you must make sure all your other pre-inserted fields are stored (not just indexed). Under the covers, an update creates a completely new record just pre-populated with previously stored values. But that functionality if very deep in (probably in Lucene itself).
Have you looked at DataImportHandler? You reverse the control flow (start from Solr), but it does have support for checking which records need to be updated and which records need to be created.
Or you can just run a solr query like http://solr.example.com:8983/solr/select?q=id%3A(ID1+ID2+ID3)&fl=id&wt=csv where you ask Solr to look for your ID records and return only ID of records it does find. Then, you could post-process that to segment your Updates and Inserts.

Partial update on field that is not indexed

Let's consider the following situation - there are two fields in "article" document - content(string) and views(int). The views field is not indexed. The views field contains information how many times this article was read.
From official doc:
We also said that documents are immutable: they cannot be changed,
only replaced. The update API must obey the same rules. Externally, it
appears as though we are partially updating a document in place.
Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described.
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document? For example - I want to update views every time someone reads some article. If entire document is reindexed I can't do real time update (as it's too heavy operation). So I will have to work with delay, for example updates all articles the visitors have read every 3-5-10 minutes. Or I understand something wrong?

But what if we do particial update of not indexed field - will elasticsearch reindex the entire document?
Yes, whilst the views field is not indexed individually it is part of the _source field. The _source field contains the original JSON you sent to Elasticsearch when you indexed the document and is returned in the results if there is a match on the document during a search. The _source field is indexed with the document in Lucene. In your update script you are changing the _source field so the whole document will be re-indexed.
Could you then evaluate the following strategy. Every time someone reads the article I send update to elastic. However refresh_interval I set to 30 seconds. Will this strategy be normal if during 30 second interval about 1000 users have read one article?
You are still indexing the 1000 documents, 1 document will be indexed as the current document, 999 documents will be indexed marked as deleted and removed from the index during the next Lucene merge.

CouchDB/Couchbase/MongoDB transaction emulation?

I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?

Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.

Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic

MongoDB retreiving last inserted document [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm new to MongoDB and trying to figure out some solutions for basic requirements.
Here is the scenario:
I have an application which saves info to MongoDB through a scheduler. Each time when the scheduler runs, i need to fetch the last inserted document from db to get the last info id which i should send with my request to get the next set of info. Also when i save an object, i need to validate it whether a specific value for a field is already there in db, if it is there, i just need to update a count in that document or else will save a new one.
My Questions are:
I'm using MongoDB java driver. Is it good to generate the object id myself, or is it good to use what is generated from the driver or MongoDB itself?
What is the best way to retrieve the last inserted document to find the last processed info id?
If i do a find for the specific value in a column before each insert, i'm worried about the performance of the application since it is supposed to have thousands of records. What is the best way to do this validation?
I saw in some places where they are talking about two writes when doing inserts, one to write the document to the collection, other one to write it to another collection as last updated to keep track of the last inserted entry. Is this a good way? We don't normally to this with relational databases.
Can i expect the generated object ids to be unique for every record inserted in the future too? (Have a doubt whether it can repeat or not)
Thanks.

Can i expect the generated object ids to be unique for every record inserted in the future too? (Have a doubt whether it can repeat or not)
I'm using MongoDB java driver. Is it good to generate the object id myself, or is it good to use what is generated from the driver or MongoDB itself?
If you don't provide a object id for the document, mongo will automatically generate it for you. All documents must definitely have _id. Here is the relevant reference.
The relevant part is
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
http://docs.mongodb.org/manual/reference/object-id/
I guess this is more than enough randomization(although a poor hash choice, since dates/time increase monotonically)
If i do a find for the specific value in a column before each insert, i'm worried about the performance of the application since it is supposed to have thousands of records. What is the best way to do this validation?
Index the field(there are no columns in mongodb, 2 documents can have different fields) first.
db.collection_name.ensureIndex({fieldName : 1})
What is the best way to retrieve the last inserted document to find the last processed >info id?
Fun fact. We don't need info id field if we are using it once and then deleting the document. The _id field is datewise sorted. But if the document is regularly updated, then we need to modify the document atomically with the find and modify operation.
http://api.mongodb.org/java/2.6/com/mongodb/DBCollection.html#findAndModify%28com.mongodb.DBObject,%20com.mongodb.DBObject,%20com.mongodb.DBObject,%20boolean,%20com.mongodb.DBObject,%20boolean,%20boolean%29
Now you have updated the date of the last inserted/modified document. Make sure this field is indexed. Now using the above link, look for the sort parameter and populate it with
new BasicDbObject().put("infoId", -1); //-1 is for descending order.
I saw in some places where they are talking about two writes when doing inserts, one to write the document to the collection, other one to write it to another collection as last updated to keep track of the last inserted entry. Is this a good way? We don't normally to this with relational databases.
Terrible Idea ! Welcome to Mongo. You can do better than this.

Updating lucene index - the most popular way

I am updating lucene index onec a day. My strategy in general is:
Find all objects in DB that was modified since last index generation.
Create new tmp-index for these objects. (old index is stil available)
Remove all new indexed Documents (they are in tmp-index) from the old index using IndexWriter.deleteDocuments(Term)
Merge old index and tmp-index using IndexWriter.addIndexes(...)
I have found that in lucene wiki: There is no direct update procedure in Lucene...
I have found also that in lucene 4.1.0 doc: A document can be updated with updateDocument...
I have tried IndexWriter.updateDocument(Term, Document) but then performing search with filter I got NPE from one of my methods what not happens when I update index as describe in 1-4. Have anyone had a similar problem? How do you update your index?

What I do is basically this:
I keep a persistent IndexReader/Readers, this will keep the state that it has since it was created.
I start to delete and create all documents once again. I think I just do a deleteAll() and then recreate them (addDocument()).
I commit, which will activate all those changes.
I drop all IndexReaders that I have, so the next time the system request a Reader, it will create it and store it for subsequent requests.
The updateDocument is basically a delete/create, afaik.

You might want to use a SearcherManager to get new IndexSearchers as you update the index with a IndexWriter. I see no need for using a temporary index?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.