I have been working on lucene since about 1 years and suddenly today I figured out something weird about it.
I was updating my indexing using the normal lucene mechanism of fetching the document and deleting old document and then reindexing the document.
So
1. Fetched the document to update from lucene index and maintained this doc in a list
2. Removed the document from index.
3. Using the doc from list updated some of it field and then re-indexed this document.
But when I found that this updated document that was indexed were having duplicate values for the original document field.
Like suppose there was a field id:1 and I didnt updated this field and updated the other content from the document and then index this doc.
I found that this id:1 was appearing two times in the same document. And even further if i reindex the same document the same field will get created those many time under single document.
How should I get rid of this duplication?
I have to make some modification for the document that was re-indexed. Means that document I fetched from the indexed, using that I took out all the fields and then created a new fresh document and added those field to that document and then re-indexed this new document, which got indexed properly without any duplication.
Was not able to find the cause but the document fetched from index was having docId and due to this when it was re-index internally some duplication might be taking place which must have cause the problem.
Related
Let's say I have collection called root
Can I create document with its subcollection in one call?
I mean if I do:
db.Collection("root").document("doc1").Collection("sub").document("doc11").Set(data)
Then will that create the structure in one shot?
To be honest I gave this a try and doc1 had an italics heading which I thought only for deleted docs
The code you shared doesn't create an actual document. It merely "reserves" the ID for a document in root and then creates a sub collection under it with an actual doc11 document.
Seeing a document name in italics in the Firestore console indicates that there is no physical document at a location, but that there is data under the location. This is most typical when you've deleted a document that previously existed, but your code is another way accomplishing the same.
There is no way to create two documents in one call, although you can create multiple documents in a single transaction or batch write if you want.
After overwriting a document with an updated version the score for it changes, even if the query doesn't target any of the changed fields.
I've found out, that overwriting a document causes the old one to be flagged as deleted and a new one is created (Elasticsearch Reference [1.7] » Document APIs » Update API). as a consequence, the number of maxDocs increases by one and both documents are used for idf (inversed document frequency). Is there any workaround for this issue?
Let's consider the following situation - there are two fields in "article" document - content(string) and views(int). The views field is not indexed. The views field contains information how many times this article was read.
From official doc:
We also said that documents are immutable: they cannot be changed,
only replaced. The update API must obey the same rules. Externally, it
appears as though we are partially updating a document in place.
Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described.
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document? For example - I want to update views every time someone reads some article. If entire document is reindexed I can't do real time update (as it's too heavy operation). So I will have to work with delay, for example updates all articles the visitors have read every 3-5-10 minutes. Or I understand something wrong?
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document?
Yes, whilst the views field is not indexed individually it is part of the _source field. The _source field contains the original JSON you sent to Elasticsearch when you indexed the document and is returned in the results if there is a match on the document during a search. The _source field is indexed with the document in Lucene. In your update script you are changing the _source field so the whole document will be re-indexed.
Could you then evaluate the following strategy. Every time someone reads the article I send update to elastic. However refresh_interval I set to 30 seconds. Will this strategy be normal if during 30 second interval about 1000 users have read one article?
You are still indexing the 1000 documents, 1 document will be indexed as the current document, 999 documents will be indexed marked as deleted and removed from the index during the next Lucene merge.
I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)
You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.
i'm trying to update an indexed documents in lucene by searching for the document, and then extracting the indexed document fields, and then deleting the document and creating a new one.
is there anther effective way for such an update?
There isn't. The best you can get is IndexWriter.updateDocument(Term term, Iterable<? extends IndexableField> document) but even this deletes and adds the same document again.