i'm trying to update an indexed documents in lucene by searching for the document, and then extracting the indexed document fields, and then deleting the document and creating a new one.
is there anther effective way for such an update?
There isn't. The best you can get is IndexWriter.updateDocument(Term term, Iterable<? extends IndexableField> document) but even this deletes and adds the same document again.
Related
Let's say I have collection called root
Can I create document with its subcollection in one call?
I mean if I do:
db.Collection("root").document("doc1").Collection("sub").document("doc11").Set(data)
Then will that create the structure in one shot?
To be honest I gave this a try and doc1 had an italics heading which I thought only for deleted docs
The code you shared doesn't create an actual document. It merely "reserves" the ID for a document in root and then creates a sub collection under it with an actual doc11 document.
Seeing a document name in italics in the Firestore console indicates that there is no physical document at a location, but that there is data under the location. This is most typical when you've deleted a document that previously existed, but your code is another way accomplishing the same.
There is no way to create two documents in one call, although you can create multiple documents in a single transaction or batch write if you want.
I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)
You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.
I would like to know how the Search API store Document internally? Does it store a Document into the Datastore with a "Document" kind? Or something else? Also where are indexes stored? In memcache?
Documents and indexes are stored in HR Datastore
Documents and indexes are saved in a separate persistent store
optimized for search operations
The Document class represents documents. Each document has a document identifier and a list of fields.
The Document class represents documents. Each document has a document identifier and a list of fields.
It's all in Google's documentation
I have been working on lucene since about 1 years and suddenly today I figured out something weird about it.
I was updating my indexing using the normal lucene mechanism of fetching the document and deleting old document and then reindexing the document.
So
1. Fetched the document to update from lucene index and maintained this doc in a list
2. Removed the document from index.
3. Using the doc from list updated some of it field and then re-indexed this document.
But when I found that this updated document that was indexed were having duplicate values for the original document field.
Like suppose there was a field id:1 and I didnt updated this field and updated the other content from the document and then index this doc.
I found that this id:1 was appearing two times in the same document. And even further if i reindex the same document the same field will get created those many time under single document.
How should I get rid of this duplication?
I have to make some modification for the document that was re-indexed. Means that document I fetched from the indexed, using that I took out all the fields and then created a new fresh document and added those field to that document and then re-indexed this new document, which got indexed properly without any duplication.
Was not able to find the cause but the document fetched from index was having docId and due to this when it was re-index internally some duplication might be taking place which must have cause the problem.
I'd like to implement a filter/search feature in my application using Lucene.
Querying Lucene index gives me a Hits instance, which is nothing more than a list of Documents matching my criteria.
Since I generate the indexed Documents from my objects, which is the best way to find the original object related to a specific Lucene Document?
A better description of my situation:
Three model classes for now: Folder (can have other Folders or
Lists as children), List (can have Tasks as children) and
Task (can have other Tasks as children). They are all
DefaultMutableTreeNode subclasses. I'll add the Tag entity in the
future.
Each Task has a text, a start date, a due date, some boolean flags.
They are displayed in a JTree.
The hole tree is saved in an XML file.
I'd like to do things like these:
search Tasks with Google-like queries.
Find all Tasks that start today.
Filter Tasks by Tag.
You can't, not with vanilla Lucene. You said yourself that you converted your objects into Documents and then stored the Documents in Lucene, how would you imagine that process would be reversible?
If you want to store and retrieve your own objects in Lucene, I strongly recommend that you use Compass instead. Compass is to Lucene what Hibernate is to JDBC - you define a mapping between your objects and Lucene documents, Compass takes care of the conversion.
Add a "stored" field that contains an object identifier. For each hit, lookup the original object via the identifier.
Without knowing more context, it's hard to be more specific.