I am using Lucene 4.6. I created a Lucene IndexWriter(in CREATE MODE) and added documents and committed it(didnt close it). Then ran a search query and stored the results. Again I added documents to the index writer and committed it and closed it. And ran a search query on it. It gave results with new data and old data also. The old data was also present in the index. May i know the way to delete all the data from the index. Is there any way to delete all the documents at a stretch?
That would be better if you could provide us the code snippet, but it seems the issue is you are using OpenMode.CREATE instead of OpenMode.CREATE_OR_APPEND. In that case, each time you create the IndexWriter object, the old data is overritten, not appended.
Also, make sure you are using the latest version. The current is v4.9.0
indexwriter.deleteall method will delete all the documents in the index and you can reuse the same indexwriter to build INDEX on new documents and run a search query and close it later when you need
Related
I've been attempting to do the equivalent of an UPSERT (insert or update if already exists) in solr. I only know what does not work and the solr/lucene documentation I have read has not been helpful. Here's what I have tried:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","name":{"set":"steve"}}]'
{"responseHeader":{"status":409,"QTime":2},"error":{"msg":"Document not found for update. id=1","code":409}}
I do up to 50 updates in one request and request may contain the same id with exclusive fields (title_en and title_es for example). If there was a way of querying whether or not a list of id's exist, I could split the data and perform separate insert and update commands... This would be an acceptable alternative but is there already a handler that does this? I would like to avoid doing any in house routines at this point.
Thanks.
With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.
Solr does not support UPSERT mechanics out of the box. You can create a record or you can update a record and syntax is different.
And if you update the record you must make sure all your other pre-inserted fields are stored (not just indexed). Under the covers, an update creates a completely new record just pre-populated with previously stored values. But that functionality if very deep in (probably in Lucene itself).
Have you looked at DataImportHandler? You reverse the control flow (start from Solr), but it does have support for checking which records need to be updated and which records need to be created.
Or you can just run a solr query like http://solr.example.com:8983/solr/select?q=id%3A(ID1+ID2+ID3)&fl=id&wt=csv where you ask Solr to look for your ID records and return only ID of records it does find. Then, you could post-process that to segment your Updates and Inserts.
My current search application is using lucene for indexing process. And if any documents are change, I believe, we can start re-indexing at the beginning. Is this Correct?
So, if yes, then all documents have to re-indexed each time with new ones are added which will be not appropriate with very large number of content about 40 million full-text documents.
That's why I am specifically concerned that, using Lucene, Is there any way to only index documents that have changed so that to avoid the full re-indexing.
Appreciated for possible suggestions...
Thanking you........
You only need to reindex changed documents, there is no need to reindex everything. IndexWriter has deleteDocuments which can remove documents by query or term. Then, you can reinsert the changed document with addDocument and commit to make this appear atomic.
Also bear in mind that Lucene is just a library and has no idea what kind of external entities are passed for indexing and how/when they change - you, as a developer, are responsible for this.
I am updating lucene index onec a day. My strategy in general is:
Find all objects in DB that was modified since last index generation.
Create new tmp-index for these objects. (old index is stil available)
Remove all new indexed Documents (they are in tmp-index) from the old index using IndexWriter.deleteDocuments(Term)
Merge old index and tmp-index using IndexWriter.addIndexes(...)
I have found that in lucene wiki: There is no direct update procedure in Lucene...
I have found also that in lucene 4.1.0 doc: A document can be updated with updateDocument...
I have tried IndexWriter.updateDocument(Term, Document) but then performing search with filter I got NPE from one of my methods what not happens when I update index as describe in 1-4. Have anyone had a similar problem? How do you update your index?
What I do is basically this:
I keep a persistent IndexReader/Readers, this will keep the state that it has since it was created.
I start to delete and create all documents once again. I think I just do a deleteAll() and then recreate them (addDocument()).
I commit, which will activate all those changes.
I drop all IndexReaders that I have, so the next time the system request a Reader, it will create it and store it for subsequent requests.
The updateDocument is basically a delete/create, afaik.
You might want to use a SearcherManager to get new IndexSearchers as you update the index with a IndexWriter. I see no need for using a temporary index?
When doing research on deleting documents in lucene I have been shown to use the IndexReaders delete() method, passing in the document id. Now that I actually need to do this, it appears that lucene currently does not support this method, and I have had very little luck in finding the current way to do this.
Any ideas?
now the deletions can be done with IndexWriter
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html
Doc IDs are internal to Lucene, and really should never be used. They are liable to change without warning, among other issues.
How are you getting the doc IDs? Presumably through a query? Then just delete based on that query. Alternatively, if you have your own unique ID field, you can do writer.DeleteDocuments(new Term("MyIDField", "ID to delete"));
In my project we use Lucene 2.4.1 for fulltext search. This is a J2EE project, IndexSearcher is created once. In the background, the index is refreshed every couple of minutes (when the content changes). Users can search the index through a search mechanism on the page.
The problem is, the results returned by Lucene seem to be cached somehow.
This is scenario I noticed:
I start the application and search for 'keyword' - 6 results are returned,
Index is refreshed, using Luke I see, that there are 8 results now to query 'keyword',
I search again using the application, again 6 results are returned.
I analyzed our configuration and haven't found any caching anywhere. I have debugged the search, and there is no caching in out code, searcher.search returnes 6 results.
Does Lucene cache results internally somehow? What properties etc. should I check?
To see changes made by IndexWriters against an index for which you have an open IndexReader, be sure to call IndexReader.reopen() to see the latest changes.
Make sure also that your IndexWriter is committing the changes, either through an explicit commit(), a close(), or having autoCommit set to true.
With versions prior to 2.9.0, Lucene cached automatically the results of queries. With later releases there's no caching unless you wrap your query in a QueryFilter and then wrap the result in a CachingWrapperFilter. You could consider switching to a release >= 2.9.0 if reopening the index becomes a problem
One more note: In order to IndexReader find the real-time other threads updated documents, when initialize IndexReader, the parameter "read-only" has to be false. Otherwise, method reopen() will not work.