When doing research on deleting documents in lucene I have been shown to use the IndexReaders delete() method, passing in the document id. Now that I actually need to do this, it appears that lucene currently does not support this method, and I have had very little luck in finding the current way to do this.
Any ideas?
now the deletions can be done with IndexWriter
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/index/IndexWriter.html
Doc IDs are internal to Lucene, and really should never be used. They are liable to change without warning, among other issues.
How are you getting the doc IDs? Presumably through a query? Then just delete based on that query. Alternatively, if you have your own unique ID field, you can do writer.DeleteDocuments(new Term("MyIDField", "ID to delete"));
Related
I am using Lucene 4.6. I created a Lucene IndexWriter(in CREATE MODE) and added documents and committed it(didnt close it). Then ran a search query and stored the results. Again I added documents to the index writer and committed it and closed it. And ran a search query on it. It gave results with new data and old data also. The old data was also present in the index. May i know the way to delete all the data from the index. Is there any way to delete all the documents at a stretch?
That would be better if you could provide us the code snippet, but it seems the issue is you are using OpenMode.CREATE instead of OpenMode.CREATE_OR_APPEND. In that case, each time you create the IndexWriter object, the old data is overritten, not appended.
Also, make sure you are using the latest version. The current is v4.9.0
indexwriter.deleteall method will delete all the documents in the index and you can reuse the same indexwriter to build INDEX on new documents and run a search query and close it later when you need
My current search application is using lucene for indexing process. And if any documents are change, I believe, we can start re-indexing at the beginning. Is this Correct?
So, if yes, then all documents have to re-indexed each time with new ones are added which will be not appropriate with very large number of content about 40 million full-text documents.
That's why I am specifically concerned that, using Lucene, Is there any way to only index documents that have changed so that to avoid the full re-indexing.
Appreciated for possible suggestions...
Thanking you........
You only need to reindex changed documents, there is no need to reindex everything. IndexWriter has deleteDocuments which can remove documents by query or term. Then, you can reinsert the changed document with addDocument and commit to make this appear atomic.
Also bear in mind that Lucene is just a library and has no idea what kind of external entities are passed for indexing and how/when they change - you, as a developer, are responsible for this.
I am updating lucene index onec a day. My strategy in general is:
Find all objects in DB that was modified since last index generation.
Create new tmp-index for these objects. (old index is stil available)
Remove all new indexed Documents (they are in tmp-index) from the old index using IndexWriter.deleteDocuments(Term)
Merge old index and tmp-index using IndexWriter.addIndexes(...)
I have found that in lucene wiki: There is no direct update procedure in Lucene...
I have found also that in lucene 4.1.0 doc: A document can be updated with updateDocument...
I have tried IndexWriter.updateDocument(Term, Document) but then performing search with filter I got NPE from one of my methods what not happens when I update index as describe in 1-4. Have anyone had a similar problem? How do you update your index?
What I do is basically this:
I keep a persistent IndexReader/Readers, this will keep the state that it has since it was created.
I start to delete and create all documents once again. I think I just do a deleteAll() and then recreate them (addDocument()).
I commit, which will activate all those changes.
I drop all IndexReaders that I have, so the next time the system request a Reader, it will create it and store it for subsequent requests.
The updateDocument is basically a delete/create, afaik.
You might want to use a SearcherManager to get new IndexSearchers as you update the index with a IndexWriter. I see no need for using a temporary index?
I'm building a WebService (CXF with Spring and JPA) to search a read-only database table, i.e., a table which is in a database I only have read permissions, I must not change anything there.
I need to implement a full-text search for some fields of this table, and querying it is too slow (it is a books database, with title, author and keywords for each book) so I need to build an index for it.
I'm trying to understand if Hibernate Search help me with that, and how I could go about it.
I think it can't cause the documentation says it builds the index when the entities are updated (which are never the case in my WebService). But I'm new to all of its terminology, so I can misinterpreting things.
What would be a good path to go about this?
What should I study first to understand better what I need to do?
Thanks in advance!
I think Hibernate Search can be a perfect fit for your problem, because it allows you to build/keep the index on the file system. This way the database stays untouched. Given that you are already using JPA it should be extremely easy to enable Hibernate Search. You basically just need to get the Hibernate Search jar file and add it to your project, then annotate the entities you want to have indexed with #Indexed and the fields you want to index with #Field. Of course that's very simplified, but the online documentation should help you out there. There is a getting started section which explain the basics. Once you get this, you can dive deeper into different analyzers etc - http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/
I'm trying to understand if Hibernate Search help me with that, and
how I could go about it. I think it can't cause the documentation says
it builds the index when the entities are updated (which are never the
case in my WebService). But I'm new to all of its terminology, so I
can misinterpreting things.
Well, Search will help you via automatic index updates to keep your index and database in sync for read/write application. However, Search also has a programmatic API for creating an initial index or manually rebuild the index whenever you see fit. There are several ways to do that, but it can be as simple as:
EntityManager em = entityManagerFactory.createEntityManager();
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(em);
fullTextEntityManager.createIndexer().startAndWait();
Again, the documentation has more examples. I recommend you follow the getting started examples and follow up with concrete questions/problems you encounter.
I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?
It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?
If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.