Lucene seems to be caching search results - why?

Lucene seems to be caching search results - why? - java

In my project we use Lucene 2.4.1 for fulltext search. This is a J2EE project, IndexSearcher is created once. In the background, the index is refreshed every couple of minutes (when the content changes). Users can search the index through a search mechanism on the page.
The problem is, the results returned by Lucene seem to be cached somehow.
This is scenario I noticed:
I start the application and search for 'keyword' - 6 results are returned,
Index is refreshed, using Luke I see, that there are 8 results now to query 'keyword',
I search again using the application, again 6 results are returned.
I analyzed our configuration and haven't found any caching anywhere. I have debugged the search, and there is no caching in out code, searcher.search returnes 6 results.
Does Lucene cache results internally somehow? What properties etc. should I check?

To see changes made by IndexWriters against an index for which you have an open IndexReader, be sure to call IndexReader.reopen() to see the latest changes.
Make sure also that your IndexWriter is committing the changes, either through an explicit commit(), a close(), or having autoCommit set to true.

With versions prior to 2.9.0, Lucene cached automatically the results of queries. With later releases there's no caching unless you wrap your query in a QueryFilter and then wrap the result in a CachingWrapperFilter. You could consider switching to a release >= 2.9.0 if reopening the index becomes a problem

One more note: In order to IndexReader find the real-time other threads updated documents, when initialize IndexReader, the parameter "read-only" has to be false. Otherwise, method reopen() will not work.

Related

Ensure entity is added to hibernate search index

We currently have a process that can be summarized as follows:
Insert list of entity A from batch load process.
Update the status of those entities after a specified date has passed.
We use hibernate search to index some of the properties of entity A. However, we also have a requirement that we don't index the entity until the status has been updated.
Currently, we check at indexing time with an EntityIndexingInterceptor whether or not to exclude the entity based on its status.
The problem is we don't index the status field itself - so when it changes, hibernate's transparent mechanism of adding it to the index isn't applied, and it isn't ever added.
Is there a better way of being able to force hibernate to add it to the index without adding the field itself to the index? We currently rebuild the index nightly which is usually OK but still leaves a window where an entity may not be searchable until the next rebuild.

Which version of Hibernate Search are you using? Dirty checking optimization should be automatically disabled when using interceptors:
Dirty checking optimization is disabled when interceptors are used. Dirty checking optimization does check what has changed in an entity and only triggers an index update if indexed properties are changed. The reason is simple, your interceptor might depend on a non indexed property which would be ignored by this optimization.
(From the documentation)
If it isn't, please report the issue with a reproducer, or at least mentioning the exact version of Hibernate Search/Hibernate ORM you are using: JIRA, test case templates
Until we fix this (if it's actually a bug), you can always disable the dirty checking optimization explicitly:
hibernate.search.enable_dirty_check false

Indexes update in DynamoDB

I've been working with LSI and GSI in DynamoDB, but I guess I'm missing something.
I created an index to always query by the latest results without using the partition key only other attributes and without reading the entire items, only those that really matter, but with the GSI at some point my query returns data that are not up-to-date; this I understand due to the fact of eventual consistence described in the docs (You may correct me if I'm wrong).
And what about LSI? Even using the ConsistentRead, at some point my data is not being queried correctly, and the results are not up-to-date. From the docs I thought that LSI where updated syncronized with its table and with the ConsistentRead property set I'd always get the latest results, but this is not happening.
I'm using a REST endpoint (API Gateway) to perform inserts into my dynamo table (I perform some treatments before the insertion) so, I've been wondering if this has something to do with it: maybe the code (currently Java) or DynamoDB is slow to update but since in my endpoint everything seems to work fine I perform another request too fast or if I have to wait a little longer to interact with the table because the index is being updated, although I have already tested waiting longer I'm receiving the same wrong results. I'm a bit lost here.
This is the code I'm using to query the index:
QuerySpec spec = new QuerySpec()
.withKeyConditionExpression("#c = :v_attrib1 and #e = :v_attrib2")
.withNameMap(new NameMap()
.with("#c", "attrib1")
.with("#e", "attrib2"))
.withValueMap(new ValueMap()
.withString(":v_attrib1", attrib1Value)
.withString(":v_attrib2", attrib2Value))
.withMaxResultSize(1) // to only bring the latest one
.withConsistentRead(true) // is this wrong?
.withScanIndexForward(false); // what about this one?
I don't know if the maven library version would interfere, but in any case the version I'm using is the 1.11.76 (I know there are a lot of newer versions, but if this is the problem we'll update it then).
Thank you all in advance.

After searching for quite some time and some other tests I, finally, figured out that the problem was not in DynamoDB indexes, they are working as expected, but in the Lambda functions.
The fact that I was sending a lot of requests, one after another, was not giving the indexes time to remain updated: Lambda functions execute asynchronously (I should have known), and so the requests received by the database were not ordered and my data was not properly updated. So, we changed our implementation to use Atomic Counters: we can keep our data updated no matter the number or the order of the requests.
See: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Adding documents to Lucene Index Writer after commit is called

I am using Lucene 4.6. I created a Lucene IndexWriter(in CREATE MODE) and added documents and committed it(didnt close it). Then ran a search query and stored the results. Again I added documents to the index writer and committed it and closed it. And ran a search query on it. It gave results with new data and old data also. The old data was also present in the index. May i know the way to delete all the data from the index. Is there any way to delete all the documents at a stretch?

That would be better if you could provide us the code snippet, but it seems the issue is you are using OpenMode.CREATE instead of OpenMode.CREATE_OR_APPEND. In that case, each time you create the IndexWriter object, the old data is overritten, not appended.
Also, make sure you are using the latest version. The current is v4.9.0

indexwriter.deleteall method will delete all the documents in the index and you can reuse the same indexwriter to build INDEX on new documents and run a search query and close it later when you need

Search application to handle up to 40 million full-text documents using Lucene

My current search application is using lucene for indexing process. And if any documents are change, I believe, we can start re-indexing at the beginning. Is this Correct?
So, if yes, then all documents have to re-indexed each time with new ones are added which will be not appropriate with very large number of content about 40 million full-text documents.
That's why I am specifically concerned that, using Lucene, Is there any way to only index documents that have changed so that to avoid the full re-indexing.
Appreciated for possible suggestions...
Thanking you........

You only need to reindex changed documents, there is no need to reindex everything. IndexWriter has deleteDocuments which can remove documents by query or term. Then, you can reinsert the changed document with addDocument and commit to make this appear atomic.
Also bear in mind that Lucene is just a library and has no idea what kind of external entities are passed for indexing and how/when they change - you, as a developer, are responsible for this.

Updating lucene index - the most popular way

I am updating lucene index onec a day. My strategy in general is:
Find all objects in DB that was modified since last index generation.
Create new tmp-index for these objects. (old index is stil available)
Remove all new indexed Documents (they are in tmp-index) from the old index using IndexWriter.deleteDocuments(Term)
Merge old index and tmp-index using IndexWriter.addIndexes(...)
I have found that in lucene wiki: There is no direct update procedure in Lucene...
I have found also that in lucene 4.1.0 doc: A document can be updated with updateDocument...
I have tried IndexWriter.updateDocument(Term, Document) but then performing search with filter I got NPE from one of my methods what not happens when I update index as describe in 1-4. Have anyone had a similar problem? How do you update your index?

What I do is basically this:
I keep a persistent IndexReader/Readers, this will keep the state that it has since it was created.
I start to delete and create all documents once again. I think I just do a deleteAll() and then recreate them (addDocument()).
I commit, which will activate all those changes.
I drop all IndexReaders that I have, so the next time the system request a Reader, it will create it and store it for subsequent requests.
The updateDocument is basically a delete/create, afaik.

You might want to use a SearcherManager to get new IndexSearchers as you update the index with a IndexWriter. I see no need for using a temporary index?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene seems to be caching search results - why? - java

One more note: In order to IndexReader find the real-time other threads updated documents, when initialize IndexReader, the parameter "read-only" has to be false. Otherwise, method reopen() will not work.

Related

Ensure entity is added to hibernate search index

Indexes update in DynamoDB

Adding documents to Lucene Index Writer after commit is called

Search application to handle up to 40 million full-text documents using Lucene

Updating lucene index - the most popular way

Categories

Resources