Lucene search stops working

Lucene search stops working - java

I am using lucene for search in one of my projects. It's running as a separate service on a port. Whenever a query comes, a request is sent to this server and it returns a map of results.
My problem is that it stops working after some time. It works fine for 1 day or so. But after 1 day it stops returning results (i.e. service is running but it results 0 results). To get it back working, I have to restart the service and then it starts working fine again.
Please suggest some solution. I'll be happy to provide more info if needed.
Thanks.

Were I to make a guess at an easy mistake to make that could cause such behavior, maybe you're opening a bunch of indexwriters or indexreaders as time goes on, and not closing them correctly, thus running out of file descriptors available on your server. See if the 'lsof' shows a lot of open descriptors on '.cfs', '.fdx' and/or '.fdt' ('ulimit -n' can be used to see the maximum).
One thing to note about the IndexSearcher, which I've seen cause problems:
Closing the searcher may not close the underlying reader. If you pass a reader into the searcher, it won't be closed when you close the searcher (since in that case, it may be in use by other objects).
An example of this:
//Assume I have an IndexWriter named indexwriter, which I reuse.
IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexwriter, true));
//Use the searcher
searcher.close();
//We close the search, but the underlying reader remains open.
Now this has accumlated an unclosed reader, and left some index file descriptors open. If this pattern is used enough times, over time it will stop responding.
That one example of such an error anyway.
It can be fixed by simply closing the reader when you close the searcher, something like: searcher.getIndexReader().close(). Better solutions could be found, though. Reusing the reader, which can be refreshed when index contents change, for instance.
Don't know if this is the exact problem you are having, but might be worth noting.

Related

Is there any way to guarantee that an ElasticSearch index has been deleted

In some automated tests, I am trying to delete and immediately recreate an index at the start of every test, using ElasticSearch's high-level rest client (version 6.4), as follows:
DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest(indexName);
deleteIndexRequest.indicesOptions(IndicesOptions.lenientExpandOpen());
client.indices().delete(deleteIndexRequest, RequestOptions.DEFAULT);
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName);
request.mapping("_doc", "{...}", XContentType.JSON);
client.indices().create(request, RequestOptions.DEFAULT);
The problem I have is that, intermittently, my tests fail at the point of creating the index, with an error:
{"error": {"root_cause":[{"type":"resource_already_exists_exception","reason":"index [(index-name)/(UUID)] already exists, ...,}] "status":400}
The more tests I run, the more likely I am to see the error, which seems to be a strong indicator that it's a race condition - presumably when I try to recreate the index, the previous delete operation hasn't always completed.
This is backed-up with the fact that if I put a breakpoint immediately after the delete operation, and manually run a curl request to look at the index that I tried to delete, I find that it's still there some of the time; on those occasions the error above appears if I continue the test.
I've tried asserting the isAcknowledged() method of the response to the delete operation, but that always returns true, even in cases when the error occurs.
I've also tried doing an exists() check before the create operation. Interestingly in that case if I run the tests without breakpoints, the exists() check always returns false (i.e. that the index doesn't exist) even in cases where the error will then occur, but if I put a breakpoint in before the create operation, then the exists() check returns true in cases where the error will happen.
I'm at a bit of a loss. As far as I understand, my requests should be synchronous, and from a comment on this question, this should mean that the delete() operation only returns when the index has definitely been deleted.
I suspect a key part of the problem might be that these tests are running on a cluster of 3 nodes. In setting up the client, I'm only addressing one of the nodes:
client = new RestHighLevelClient(RestClient.builder(new HttpHost("example.com", 9200, "https")));
but I can see that each operation is being replicated to the other two nodes.
When I stop a breakpoint before the create operation, in cases where the index is not deleted, I can see that it's not being deleted on any of the nodes, and it seems not to matter how long I wait, it never gets deleted.
Is there some way I can reliably determine whether the index has been deleted before I create it? Or perhaps something I need to do before I attempt the delete operation, to guarantee that it will succeed?

Hey I think there are quite a few things to think about. For one I'd test everything with curl or some kind of rest client till I start doing anything in code. Might just help you conceptually, but that's just my opinion.
This is one thing you should consider:
"If an external versioning variant is used, the delete operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index)."
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html
Which kind of would explain why exists() would return false. So if external versioning variant is used then the delete option would actually create an index with the same name prior to deleting it.
You mentioned about the fact that you are working with a three node cluster. Something you can try is:
"When making delete requests, you can set the wait_for_active_shards parameter to require a minimum number of shard copies to be active before starting to process the delete request." Here is a super detailed explanation which is certainly worth reading: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-wait-for-active-shards
I suggest you try:
curl -X DELETE 127.0.0.1:9200/fooindex?wait_for_active_shards=3
You said you have 3 nodes in your cluster,so this means that:"...indexing operation will require 3 active shard copies before proceeding, a requirement which should be met because there are 3 active nodes in the cluster, each one holding a copy of the shard."
This check is probably not 100% water tight since according to the docs here:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-wait-for-active-shards
"It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary. The _shards section of the write operation’s response reveals the number of shard copies on which replication succeeded/failed." so perhaps use this parameter, but have your code check the response to see if any operations failed.
Something you can also try is:
(I can't seem to find good documentation to back this info up)
This should be able to tell you if the cluster isn't ready to accept deletes.
curl -X DELETE 127.0.0.1:9200/index?wait_for_completion=true

Lucene blocks while searching and indexing at the same time

I have a java application that uses Lucene (latest version, 5.2.1 as of this writing) in "near realtime" mode; it has one network connection to receive requests to index documents, and another connection for search requests.
I'm testing with a corpus of pretty large documents (several megabytes of plain text) and several versions of each field with different analyzers. One of them being a phonetic analyzer with the Beider-Morse filter, the indexing of some documents can take quite a bit of time (over a minute in some cases). Most of this time is spent in the call to IndexWriter.addDocument(doc);
My problem is that while a document is being indexed, searches get blocked, and they aren't processed until the indexing operation finishes. Having the search blocked for more than a couple seconds is unacceptable.
Before each search, I do the following:
DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, false);
if (newReader != null)
{
reader = newReader;
searcher = new IndexSearcher(reader);
}
I guess this is what causes the problem. However, is the only way to get the most recent changes when I do a search. I'd like to maintain this behaviour in general, but if the search would block I wouldn't mind to use a slightly old version of the index.
Is there any way to fix this?

Among other options, consider having always an IndexWriter open and perform "commits" to it as you need.
Then you should ask for index readers to it (not to the directory) and refresh them as needed. Or simply use a SearcherManager that will not only refresh searchers for you, but also will maintain a pool of readers and will manage references to them, in order to avoid reopening if the index contents haven't change.
See more here.

What are the way to generate rss feed for a site which doesn't have own rss?

When I use google reader, I found, sometime website doesn't rss support, but somehow google reader produced it, and show it. I want to know how google reader do it. Any programming language solution or just theory will be ok.

I am not going to pretend I know how google read does it but here's a simple hint:
When a browser loads a page for the first time, he keeps a copy in its cache.
The next time the page needs to be loaded, the browser first checks if the page has changed since the last time it was loaded. If it wasn't, he will simply load the version in the cache, otherwise, he will refetch the page again.
This mechanism is done, as far as I know, using the HEAD HTTP operation and the Last-Modified header.
This should be your starting point as it manage to rapidly find out if some new content was published.
The next step would be to use some clever algorithms to define what the change was, if it's relevant enough to be considered as a new content and how to present it.
Reference

SearcherManager maybeRefresh method not happening

I'm using Lucene 4.0 API in order to implement search in my application.
The navigation flow is as follows:
The user creates a new article. A new Document is than added to the index using IndexWriter.addDocument().
After addition the SearcherManager.maybeRefresh() method is called. The SearcherManager is built from the Writer in order to have access to NRL Search.
Just after the creation, the user decides to add a new tag to his article. This is when the Writer.updateDocument() is called. Considering that at step 2 I asked for a refresh I would expect the Searcher to find the added document. However, this is not found.
Is this the common behaviour? Is there a way to make shore that the Searcher finds the Document? (except commit)

I am guessing that your newly created document is kept in the memory. Lucene doesn't make the changes immediately, it keeps some documents in memory, because the I/O operations take some time and resources. It is a good practice to write only once the buffer is full. But, since you would like to view and change the document immediately, try flushing the buffer first(IndexWriter.flush()). This should write to disk. Only after this try (maybe)refreshing.

Moving files after failed validation (Java)

We are validating XML files and depending on the result of the validation we have to move the file into a different folder.
When the XML is valid the validator returns a value and we can move the file without a problem. Same thing happens when the XML is not valid according to the schema.
If however the XML is not well formed the validator throws an exception and when we try to move the file, it fails. We believe there is still a handle in the memory somewhere that keeps hold of the file. We tried putting System.gc() before moving the file and that sorted the problem but we can't have System.gc() as a solution.
The code looks like this. We have a File object from which we create a StreamSource. The StreamSource is then passed to the validator. When the XML is not well formed it throws a SAXException. In the exception handling we use the .renameTo() method to move the file.
sc = new StreamSource(xmlFile);
validator.validate(sc);
In the catch we tried
validator.reset();
validator=null;
sc=null;
but still .renameTo() is not able to move the file. If we put System.gc() in the catch, the move will succeed.
Can someone enlight me how to sort this without System.gc()?
We use JAXP and saxon-9.1.0.8 as the parser.
Many thanks

Try creating a FileInputStream and passing that into StreamSource then close the FileInputStream when you're done. By passing in a File you have lost control of how/when to close the file handle.

When you set sc = null, you are indicating to the garbage collector that the StreamSource file is no longer being used, and that it can be collected. Streams close themselves in their destroy() method, so if they are garbage collected, they will be closed, and therefore can be moved on a Windows system (you will not have this problem on a Unix system).
To solve the problem without manually invoking the GC, simply call sc.getInputStream().close() before sc = null. This is good practice anyway.
A common pattern is to do a try .. finally block around any file handle usage, eg.
try {
sc = new StreamSource(xmlFile);
// check stuff
} finally {
sc.getInputStream().close();
}
// move to the appropriate place
In Java 7, you can instead use the new try with resources block.

Try sc.getInputStream().close() in the catch

All the three answers already given are right : you must close the underlying stream, either with a direct call to StramSource, or getting getting the stream and closing it, or creating the stream yourself and closing it.
However, I've already seen this happening, under windows, since at least three years : even if you close the stream, really every stream, if you try to move or delete the file, it will throw exception .. unless ... you explicitly call System.gc().
However, since System.gc() is not mandatory for a JVM to actually execute a round of garbage collection, and since even if it was the JVM is not mandated to remove all possible garbage object, you have no real way of being sure that the file can be deleted "now".
I don't have a clear explanation, I can only imagine that probably the windows implementation of java.io somehow caches the file handle and does not close it, until the handle gets garbage collected.
It has been reported, but I haven't confirmed it, that java.nio is not subject to this behavior, cause it has more low level control on file descriptors.
A solution I've used in the past, but which is quite a hack, was to :
Put files to delete on a "list"
Have a background thread check that list periodically, calla System.gc and try to delete those files.
Remove from the list the files you managed to delete, and keep there those that are not yet ready to.
Usually the "lag" is in the order of a few milliseconds, with some exceptions of files surviving a bit more.
It could be a good idea to also call deleteOnExit on those files, so that if the JVM terminates before your thread finished cleaning some files, the JVM will try to delete them. However, deleteOnExit had it's own bug at the time, preventing exactly the removal of the file, so I didn't. Maybe today it's resolved and you can trust deleteOnExit.
This is the JRE bug that i find most annoying and stupid, and cannot believe it is still in existence, but unfortunately I hit it just a month ago on windows Vista with latest JRE installed.

Pretty old, but some people may still find this question.
I was using Oracle Java 1.8.0_77.
The problem occurs on Windows, not on Linux.
The StreamSource instanciated with a File seems to automatically allocate and release the file resource when processed by a validator or transformer. (getInputStream() returns null)
On Windows moving a file into the place of the source file (deleting the source file) after the processing is not possible.
Solution/Workaround: Move the file using
Files.move(from.toPath(), to.toPath(), REPLACE_EXISTING, ATOMIC_MOVE);
The use of ATOMIC_MOVE here is the critical point. Whatever the reason ist, it has something to do with the annoying behavior of Windows locking files.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.