Lucene blocks while searching and indexing at the same time

Lucene blocks while searching and indexing at the same time - java

I have a java application that uses Lucene (latest version, 5.2.1 as of this writing) in "near realtime" mode; it has one network connection to receive requests to index documents, and another connection for search requests.
I'm testing with a corpus of pretty large documents (several megabytes of plain text) and several versions of each field with different analyzers. One of them being a phonetic analyzer with the Beider-Morse filter, the indexing of some documents can take quite a bit of time (over a minute in some cases). Most of this time is spent in the call to IndexWriter.addDocument(doc);
My problem is that while a document is being indexed, searches get blocked, and they aren't processed until the indexing operation finishes. Having the search blocked for more than a couple seconds is unacceptable.
Before each search, I do the following:
DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, false);
if (newReader != null)
{
reader = newReader;
searcher = new IndexSearcher(reader);
}
I guess this is what causes the problem. However, is the only way to get the most recent changes when I do a search. I'd like to maintain this behaviour in general, but if the search would block I wouldn't mind to use a slightly old version of the index.
Is there any way to fix this?

Among other options, consider having always an IndexWriter open and perform "commits" to it as you need.
Then you should ask for index readers to it (not to the directory) and refresh them as needed. Or simply use a SearcherManager that will not only refresh searchers for you, but also will maintain a pool of readers and will manage references to them, in order to avoid reopening if the index contents haven't change.
See more here.

Related

Toggling WAL usage within same VM?

I'm one of the developers of the Hawk model indexing tool. Our tool indexes XMI models into graphs in order to speed up later queries, and it needs to toggle back and forth between "batch insert" and "transactional update" modes. "batch insert" is used the first time we notice a new file in a directory, and from then on we use "transactional update" mode to keep its graph in sync.
Our recently added OrientDB 2.1.4 backend uses the getTx()/getNoTx() methods in OrientGraphFactory to get the appropriate OrientGraph/OrientGraphNoTx instances. However, we aren't getting very good throughput when compared to Neo4j. Indexing set0.xmi takes 90s when placing the WAL in a Linux ramdisk with OrientDB, while it takes 22s with our Neo4j backend in the same conditions (machine + OS + JDK). We're using these additional settings to try and reduce times:
Increased WAL cache size to 10000
Disable sync on page flush
Save only dirty objects
Use massive insert intent
Disable transactional log
Disable MVCC
Disable validation
Use lightweight edges when possible
We've thought of disabling the WAL when entering "batch insert" mode, but there doesn't seem to be an easy way to toggle that on and off. It appears it can only be set once at program startup and that's it. We've tried explicitly closing the underlying storage so the USE_WAL flag will be read once more while reopening the storage, but that only results in NullPointerExceptions and other random errors.
Any pointers on how we could toggle the WAL, or improve performance beyond that would be greatly appreciated.
Update: We've switched to using the raw document API and marking dirty nodes/edges ourselves and we're now hitting 55 seconds, but the WAL problem still persists. Also tried 2.2.0-beta, but it actually took longer.

We solved this ourselves. Leaving this in case it might help someone :-). We've hit 30 seconds after many internal improvements in our backend (still using the raw doc API) and switching to OrientDB 2.0.15, and we found the way to toggle the Write Ahead Log ourselves. This works for us (db is our ODatabaseDocumentTx instance):
private void reopenWithWALSetTo(final boolean useWAL) {
db.getStorage().close(true, false);
db.close();
OGlobalConfiguration.USE_WAL.setValue(useWAL);
db = new ODatabaseDocumentTx(dbURL);
db.open("admin", "admin");
}
I was being silly and thought I had to close the DB first and then the storage, but it turns out that wasn't the case :-). It's necessary to use the two-arg version of the ODatabaseDocumentTx#close method, as the no-arg version basically does nothing for the OAbstractPaginatedStorage implementation used through plocal:// URLs.

Unable to complete the HTTP request when using SpreadSheet API

I am developing a Google App Engine application which reads and edits a big SpreadSheet with around 150 columns and 500 rows. Beside the specific size (it may vary) I am looking for a way to improve performance since most of the times I get a 500 Internal Server Error (as you can see below).
java.lang.RuntimeException: Unable to complete the HTTP request Caused
by: java.net.SocketTimeoutException: Timeout while fetching URL:
https://spreadsheets.google.com/feeds/worksheets/xxxxxxxxxxxxxxxxxxxxxxx/private/full
In the code snippet below you can see how I read my SpreadSheet and which line throws the exception.
for (SpreadsheetEntry entry : spreadsheets) {
if (entry.getTitle().getPlainText().compareTo(spreadsheetname) == 0) {
spreadsheet = entry;
}
}
WorksheetFeed worksheetFeed = service.getFeed(spreadsheet.getWorksheetFeedUrl(), WorksheetFeed.class);
List<WorksheetEntry> worksheets = worksheetFeed.getEntries();
WorksheetEntry worksheet = worksheets.get(0);
URL listFeedUrl = worksheet.getListFeedUrl();
// The following line is the one who generates the error
ListFeed listFeed = service.getFeed(listFeedUrl, ListFeed.class);
for (ListEntry row : listFeed.getEntries()) {
String content = row.getCustomElements().getValue("rowname");
String content2 = row.getCustomElements().getValue("rowname2");
}
I already improved the performance using structured queries. Basically I apply filters within the URL and that allows me to only retrieve the few rows I need. Please notice that I still get the above error sometimes no matter what.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?sq=rowname=" + URLEncoder.encode("\"" + filter+ "\"").toString()).toURL();
My problem however is different, first of all there are certain times where I must read ALL rows but only FEW columns (around 5). I still need to find a way to achieve that, I do know that there is another parameter "tq" which allows to select columns but that statement requires the letter notation (such as A,B,AA), I'd like to use column names instead.
Most important I need to get rid of the 500 Internal Server Error. Since it sounds like a Timeout problem I'd like to increase that value to a resonable amount of time. My users can wait for a few seconds also because it seems completely random. When it works it loads the page in around 2-3 seconds. When it doesn't work however I get a 500 Internal Server Error which is going to be really frustrating for the enduser.
Any idea? I couldn't find anything on the App Engine settings. The only idea I had so far is to split the spreadsheet in multiple spreadsheets (or worksheets) in order to read less columns. However if there's an option that can allow me to increase the Timeout it would be awesome.
EDIT: I was looking around on the Internet and I may have found something that can help me. I just found out service object offers a setConnectionTimeout method, testing it right away.
// Set timeout
int timeout = 60000;
service.setConnectTimeout(timeout);

Time Out
I use a 10 Second time out with a retry. It works ok for me.
Sheet size
I have used it with 80,000 cells at a time. It works fine, I have not seen the retry fail. I am using CellFeed, not ListFeed.
Yes, it does not like large sheets, small sheets of 1000 cells or so are much faster. Even if I only write to part of the sheet, small sheets are much faster. (Feels like it recalculates whole sheets, as does not look to be down to data volume, but I am not sure)
Exponential backoff
Zig suggests an exponential backoff - would be be interested in numbers - what timeout values and failure rates people get with exponential backoff - also the impact of sheet size.
I suspect start with a 3 Second Time out and double with every retry might work, but have not tested it.

The real problem is that you shouldnt use a spreadsheet for this. It will throw many errors including rate limits if you attempt to make heavy use.
At a minimum you will need to use exponential backoff to retry errors but will still be slow. Doing a query by url is not efficient either.
The solution is that you dump the spresdsheet into the datastore, then do your queries from there. Since you also edit the spreadsheet its not that easy to keep it in sync with your datastore data. A general solution requires task queues to handle correctly the timeouts and lots of data (cells)

Lucene search stops working

I am using lucene for search in one of my projects. It's running as a separate service on a port. Whenever a query comes, a request is sent to this server and it returns a map of results.
My problem is that it stops working after some time. It works fine for 1 day or so. But after 1 day it stops returning results (i.e. service is running but it results 0 results). To get it back working, I have to restart the service and then it starts working fine again.
Please suggest some solution. I'll be happy to provide more info if needed.
Thanks.

Were I to make a guess at an easy mistake to make that could cause such behavior, maybe you're opening a bunch of indexwriters or indexreaders as time goes on, and not closing them correctly, thus running out of file descriptors available on your server. See if the 'lsof' shows a lot of open descriptors on '.cfs', '.fdx' and/or '.fdt' ('ulimit -n' can be used to see the maximum).
One thing to note about the IndexSearcher, which I've seen cause problems:
Closing the searcher may not close the underlying reader. If you pass a reader into the searcher, it won't be closed when you close the searcher (since in that case, it may be in use by other objects).
An example of this:
//Assume I have an IndexWriter named indexwriter, which I reuse.
IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexwriter, true));
//Use the searcher
searcher.close();
//We close the search, but the underlying reader remains open.
Now this has accumlated an unclosed reader, and left some index file descriptors open. If this pattern is used enough times, over time it will stop responding.
That one example of such an error anyway.
It can be fixed by simply closing the reader when you close the searcher, something like: searcher.getIndexReader().close(). Better solutions could be found, though. Reusing the reader, which can be refreshed when index contents change, for instance.
Don't know if this is the exact problem you are having, but might be worth noting.

SearcherManager maybeRefresh method not happening

I'm using Lucene 4.0 API in order to implement search in my application.
The navigation flow is as follows:
The user creates a new article. A new Document is than added to the index using IndexWriter.addDocument().
After addition the SearcherManager.maybeRefresh() method is called. The SearcherManager is built from the Writer in order to have access to NRL Search.
Just after the creation, the user decides to add a new tag to his article. This is when the Writer.updateDocument() is called. Considering that at step 2 I asked for a refresh I would expect the Searcher to find the added document. However, this is not found.
Is this the common behaviour? Is there a way to make shore that the Searcher finds the Document? (except commit)

I am guessing that your newly created document is kept in the memory. Lucene doesn't make the changes immediately, it keeps some documents in memory, because the I/O operations take some time and resources. It is a good practice to write only once the buffer is full. But, since you would like to view and change the document immediately, try flushing the buffer first(IndexWriter.flush()). This should write to disk. Only after this try (maybe)refreshing.

best way to update RAMDirectory

At inconsistent intervals, specific documents in a Lucene index need to be updated. The updates could be hourly or every few minutes. Currently I have a process that runs and looks for changes, and if changes have happened, it (in Lucene 3.5 fashion) removes the document and then re-adds it to the RAMDirectory.
Is there a better way to manage a Lucene index of documents that are constantly transforming? Is RAMDirectory the best choice?
The code I use for "updating" the index:
Term idTerm = new Term("uid",row.getKey());
getWriter().deleteDocuments(idTerm);
getWriter().commit();
// do some fun stuff creating a new doc with the changes
getWriter().addDocument(doc);

Lucene has recently had two very useful helper classes to handle frequently-changing indexes:
SearcherManager,
NRTManager.
You can read more about them at Mike McCandless' blog.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.