I'm using Lucene 4.0 API in order to implement search in my application.
The navigation flow is as follows:
The user creates a new article. A new Document is than added to the index using IndexWriter.addDocument().
After addition the SearcherManager.maybeRefresh() method is called. The SearcherManager is built from the Writer in order to have access to NRL Search.
Just after the creation, the user decides to add a new tag to his article. This is when the Writer.updateDocument() is called. Considering that at step 2 I asked for a refresh I would expect the Searcher to find the added document. However, this is not found.
Is this the common behaviour? Is there a way to make shore that the Searcher finds the Document? (except commit)
I am guessing that your newly created document is kept in the memory. Lucene doesn't make the changes immediately, it keeps some documents in memory, because the I/O operations take some time and resources. It is a good practice to write only once the buffer is full. But, since you would like to view and change the document immediately, try flushing the buffer first(IndexWriter.flush()). This should write to disk. Only after this try (maybe)refreshing.
Related
So I have been using Java Batch Processing for some time now. My jobs were either import/export jobs which chunked from a reader to a writer, or I would write Batchlets that would do some more complex processing. Since I am beginning to hit memory limits I need to rethink the architecture.
So I want to want to better leverage the chunked Reader/Processor/Writer pattern. And apparently I feel unsure how to distribute the work over the three items. During processing it becomes clear whether to write zero, one or several other records.
The reader is quite clear: It reads the data to be processed from the DB. But I am unsure how to write the records back to the database. I see these options:
Make the processor now store the variable amounts of data in the DB.
Make the processor send variable amount of data to the writer that would then perform the writing.
Place the entire logic into the writer.
Which way would be the best for this kind of task?
Looking at https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf, especially the chapters Chunk/The Processor and Chunk/The Writer it becomes obvious that it is up to me.
The processor can return an object, and the writer will have to understand and write this object. So for the above case where the processor has zero, one or many items to write per input record, it should simply return a list. This list can contain zero, one or several elements. The writer has to understand the list and write its elements to the database.
Since the logic is divided this way, the code is still pluggable and can easily be extended or maintained.
Addon: Since both reader and writer this time connect to the same database, I perceived the problem that upon commit for each chunk the connection for the reader was also invalidated. The solution was to use a nonJTA datasource for the reader.
Typically, an item processor processes an input item passed from an item reader, and the processing result can be null or a domain object. So it's not suited for your cases where the processing result may be split into multiple objects. I would assume even in your case, multiple objects from a processing iteration is not common. So I would suggest to use list or any collection type as the element type of the processed object only when necessary. In other more common cases, the item processor will still return null (to skip the current processed item) or a domain object.
When the item writer iterates through accumulated items, it can check if it's a collection and then write out all contained elements. For domain object type, then just write it as usual.
Using non-jta datasource for the reader is fine. I think you would want to keep the reader connection open from the start to end to keep reading from the result set. In an item writer, the connection is typically acquired at the beginning of the write operation and closed at the end of the chunk transaction commit or rollback.
Some resources that may be of help:
Jakarta Batch API,
jberet-support JdbcItemReader,
jberet-support JdbcItemWriter
I have a java application that uses Lucene (latest version, 5.2.1 as of this writing) in "near realtime" mode; it has one network connection to receive requests to index documents, and another connection for search requests.
I'm testing with a corpus of pretty large documents (several megabytes of plain text) and several versions of each field with different analyzers. One of them being a phonetic analyzer with the Beider-Morse filter, the indexing of some documents can take quite a bit of time (over a minute in some cases). Most of this time is spent in the call to IndexWriter.addDocument(doc);
My problem is that while a document is being indexed, searches get blocked, and they aren't processed until the indexing operation finishes. Having the search blocked for more than a couple seconds is unacceptable.
Before each search, I do the following:
DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, false);
if (newReader != null)
{
reader = newReader;
searcher = new IndexSearcher(reader);
}
I guess this is what causes the problem. However, is the only way to get the most recent changes when I do a search. I'd like to maintain this behaviour in general, but if the search would block I wouldn't mind to use a slightly old version of the index.
Is there any way to fix this?
Among other options, consider having always an IndexWriter open and perform "commits" to it as you need.
Then you should ask for index readers to it (not to the directory) and refresh them as needed. Or simply use a SearcherManager that will not only refresh searchers for you, but also will maintain a pool of readers and will manage references to them, in order to avoid reopening if the index contents haven't change.
See more here.
When I use google reader, I found, sometime website doesn't rss support, but somehow google reader produced it, and show it. I want to know how google reader do it. Any programming language solution or just theory will be ok.
I am not going to pretend I know how google read does it but here's a simple hint:
When a browser loads a page for the first time, he keeps a copy in its cache.
The next time the page needs to be loaded, the browser first checks if the page has changed since the last time it was loaded. If it wasn't, he will simply load the version in the cache, otherwise, he will refetch the page again.
This mechanism is done, as far as I know, using the HEAD HTTP operation and the Last-Modified header.
This should be your starting point as it manage to rapidly find out if some new content was published.
The next step would be to use some clever algorithms to define what the change was, if it's relevant enough to be considered as a new content and how to present it.
Reference
I am using lucene for search in one of my projects. It's running as a separate service on a port. Whenever a query comes, a request is sent to this server and it returns a map of results.
My problem is that it stops working after some time. It works fine for 1 day or so. But after 1 day it stops returning results (i.e. service is running but it results 0 results). To get it back working, I have to restart the service and then it starts working fine again.
Please suggest some solution. I'll be happy to provide more info if needed.
Thanks.
Were I to make a guess at an easy mistake to make that could cause such behavior, maybe you're opening a bunch of indexwriters or indexreaders as time goes on, and not closing them correctly, thus running out of file descriptors available on your server. See if the 'lsof' shows a lot of open descriptors on '.cfs', '.fdx' and/or '.fdt' ('ulimit -n' can be used to see the maximum).
One thing to note about the IndexSearcher, which I've seen cause problems:
Closing the searcher may not close the underlying reader. If you pass a reader into the searcher, it won't be closed when you close the searcher (since in that case, it may be in use by other objects).
An example of this:
//Assume I have an IndexWriter named indexwriter, which I reuse.
IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexwriter, true));
//Use the searcher
searcher.close();
//We close the search, but the underlying reader remains open.
Now this has accumlated an unclosed reader, and left some index file descriptors open. If this pattern is used enough times, over time it will stop responding.
That one example of such an error anyway.
It can be fixed by simply closing the reader when you close the searcher, something like: searcher.getIndexReader().close(). Better solutions could be found, though. Reusing the reader, which can be refreshed when index contents change, for instance.
Don't know if this is the exact problem you are having, but might be worth noting.
First, this may be a stupid question, but I'm hoping someone will tell me so, and why. I also apologize if my explanation of what/why is lacking.
I am using a servlet to upload a HUGE (247MB) file, which is pipe (|) delineated. I grab about 5 of 20 fields, create an object, then add it to a list. Once this is done, I pass the the list to an OpenJPA transactional method called persistList().
This would be okay, except for the size of the file. It's taking forever, so I'm looking for a way to improve it. An idea I had was to use a BlockingQueue in conjunction with the persist/persistList method in a new thread. Unfortunately, my skills in java concurrency are a bit weak.
Does what I want to do make sense? If so, has anyone done anything like it before?
Servlets should respond to requests within a short amount of time. In this case, the persist of the file contents needs to be an asynchronous job, so:
The servlet should respond with some text about the upload job, expected time to complete or something like that.
The uploaded content should be written to some temp space in binary form, rather than keeping it all in memory. This is the usual way the multi-part post libraries to their work.
You should have a separate service that blocks on a queue of pending jobs. Once it gets a job, it processes it.
The 'job' is simply some handle to the temporary file that was written when the upload happened... and any metadata like who uploaded it, job id, etc.
The persisting service needs to upload a large number of rows, but make it appear 'atomic', either model the intermediate state as part of the table model(s), or write to temp spaces.
If you are writing to temp tables, and then copying all the content to the live table, remember to have enough log space and temp space at the database level.
If you have a full J2EE stack, consider modelling the job queue as a JMS queue, so recovery makes sense. Once again, remember to have proper XA boundaries, so all the row persists fall within an outer transaction.
Finally, consider also having a status check API and/or UI, where you can determine the state of any particular upload job: Pending/Processing/Completed.