best way to update RAMDirectory - java

At inconsistent intervals, specific documents in a Lucene index need to be updated. The updates could be hourly or every few minutes. Currently I have a process that runs and looks for changes, and if changes have happened, it (in Lucene 3.5 fashion) removes the document and then re-adds it to the RAMDirectory.
Is there a better way to manage a Lucene index of documents that are constantly transforming? Is RAMDirectory the best choice?
The code I use for "updating" the index:
Term idTerm = new Term("uid",row.getKey());
getWriter().deleteDocuments(idTerm);
getWriter().commit();
// do some fun stuff creating a new doc with the changes
getWriter().addDocument(doc);

Lucene has recently had two very useful helper classes to handle frequently-changing indexes:
SearcherManager,
NRTManager.
You can read more about them at Mike McCandless' blog.

Related

Full text search by summaries

Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?
I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.
Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).
Update. More specifically I don't want an engine to keep the whole
document, but only it's "summary" (you may call it index information
or TD-IDF representation).
To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:
org.apache.lucene.document.Field.Store.NO
By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.
Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.
I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.
References for MLT in Elasticsearch, Lucene, Solr
but only it's "summary" (you may call it index information or TD-IDF representation).
What you are looking for seems quite standard :
Apache Lucene [1], if you look for a library
Apache Solr or Elastic Search, if you are looking for a
production ready Enterprise Search Server.
How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).
What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .
In Lucene and Solr this is matter of configuration.
Summarisation is a completely different NLP task and is not probably what you need.
Cheers
[1] http://lucene.apache.org/index.html
[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!
If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier
The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)
As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Lucene blocks while searching and indexing at the same time

I have a java application that uses Lucene (latest version, 5.2.1 as of this writing) in "near realtime" mode; it has one network connection to receive requests to index documents, and another connection for search requests.
I'm testing with a corpus of pretty large documents (several megabytes of plain text) and several versions of each field with different analyzers. One of them being a phonetic analyzer with the Beider-Morse filter, the indexing of some documents can take quite a bit of time (over a minute in some cases). Most of this time is spent in the call to IndexWriter.addDocument(doc);
My problem is that while a document is being indexed, searches get blocked, and they aren't processed until the indexing operation finishes. Having the search blocked for more than a couple seconds is unacceptable.
Before each search, I do the following:
DirectoryReader newReader = DirectoryReader.openIfChanged(reader, writer, false);
if (newReader != null)
{
reader = newReader;
searcher = new IndexSearcher(reader);
}
I guess this is what causes the problem. However, is the only way to get the most recent changes when I do a search. I'd like to maintain this behaviour in general, but if the search would block I wouldn't mind to use a slightly old version of the index.
Is there any way to fix this?
Among other options, consider having always an IndexWriter open and perform "commits" to it as you need.
Then you should ask for index readers to it (not to the directory) and refresh them as needed. Or simply use a SearcherManager that will not only refresh searchers for you, but also will maintain a pool of readers and will manage references to them, in order to avoid reopening if the index contents haven't change.
See more here.

SearcherManager maybeRefresh method not happening

I'm using Lucene 4.0 API in order to implement search in my application.
The navigation flow is as follows:
The user creates a new article. A new Document is than added to the index using IndexWriter.addDocument().
After addition the SearcherManager.maybeRefresh() method is called. The SearcherManager is built from the Writer in order to have access to NRL Search.
Just after the creation, the user decides to add a new tag to his article. This is when the Writer.updateDocument() is called. Considering that at step 2 I asked for a refresh I would expect the Searcher to find the added document. However, this is not found.
Is this the common behaviour? Is there a way to make shore that the Searcher finds the Document? (except commit)
I am guessing that your newly created document is kept in the memory. Lucene doesn't make the changes immediately, it keeps some documents in memory, because the I/O operations take some time and resources. It is a good practice to write only once the buffer is full. But, since you would like to view and change the document immediately, try flushing the buffer first(IndexWriter.flush()). This should write to disk. Only after this try (maybe)refreshing.

What is the best Java text indexing library for Google App Engine?

To the moment I know that compass may handle this work. But indexing with compass looks pretty expensive. Is there any lighter alternatives?
To be honest, I don't know if Lucene will be lighter than Compass in terms of indexing (why would it be, doesn't Compass use Lucene for that?).
Anyway, because you asked for alternatives, there is GAELucene. I'm quoting its announcement below:
Enlightened by the discussion "Can I
run Lucene in google app engine?",
I implemented a google datastore based
Lucene component, GAELucene, which can
help you to run search applications on
google app engine.
The main clazz of GAELucene include:
GAEDirectory - a read only Directory based on google datastore.
GAEFile - stands for an index file, the file's byte content will be
splited into multi GAEFileContent.
GAEFileContent - stands for a segment of index file.
GAECategory - the identifier of different indices.
GAEIndexInput - a memory-resident IndexInput? implementation like the
RAMInputStream.
GAEIndexReader - wrapper for IndexReader? that cached in
GAEIndexReaderPool
GAEIndexReaderPool - pool for GAEIndexReader
The following code snippet
demonstrates the use of GAELucene do
searching:
Query queryObject = parserQuery(request);
GAEIndexReaderPool readerPool = GAEIndexReaderPool.getInstance();
GAEIndexReader indexReader = readerPool.borrowReader(INDEX_CATEGORY_DEMO);
IndexSearcher searcher = newIndexSearcher(indexReader);
Hits hits = searcher.search(queryObject);
readerPool.returnReader(indexReader);
I warmly recommend to read the whole discussion on nabble, very informative.
Just in case, regarding Compass, Shay Banon wrote a blog entry detailing how to use Compass in App Engine here: http://www.kimchy.org/searchable-google-appengine-with-compass/
Apache Lucene is the de-facto choice for full text indexing in Java. Looks like Compass Core contains "An implementation of Lucene Directory to store the index within a database (using Jdbc). It is separated from Compass code base and can be used with pure Lucene applications." plus tons of other stuff. You could try to separate just the Lucence component thereby stripping away several libs and making it more lightweight. Either that or ditch Compass altogether and use pure unadorned Lucene.
For Google App Engine, the only indexing library I've seen is appengine-search, with a description of how to use it on this page. I haven't tried it out though.
I've used Lucene (which Compass is based on) and found it to work great with comparatively low expense. The indexing is a task that you can schedule at times that work for your app.
Some alternatives indexing projects are mentioned in this SO thread, including Xapian and minion. I haven't checked either of these out though, since Lucene did everything I needed it to very well.
The Google App engine internal search seems better, and even havsupport synonyms:
https://developers.google.com/appengine/docs/java/search/
If you want to run Lucene on GAE you might also have a look at LuGAEne. It's an implementation of Lucene's Directory for GAE.
Usage is actually pretty simple, just replace one of Lucene's standard directories with GaeDirectory
Directory directory = new GaeDirectory("MyIndex");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
IndexWriter writer = new IndexWriter(directory, config);
...
gaelucene seems to be in "maintenance mode" (no commit since Sep 2009) and lucene-appengine does not (yet) work when you're using Objectify version 4 in your application.
Disclaimer: I'm the author of LuGAEne.

Categories

Resources