approach for creating lucene indexes

approach for creating lucene indexes - java

I am implementing search feature for a news website. On that website ,users submit news articles containing title and text, currently these articles are inserted directly into a database.I heard that full text searching inside a database containing long..long text would not be efficient.
so i tried using lucene for indexing and searching. i am able to index full database with it and also able to search the content.But i am not sure if i am using the best approach.
Here is my indexer class :
public class LuceneIndexer {
public static void indexNews(Paste p ,IndexWriter indexWriter) throws IOException {
Document doc = new Document();
doc.add(new Field("id", p.getNewsId(), Field.Store.YES, Field.Index.NO));
doc.add(new Field("title", p.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("text", p.getNewsRawText(), Field.Store.YES, Field.Index.UN_TOKENIZED));
String fullSearchableText = p.getTitle() + " " + p.getNewsRawText();
doc.add(new Field("content", fullSearchableText, Field.Store.NO, Field.Index.TOKENIZED));
indexWriter.addDocument(doc);
}
public static void rebuildIndexes() {
try {
System.out.println("started indexing");
IndexWriter w = getIndexWriter();
ArrayList<News> n = new GetNewsInfo().getLastPosts(0);
for (News news : n) {
indexNews(news,w );
}
closeIndexWriter(w);
System.out.println("indexing done");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriter indexWriter = new IndexWriter(GlobalData.LUCENE_INDEX_STOREAGE, new StandardAnalyzer(), true);
return indexWriter;
}
public static void closeIndexWriter(IndexWriter w) throws CorruptIndexException, IOException {
w.close();
}
Is above code efficient ?
I think i should add a document into index when it is submitted by user instead of indexing full database again.
Do i need to create new IndexWriter every time an article is submitted?
is that efficient to open and close an IndexWriter very frequently?

You are right that you don't need to readd every document to the index, you only need to add new ones, the rest will remain in the index.
But then you do need to create a new IndexWriter every time. If you prefer you can use a service or something which keeps an IndexWriter alive, but the opening and closing does not take much time. If you do reuse an IndexWriter make sure that you use indexWriter.commit() after each adding.

Do i need to create new IndexWriter every time an article is
submitted?
No
is that efficient to open and close an IndexWriter very frequently?
Definitely not! You should read the guidelines for indexing here.

Related

cannot use regular expression to search in lucene

I am trying to index text, word files and also search some content in these files. It is OK when I search for a specific string but when I try to use a regular expression to search, it will not work any more. In the following, I will list some crucial code for explaining.
The index function:
// FileBean is the class contains the file path,
// file content, file lastModified information
public void indexDoc(IndexWriter writer, FileBean t) throws Exception {
Document doc = new Document();
System.out.println(t.getPath());
doc.add(new StringField(LuceneConstants.PATH, t.getPath(), Field.Store.YES));
doc.add(new LongPoint(LuceneConstants.MODIFIED, t.getModified()));
doc.add(new TextField(LuceneConstants.CONTENT, t.getContent(), Field.Store.NO));
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE){
writer.addDocument(doc);
} else{
writer.updateDocument(new Term(LuceneConstants.PATH, t.getPath()), doc);
}
}
I am using queryParse to build the query, the query will be a RegexQuery just like '\d{16}' for a number.
the search function
public static TopDocs getResults(IndexSearcher searcher, Query query) throws IOException {
TopDocs docs = searcher.search(query, 10);
return docs;
}
TopDocs's totalHit is 0, which is not what I expect. It seems to me that there is no file being searched. This content should satisfy with the given regular expression that is provided.
I tried googling it but still I have not found a valid solution. Can anyone provide any suggestions on why totalHit is returning 0? Thanks.

Try taking away the '+', so it would be '\d{16}'.

OMG, I finally found the reason. Though I didn't know what was the deep reason. I found if I used '[0-9]' instead if '\d'. It would be OK!!!!!
If anyone could explain this, it will be wonderful!!!!!

Writing to Lucene index, one document at a time, slows down over time

We have a program, which runs continually, does various things, and changes some records in our database. Those records are indexed using Lucene. So each time we change an entity we do something like:
open db transaction, open Lucene IndexWriter
make the changes to the db in the transaction, and update that entity in Lucene by using indexWriter.deleteDocuments(..) then indexWriter.addDocument(..).
If all went well, commit the db transaction and commit the IndexWriter.
This works fine, but over time, the indexWriter.commit() takes more and more time. Initially it takes about 0.5 seconds but after a few hundred such transactions it takes more than 3 seconds. I don't doubt it would take even longer if the script ran longer.
My solution so far has been to comment out the indexWriter.addDocument(..) and indexWriter.commit(), and recreate the entire index every now and again by first using indexWriter.deleteAll() then re-adding all documents, within one Lucene transction/IndexWriter (about 250k documents in about 14 seconds). But this obviously goes against the transactional approach offered by databases and Lucene, which keeps the two in sync, and keeps the updates to the database visible to users of our tools who are searching using Lucene.
It seems strange that I can add 250k documents in 14 seconds, but adding 1 document takes 3 seconds. What am I doing wrong, how can I improve the situation?

What you are doing wrong is assuming that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database, when they really don't. More specifically in your case, a commit syncs all index files with the disk, making commit times proportional to index size. That is why over time your indexWriter.commit() takes more and more time. The Javadoc for IndexWriter.commit() even warns that:
This may be a costly operation, so you should test the cost in your
application and do it only when really necessary.
Can you imagine database documentation telling you to avoid doing commits?
Since your main goal seems to be to keep database updates visible through Lucene searches in a timely manner, to improve the situation, do the following:
Have indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) trigger after a successful database commit, instead of before
Perform indexWriter.commit() periodically instead of every transaction, just to make sure your changes are eventually written to disk
Use a SearcherManager for searching and invoke maybeRefresh() periodically to see updated documents within a reasonable time frame
The following is an example program which demonstrates how document updates can be retrieved by periodically performing maybeRefresh(). It builds an index of 100000 documents, uses a ScheduledExecutorService to set up periodic invocations of commit() and maybeRefresh(), prompts you to update a single document, then repeatedly searches until the update is visible. All resources are properly cleaned up on program termination. Note that the controlling factor for when the update becomes visible is when maybeRefresh() is invoked, not commit().
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
public class LucenePeriodicCommitRefreshExample {
ScheduledExecutorService scheduledExecutor;
MyIndexer indexer;
MySearcher searcher;
void init() throws IOException {
scheduledExecutor = Executors.newScheduledThreadPool(3);
indexer = new MyIndexer();
indexer.init();
searcher = new MySearcher(indexer.indexWriter);
searcher.init();
}
void destroy() throws IOException {
searcher.destroy();
indexer.destroy();
scheduledExecutor.shutdown();
}
class MyIndexer {
IndexWriter indexWriter;
Future commitFuture;
void init() throws IOException {
indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
indexWriter.deleteAll();
for (int i = 1; i <= 100000; i++) {
add(String.valueOf(i), "whatever " + i);
}
indexWriter.commit();
commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
try {
indexWriter.commit();
} catch (IOException e) {
e.printStackTrace();
}
}, 5, 5, TimeUnit.MINUTES);
}
void add(String id, String text) throws IOException {
Document doc = new Document();
doc.add(new StringField("id", id, Field.Store.YES));
doc.add(new StringField("text", text, Field.Store.YES));
indexWriter.addDocument(doc);
}
void update(String id, String text) throws IOException {
indexWriter.deleteDocuments(new Term("id", id));
add(id, text);
}
void destroy() throws IOException {
commitFuture.cancel(false);
indexWriter.close();
}
}
class MySearcher {
IndexWriter indexWriter;
SearcherManager searcherManager;
Future maybeRefreshFuture;
public MySearcher(IndexWriter indexWriter) {
this.indexWriter = indexWriter;
}
void init() throws IOException {
searcherManager = new SearcherManager(indexWriter, true, null);
maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
try {
searcherManager.maybeRefresh();
} catch (IOException e) {
e.printStackTrace();
}
}, 0, 5, TimeUnit.SECONDS);
}
String findText(String id) throws IOException {
IndexSearcher searcher = null;
try {
searcher = searcherManager.acquire();
TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
} finally {
if (searcher != null) {
searcherManager.release(searcher);
}
}
}
void destroy() throws IOException {
maybeRefreshFuture.cancel(false);
searcherManager.close();
}
}
public static void main(String[] args) throws IOException {
LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
example.init();
Runtime.getRuntime().addShutdownHook(new Thread() {
#Override
public void run() {
try {
example.destroy();
} catch (IOException e) {
e.printStackTrace();
}
}
});
try (Scanner scanner = new Scanner(System.in)) {
System.out.print("Enter a document id to update (from 1 to 100000): ");
String id = scanner.nextLine();
System.out.print("Enter what you want the document text to be: ");
String text = scanner.nextLine();
example.indexer.update(id, text);
long startTime = System.nanoTime();
String foundText;
do {
foundText = example.searcher.findText(id);
} while (!text.equals(foundText));
long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
} catch (Exception e) {
e.printStackTrace();
} finally {
System.exit(0);
}
}
}
This example was successfully tested using Lucene 5.3.1 and JDK 1.8.0_66.

My first approach: do not commit that often. When you delete and re-add document you will probably trigger a merge. Merges are somewhat slow.
If you use a near real-time IndexReader you can still search like you used to (it does not show deleted documents), but you do not get the commit penalty. You can always commit later, to make sure the file system is in sync with your index. You can do this while using your index, so you do not have to block all other operations.
See also this interesting blog post (and do read the other posts as well, they provide great information).

apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

Lucene 4.0 Index writer - creating single index

I am trying to implement simplest lucene search. I followed this as my starting point.
I can understand the sample code :
public static void indexHotel(Hotel hotel) throws IOException {
IndexWriter writer = (IndexWriter) getIndexWriter(false);
Document doc = new Document();
doc.add(new Field("id", hotel.getId(), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("name", hotel.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("city", hotel.getCity(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES,
Field.Index.TOKENIZED));
String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
doc.add(new Field("content", fullSearchableText,
Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
}
the thing i can't get in this code is what does getIndexWriter(false) do . this method is nowhere mentioned in the post i followed. moreover in another code block :
public void rebuildIndexes() throws IOException {
//
// Erase existing index
//
getIndexWriter(true);
//
// Index all hotel entries
//
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel: hotels) {
indexHotel(hotel);
}
//
// Don’t forget to close the index writer when done
//
closeIndexWriter();
}
some undefined methods are used.
A bit confusing for beginner like me.
I want to create only one index . i think getIndexWriter(true); and closeIndexWriter() are some utility methods just to get IndexWriter but i can't make any assumption for what does that true in getIndexWriter(true); is used for.
By following some other post's iv'e got more confused regarding the creation of IndexWriter.
Can somebody please put me on the right path if i am doing anything wrong ?

Well depending on where is your index (RAM or FileSystem) you can open different indexWriters.
Assuming that you are trying to write the index into the file system you should have something like this:
public static final Version luceneVersion = Version.LUCENE_40;
IndexWriter getIndexWriter(){
Directory indexDir = FSDirectory.open(new File( INDEX_PATH ));
IndexWriterConfig luceneConfig = new IndexWriterConfig(
luceneVersion, new StandardAnalyzer(luceneVersion));
return(new IndexWriter(indexDir, luceneConfig));
}
Note the analyzer class 'StandardAnalyzer', you should choose the analyzer depending on the application requirements. I reckon StandardAnalyzer is good enough for what you want to do.
The input argument maybe asks if a new writer should be created ?

Lucene 3.0.3 does not delete document

We use Lucene to index some internal documents. Sometimes we need to remove documents. These documents have an unique id and are represented by a class DocItem as follows (ALL THE CODE IS A SIMPLIFIED VERSION WITH ONLY SIGNIFICANT (I hope) PARTS):
public final class DocItem {
public static final String fID = "id";
public static final String fTITLE = "title";
private Document doc = new Document();
private Field id = new Field(fID, "", Field.Store.YES, Field.Index.ANALYZED);
private Field title = new Field(fTITLE, "", Field.Store.YES, Field.Index.ANALYZED);
public DocItem() {
doc.add(id);
doc.add(title);
}
... getters & setters
public getDoc() {
return doc;
}
}
So, to index a document, a new DocItem is created and passed to an indexer class as follows:
public static void index(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.addDocument(docitem.getDoc());
idxWriter.close();
}
We created an auxiliary method to iterate over the index directory:
public static void listAll() {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexReader reader = IndexReader.open(dir);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
System.out.println(doc.get(DocItem.fID));
}
}
Running the listAll, we can see that our docs are being indexed properly. At least, we can see the id and other attributes.
We retrieve the document using IndexSearcher as follows:
public static DocItem search(String id) {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexSearcher searcher = new IndexSearcher(index, true);
Query q = new QueryParser(Version.LUCENE_30, DocItem.fID, new StandardAnalyzer(Version.LUCENE_30)).parse(id);
TopDocs td = searcher.search(q, 1);
ScoreDoc[] hits = td.scoreDocs;
searcher.close();
return hits[0];
}
So after retrieving it, we are trying to delete it with:
public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close();
}
The problem is that it doesn't work. The document is never deleted. If I run the listAll() after the deletion, the document is still there. We tried to use IndexReader, with no lucky.
By this post and this post, We think that we are using it accordinlgy.
What we are doing wrong? Any advice? We are using lucene 3.0.3 and java 1.6.0_24.
TIA,
Bob

I would suggest, use IndexReader DeleteDocumets, it returns the number of documents deleted. this will help you narrow whether the deletions occur on first count.
the advantage of this over the indexwriter method, is that it returns the total document deleted, if none if shall return 0.
Also see the How do I delete documents from the index? and this post
Edit: Also i noticed you open the indexreader in readonly mode, can you change the listFiles() index reader open with false as second param, this will allow read write, perhaps the source of error

I call IndexWriterConfig#setMaxBufferedDeleteTerms(1) during IndexWriter instantiation/configuration and all delete operations go to disc immediately. Maybe it's not correct design-wise, but solves the problem explained here.

public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close(

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

approach for creating lucene indexes - java

Do i need to create new IndexWriter every time an article is submitted? No is that efficient to open and close an IndexWriter very frequently? Definitely not! You should read the guidelines for indexing here.

Related

cannot use regular expression to search in lucene

Writing to Lucene index, one document at a time, slows down over time

apache lucene indexing and searching on the filepath

Lucene 4.0 Index writer - creating single index

Lucene 3.0.3 does not delete document

Categories

Resources