Lucene 4.0 Index writer - creating single index

Lucene 4.0 Index writer - creating single index - java

I am trying to implement simplest lucene search. I followed this as my starting point.
I can understand the sample code :
public static void indexHotel(Hotel hotel) throws IOException {
IndexWriter writer = (IndexWriter) getIndexWriter(false);
Document doc = new Document();
doc.add(new Field("id", hotel.getId(), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("name", hotel.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("city", hotel.getCity(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES,
Field.Index.TOKENIZED));
String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
doc.add(new Field("content", fullSearchableText,
Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
}
the thing i can't get in this code is what does getIndexWriter(false) do . this method is nowhere mentioned in the post i followed. moreover in another code block :
public void rebuildIndexes() throws IOException {
//
// Erase existing index
//
getIndexWriter(true);
//
// Index all hotel entries
//
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel: hotels) {
indexHotel(hotel);
}
//
// Don’t forget to close the index writer when done
//
closeIndexWriter();
}
some undefined methods are used.
A bit confusing for beginner like me.
I want to create only one index . i think getIndexWriter(true); and closeIndexWriter() are some utility methods just to get IndexWriter but i can't make any assumption for what does that true in getIndexWriter(true); is used for.
By following some other post's iv'e got more confused regarding the creation of IndexWriter.
Can somebody please put me on the right path if i am doing anything wrong ?

Well depending on where is your index (RAM or FileSystem) you can open different indexWriters.
Assuming that you are trying to write the index into the file system you should have something like this:
public static final Version luceneVersion = Version.LUCENE_40;
IndexWriter getIndexWriter(){
Directory indexDir = FSDirectory.open(new File( INDEX_PATH ));
IndexWriterConfig luceneConfig = new IndexWriterConfig(
luceneVersion, new StandardAnalyzer(luceneVersion));
return(new IndexWriter(indexDir, luceneConfig));
}
Note the analyzer class 'StandardAnalyzer', you should choose the analyzer depending on the application requirements. I reckon StandardAnalyzer is good enough for what you want to do.
The input argument maybe asks if a new writer should be created ?

Related

cannot use regular expression to search in lucene

I am trying to index text, word files and also search some content in these files. It is OK when I search for a specific string but when I try to use a regular expression to search, it will not work any more. In the following, I will list some crucial code for explaining.
The index function:
// FileBean is the class contains the file path,
// file content, file lastModified information
public void indexDoc(IndexWriter writer, FileBean t) throws Exception {
Document doc = new Document();
System.out.println(t.getPath());
doc.add(new StringField(LuceneConstants.PATH, t.getPath(), Field.Store.YES));
doc.add(new LongPoint(LuceneConstants.MODIFIED, t.getModified()));
doc.add(new TextField(LuceneConstants.CONTENT, t.getContent(), Field.Store.NO));
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE){
writer.addDocument(doc);
} else{
writer.updateDocument(new Term(LuceneConstants.PATH, t.getPath()), doc);
}
}
I am using queryParse to build the query, the query will be a RegexQuery just like '\d{16}' for a number.
the search function
public static TopDocs getResults(IndexSearcher searcher, Query query) throws IOException {
TopDocs docs = searcher.search(query, 10);
return docs;
}
TopDocs's totalHit is 0, which is not what I expect. It seems to me that there is no file being searched. This content should satisfy with the given regular expression that is provided.
I tried googling it but still I have not found a valid solution. Can anyone provide any suggestions on why totalHit is returning 0? Thanks.

Try taking away the '+', so it would be '\d{16}'.

OMG, I finally found the reason. Though I didn't know what was the deep reason. I found if I used '[0-9]' instead if '\d'. It would be OK!!!!!
If anyone could explain this, it will be wonderful!!!!!

Lucene - Creating an Index using FSDirectory

first time posting; long time reader. I apologize a head of time if this was already asked here (I'm new to lucene as well!). I've done a lot of research and wasn't able to find a good explanation/example for my question.
First of all, I've used IKVM.NET to convert lucene 4.9 java to include in my .net application. I chose to do this so I was able to use the most recent version of lucene. No issues.
I am trying to create a basic example to start to learn lucene and to apply it to my app. I've done countless google searches and read lots of articles, apache's website, etc. My code follows mostly the example here: http://www.lucenetutorial.com/lucene-in-5-minutes.html
My question is, I don't believe I want to use RAMDirectory.. right? Since I will be indexing a database and allowing users to search it via the website. I opted for using FSDirectory because I didn't think it should be all stored in memory.
When the IndexWriter is created it is creating new files each time(.cfe, .cfs, .si, segments.gen, write.lock, etc.) It seems to me you would create these files once and then use them until the index needs to be rebuilt?
So how do I create an IndexWriter with out recreating the index files?
Code:
StandardAnalyzer analyzer;
Directory directory;
protected void Page_Load(object sender, EventArgs e)
{
var version = org.apache.lucene.util.Version.LUCENE_CURRENT;
analyzer = new StandardAnalyzer(version);
if(directory == null){ directory= FSDirectory.open(new java.io.File(HttpContext.Current.Request.PhysicalApplicationPath + "/indexes"));
}
IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
//i found setting the open mode will overwrite the files but still creates new each time
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter w = new IndexWriter(directory, config);
addDoc(w, "test", "1234");
addDoc(w, "test1", "1234");
addDoc(w, "test2", "1234");
addDoc(w, "test3", "1234");
w.close();
}
private static void addDoc(IndexWriter w, String _keyword, String _keywordid)
{
Document doc = new Document();
doc.add(new TextField("Keyword", _keyword, Field.Store.YES));
doc.add(new StringField("KeywordID", _keywordid, Field.Store.YES));
w.addDocument(doc);
}
protected void searchButton_Click(object sender, EventArgs e)
{
String querystr = "";
String results="";
querystr = searchTextBox.Text.ToString();
Query q = new QueryParser(org.apache.lucene.util.Version.LUCENE_4_0, "Keyword", analyzer).parse(querystr);
int hitsPerPage = 100;
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if (hits.Length == 0)
{
label.Text = "Nothing was found.";
}
else
{
for (int i = 0; i < hits.Length; ++i)
{
int docID = hits[i].doc;
Document d = searcher.doc(docID);
results += "<br />" + (i + 1) + ". " + d.get("KeywordID") + "\t" + d.get("Keyword") + " Hit Score: " + hits[i].score.ToString() + "<br />";
}
label.Text = results;
reader.close();
}
}

Yes, RAMDirectory is great for quick, on-the-fly tests and tutorials, but in production you will usually want to store your index on the file system through an FSDirectory.
The reason it's rewriting the index every time you open the writer is that you are setting the OpenMode to IndexWriterConfig.OpenMode.CREATE. CREATE means you want to remove any existing index at that location, and start from scratch. You probably want IndexWriterConfig.OpenMode.CREATE_OR_APPEND, which will open an existing index if one is found.
One minor note:
You shouldn't use LUCENE_CURRENT (deprecated), use a real version instead. You are also using LUCENE_4_0 in your QueryParser. Neither of these will probably cause any major problems, but good to be consistent anyway.

When we use RAMDirectory it loads whole index or large parts of it into “memory” that is virtual memory. As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory. So RAMDirectory is not a good idea to optimize index loading times.
On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.
To resolve all above issue MMapDirectory uses virtual memory and a kernel feature called “mmap” to access the disk files.
Check this link also.

apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

approach for creating lucene indexes

I am implementing search feature for a news website. On that website ,users submit news articles containing title and text, currently these articles are inserted directly into a database.I heard that full text searching inside a database containing long..long text would not be efficient.
so i tried using lucene for indexing and searching. i am able to index full database with it and also able to search the content.But i am not sure if i am using the best approach.
Here is my indexer class :
public class LuceneIndexer {
public static void indexNews(Paste p ,IndexWriter indexWriter) throws IOException {
Document doc = new Document();
doc.add(new Field("id", p.getNewsId(), Field.Store.YES, Field.Index.NO));
doc.add(new Field("title", p.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("text", p.getNewsRawText(), Field.Store.YES, Field.Index.UN_TOKENIZED));
String fullSearchableText = p.getTitle() + " " + p.getNewsRawText();
doc.add(new Field("content", fullSearchableText, Field.Store.NO, Field.Index.TOKENIZED));
indexWriter.addDocument(doc);
}
public static void rebuildIndexes() {
try {
System.out.println("started indexing");
IndexWriter w = getIndexWriter();
ArrayList<News> n = new GetNewsInfo().getLastPosts(0);
for (News news : n) {
indexNews(news,w );
}
closeIndexWriter(w);
System.out.println("indexing done");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriter indexWriter = new IndexWriter(GlobalData.LUCENE_INDEX_STOREAGE, new StandardAnalyzer(), true);
return indexWriter;
}
public static void closeIndexWriter(IndexWriter w) throws CorruptIndexException, IOException {
w.close();
}
Is above code efficient ?
I think i should add a document into index when it is submitted by user instead of indexing full database again.
Do i need to create new IndexWriter every time an article is submitted?
is that efficient to open and close an IndexWriter very frequently?

You are right that you don't need to readd every document to the index, you only need to add new ones, the rest will remain in the index.
But then you do need to create a new IndexWriter every time. If you prefer you can use a service or something which keeps an IndexWriter alive, but the opening and closing does not take much time. If you do reuse an IndexWriter make sure that you use indexWriter.commit() after each adding.

Do i need to create new IndexWriter every time an article is
submitted?
No
is that efficient to open and close an IndexWriter very frequently?
Definitely not! You should read the guidelines for indexing here.

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html

If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene 4.0 Index writer - creating single index - java

Related

cannot use regular expression to search in lucene

Lucene - Creating an Index using FSDirectory

apache lucene indexing and searching on the filepath

approach for creating lucene indexes

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

Categories

Resources