apache lucene indexing and searching on the filepath

apache lucene indexing and searching on the filepath - java

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

Related

Lucene to use different files to search text

I am working on a task where I need to search a text using lucene. But here the requirement is to use the already created segment, .si, .cfe and .cfs files by other application.
I am able to get those files but while searching the text it won't show me the results.
The code is for search is:
public void searchText(String indexPath, String searchString) {
try {
Analyzer analyzer = new StandardAnalyzer();
File indexDirectory = new File(indexPath);
Directory directory = FSDirectory.open(indexDirectory.toPath());
IndexReader directoryReader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(directoryReader);
QueryParser parser = new QueryParser("requiredtext", analyzer);
Query query = parser.parse(searchString);
System.out.println(query);
// Parse a simple query that searches for "text":
ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc);
}
analyzer.close();
directoryReader.close();
directory.close();
}
catch (Exception ex){
System.out.println("Exception - "+ex.getMessage());
}
}
I am using the Lucene version 8.11.1 with Java8.
The question is: Is it possible in Lucene to read/find/search the text for which the files are written by some other application and search by other application. If it is then please provide the pointers how?
Atul

I found the issue and fixed it.
I was looking for the data in the field "requiredtext" but the indexer wont store the data for the field like while indexing it wont set the property for this field "TextField.Store.YES" and that is the reason I wont get the data for the field which I am looking for.
I got the data for other field for which the property was set.
And my questions was is it possible to search data on other files which are created by other application? So the answer is yes. #andrewJames answer helps to prove it.

cannot use regular expression to search in lucene

I am trying to index text, word files and also search some content in these files. It is OK when I search for a specific string but when I try to use a regular expression to search, it will not work any more. In the following, I will list some crucial code for explaining.
The index function:
// FileBean is the class contains the file path,
// file content, file lastModified information
public void indexDoc(IndexWriter writer, FileBean t) throws Exception {
Document doc = new Document();
System.out.println(t.getPath());
doc.add(new StringField(LuceneConstants.PATH, t.getPath(), Field.Store.YES));
doc.add(new LongPoint(LuceneConstants.MODIFIED, t.getModified()));
doc.add(new TextField(LuceneConstants.CONTENT, t.getContent(), Field.Store.NO));
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE){
writer.addDocument(doc);
} else{
writer.updateDocument(new Term(LuceneConstants.PATH, t.getPath()), doc);
}
}
I am using queryParse to build the query, the query will be a RegexQuery just like '\d{16}' for a number.
the search function
public static TopDocs getResults(IndexSearcher searcher, Query query) throws IOException {
TopDocs docs = searcher.search(query, 10);
return docs;
}
TopDocs's totalHit is 0, which is not what I expect. It seems to me that there is no file being searched. This content should satisfy with the given regular expression that is provided.
I tried googling it but still I have not found a valid solution. Can anyone provide any suggestions on why totalHit is returning 0? Thanks.

Try taking away the '+', so it would be '\d{16}'.

OMG, I finally found the reason. Though I didn't know what was the deep reason. I found if I used '[0-9]' instead if '\d'. It would be OK!!!!!
If anyone could explain this, it will be wonderful!!!!!

Lucene - Creating an Index using FSDirectory

first time posting; long time reader. I apologize a head of time if this was already asked here (I'm new to lucene as well!). I've done a lot of research and wasn't able to find a good explanation/example for my question.
First of all, I've used IKVM.NET to convert lucene 4.9 java to include in my .net application. I chose to do this so I was able to use the most recent version of lucene. No issues.
I am trying to create a basic example to start to learn lucene and to apply it to my app. I've done countless google searches and read lots of articles, apache's website, etc. My code follows mostly the example here: http://www.lucenetutorial.com/lucene-in-5-minutes.html
My question is, I don't believe I want to use RAMDirectory.. right? Since I will be indexing a database and allowing users to search it via the website. I opted for using FSDirectory because I didn't think it should be all stored in memory.
When the IndexWriter is created it is creating new files each time(.cfe, .cfs, .si, segments.gen, write.lock, etc.) It seems to me you would create these files once and then use them until the index needs to be rebuilt?
So how do I create an IndexWriter with out recreating the index files?
Code:
StandardAnalyzer analyzer;
Directory directory;
protected void Page_Load(object sender, EventArgs e)
{
var version = org.apache.lucene.util.Version.LUCENE_CURRENT;
analyzer = new StandardAnalyzer(version);
if(directory == null){ directory= FSDirectory.open(new java.io.File(HttpContext.Current.Request.PhysicalApplicationPath + "/indexes"));
}
IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
//i found setting the open mode will overwrite the files but still creates new each time
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter w = new IndexWriter(directory, config);
addDoc(w, "test", "1234");
addDoc(w, "test1", "1234");
addDoc(w, "test2", "1234");
addDoc(w, "test3", "1234");
w.close();
}
private static void addDoc(IndexWriter w, String _keyword, String _keywordid)
{
Document doc = new Document();
doc.add(new TextField("Keyword", _keyword, Field.Store.YES));
doc.add(new StringField("KeywordID", _keywordid, Field.Store.YES));
w.addDocument(doc);
}
protected void searchButton_Click(object sender, EventArgs e)
{
String querystr = "";
String results="";
querystr = searchTextBox.Text.ToString();
Query q = new QueryParser(org.apache.lucene.util.Version.LUCENE_4_0, "Keyword", analyzer).parse(querystr);
int hitsPerPage = 100;
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if (hits.Length == 0)
{
label.Text = "Nothing was found.";
}
else
{
for (int i = 0; i < hits.Length; ++i)
{
int docID = hits[i].doc;
Document d = searcher.doc(docID);
results += "<br />" + (i + 1) + ". " + d.get("KeywordID") + "\t" + d.get("Keyword") + " Hit Score: " + hits[i].score.ToString() + "<br />";
}
label.Text = results;
reader.close();
}
}

Yes, RAMDirectory is great for quick, on-the-fly tests and tutorials, but in production you will usually want to store your index on the file system through an FSDirectory.
The reason it's rewriting the index every time you open the writer is that you are setting the OpenMode to IndexWriterConfig.OpenMode.CREATE. CREATE means you want to remove any existing index at that location, and start from scratch. You probably want IndexWriterConfig.OpenMode.CREATE_OR_APPEND, which will open an existing index if one is found.
One minor note:
You shouldn't use LUCENE_CURRENT (deprecated), use a real version instead. You are also using LUCENE_4_0 in your QueryParser. Neither of these will probably cause any major problems, but good to be consistent anyway.

When we use RAMDirectory it loads whole index or large parts of it into “memory” that is virtual memory. As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory. So RAMDirectory is not a good idea to optimize index loading times.
On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.
To resolve all above issue MMapDirectory uses virtual memory and a kernel feature called “mmap” to access the disk files.
Check this link also.

Lucene 4.0 Index writer - creating single index

I am trying to implement simplest lucene search. I followed this as my starting point.
I can understand the sample code :
public static void indexHotel(Hotel hotel) throws IOException {
IndexWriter writer = (IndexWriter) getIndexWriter(false);
Document doc = new Document();
doc.add(new Field("id", hotel.getId(), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("name", hotel.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("city", hotel.getCity(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES,
Field.Index.TOKENIZED));
String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
doc.add(new Field("content", fullSearchableText,
Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
}
the thing i can't get in this code is what does getIndexWriter(false) do . this method is nowhere mentioned in the post i followed. moreover in another code block :
public void rebuildIndexes() throws IOException {
//
// Erase existing index
//
getIndexWriter(true);
//
// Index all hotel entries
//
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel: hotels) {
indexHotel(hotel);
}
//
// Don’t forget to close the index writer when done
//
closeIndexWriter();
}
some undefined methods are used.
A bit confusing for beginner like me.
I want to create only one index . i think getIndexWriter(true); and closeIndexWriter() are some utility methods just to get IndexWriter but i can't make any assumption for what does that true in getIndexWriter(true); is used for.
By following some other post's iv'e got more confused regarding the creation of IndexWriter.
Can somebody please put me on the right path if i am doing anything wrong ?

Well depending on where is your index (RAM or FileSystem) you can open different indexWriters.
Assuming that you are trying to write the index into the file system you should have something like this:
public static final Version luceneVersion = Version.LUCENE_40;
IndexWriter getIndexWriter(){
Directory indexDir = FSDirectory.open(new File( INDEX_PATH ));
IndexWriterConfig luceneConfig = new IndexWriterConfig(
luceneVersion, new StandardAnalyzer(luceneVersion));
return(new IndexWriter(indexDir, luceneConfig));
}
Note the analyzer class 'StandardAnalyzer', you should choose the analyzer depending on the application requirements. I reckon StandardAnalyzer is good enough for what you want to do.
The input argument maybe asks if a new writer should be created ?

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html

If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

apache lucene indexing and searching on the filepath - java

Related

Lucene to use different files to search text

cannot use regular expression to search in lucene

Lucene - Creating an Index using FSDirectory

Lucene 4.0 Index writer - creating single index

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

Categories

Resources