Lucene - Creating an Index using FSDirectory

Lucene - Creating an Index using FSDirectory - java

first time posting; long time reader. I apologize a head of time if this was already asked here (I'm new to lucene as well!). I've done a lot of research and wasn't able to find a good explanation/example for my question.
First of all, I've used IKVM.NET to convert lucene 4.9 java to include in my .net application. I chose to do this so I was able to use the most recent version of lucene. No issues.
I am trying to create a basic example to start to learn lucene and to apply it to my app. I've done countless google searches and read lots of articles, apache's website, etc. My code follows mostly the example here: http://www.lucenetutorial.com/lucene-in-5-minutes.html
My question is, I don't believe I want to use RAMDirectory.. right? Since I will be indexing a database and allowing users to search it via the website. I opted for using FSDirectory because I didn't think it should be all stored in memory.
When the IndexWriter is created it is creating new files each time(.cfe, .cfs, .si, segments.gen, write.lock, etc.) It seems to me you would create these files once and then use them until the index needs to be rebuilt?
So how do I create an IndexWriter with out recreating the index files?
Code:
StandardAnalyzer analyzer;
Directory directory;
protected void Page_Load(object sender, EventArgs e)
{
var version = org.apache.lucene.util.Version.LUCENE_CURRENT;
analyzer = new StandardAnalyzer(version);
if(directory == null){ directory= FSDirectory.open(new java.io.File(HttpContext.Current.Request.PhysicalApplicationPath + "/indexes"));
}
IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
//i found setting the open mode will overwrite the files but still creates new each time
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter w = new IndexWriter(directory, config);
addDoc(w, "test", "1234");
addDoc(w, "test1", "1234");
addDoc(w, "test2", "1234");
addDoc(w, "test3", "1234");
w.close();
}
private static void addDoc(IndexWriter w, String _keyword, String _keywordid)
{
Document doc = new Document();
doc.add(new TextField("Keyword", _keyword, Field.Store.YES));
doc.add(new StringField("KeywordID", _keywordid, Field.Store.YES));
w.addDocument(doc);
}
protected void searchButton_Click(object sender, EventArgs e)
{
String querystr = "";
String results="";
querystr = searchTextBox.Text.ToString();
Query q = new QueryParser(org.apache.lucene.util.Version.LUCENE_4_0, "Keyword", analyzer).parse(querystr);
int hitsPerPage = 100;
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if (hits.Length == 0)
{
label.Text = "Nothing was found.";
}
else
{
for (int i = 0; i < hits.Length; ++i)
{
int docID = hits[i].doc;
Document d = searcher.doc(docID);
results += "<br />" + (i + 1) + ". " + d.get("KeywordID") + "\t" + d.get("Keyword") + " Hit Score: " + hits[i].score.ToString() + "<br />";
}
label.Text = results;
reader.close();
}
}

Yes, RAMDirectory is great for quick, on-the-fly tests and tutorials, but in production you will usually want to store your index on the file system through an FSDirectory.
The reason it's rewriting the index every time you open the writer is that you are setting the OpenMode to IndexWriterConfig.OpenMode.CREATE. CREATE means you want to remove any existing index at that location, and start from scratch. You probably want IndexWriterConfig.OpenMode.CREATE_OR_APPEND, which will open an existing index if one is found.
One minor note:
You shouldn't use LUCENE_CURRENT (deprecated), use a real version instead. You are also using LUCENE_4_0 in your QueryParser. Neither of these will probably cause any major problems, but good to be consistent anyway.

When we use RAMDirectory it loads whole index or large parts of it into “memory” that is virtual memory. As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory. So RAMDirectory is not a good idea to optimize index loading times.
On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.
To resolve all above issue MMapDirectory uses virtual memory and a kernel feature called “mmap” to access the disk files.
Check this link also.

Related

Lucene to use different files to search text

I am working on a task where I need to search a text using lucene. But here the requirement is to use the already created segment, .si, .cfe and .cfs files by other application.
I am able to get those files but while searching the text it won't show me the results.
The code is for search is:
public void searchText(String indexPath, String searchString) {
try {
Analyzer analyzer = new StandardAnalyzer();
File indexDirectory = new File(indexPath);
Directory directory = FSDirectory.open(indexDirectory.toPath());
IndexReader directoryReader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(directoryReader);
QueryParser parser = new QueryParser("requiredtext", analyzer);
Query query = parser.parse(searchString);
System.out.println(query);
// Parse a simple query that searches for "text":
ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc);
}
analyzer.close();
directoryReader.close();
directory.close();
}
catch (Exception ex){
System.out.println("Exception - "+ex.getMessage());
}
}
I am using the Lucene version 8.11.1 with Java8.
The question is: Is it possible in Lucene to read/find/search the text for which the files are written by some other application and search by other application. If it is then please provide the pointers how?
Atul

I found the issue and fixed it.
I was looking for the data in the field "requiredtext" but the indexer wont store the data for the field like while indexing it wont set the property for this field "TextField.Store.YES" and that is the reason I wont get the data for the field which I am looking for.
I got the data for other field for which the property was set.
And my questions was is it possible to search data on other files which are created by other application? So the answer is yes. #andrewJames answer helps to prove it.

Change order pages of PDF document in iTextSharp

I'm trying to change reorder pages of my PDF document, but i can't and I don't know why.
I read several articals about changing order, it's java(iText) and i have got few problems with it.(exampl1, exampl2, example3). This example on c#, but there is using other method(exampl4)
I want take my TOC on 12 page and put to 2 page. After 12 page I have other content. This is my template for change order of pages:
String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n)
This is my method for changing order of pages:
public void ChangePageOrder(string path)
{
MemoryStream baos = new MemoryStream();
PdfReader sourcePDFReader = new PdfReader(path);
int toc = 12;
int n = sourcePDFReader.NumberOfPages;
sourcePDFReader.SelectPages(String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n));
using (var fs = new FileStream(path, FileMode.Open, FileAccess.ReadWrite))
{
PdfStamper stamper = new PdfStamper(sourcePDFReader, fs);
stamper.Close();
}
}
Here is call to method:
...
doc.Close();
ChangePageOrder(filePath);
What am I doing not right?
Thank you.

Your code can't work because you are using path to create the PdfReader as well as to create the FileStream. You probably get an error such as "The file is in use" or "The file can't be accessed".
This is explained here:
StackOverflow: How to update a PDF without creating a new PDF?
Official web site:
How to update a PDF without creating a new PDF?
You create a MemoryStream() named baos, but you aren't using that object anywhere. One way to solve your problem, is to replace the FileStream when you first create your PDF by that MemoryStream, and then use the bytes stored in that memory stream to create a PdfReader instance. In that case, PdfStamper won't be writing to a file that is in use.
Another option would be to use a different path. For instance: first you write the document to a file named my_story_unordered.pdf (created by PdfWriter), then you write the document to a file named my_story_reordered.pdf (created by PdfStamper).
It's also possible to create the final document in one go. In that case, you need to switch to linear mode. There's an example in my book "iText in Action - Second Edition" that shows how to do this: MovieHistory1
In the C# port of this example, you have:
writer.SetLinearPageMode();
In normal circumstances, iText will create a page tree with branches and leaves. As soon a a branch has more than 10 leaves, a new branch is created. With setLinearPageMode(), you tell iText not to do this. The complete page tree will consist of one branch with nothing but leaves (no extra branches). This is bad from the point of view of performance when viewing the document, but it's acceptable if the number of pages in your document is limited.
Once you've switched to page mode, you can reorder the pages like this:
document.NewPage();
// get the total number of pages that needs to be reordered
int total = writer.ReorderPages(null);
// change the order
int[] order = new int[total];
for (int i = 0; i < total; i++) {
order[i] = i + toc;
if (order[i] > total) {
order[i] -= total;
}
}
// apply the new order
writer.ReorderPages(order);
Summarized: if your document doesn't have many pages, use the ReorderPages method. If your document has many pages, use the method you've been experimenting with, but do it correctly. Don't try to write to the file that you are still trying to read.

Without going into details about what you should do you can loop through all pages from a pdf, put them into a new pdf doc with all the pages. You can put your logic inside the for loop.
reader = new PdfReader(sourcePDFpath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startpage));
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPDFpath, System.IO.FileMode.Create));
sourceDocument.Open();
for (int i = startpage; i <= endpage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();

Optimizing indexing lucene 5.2.1

I have developed my own Indexer in Lucene 5.2.1. I am trying to index a file of dimension of 1.5 GB and I need to do some non-trivial calculation during indexing time on every single document of the collection.
The problem is that it takes almost 20 minutes to do all the indexing! I have followed this very helpful wiki, but it is still way too slow. I have tried increasing Eclipse heap space and java VM memory, but it seems more a matter of hard disk rather than virtual memory (I am using a laptop with 6GB or RAM and a common Hard Disk).
I have read this discussion that suggests to use RAMDirectory or mount a RAM disk. The problem with RAM disk would be that of persisting index in my filesystem (I don't want to lose indexes after reboot). The problem with RAMDirectory instead is that, according to the APIs, I should not use it because my index is more than "several hundreds of megabites"...
Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.
Here you can find my code:
public class ReviewIndexer {
private JSONParser parser;
private PerFieldAnalyzerWrapper reviewAnalyzer;
private IndexWriterConfig iwConfig;
private IndexWriter indexWriter;
public ReviewIndexer() throws IOException{
parser = new JSONParser();
reviewAnalyzer = new ReviewWrapper().getPFAWrapper();
iwConfig = new IndexWriterConfig(reviewAnalyzer);
//change ram buffer size to speed things up
//#url https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
iwConfig.setRAMBufferSizeMB(2048);
//little speed increase
iwConfig.setUseCompoundFile(false);
//iwConfig.setMaxThreadStates(24);
// Set to overwrite the existing index
indexWriter = new IndexWriter(FileUtils.openDirectory("review_index"), iwConfig);
}
/**
* Indexes every review.
* #param file_path : the path of the yelp_academic_dataset_review.json file
* #throws IOException
* #return Returns true if everything goes fine.
*/
public boolean indexReviews(String file_path) throws IOException{
BufferedReader br;
try {
//open the file
br = new BufferedReader(new FileReader(file_path));
String line;
//define fields
StringField type = new StringField("type", "", Store.YES);
String reviewtext = "";
TextField text = new TextField("text", "", Store.YES);
StringField business_id = new StringField("business_id", "", Store.YES);
StringField user_id = new StringField("user_id", "", Store.YES);
LongField stars = new LongField("stars", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
LongField date = new LongField("date", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
StringField votes = new StringField("votes", "", Store.YES);
Date reviewDate;
JSONObject jsonVotes;
try {
indexWriter.deleteAll();
//scan the file line by line
//TO-DO: split in chunks and use parallel computation
while ((line = br.readLine()) != null) {
try {
JSONObject jsonline = (JSONObject) parser.parse(line);
Document review = new Document();
//add values to fields
type.setStringValue((String) jsonline.get("type"));
business_id.setStringValue((String) jsonline.get("business_id"));
user_id.setStringValue((String) jsonline.get("user_id"));
stars.setLongValue((long) jsonline.get("stars"));
reviewtext = (String) jsonline.get("text");
//non-trivial function being calculated here
text.setStringValue(reviewtext);
reviewDate = DateTools.stringToDate((String) jsonline.get("date"));
date.setLongValue(reviewDate.getTime());
jsonVotes = (JSONObject) jsonline.get("votes");
votes.setStringValue(jsonVotes.toJSONString());
//add fields to document
review.add(type);
review.add(business_id);
review.add(user_id);
review.add(stars);
review.add(text);
review.add(date);
review.add(votes);
//write the document to index
indexWriter.addDocument(review);
} catch (ParseException | java.text.ParseException e) {
e.printStackTrace();
br.close();
return false;
}
}//end of while
} catch (IOException e) {
e.printStackTrace();
br.close();
return false;
}
//close buffer reader and commit changes
br.close();
indexWriter.commit();
} catch (FileNotFoundException e1) {
e1.printStackTrace();
return false;
}
System.out.println("Done.");
return true;
}
public void close() throws IOException {
indexWriter.close();
}
}
What is the best thing to do then? Should I Build a RAM disk and then copy indexes to FileSystem once they are done, or should I use RAMDirectory anyway -or maybe something else? Many thanks

Lucene claims 150GB/hour on modern hardware - that is with 20 indexing threads on a 24 core machine.
You have 1 thread, so expect about 150/20 = 7.5 GB/hour. You will probably see that 1 core is working 100% and the rest is only working when merging segments.
You should use multiple index threads to speeds things up. See for example the luceneutil Indexer.java for inspiration.
As you have a laptop I suspect you have either 4 or 8 cores, so multi-threading should be able to give your indexing a nice boost.

You can try setMaxTreadStates in IndexWriterConfig
iwConfig.setMaxThreadStates(50);

apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}

You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html

If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene - Creating an Index using FSDirectory - java

Related

Lucene to use different files to search text

Change order pages of PDF document in iTextSharp

Optimizing indexing lucene 5.2.1

apache lucene indexing and searching on the filepath

How to persist the Lucene document index so that the documents do not need to be loaded into it each time the program starts up?

Categories

Resources