NGramIndex building and questioning

NGramIndex building and questioning - java

I tried to implement an indexbased textsearch with lucene 4.3.1. The code is below. I created the index with au NGramTokenyzer because I would like to find searchresults which are too far away for the FuzzyQuery.
I have two problems with my solution. The first is that i don't understand why it finds some things but others not. E.g. if I look for "Buter", "utter" or "Bute" it finds "Butter", but if I look for "Btter" there's no Result. Is there an error in my implementation, what should i make different?
Also I would like that it always gives me (e.g.) 10 results for each query. Is this eaven achievable with my code, or what would I need to change to get this 10 results?
Here's the code:
public LuceneIndex() throws IOException{
File dir = new File(indexDirectoryPath);
index = FSDirectory.open(dir);
analyzer = new NGramAnalyzer();
config = new IndexWriterConfig(luceneVersion, analyzer);
indexWriter = new IndexWriter(index, config);
reader = DirectoryReader.open(FSDirectory.open(dir));
searcher = new IndexSearcher(reader);
queryParser = new QueryParser(luceneVersion, "label", new NGramAnalyzer());
}
/**
* building the index
* #param graph
* #throws IOException
*/
public void makeIndex(MyGraph graph) throws IOException {
FieldType fieldType = new FieldType();
fieldType.setTokenized(true);
//read the items that should be indexed
ArrayList<String> DbList = Helper.readListFromFileDb(indexFilePath);
for (String word : DbList) {
Document doc = new Document();
doc.add(new TextField("label", word, Field.Store.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
}
public void searchIndexWithQueryParser(String searchString, int numberOfResults) throws IOException, ParseException {
System.out.println("Searching for '" + searchString + "' using QueryParser");
Query query = queryParser.parse(searchString);
System.out.println(query.toString());
TopDocs results = searcher.search(query, numberOfResults);
ScoreDoc[] hits = results.scoreDocs;
//just to see some output...
int i = 0;
Document doc = searcher.doc(hits[i].doc);
String label = doc.get("label");
System.out.println(label);
}
Edit: Code for the NGramAnalyzer
public class NGramAnalyzer extends Analyzer {
int minGram = 2;
int maxGram = 2;
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(reader, minGram, maxGram);
CharArraySet charArraySet = StopFilter.makeStopSet(Version.LUCENE_43,
FoodProductBlackList.blackList, true);
TokenStream filter = new StopFilter(Version.LUCENE_43, source, charArraySet);
return new TokenStreamComponents(source, filter);
}
}

Related

Lucene suggest: "is not a SuggestField" exception when using a CompletionQuery

I am trying to implement search suggestions for my app. Actually I need a kind of "multi-term prefix query" and I was trying to use a PrefixCompletionQuery. The problem is that an IllegalArgumentException is thrown when "search" or "suggest" methods are called from a SuggestIndexSearcher object.
I wrote a sample code to reproduce the problem:
public static void main(String[] args) throws IOException {
RAMDirectory dir = new RAMDirectory(); //just for this experiment
Analyzer analyzer = new CompletionAnalyzer(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer));
var doc = new Document();
doc.add(new SuggestField("suggest", "Hi everybody!",4));
writer.addDocument(doc);
doc = new Document();
doc.add(new SuggestField("suggest", "nice to meet you",4));
writer.addDocument(doc);
writer.commit(); // maybe redundant
writer.close();
var reader = DirectoryReader.open(dir);
var searcher = new SuggestIndexSearcher(reader);
var query = new PrefixCompletionQuery(analyzer, new Term("suggest", "everyb"));
TopDocs results = searcher.search(query, 5);
for (var res : results.scoreDocs) {
System.out.println(reader.document(res.doc).get("id"));
}
}
And this is what i get:
Exception in thread "main" java.lang.IllegalArgumentException: suggest is not a SuggestField
at org.apache.lucene.search.suggest.document.CompletionWeight.bulkScorer(CompletionWeight.java:86)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:574)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:421)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:432)
at experiments.main.main(main.java:67) #TopDocs results = searcher.search(query, 5);
Trying to be as complete as possible, the project depends on lucene-core 8.8.2 and lucene-suggest 8.8.2 .
Where am I wrong?

I think you have to change the posting format of your suggestion field by adding a custom codec to your index writer.
For example something like this:
RAMDirectory dir = new RAMDirectory();
Analyzer analyzer = new CompletionAnalyzer(new StandardAnalyzer());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
Codec codec = new Lucene87Codec() {
#Override
public PostingsFormat getPostingsFormatForField(String field) {
if (field.equals("suggest")) {
return new Completion84PostingsFormat();
}
return super.getPostingsFormatForField(field);
}
};
config.setCodec(codec);
IndexWriter indexWriter = new IndexWriter(dir, config);

TermQuery not giving expected result as QueryParser - Lucene 7.4.0

I am indexing 10 text documents using StandardAnalyser.
public static void indexDoc(final IndexWriter writer, Path filePath, long timstamp)
{
try (InputStream iStream = Files.newInputStream(filePath))
{
Document doc = new Document();
Field pathField = new StringField("path",filePath.toString(),Field.Store.YES);
Field flagField = new TextField("ashish","i am stored",Field.Store.YES);
LongPoint last_modi = new LongPoint("last_modified",timstamp);
Field content = new TextField("content",new BufferedReader(new InputStreamReader(iStream,StandardCharsets.UTF_8)));
doc.add(pathField);
doc.add(last_modi);
doc.add(content);
doc.add(flagField);
if(writer.getConfig().getOpenMode()==OpenMode.CREATE)
{
System.out.println("Adding "+filePath.toString());
writer.addDocument(doc);
}
} catch (IOException e) {
e.printStackTrace();
}
}
above is the code snippet used to index a document.
for testing purpose, i am searching a field called as 'ashish'.
When I use QueryParser, Lucene gives the search results as expected.
public static void main(String[] args) throws Exception
{
String index = "E:\\Lucene\\Index";
String field = "ashish";
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(field, analyzer);
String line = "i am stored";
Query query = parser.parse(line);
// Query q = new TermQuery(new Term("ashish","i am stored"));
System.out.println("Searching for: " + query.toString());
TopDocs results = searcher.search(query, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits);
System.out.println(numTotalHits + " total matching documents");
for(int i=0;i<numTotalHits;i++)
{
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
String content = doc.get("ashish");
System.out.println(path+"\n"+content);
}
}
above code demonstrates the use of QueryParser to retrieve the desired field, which works properly. it hits all 10 documents, as i am storing this field for all 10 documents. all good here.
however when I use TermQuery API, I don't get the desired result.
I am presenting the code change that I did for TermQuery.
public static void main(String[] args) throws Exception
{
String index = "E:\\Lucene\\Index";
String field = "ashish";
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
// QueryParser parser = new QueryParser(field, analyzer);
String line = "i am stored";
// Query query = parser.parse(line);
Query q = new TermQuery(new Term("ashish","i am stored"));
System.out.println("Searching for: " + q.toString());
TopDocs results = searcher.search(q, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits);
System.out.println(numTotalHits + " total matching documents");
for(int i=0;i<numTotalHits;i++)
{
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
String content = doc.get("ashish");
System.out.println(path+"\n"+content);
System.out.println("----------------------------------------------------------------------------------");
}
}
also attaching the screenshot of TermQuery API execution.
did some research on stackoverflow itself example Lucene TermQuery and QueryParser but did not find any practical solution also the lucene version was very old in those examples.
would appreciate a help.
thanks in advance!

I got the answer of my question in this post
link that explains how TermQuery works
TermQuery searches for entire String as it is. this behavior will give you improper results as while indexing data is often tokenized.
in the posted code, I was passing entire search String to TermQuery like
Query q = new TermQuery(new Term("ashish","i am stored"));
now in above case, Lucene is finding "i am stored" as it is, which it will never find because in indexing this string is tokenized.
instead I tried to search like Query q = new TermQuery(new Term("ashish","stored"));
Above query gave me an expected results.
thanks,
Ashish

The real problem is your query string is not getting analyzed here. So, use same analyzer as used while indexing document and try using below code to analyze query string and then search.
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("ashish", analyzer);
Query query = new TermQuery(new Term("ashish", "i am stored"));
query = parser.parse(query.toString());
ScoreDoc[] hits = searcher.search(query, 5).scoreDocs;

How to give web-inf folder path to fsdirectory for full text search using apache lucene

I am very new to apache Lucene.here I am trying to develop sample full text search application which will search in html files with given input query,if given string found in any file then index's are created
My results jsp page look like this :
if I click on any hyper link that html file need to be open in new tab but I am getting blank page.
This is My Application folder structure
This is my code:
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
{
String query=request.getParameter("squery");
String path1= "C:/POC/indexs/";
System.out.println(path1);
String path2 = request.getContextPath()+"/WEB-INF/Html-Files";
File indexDir = new File(path1);
int hits = 100;
FilesTextFinder createIndex=new FilesTextFinder();
File dataDir = new File(path2);
String suffix = "html";
try
{
boolean iscreated=isIndexCreated(indexDir, query, hits);
if(!iscreated)
{
System.out.println("no indexs found...");
int numIndex = createIndex.index(indexDir, dataDir, suffix);
System.out.println("Total Indexs: "+numIndex);
}
searchIndex(indexDir, query, hits);
RequestDispatcher rd=request.getRequestDispatcher("/Results.jsp");
request.setAttribute("results", values);
request.setAttribute("query", query);
rd.forward(request, response);
}
catch (Exception e)
{
e.printStackTrace();
}
}
private void searchIndex(File indexDir, String queryStr, int maxHits) throws Exception
{
Directory directory = FSDirectory.open(indexDir);
DirectoryReader dreader=DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dreader);
QueryParser parser = new QueryParser("contents",new SimpleAnalyzer());
Query query = parser.parse(queryStr);
TopDocs topDocs = searcher.search(query, maxHits);
ScoreDoc[] hits = topDocs.scoreDocs;
int count=hits.length;
values=new HashMap<String, String>();
for (int i = 0; i<count; i++)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String cmpt_path=d.get("filename");
int indx=cmpt_path.lastIndexOf("\\");
String name=cmpt_path.substring(indx+1,cmpt_path.length());
if(name.length()>40)
{
name=name.substring(0, 40);
}
System.out.println("name: "+name);
values.put(name, cmpt_path);
}
System.out.println("Found " + hits.length);
}
private boolean isIndexCreated(File indexDir, String queryStr, int maxHits) throws Exception
{
Directory directory = FSDirectory.open(indexDir);
System.out.println("---------"+indexDir);
DirectoryReader dreader=DirectoryReader.open(directory);//here i am getting error
IndexSearcher searcher = new IndexSearcher(dreader);
QueryParser parser = new QueryParser("contents",new SimpleAnalyzer());
Query query = parser.parse(queryStr);
TopDocs topDocs = searcher.search(query, maxHits);
ScoreDoc[] hits = topDocs.scoreDocs;
int count=hits.length;
System.out.println("called..."+count);
directory.close();
if(count>0)
return true;
else
return false;
}
}
how can I pass my html files contained folder path i.e WEB-INF/Html-files to FSDirectory class and when i cick on any hyperlink in my results page corresponding html file need be open in new tab of browser.
How can I achieve this.
Thanks in advance...

Why doesn't Lucene find any documents with this code?

I am working on this piece of code which add a single document to a lucene (4.7) index and then try to find it by quering a term that exists in the document for sure. But indexSearcher doesn't return any document. What is wrong with my code? Thank you for your comments and feedbacks.
String indexDir = "/home/richard/luc_index_03";
try {
Directory directory = new SimpleFSDirectory(new File(
indexDir));
Analyzer analyzer = new SimpleAnalyzer(
Version.LUCENE_47);
IndexWriterConfig conf = new IndexWriterConfig(
Version.LUCENE_47, analyzer);
conf.setOpenMode(OpenMode.CREATE_OR_APPEND);
conf.setRAMBufferSizeMB(256.0);
IndexWriter indexWriter = new IndexWriter(
directory, conf);
Document doc = new Document();
String title="New York is an awesome city to live!";
doc.add(new StringField("title", title, StringField.Store.YES));
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.close();
directory.close();
IndexReader reader = DirectoryReader
.open(FSDirectory.open(new File(
indexDir)));
IndexSearcher indexSearcher = new IndexSearcher(
reader);
String field="title";
SimpleQueryParser qParser = new SimpleQueryParser(analyzer, field);
String queryText="New York" ;
Query query = qParser.parse(queryText);
int hitsPerPage = 100;
TopDocs results = indexSearcher.search(query, 5 * hitsPerPage);
System.out.println("number of results: "+results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
for (ScoreDoc scoreDoc:hits){
Document docC = indexSearcher.doc(scoreDoc.doc);
String path = docC.get("path");
String titleC = docC.get("title");
String ne = docC.get("ne");
System.out.println(path+"\n"+titleC+"\n"+ne);
System.out.println("---*****----");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
After running I just get
number of results: 0

This is because you use StringField. From the javadoc:
A field that is indexed but not tokenized: the entire String value is indexed as a single token.
Just use TextField instead and you should be ok.

Can Lucene return search results with line number?

I want to implement "Find in Files" similar to one in IDE's using lucene. Basically wants to search in source code files like .c,.cpp,.h,.cs and .xml. I tried the demo shown in apache website. It returns the list of files without line numbers and number of occurance in that file. I am sure there should be some ways to get it.
Is there anyway to get those details?

Can you please share the link of the demo shown in apache website?
Here I show you how to get the term frequency of a term given set of documents:
public static void main(final String[] args) throws CorruptIndexException,
LockObtainFailedException, IOException {
// Create the index
final Directory directory = new RAMDirectory();
final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig config = new IndexWriterConfig(
Version.LUCENE_36, analyzer);
final IndexWriter writer = new IndexWriter(directory, config);
// addDoc(writer, field, text);
addDoc(writer, "title", "foo");
addDoc(writer, "title", "buz qux");
addDoc(writer, "title", "foo foo bar");
// Search
final IndexReader reader = IndexReader.open(writer, false);
final IndexSearcher searcher = new IndexSearcher(reader);
final Term term = new Term("title", "foo");
final Query query = new TermQuery(term);
System.out.println("Query: " + query.toString() + "\n");
final int limitShow = 3;
final TopDocs td = searcher.search(query, limitShow);
final ScoreDoc[] hits = td.scoreDocs;
// Take IDs and frequencies
final int[] docIDs = new int[td.totalHits];
for (int i = 0; i < td.totalHits; i++) {
docIDs[i] = hits[i].doc;
}
final Map<Integer, Integer> id2freq = getFrequencies(reader, term,
docIDs);
// Show results
for (int i = 0; i < td.totalHits; i++) {
final int docNum = hits[i].doc;
final Document doc = searcher.doc(docNum);
System.out.println("\tposition " + i);
System.out.println("Title: " + doc.get("title"));
final int freq = id2freq.get(docNum);
System.out.println("Occurrences of \"" + term.text() + "\" in \""
+ term.field() + "\" = " + freq);
System.out.println("--------------------------------\n");
}
searcher.close();
reader.close();
writer.close();
}
Here we add the documents to the index:
private static void addDoc(final IndexWriter w, final String field,
final String text) throws CorruptIndexException, IOException {
final Document doc = new Document();
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
This is an example of how to take the number of occurrencies of a term in a doc:
public static Map<Integer, Integer> getFrequencies(
final IndexReader reader, final Term term, final int[] docIDs)
throws CorruptIndexException, IOException {
final Map<Integer, Integer> id2freq = new HashMap<Integer, Integer>();
final TermDocs tds = reader.termDocs(term);
if (tds != null) {
for (final int docID : docIDs) {
// Skip to the next docID
tds.skipTo(docID);
// Get its term frequency
id2freq.put(docID, tds.freq());
}
}
return id2freq;
}
If you put all together and you run it you will obtain this output:
Query: title:foo
position 0
Title: foo
Occurrences of "foo" in "title" = 2
--------------------------------
position 1
Title: foo foo bar
Occurrences of "foo" in "title" = 4
--------------------------------

I tried many forums, response is zero. So finally I got an idea from #Luca Mastrostefano answer to get the line number details.
Taginfo from lucene searcher returns the file names. I think that is sufficient enough to get the line number. Lucene index is not storing the actual content, it is actually stores the hash values. So it is impossible to get the line number directly. Hence, I assume only way is use that path and read the file and get line number.
public static void PrintLines(string filepath,string key)
{
int counter = 1;
string line;
// Read the file and display it line by line.
System.IO.StreamReader file = new System.IO.StreamReader(filepath);
while ((line = file.ReadLine()) != null)
{
if (line.Contains(key))
{
Console.WriteLine("\t"+counter.ToString() + ": " + line);
}
counter++;
}
file.Close();
}
Call this function after path from lucene searcher.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

NGramIndex building and questioning - java

Related

Lucene suggest: "is not a SuggestField" exception when using a CompletionQuery

TermQuery not giving expected result as QueryParser - Lucene 7.4.0

How to give web-inf folder path to fsdirectory for full text search using apache lucene

Why doesn't Lucene find any documents with this code?

Can Lucene return search results with line number?

Categories

Resources