Lucene query fuzzyquery

Lucene query fuzzyquery - java

I want to apologize first for my poor English I'm new to loosen and I didn't really understand the query documentation, I indexed some docs and made this query code but its not working
Term t = new Term("description", "history");
Query q = new FuzzyQuery(t, 2);
int hitsPerPage = 100;
Path indexPath = Paths.get("C:\\Users\\Win 7\\Desktop\\projet_ri\\index");
Directory directory = FSDirectory.open(indexPath);
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher iSearcher = new IndexSearcher(reader);
TopDocs topdocs = iSearcher.search(q, hitsPerPage);
ScoreDoc[] resultsList = topdocs.scoreDocs;
System.out.println("Tab size: "+resultsList.length); // This prints Tab size: 0
for(int i = 0; i<resultsList.length; i++){
Document book = iSearcher.doc(resultsList[i].doc);
String description = book.getField("description").stringValue();
System.out.println(description);
}
The program isnt even entering the loop, i tried to check resultsList tab and it prints that the size is zero
Can someone help me to correct my code or give me a query example code ?

You actually missed using a QueryParser for your query.
This QueryParser needs the same Analyzer as you use for indexing. This is really important, otherwise the resultset may differs from what you expect. Your sequence should be something like this:
open Index
create IndexSearcher
create QueryParser with on indexing used Analyzer
create Query with given search terms
parse Query with QueryParser
search
close everything!
See basic lucene tutorial: https://www.tutorialspoint.com/lucene/lucene_search_operation.htm

Related

Lucene how can i add in QueryParser parametr InOrder=true?

I have a text of file:
war force
force war
I do "split" and save word in TextWord:
TextWord[0]: war
TextWord[1]: force
TextWord[2]: force
TextWord[3]: war
I want to find only "war force", but my search also finds "force war".
I want the search to take into account 2 rules:
Keep word order. (If my str of query = "war force" and I found only index 0 and 1. This "force war" would be wrong);
Slop = 0 (So that there are no words between the word "war" and "force" and correct is "war force", but this "war SOMEWORD force" would be wrong )
I try this:
Query query = parser.parse(" \"war force\"~0x ");
Query query = parser.parse(" \"war force\"~0 ");
Query query = parser.parse("war AND force");
Query query = parser.parse("war force");
But such requests do not give the desired result, tell me how you can do this?
My code:
Analyzer customAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
QueryParser parser = new QueryParser("tags", customAnalyzer);
Query query = parser.parse("\"war force\" AND NOT \"force war\"");
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, 10);
System.out.println(" ");
FastVectorHighlighter highlighter = new FastVectorHighlighter();
FieldQuery fieldQuery = highlighter.getFieldQuery(query);
FieldTermStack stack = new FieldTermStack(reader, 0, "tags", fieldQuery);
TermInfo myTermInfo = stack.pop();
while(myTermInfo != null){
System.out.println("word[" + myTermInfo.getPosition() + "]: " + myTermInfo.getText());
myTermInfo = stack.pop();
}
My output:
word[0]: war
word[1]: force
word[4]: force
word[5]: war
The result I need:
word[0]: war
word[1]: force
I saw a documentation. If we have such a request: "Word1 Word2", and between these words there is no operator, then by default the OR operator is put. This means that the request "war force" will be equal to the request "force war", so it will be found: 1) "war force"; 2) "force war". And I don't know how to make sure that I have only this as a result: "war force".
Tell me how to be? Am I missing something?
And if I use highlighter, I have result:
?<b>war</b> <b>force</b> bookcase bookcase1
force war
My code with highlighter:
Analyzer customAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
//... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("tags", customAnalyzer);
Query query = parser.parse(" \"war force\"~0 ");
//Query query = parser.parse("*Case");
//Query query = new PrefixQuery(new Term("tags", "book")); //Поиск чтобы слово начиналось на строку "book" - "bookcase"
TopDocs hits = searcher.search(query, 10);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<b>", "</b>");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.scoreDocs.length; i++) {
int id = hits.scoreDocs[i].doc;
Document doc = searcher.doc(id);
String text = doc.get("tags");
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "tags", customAnalyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, true, 100);//highlighter.getBestFragments(tokenStream, text, 3, "...");
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
System.out.println("finish test");
}
But if I use highlighter, I don't have possition of found word.

To exclude a term or phrase, you can use the - operator (the "prohibit" operator):
"war force" -"force war"
So, in Java, this would be:
Query query = parser.parse("\"war force\" -\"force war\"");
You can also use AND NOT:
"war force" AND NOT "force war"
You can see more details in the classic query parser syntax documentation.
Update
The question has changed a lot since you first asked it!
Now there are 2 new problems:
Your query appears to be retrieving documents that it should not retrieve.
You cannot get the positions of matched terms.
Problem 1
I cannot recreate this problem. Let's assume I have 2 documents in my index:
Doc 1: State WEAPONRY war force word1 And force war Book WEAPONRY
Doc 2: State WEAPONRY war force 123 War WORD1 Force And war Book WEAPONRY
When I use the following query:
"war force" AND NOT "force war"
I find Doc 2, but not Doc 1 - which is correct.
I don't know why you are seeing incorrect/unexpected results. I guess it may be because your index contains unexpected data or may be using an unexpected indexing approach. There is nothing in the question which helps to explain this.
Problem 2
Now, your question contains two examples of using highlighters:
the fast vector highlighter
the standard highlighter
However, both of your code fragments will not report the positions of matched tokens. To do that you can use the approach shown in this answer:
Lucene how can I get position of found query?
When I use that approach, and use the same data and query as shown above, I get the following results:
Found term: war
Position: 3
Found term: force
Position: 4
And, again, this is correct: The matched terms are the 3rd and 4th words in the found document.

Lucene MultiFieldQueryParser does not work

I do not understand why the query does not work.
I need to search for a document in by two fields. Two ID-s. It need to search for a document if 2 values match. ID1 AND ID2
But I get an empty result.
query = MultiFieldQueryParser.parse(new String[]{id1, id2},
new String[]{"ID1", "ID2"},
new SimpleAnalyzer());
TopDocs topDocs = searcher.search(query, 1);
Document doc = searcher.doc(topDocs.scoreDocs[0].doc)
The index works 100%. Verified by other requests.
Thanks for the help.

Since you only want to perform an AND intersection between two separate queries -- and not really do a MultiFieldQuery (where you search for the same value in multiple fields), a slightly modified version of what is shown in Lucene OR search using Boolean Query should work:
BooleanQuery bothQuery = new BooleanQuery();
// field, value
TermQuery idQuery1 = new TermQuery(new Term("ID1", "id1"));
TermQuery idQuery2 = new TermQuery(new Term("ID2", "id2"));
bothQuery.add(new BooleanClause(idQuery1, BooleanClause.Occur.MUST));
bothQuery.add(new BooleanClause(idQuery2, BooleanClause.Occur.MUST));
TopDocs topDocs = searcher.search(bothQuery, 1);
Document doc = searcher.doc(topDocs.scoreDocs[0].doc)

Thank MatsLindh for the above answer. Managed to solved similar problems for school assignment thanks to you.
Bear in mind that the sample code is outdated and for Lucene 8.9 (my case), you should do this instead
Query query = new BooleanQuery.Builder()
.add(query1, BooleanClause.Occur.MUST)
.add(query2, BooleanClause.Occur.MUST)
.build();
TopDocs topDocs = searcher.search(query, 1);
Document doc = searcher.doc(topDocs.scoreDocs[0].doc)
TermQuery objects and Query objects can be used interchangeably to replace query1 and query2 for the above code.

Java Lucene - different results for BooleanQuery and QueryParser Query for same Lucene Query Language

I have observed an odd behaviour but I don't see what I am doing wrong.
I created via multiple BooleanQueries the following query:
+(-(Request.zipCode:18055 Request.zipCode:33333 Request.zipCode:99999) +Request.zipCode:[* TO *]) *:*
...this is what I get via toString
Update: this way I created a part of the BooleanQuery which is responsible to create this snippet +Request.zipCode:[* TO *])
Query fieldOccursQuery = new TermQuery(new Term(queryFieldName, "[* TO *]"));
I have created exaclty same (per my understanding) Query via QueryParser like this:
String querystr = "+(-(Request.zipCode:18055 Request.zipCode:33333 Request.zipCode:99999) +Request.zipCode:[* TO *]) *:*";
Query query = new QueryParser(Version.LUCENE_46, "title", LuceneServiceI.analyzer).parse(querystr);
I processed both of them the same way like this:
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
int max = reader.maxDoc();
TopScoreDocCollector collector = TopScoreDocCollector.create(max > 0 ? max : 1, true);
searcher.search(query, collector);
....
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Map<Integer, Document> docMap = new TreeMap<Integer, Document>();
for (int i = 0; i < hits.length; i++) {
docMap.put(hits[i].doc, indexSearcher.doc(hits[i].doc));
}
Different results
On a index like: stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<Request.zipCode:04103>
The Query via QueryParser deliver one document as expected
The Query via BooleanQuery does not deliver 1 expected document
Questions
Are there possibilities that both same queries deliver different results? Set certain attributes to my BooleanQuery etc.
How can I get the same wanted result for BooleanQuery?
I could not found anything about differences only in concern of performance (http://www.gossamer-threads.com/lists/lucene/java-user/144374)

I found the solution to my problem.
Instead of creating this for the BooleanQuery:
Query fieldOccursQuery = new TermQuery(new Term(queryFieldName, "[* TO *]"));
I used this:
ConstantScoreQuery constantScoreQuery = new ConstantScoreQuery(new FieldValueFilter(queryFieldName));
query.add(constantScoreQuery, Occur.MUST);
Now my query looks different but I only get documents with fields with my queryFieldName.
Issue seems to be the leading wildcard in my first solution:
Find all Lucene documents having a certain field

Lucene Query Scoring Based on Multiple Matched Columns

I am using Lucene to search a contacts directly with general contact information for a database of people such as first name, last name, phone number, address etc. This question pertains specifically to searching by first and last name. Here is how I am indexing the names.
document.add(new Field("firstName", contact.getFirstName(), Field.Store.NO, Field.Index.NOT_ANALYZED));
document.add(new Field("lastName", contact.getLastName(), Field.Store.NO, Field.Index.NOT_ANALYZED));
I am searching the index like this:
IndexReader indexReader = IndexReader.open(FSDirectory.open(directory));
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
int hitsPerPage = indexSearcher.maxDoc();
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
String[] fields = {"id", "firstName", "lastName", "phoneNumber", "email", "address", "website"};
BooleanQuery booleanQuery = new BooleanQuery();
String[] terms = queryString.split(" ");
for(String term : terms) {
for(String field : fields) {
booleanQuery.add(new FuzzyQuery(new Term(field, term)), BooleanClause.Occur.SHOULD);
}
}
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
indexSearcher.search(booleanQuery, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
The reason I am using a boolean query as opposed to a MultiFieldQuery is because it allows me to get results when a field is not exact. Basically I split the querystring by whitespace and then add terms for each of those keywords on each field in the index. I'm new to Lucene so I really have no idea if this is the optimal way to do this, but so far its been working ok for me.
The only hiccup i'm having is that when searching by full name it is not returning the results in the right order.
Index has 2 records, John Doe and John Smith.
When I search for John Doe my results will look like:
1) John Smith
2) John Doe
If I type John Smith it will reverse and display John Doe first. Why is it not returning the exact match as the first result?

If you are going to search for all terms across all fields, why not index the entire text as part of another field? And then you can issue a query like
/*
\\\\ is for escaping "
*/
String searchCriteria = "all:\\\\"John Doe\\\\"^3 OR all:(John Doe)";
IndexSearcher is = new IndexSearcher(indexDirectory);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("all", analyzer);
Query query = parser.parse(searchCriteria);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
indexSearcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
However, if you want to continue with your current design, you can try http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int) to find out why a document is being scored higher than other.

Using boolean queries and a for loop turned out to be a proper way of searching the index in my situation. The results were being reversed due to the way I was parsing and displaying them on the client side so it was a completely unrelated issue.

Lucene docFreq returning 0

I'm using Lucene 3.1 to index some documents.
When I use IndexSearcher.search(), I successfully get back results for queries.
However, when I use IndexSearcher.doqFreq(), I get back 0 for a term. Can anyone offer some insight?
Also, why is there both an IndexSearcher.docFreq() and IndexReader.docFreq()? I have tried both, and both give me 0.
Here is my code:
IndexReader indexReader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(indexReader);
...
String seachTermString = "foobar";
String field = "body";
Term term = new Term(field, searchTermString);
int numDocs = searcher.docFreq(term);
and then I get numDocs=0, even though when I use IndexSearcher.search() with the same search term string, I get back hits.

Try converting your term completely to lower case letters.

Create TermQuery from the Term you are creating to get document frequency with search.docFreq(term). Use this TermQuery for searching and check if it yields any results. It should. If this TermQuery doesn't give any results, something is amiss in the query creation in the step 1 of search in the question.

Are you adding your Fields with the Field.TermVector.YES option enabled?
Document doc = new Document();
doc.add(new Field("value", documentContents, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));

Use TermEnum:
Term term = new Term(field, searchTermString);
TermEnum enum = indexReader.terms(term);
int numDocs = enum.docFreq();
And you don't need the IndexSearcher

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.