Lucene: queryparser vs phrasequery or termquery - java

what are the advantages of not using queryparser and using phrasequery or termquery? It seems to me you can use queryparser to replace any of those?
For example, if I want to search for a exact phrase, I can do:
String searchString = "\"word1 word2\"";
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
or if I want to search for 2 terms, I can do
String searchString = "word1* AND word2*";
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
Currently, I am only using queryparser and it is working for me, but is this the correct way of using Lucene?

Main disadvantage of not using QueryParser is following (it's especially the case when using Solr/Elastic):
When you're creating the TermQuery, something like this:
Query q = new TermQuery("text", "keyword")
the problem will be that you need to apply analyzers/filters manually. Let's say user types KeyWord, then if you just pass it into TermQuery, you will not find anything, if during indexing time you were using lowercasing. Of course the lowercasing is simple, but do you want to apply everything in the code for stemming/nramming, etc., and not relying on existing functionality from analyzers/filters?

Related

Apache Lucene 6 QueryParser range query is not working with IntPoint

I'm using Lucene 6 new IntPoint and I want to do some range search
Using IntPoint.newRangeQuery the search works and the correct documents are returned, however when I'm using QueryParser (classic) or the new StandardQueryParser nothing is returned.
// This works
Query query = IntPoint.newRangeQuery("duration",1,20);
System.out.println(query);
//This doesn't work
QueryParser parser = new QueryParser("name", analyzer);
Query query = parser.parse("duration:[1 TO 20]");
System.out.println(query);
//This doesn't work
StandardQueryParser queryParserHelper = new StandardQueryParser();
Query query = queryParserHelper.parse("timestamp:[1 TO 20]", "timestamp");
System.out.println(query);
// In all 3 cases it prints: timestamp:[1 TO 20]
Is this a bug or am I missing something?
It's not a bug, and I wouldn't say you are missing anything, really. QueryParser doesn't have any support for IntPoint fields, or any other numeric (PointValues) field types. Range queries in QueryParser syntax will always generate a TermRangeQuery, which will search for that field based on lexicographic order in the inverted index, which will not be work for searching PointValues fields. Generating these using IntPoint.newRangeQuery and similar methods is the correct thing to do.

lucene BooleanQuery.Builder Build doesn't Work

Hello Guys i have a Question :)
I create a BooleanQuery Like this :
BooleanQuery.Builder qry = new BooleanQuery.Builder();
qry.add(new TermQuery(new Term("Name", "Anna")), BooleanClause.Occur.SHOULD);
And if i do a search like this now :
TopDocs docs = searcher.search(qry.build(), hitsPerPage);
it gets Zero Results ? But if I use this code :
TopDocs docs = searcher.search(parser.parse(qry.build().toString()), hitsPerPage);
Then I get the right results ? Can you explain me why I have to parse it again ?
I am using Version 5.5.0 and Name is a TextField
A TextField runs your data through an analyzer and will likely produce the term "anna" (lowercase). A TermQuery does not run anything through an analyzer, so it searches for "Anna" (uppercase) and this does not match. Create the TermQuery with the lowercased term and you should see results: new TermQuery(new Term("Name", "anna")).
The BooleanQuery has nothing to do with this, in fact, this particular query would rewrite itself to the underlying TermQuery, as this is the only subquery.
The parser takes the string "Name:Anna" (produced by the TermQuery), runs it through the analyzer and gives you a "Name:anna" TermQuery, that's why it works if you run the query through the parser – it involves the necessary analyzing step.

Lucene query object and search

I am upgrading from Lucene 3.6 to 5.3.0, but the search doesn't want to take my parameters when using 5.3.0.
This works in 3.6:
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory));
SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents",
analyzer);
TopDocs topDocs = null;
Query query = parser.parse(queryString);
topDocs = searcher.search(query, 1000);
But in 5.3, the compiler is asking me to use SrndQuery, but I still get an error on the searcher.search method:
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));
Analyzer analyzer = new SimpleAnalyzer();
QueryParser parser = new QueryParser();
TopDocs topDocs = null;
SrndQuery query = QueryParser.parse(queryString);
topDocs = searcher.search(query, 1000);//**The method search(Query, int) in the type IndexSearcher is not applicable for the arguments (SrndQuery, int)**
Not sure what I am doing wrong here. Any ideas?
P.S. I am upgrading because I am not able to get Highlighted text from some PDFs I recently indexed.
It bears stating that you are using the Surround query parser, rather than the standard query parser (if you are intending to use the standard parser, you are importing the wrong one).
The problem you are running into is that a SrndQuery isn't really a lucene query, so you can't just run it into the searcher and get results. You need to transform it into lucene query to search with it. This is done via the SrndQuery.makeLuceneQueryField method. You'll need to create a BasicQueryFactory to pass into it, but they are easy to construct:
SrndQuery query = QueryParser.parse(queryString);
BasicQueryFactory factory = new BasicQueryFactory(1000 /*maxBasicQueries*/);
Query luceneQuery = query.makeLuceneQueryField("myDefaultField", factory);
topDocs = searcher.search(luceneQuery, 1000);
Somewhat Tangential Note: I kinda wondered if you should keep the BasicQueryFactory around, rather than creating a new one for every search, but appears to be unnecessary. Definitely nothing expensive going on in the ctor, and it looks like solr's SurroundQParserPlugin constructs a new one for each query it parses, so doing that should be fine.

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.
You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));
When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Lucene: delete from index, based on multiple fields

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?
IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.
Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Categories

Resources