Apache Lucene 6 QueryParser range query is not working with IntPoint

Apache Lucene 6 QueryParser range query is not working with IntPoint - java

I'm using Lucene 6 new IntPoint and I want to do some range search
Using IntPoint.newRangeQuery the search works and the correct documents are returned, however when I'm using QueryParser (classic) or the new StandardQueryParser nothing is returned.
// This works
Query query = IntPoint.newRangeQuery("duration",1,20);
System.out.println(query);
//This doesn't work
QueryParser parser = new QueryParser("name", analyzer);
Query query = parser.parse("duration:[1 TO 20]");
System.out.println(query);
//This doesn't work
StandardQueryParser queryParserHelper = new StandardQueryParser();
Query query = queryParserHelper.parse("timestamp:[1 TO 20]", "timestamp");
System.out.println(query);
// In all 3 cases it prints: timestamp:[1 TO 20]
Is this a bug or am I missing something?

It's not a bug, and I wouldn't say you are missing anything, really. QueryParser doesn't have any support for IntPoint fields, or any other numeric (PointValues) field types. Range queries in QueryParser syntax will always generate a TermRangeQuery, which will search for that field based on lexicographic order in the inverted index, which will not be work for searching PointValues fields. Generating these using IntPoint.newRangeQuery and similar methods is the correct thing to do.

Related

Lucene query modification

I have a requirement where I want to modify string formatted lucene query values.
I am taking lucene query as input from user interface and passing it to elastic.
For e.g.
Input : name:"abc" and age:26
Output expected: name: "abcmodified" and userage:26
How do I parse and modify string formatted lucene query in java?

Have you tried looking into org.apache.lucene.queryparser.classic.QueryParser? It has functionality to return a Lucene Query Object from an input string. For example:
String rawQuery = "name:abc AND age:26";
QueryParser parser = new QueryParser(Version.LUCENE_45, null, new WhitespaceAnalyzer(Version.LUCENE_45));
BooleanQuery query = (BooleanQuery) praser.parse(rawQuery);
query.clauses().get(0).setQuery(new TermQuery(new Term("name", "abcmodified")));
query.clauses().get(1).setQuery(new TermQuery(new Term("userage", "26")));
System.out.println(query);
Will print +name:abcmodified +userage:26, which is essentially what you want. Obviously you can have smarter processing using a recursive method that traverses the query based on the query type (Boolean, Prefix, Term, Fuzzy etc...)
Hope this helps!

Apache Lucene createWeight() for wildcard query

I'm using Apache Lucene 6.6.0 and I'm trying to extract terms from the search query. Current version of code looks like this:
Query parsedQuery = new AnalyzingQueryParser("", analyzer).parse(query);
Weight weight = parsedQuery.createWeight(searcher, false);
Set<Term> terms = new HashSet<>();
weight.extractTerms(terms);
It works pretty much fine, but recently I noticed that it doesn't support queries with wildcards (i.e. * sign). If the query contains wildcard(s), then I get an exception:
java.lang.UnsupportedOperationException: Query
id:123*456 does not implement createWeight at
org.apache.lucene.search.Query.createWeight(Query.java:66) at
org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:751)
at
org.apache.lucene.search.BooleanWeight.(BooleanWeight.java:60)
at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:225)
So is there a way to use createWeight() with wildcarded queries? Or maybe there's another way to extract search terms from query without createWeight()?

Long story short, it is necessary to rewrite the query, for example, as follows:
final AnalyzingQueryParser analyzingQueryParser = new AnalyzingQueryParser("", analyzer);
// TODO: The rewrite method can be overridden.
// analyzingQueryParser.setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE);
Query parsedQuery = analyzingQueryParser.parse(query);
// Here parsedQuery is an instance of the org.apache.lucene.search.WildcardQuery class.
parsedQuery = parsedQuery.rewrite(reader);
// Here parsedQuery is an instance of the org.apache.lucene.search.MultiTermQueryConstantScoreWrapper class.
final Weight weight = parsedQuery.createWeight(searcher, false);
final Set<Term> terms = new HashSet<>();
weight.extractTerms(terms);
Please refer to the thread:
Nabble: Lucene - Java Users - How to get the terms matching a WildCardQuery in Lucene 6.2?
Mail archive: How to get the terms matching a WildCardQuery in Lucene 6.2?
for further details.
It seems the mentioned Stack Overflow question is this one: How to get matches from a wildcard Query in Lucene 6.2.

Lucene: queryparser vs phrasequery or termquery

what are the advantages of not using queryparser and using phrasequery or termquery? It seems to me you can use queryparser to replace any of those?
For example, if I want to search for a exact phrase, I can do:
String searchString = "\"word1 word2\"";
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
or if I want to search for 2 terms, I can do
String searchString = "word1* AND word2*";
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
Currently, I am only using queryparser and it is working for me, but is this the correct way of using Lucene?

Main disadvantage of not using QueryParser is following (it's especially the case when using Solr/Elastic):
When you're creating the TermQuery, something like this:
Query q = new TermQuery("text", "keyword")
the problem will be that you need to apply analyzers/filters manually. Let's say user types KeyWord, then if you just pass it into TermQuery, you will not find anything, if during indexing time you were using lowercasing. Of course the lowercasing is simple, but do you want to apply everything in the code for stemming/nramming, etc., and not relying on existing functionality from analyzers/filters?

Lucene and Multifield query

I have an archive of university theses and publications indexed (with BM25 similarity) on Lucene (Java version). I have English document and Italian document, for this reason i have duplicate field like: pdf, pdf_en or like: titolo, titolo_en. When i have an italian document i fill italian field, otherwise i fill english filed.
Now i have a BooleanQuery with MultiFieldQueryParser, this is my code:
String[] fieldsGEN={"url","autori","lingua","settore","pdfurl"};
String[] fieldsITA={"titolo","tipologia","abstract","pdf"};
String[] fieldsENG={"titolo_en","tipologia_en", "abstract_en","pdf_en"};
MultiFieldQueryParser parserGEN = new MultiFieldQueryParser(version, fieldsGEN, analyzerIT);
MultiFieldQueryParser parserITA = new MultiFieldQueryParser(version, fieldsITA, analyzerIT);
MultiFieldQueryParser parserENG = new MultiFieldQueryParser(version, fieldsENG, analyzerENG);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserENG.setDefaultOperator(QueryParser.Operator.OR);
Query query4 =parserGEN.parse(ricerca.ricerca);
bq.add(query4, Occur.SHOULD);
Query query2 =parserITA.parse(ricerca.ricerca);
bq.add(query2, Occur.SHOULD);
Query query3 =parserENG.parse(ricerca.ricerca);
bq.add(query3, Occur.SHOULD);
If I search "anna" (Name of an author) the 3 query are:
Query: [titolo:anna tipologia:anna abstract:anna pdf:anna]
Query: [titolo_en:anna tipologia_en:anna abstract_en:anna pdf_en:anna]
Query: [url:anna autori:anna lingua:anna settore:anna pdfurl:anna]
and I also authors without the name anna even if they are in the last position (about 3 document of 21 on 1000 indexed), I suppose that finds them in other fields.
Do you think the query is done well? the query can be improved? how? a search engine like google how it works on multifield search?
There is a better way to deal with multi-language field?
Thanks,
Neptune.

Unless you have both translations for all documents, I would create 2 indexes -- 1 for each language, using the same field names for each index. You would then use a MultiReader with the search queries.
The problem with this approach is words that are spelled the same in each language but have different meanings between English and Italian. Apart from those words, I think that this architecture will be easier to understand as well as easier to interpret the results of.

Lucene: delete from index, based on multiple fields

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Lucene 6 QueryParser range query is not working with IntPoint - java

Related

Lucene query modification

Apache Lucene createWeight() for wildcard query

Lucene: queryparser vs phrasequery or termquery

Lucene and Multifield query

Lucene: delete from index, based on multiple fields

Categories

Resources