Lucene query modification

Lucene query modification - java

I have a requirement where I want to modify string formatted lucene query values.
I am taking lucene query as input from user interface and passing it to elastic.
For e.g.
Input : name:"abc" and age:26
Output expected: name: "abcmodified" and userage:26
How do I parse and modify string formatted lucene query in java?

Have you tried looking into org.apache.lucene.queryparser.classic.QueryParser? It has functionality to return a Lucene Query Object from an input string. For example:
String rawQuery = "name:abc AND age:26";
QueryParser parser = new QueryParser(Version.LUCENE_45, null, new WhitespaceAnalyzer(Version.LUCENE_45));
BooleanQuery query = (BooleanQuery) praser.parse(rawQuery);
query.clauses().get(0).setQuery(new TermQuery(new Term("name", "abcmodified")));
query.clauses().get(1).setQuery(new TermQuery(new Term("userage", "26")));
System.out.println(query);
Will print +name:abcmodified +userage:26, which is essentially what you want. Obviously you can have smarter processing using a recursive method that traverses the query based on the query type (Boolean, Prefix, Term, Fuzzy etc...)
Hope this helps!

Related

How can I automatically convert all Lucene TermQuery objects to PrefixQuery?

I'm using QueryParser with a StandardAnalyzer to parse a queryString. With this setup, if I search for "key short", it will not match the text "keyboard shortcut".
I think it's because the queryString "key short" gets parsed as BooleanQuery(TermQuery("key"), TermQuery("short")). If I wanted it to match "keyboard shortcut", I'd have to search for "key* short*". I'd like the QueryParser to do this for me automatically, ie produce: BooleanQuery(PrefixQuery("key"), PrefixQuery("short")) when given the queryString "key short".
Is this the right approach? If so, how should I go about doing this?

I never found a 'proper' solution to this, so I implemented a hack that appends wildcards to individual words in the raw query and then feeds that to the analyzer:
private static final Pattern QUERY_WORD_PATTERN = Pattern.compile("(?<= |^)(?!AND|OR)(\\w+)(?= |$)");
...
String processedQuery = String.format("%s OR %s",
QUERY_WORD_PATTERN.matcher(queryString).replaceAll("$1*"),
queryString);
Query query = new QueryParser(CONTENTS_FIELD, analyzer).parse(processedQuery);

Apache Lucene 6 QueryParser range query is not working with IntPoint

I'm using Lucene 6 new IntPoint and I want to do some range search
Using IntPoint.newRangeQuery the search works and the correct documents are returned, however when I'm using QueryParser (classic) or the new StandardQueryParser nothing is returned.
// This works
Query query = IntPoint.newRangeQuery("duration",1,20);
System.out.println(query);
//This doesn't work
QueryParser parser = new QueryParser("name", analyzer);
Query query = parser.parse("duration:[1 TO 20]");
System.out.println(query);
//This doesn't work
StandardQueryParser queryParserHelper = new StandardQueryParser();
Query query = queryParserHelper.parse("timestamp:[1 TO 20]", "timestamp");
System.out.println(query);
// In all 3 cases it prints: timestamp:[1 TO 20]
Is this a bug or am I missing something?

It's not a bug, and I wouldn't say you are missing anything, really. QueryParser doesn't have any support for IntPoint fields, or any other numeric (PointValues) field types. Range queries in QueryParser syntax will always generate a TermRangeQuery, which will search for that field based on lexicographic order in the inverted index, which will not be work for searching PointValues fields. Generating these using IntPoint.newRangeQuery and similar methods is the correct thing to do.

MongoDB Text Index using Java Driver

Using the MongoDB Java API, I have not been able to successfully locate a full example using text search. The code I am using is this:
DBCollection coll;
String searchString = "Test String";
coll.createIndex(new BasicDBObject ("blogcomments", "text"));
DBObject q = start("blogcomments").text(searchString).get();
The name of my collection that I am performing the search on is blogcomments. creatIndex() is the replacement method for the deprecated method ensureIndex(). I have seen examples for how to use the createIndex(), but not how to execute actual searches with the Java API. Is this the correct way to go about doing this?

That's not quite right. Queries that use indexes of type "text" can not specify a field name at query time. Instead, the field names to include in the index are specified at index creation time. See the documentation for examples. Your query will look like this:
DBObject q = QueryBuilder.start().text(searchString).get();

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Lucene: delete from index, based on multiple fields

I need to perform deletion of the document from lucene search index. Standard approach :
indexReader.deleteDocuments(new Term("field_name", "field value"));
Won't do the trick: I need to perform the deletion based on multiple fields. I need something like this:
(pseudo code)
TermAggregator terms = new TermAggregator();
terms.add(new Term("field_name1", "field value 1"));
terms.add(new Term("field_name2", "field value 2"));
indexReader.deleteDocuments(terms.toTerm());
Is there any constructs for that?

IndexWriter has methods that allow more powerful deleting, such as IndexWriter.deleteDocuments(Query). You can build a BooleanQuery with the conjunction of terms you wish to delete, and use that.

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Numeric Field Queries (Int etc...)
When the key fields are numeric, you can't use a TermQuery, but instead must use a NumericRangeQuery.
See the answer to this question.
// NOTE: For IntFields, we need NumericRangeQueries:
// https://stackoverflow.com/a/14076439/231860
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name1", 1, 1, true, true), BooleanClause.Occur.MUST);
multiTermQuery.add(NumericRangeQuery.newIntRange("field_name2", 2, 2, true, true), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene query modification - java

Related

How can I automatically convert all Lucene TermQuery objects to PrefixQuery?

Apache Lucene 6 QueryParser range query is not working with IntPoint

MongoDB Text Index using Java Driver

Lucene: Multiple words in a single term

Lucene: delete from index, based on multiple fields

Categories

Resources