Lucene: Searching multiple fields with default operator = AND

Lucene: Searching multiple fields with default operator = AND - java

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Related

Lucene searching over stored and unstored Fields concurrently

I'm working with Lucene 7.4 and have indexed a sample of txt files.
I have some Fields that have been stored, such as path and filename,
and a content Field, which was unstored before passing the doc to the IndexWriter.
Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.
try (InputStream stream = Files.newInputStream(file)) {
// create empty document
Document doc = new Document();
// add the last modification time field
Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
doc.add(lastModField);
// add the path Field
Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
doc.add(pathField);
// add the name Field
doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));
// add the content
doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));
System.out.println("adding " + file);
writer.addDocument(doc);
Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each.
I cant't figure out how to combine them.
What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field.
Does Lucene 7.4 provide you with the means for an easy solution to this?
PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.
EDIT:
Analyzer used for indexing content Field and for searching content Field:
Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.build();
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
My program is supposed to index files and search over that index, retrieving
a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName,
I'd like that to heavily impact my search results.
I'm a computer science student, and this is my first project with Lucene.
If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.
EDIT 2:
I had a misconception aobut what happens when using Stored.YES/.NO.
My problem has nothing to do with it.
The String wasn't tokenized, because it was in a StringField.
I assumed it was because it was stored.
However, my question remains.
Is there a way to search over tokenized and untokenized Fields concurrently?

As #andrewjames mentions, you don't need to use multiple analyzers in your example because only the TextField gets analyzed, the StringFields do not. If you had a situation where you did need to use different analyzers for different fields, Lucene can accommodate that. To do so you use a PerFieldAnalyzerWrapper which basically let's you specify a default Analyzer and then as many field specific analyzers as you like (passed to PerFieldAnalyzerWrapper as a dictionary). Then when analyzing the doc it will use the field specific analyzer if one was specified and if not, it will use the default analyzer you specified for the PerFieldAnalyzerWrapper.
Whether using a single analyzer or using multiple via PerFieldAnalyzerWrapper, you only need one QueryParser and you will pass that parser either the one analyzer or the PerFieldAnalyzerWrapper which is an analyzer that wraps several analyzers.
The fact that some of your fields are stored and some are not stored has no impact on searching them. The only thing that matters for the search is that the field is indexed, and both StringFields and TextFields are always indexed.
You mention the following:
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
Whether a field is stored or not has nothing to do with whether it's analyzed. For the filename field your code is using a StringField with Field.Store.YES. Because it's a StringField it will be indexed BUT not analyzed, and because you specified to store the field it will be stored. So since the field is NOT analyzed, it won't be using the KeywordAnalyzer or any other analyzer :-)
Is there a way to search over tokenized and untokenized Fields concurrently?
The real issue here isn't about searching tokenized and untokenized fields concurrently, it's really just about search multiple fields concurrently. The fact that one is tokenized and one is not is of no consequence for lucene. To search multiple fields at once you can use a BooleanQuery and with this query object you can add multiple queries to it, one for each field, and specify an AND ie Must or an OR ie Should relationship between the subqueries.
I hope this helps clear things up for you.

Subtle difference when searching multi value fields in Solr

I have a very simple question but I don't understand exactly why it happens and what the difference is.
Take a simple Solr search on a multi value field:
field_name:ABC AND DEF
field_name:(ABC AND DEF)
They return quite different results. I understand the brackets are for grouping but I don't understand the difference. It seems quite subtle.
Many thanks.

The first query isn't doing what you think it's doing.
field_name:ABC AND DEF
This is parsed as:
field_name:ABC AND <default search field>:DEF
This is different from your second example, which is parsed as:
field_name:ABC AND field_name:DEF
In the first example the second part of your query is made against whatever field is defined as the default search field in your index (or in the query itself, if you've set df).

JQPL createQuery vs Entity object loop

I am working on some inherited code and I am not use to the entity frame work. I'm trying to figure out why a previous programmer coded things the way they did, sometimes mixing and matching different ways of querying data.
Deal d = _em.find(Deal.class, dealid);
List<DealOptions> dos = d.getDealOptions();
for(DealOptions o : dos) {
if(o.price == "100") {
//found the 1 item i wanted
}
}
And then sometimes i see this:
Query q = _em.createQuery("select count(o.id) from DealOptions o where o.price = 100 and o.deal.dealid = :dealid");
//set parameters, get results then check result and do whatver
I understand what both pieces of code do, and I understand that given a large dataset, the second way is more efficient. However, given only a few records, is there any reason not to do a query vs just letting the entity do the join and looping over your recordset?

Some reasons never to use the first approach regardless of the number of records:
It is more verbose
The intention is less clear, since there is more clutter
The performance is worse, probably starting to degrade with the very first entities
The performance of the first approach will degrade much more with each added entity than with the second approach
It is unexpected - most experienced developers would not do it - so it needs more cognitive effort for other developers to understand. They would assume you were doing it for a compelling reason and would look for that reason without finding one.

Lucene Solr using complex filters

I am currently having a problem with specifying filters for Lucene/Solr. Every solution I come up with breaks other solutions. Let me start with an example. Assume that we have the following 5 documents:
doc1 = [type:Car, sold:false, owner:John]
doc2 = [type:Bike, productID:1, owner:Brian]
doc3 = [type:Car, sold:true, owner:Mike]
doc4 = [type:Bike, productID:2, owner:Josh]
doc5 = [type:Car, sold:false, owner:John]
So I need to construct the following filter queries:
Give me all documents of type:Car which has sold:false only and if it is a type that is different that Car, include in the result. So basically I want docs 1, 2, 4, 5 the only document I don't want is doc3 because it is has sold:true. To put it more precisely:
for each document d in solr/lucene
if d.type == Car {
if d.sold == false, then add to result
else ignore
}
else {
add to result
}
return result
Filter in all documents that are of (type:Car and sold:false) or (type:Bike and productID:1). So for this I will get 1,2,5.
Get all documents that if the type:Car then get only with sold:false, otherwise get me documents from owners John, Brian, Josh. So for this query I should get 1, 2, 4, 5.
Note: You don't know all the types in the documents. Here it is obvious because of the small number of documents.
So my solutions were:
(-type:Car) OR ((type:Car) AND (sold:false). This works fine and as expected.
((-type:Car) OR ((type:Car) AND (sold:false)) AND ((-type:Bike) OR ((type:Bike) AND (productID:1))). This solution does not work.
((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((-type:Car) OR ((type:Car) AND (sold:false)). This does not work, I can make it work if I do I do this: ((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((version:* OR (-type:Car)) OR ((type:Car) AND (sold:false)). I don't understand how this works, because logically it should work, but Solr/Lucene somehow does something.

Okay, to get anything but a sold car, you could use -(type:Car sold:true).
This can be incorporated into the other queries, but you'll need to be careful with lonely negative queries like this. Lucene doesn't handle them well, generally speaking, and Solr has some odd gotchas as well. Particularly, A -B reads more like "get all A but forbid B" rather than "get all A and anything but B". Similar problem with A or -B, see this question for more.
To get around that, you'll need to surround the negative with an extra set of parentheses, to ensure it is understood by Solr to be a standalone negative query, like: (-(type:Car AND sold:true))
So:
-(type:Car AND sold:true) (This doesn't get the result you stated, but as per my comment, I don't really understand your stated results)
(type:Bike AND productID:1) (-(type:Car AND sold:true)) (You actually wrote this in the description of the problem!)
(-(type:Car AND sold:false)) owner:(John Brian Josh)

My advice is to use programmatic Lucene (that is, directly in Java using the Java Lucene API) rather than issuing text queries which will be interpreted. This will give you much more fine-grained control.
What you're going to want to do is construct a Lucene Filter Object using the QueryWrapperFilter API. A QueryWrapperFilter is a filter which takes a Lucene Query, and filters out any documents which do not match that query.
In order to use QueryWrapperFilter, you'll need to construct a Query which matches the terms you're interested in. The best way to do this is to use TermQuery:
TermQuery tq = new TermQuery(new Term("fieldname", "value"));
As you might have guessed, you'll want to replace "fieldname" with the name of a field, and "value" with a desired value. For example, from your example in the OP, you might want to do something like new Term("type", "Car").
This only matches a single term. You're going to need multiple TermQueries, and a way to combine them to create a single, larger query. The best way to do this is with BooleanQuery:
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanQuery.Occur.MUST);
You can call bq.add as many times as you want - once for each TermQuery that you have. The second argument specifies how strict the query is. It can specify that a sub-query MUST appear, SHOULD appear, or should NOT appear (these are the three values of the BooleanQuery.Occur enum).
After you've added each of the sub-queries, this BooleanQuery represents the full query which will match only the documents you ask for. However, it's still not a filter. We now need to feed it to QueryWrapperFilter, which will give us back a filter object:
QueryWrapperFilter qwf = new QueryWrapperFilter(bq);
That should do it. Then if you want to run queries over only the documents allowed through by that filter, you just take your new query (call it q) and your filter, and create a FilteredQuery:
FilteredQuery fq = new FilteredQuery(q, qwf);

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.
For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".
From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.
The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.
Like I said I have indexed the database on company name, and then have the following code:
IndexSearcher searcher = new IndexSearcher(directory);
new QueryParser(Version.LUCENE_30, "company", analyzer);
Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));
I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(fuzzy_query, collector);
System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());
Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:
import org.apache.lucene.queryParser.*;
Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:
Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);
Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.
You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:
TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.score);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.