Subtle difference when searching multi value fields in Solr

Subtle difference when searching multi value fields in Solr - java

I have a very simple question but I don't understand exactly why it happens and what the difference is.
Take a simple Solr search on a multi value field:
field_name:ABC AND DEF
field_name:(ABC AND DEF)
They return quite different results. I understand the brackets are for grouping but I don't understand the difference. It seems quite subtle.
Many thanks.

The first query isn't doing what you think it's doing.
field_name:ABC AND DEF
This is parsed as:
field_name:ABC AND <default search field>:DEF
This is different from your second example, which is parsed as:
field_name:ABC AND field_name:DEF
In the first example the second part of your query is made against whatever field is defined as the default search field in your index (or in the query itself, if you've set df).

Related

Retrieving the "match of value" in an ADD Diff computed by EMF Compare

I am fairly new to EMF and have recently started using EMF Compare to compute the differences between two models. For the moment, those differences are simply printed to the console, and I try to retrieve all the useful information from them.
When I print an ADD Diff corresponding for example to the add of an eAttribute, it looks like this:
UNRESOLVED LEFT ADD org.eclipse.emf.compare.internal.spec.ReferenceChangeSpec{
reference=EClass.eStructuralFeatures,
value=EAttribute#7e8dcdaanom,
parentMatch=org.eclipse.emf.compare.internal.spec.MatchSpec{
left=EClass#5cdd09b1SystemOfAirport,
right=EClass#8c11eeeSystemOfAirport,
origin=<null>,
#differences=2,
#submatches=5
},
match of value=org.eclipse.emf.compare.internal.spec.MatchSpec{
left=EAttribute#7e8dcdaanom,
right=<null>,
origin=<null>,
#differences=1,
#submatches=0
}
}
What I would like to retrieve is the MatchSpec corresponding to the match of value attribute of my ReferenceChangeSpec. However, I can't seem to find the corresponding getter in the ReferenceChangeSpec documentation. I have tried looking into the GitHub code for Diffs and especially Diff.toString(), but it hasn't brought me any further, that's why I am asking for your help.

After asking this question on the EMF Compare forum, I have been able to find the solution to my problem.
The trick is to use Comparison.getMatch(EObject). So for a Difference d of kind ADD you want to retrieve the match of value:
Match matchOfValue = comparison.getMatch(((ReferenceChangeSpec) difference).getValue());

Lucene Solr using complex filters

I am currently having a problem with specifying filters for Lucene/Solr. Every solution I come up with breaks other solutions. Let me start with an example. Assume that we have the following 5 documents:
doc1 = [type:Car, sold:false, owner:John]
doc2 = [type:Bike, productID:1, owner:Brian]
doc3 = [type:Car, sold:true, owner:Mike]
doc4 = [type:Bike, productID:2, owner:Josh]
doc5 = [type:Car, sold:false, owner:John]
So I need to construct the following filter queries:
Give me all documents of type:Car which has sold:false only and if it is a type that is different that Car, include in the result. So basically I want docs 1, 2, 4, 5 the only document I don't want is doc3 because it is has sold:true. To put it more precisely:
for each document d in solr/lucene
if d.type == Car {
if d.sold == false, then add to result
else ignore
}
else {
add to result
}
return result
Filter in all documents that are of (type:Car and sold:false) or (type:Bike and productID:1). So for this I will get 1,2,5.
Get all documents that if the type:Car then get only with sold:false, otherwise get me documents from owners John, Brian, Josh. So for this query I should get 1, 2, 4, 5.
Note: You don't know all the types in the documents. Here it is obvious because of the small number of documents.
So my solutions were:
(-type:Car) OR ((type:Car) AND (sold:false). This works fine and as expected.
((-type:Car) OR ((type:Car) AND (sold:false)) AND ((-type:Bike) OR ((type:Bike) AND (productID:1))). This solution does not work.
((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((-type:Car) OR ((type:Car) AND (sold:false)). This does not work, I can make it work if I do I do this: ((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((version:* OR (-type:Car)) OR ((type:Car) AND (sold:false)). I don't understand how this works, because logically it should work, but Solr/Lucene somehow does something.

Okay, to get anything but a sold car, you could use -(type:Car sold:true).
This can be incorporated into the other queries, but you'll need to be careful with lonely negative queries like this. Lucene doesn't handle them well, generally speaking, and Solr has some odd gotchas as well. Particularly, A -B reads more like "get all A but forbid B" rather than "get all A and anything but B". Similar problem with A or -B, see this question for more.
To get around that, you'll need to surround the negative with an extra set of parentheses, to ensure it is understood by Solr to be a standalone negative query, like: (-(type:Car AND sold:true))
So:
-(type:Car AND sold:true) (This doesn't get the result you stated, but as per my comment, I don't really understand your stated results)
(type:Bike AND productID:1) (-(type:Car AND sold:true)) (You actually wrote this in the description of the problem!)
(-(type:Car AND sold:false)) owner:(John Brian Josh)

My advice is to use programmatic Lucene (that is, directly in Java using the Java Lucene API) rather than issuing text queries which will be interpreted. This will give you much more fine-grained control.
What you're going to want to do is construct a Lucene Filter Object using the QueryWrapperFilter API. A QueryWrapperFilter is a filter which takes a Lucene Query, and filters out any documents which do not match that query.
In order to use QueryWrapperFilter, you'll need to construct a Query which matches the terms you're interested in. The best way to do this is to use TermQuery:
TermQuery tq = new TermQuery(new Term("fieldname", "value"));
As you might have guessed, you'll want to replace "fieldname" with the name of a field, and "value" with a desired value. For example, from your example in the OP, you might want to do something like new Term("type", "Car").
This only matches a single term. You're going to need multiple TermQueries, and a way to combine them to create a single, larger query. The best way to do this is with BooleanQuery:
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanQuery.Occur.MUST);
You can call bq.add as many times as you want - once for each TermQuery that you have. The second argument specifies how strict the query is. It can specify that a sub-query MUST appear, SHOULD appear, or should NOT appear (these are the three values of the BooleanQuery.Occur enum).
After you've added each of the sub-queries, this BooleanQuery represents the full query which will match only the documents you ask for. However, it's still not a filter. We now need to feed it to QueryWrapperFilter, which will give us back a filter object:
QueryWrapperFilter qwf = new QueryWrapperFilter(bq);
That should do it. Then if you want to run queries over only the documents allowed through by that filter, you just take your new query (call it q) and your filter, and create a FilteredQuery:
FilteredQuery fq = new FilteredQuery(q, qwf);

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Which is the best choice to indexing a Boolean value in lucene?

Indexing a Boolean value(true/false) in lucene(not need to store)
I want to get more disk space usage and higher search performance
doc.add(new Field("boolean","true",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new Field("boolean","1",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new NumericField("boolean",Integer.MAX_VALUE,Field.Store.NO,true).setIntValue(1));
Which should I choose? Or any other better way?
thanks a lot

An interesting question!
I don't think the third option (NumericField) is a good choice for a boolean field. I can't think of any use case for this.
The Lucene search index (leaving to one side stored data, which you aren't using anyway) is stored as an inverted index
Leaving your first and second options as (theoretically) identical
If I was faced with this, I think I would choose option one ("true" and "false" terms), if it influences the final decision.
Your choice of NOT_ANALYZED_NO_NORMS looks good, I think.

Lucene jumps through an elaborate set of hoops to make NumericField searchable by NumericRangeQuery, so definitely avoid it an all cases where your values don't represent quantities. For example, even if you index an integer, but only as a unique ID, you would still want to use a plain String field. Using "true"/"false" is the most natural way to index a boolean, while using "1"/"0" gives just a slight advantage by avoiding the possibility of case mismatch or typo. I'd say this advantage is not worth much and go for true/false.

Use Solr (a flavour of lucene) - it indexes all basic java types natively.
I've used it and it rocks.

Lucene TermFrequenciesVector

what do I obtain if I call IndexReader.getTermFrequenciesVector(...) on an index created with TermVector.YES option?

The documentation already answers this, as Xodorap notes in a comment.
The TermFreqVector object returned can retrieve which terms (words produced by your analyzer) a field contains and how many times each of those terms exists within that field.
You can cast the returned TermFreqVector to the interface TermPositionVector if you index the field using TermVector.WITH_OFFSETS, TermVector.WITH_POSITIONS or TermVector.WITH_POSITIONS_OFFSETS. This gives you access to GetTermPositions with allow you to check where in the field the term exists, and GetOffsets which allows you to check where in the original content the term originated from. The later allows, combined with Store.YES, highlighting of matching terms in a search query.
There are different contributed highlighters available under Contrib area found at the Lucene homepage.

Or you can implement proximity or first occurrence type score contributions. Which highlighting won't help you with at all.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.