Scoring difference between seemingly equivalent Solr queries - java

As I understand Solr's scoring function, the following two queries should be equivalent.
Namely, score(q1, d) = score(q2, d) for each docuement d in the corpus.
Query 1: evolution OR selection OR germline OR dna OR rna OR mitochondria
Query 2: (evolution OR selection OR germline) OR (dna OR rna OR mitochondria)
The queries are obviously logically equivalent (they both return the same set of documents). Also, both queries consist of the same 6 terms, and each term has a boost of 1 in both queries. Hence each term is supposed to have the same contribution to the total score (same TF, same IDF, same boost).
In spite of that, the queries don't give the same scores.
In general, a conjunction of terms (a OR b OR c OR d) is not the same as a conjunction of queries ((a OR b) OR (c OR d)). What is the semantic difference between the two types of queries? What is causing them to result in different scorings?
The reason I'm asking is that I'm building a custom request handler in which I construct the second type of query (conjunction of queries) while I might actually need to construct the first type of query (conjunction of terms). In other words, this is what I'm doing:
Query q1 = ... //conjunction of terms evolution, selection, germline
Query q2 = ... //conjunction of terms dna, rna, mitochondria
Query conjunctionOfQueries = new BooleanQuery();
conjunctionOfQueries.add(q1, BooleanClause.Occure.SHOULD);
conjunctionOfQueries.add(q2, BooleanClause.Occure.SHOULD);
while maybe I should actually do:
List<String> terms = ... //extract all 6 terms from q1 and q2
List<TermQuery> termQueries = ... //create a new TermQuery from each term in terms
Query conjunctionOfTerms = new BooleanQuery();
for (TermQuery t : termQueries) {
conjunctionOfTerms.add(t, BooleanClause.Occure.SHOULD);
}

I've followed femtoRgon's advice to check the debug element of the score calculation. What I've found is that the calculations are indeed mathematically equivalent. The only difference is that in the conjunction-of-queries calculation we store intermediate results. More precisely, we store the contribution to the sum of each sub-query in a variable. Apparently, stopping in order to store intermediate results has an effect of accumulating a numerical error: Each time we store the intermediate result we're losing some accuracy. Since the actual queries in the application are quite big (not like the trivial example query), there's plenty of accuracy to be lost, and the accumulated error sometimes even changes the ranking order of the returned documents.
So the conjunction-of-terms query is expected to give a slightly better ranking than the conjunction-of-queries query, because the conjunction-of-queries query accumulates a greater numerical error.

Related

Relation between TopDocs.totalHits and parameter 'n' of Indexsearcher.search

I would like to find a total number of hits for a query using Lucene index( version 4.3.1).
I understood that I have to use one among the search method of https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,%20int)
public TopDocs search(Query query,
int n) - Finds the top n hits for query.
In the TopDocs, I can see a totalHits field
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/TopDocs.html#totalHits
But I am not able to understand the impact of parameter of ’n’ search() to TopDocs.totalHits.
For eg: If I set n = 1000, then is it TopDocs.totalHits will be < = n ?
In one of my run I passed n = 1 but in that search TopDocs.totalHits was 29.
Can somebody please throw some light.
If I set n = 1000, then is it TopDocs.totalHits will be < = n ?
Yes. With "n" you define how many results you're interested in. TopDocs.totalHits reflects the effective found number of hits.
Usually it's not really useful to search for all document because it may lead to performance issues. Additional to that the user's may not interested in all results -> there's where paging or filtering takes place.
If you wanna search for all results you need to work with a Collector and this search method:
public void search(Query query, Collector results)
Based on your Collector you're able to get all searchresults or number of hits or scores of those hits.

ElasticSearch query a specific term, not other terms

When I query for a term (standard-analyzer), I get a list of results sorted on score. Which is good. But when calling:
QueryBuilders.termQuery(fieldname, word);
I get a mixture of:
word
some word
WORD
word and such
In no particular ordering, since all score the same, because they all contain word. Since the number of results vary between 0 and towards 1M, I need to most exact matches first (or the others filtered).
I tried adding based on ES regex filter, but looks like they are not being processed:
FilterBuilders.regexQuery(fieldname, "~"+word).flag(RegexpFlag.ALL);
FilterBuilders.regexQuery(fieldname, "^((?!" + word+").)*$".flag(RegexpFlag.ALL);// and this
FilterBuilders.regexQuery(fieldname, "^\\(\\(\\?!" + word+"\\)\\.\\)*$".flag(RegexpFlag.ALL);// or
I've also tried the QueryBuilders.boostingQuery which I also seem to fail in - besides I came across some comments that the negative querying does not work.
So basically, I'm looking for a query that queries for a particular term, while filtering/negative boosting the results that contains other words.
If possible I'd what to stay away from scripting for now (bad experiences).
So query: Must/should not contain a word different from word
In fact the most easy set of queries is:
final int fetchAmount = 100; // number of items to return
final FilterBuilder filterBuilder = FilterBuilders.termFilter(fieldname, word);
final QueryBuilder combinedQuery = QueryBuilders.termQuery(fieldname, word);
final QueryBuilder queryBuilder = QueryBuilders.filteredQuery(combinedQuery, filterBuilder);
final SearchResponse builder = CLIENT.prepareSearch(index_name).setQuery(queryBuilder).setExplain(true)
.setTypes(type_name).setSize(fetchAmount).setSearchType(SearchType.QUERY_THEN_FETCH).execute().actionGet();
Using the FilterBuilder to, cheaply, discard the values that don't contain word. Use the same query (TermQuery) for the QueryBuilder will result in a scoring mechanism. Take the score SearchHit.score() from the first, then continue until one is found for which the score < firstScore.
The problem, as described in question, occurs when instead of using TermQuery for QueryBuilder QueryBuilders.matchAllQuery() is used. The same set of results will be returned in the latter case, but no scoring (hence no sorting) mechanism is applied.
Keep the setSize relatively low, for speed purposes, when the last item is still of interest, call the above query again, but then add setFrom(fetchAmount ) so that the second query will start where the first one stopped, like:
final int xthQueryCalledTime = 1; // if using a loop
final SearchResponse builder = CLIENT.prepareSearch(index_name).setQuery(queryBuilder).setExplain(true)
.setTypes(type_name).setSize(fetchAmount).setSearchType(SearchType.QUERY_THEN_FETCH).setFrom(fetchAmount * xthQueryCalledTime).execute().actionGet();
Do until done.
Ps. Don't using scroll! This will mix-up the score ordering. From JavaDoc on SearchType.SCAN:
Performs scanning of the results which executes the search without any sorting. It will automatically start scrolling the result set

JQPL createQuery vs Entity object loop

I am working on some inherited code and I am not use to the entity frame work. I'm trying to figure out why a previous programmer coded things the way they did, sometimes mixing and matching different ways of querying data.
Deal d = _em.find(Deal.class, dealid);
List<DealOptions> dos = d.getDealOptions();
for(DealOptions o : dos) {
if(o.price == "100") {
//found the 1 item i wanted
}
}
And then sometimes i see this:
Query q = _em.createQuery("select count(o.id) from DealOptions o where o.price = 100 and o.deal.dealid = :dealid");
//set parameters, get results then check result and do whatver
I understand what both pieces of code do, and I understand that given a large dataset, the second way is more efficient. However, given only a few records, is there any reason not to do a query vs just letting the entity do the join and looping over your recordset?
Some reasons never to use the first approach regardless of the number of records:
It is more verbose
The intention is less clear, since there is more clutter
The performance is worse, probably starting to degrade with the very first entities
The performance of the first approach will degrade much more with each added entity than with the second approach
It is unexpected - most experienced developers would not do it - so it needs more cognitive effort for other developers to understand. They would assume you were doing it for a compelling reason and would look for that reason without finding one.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?
Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!
Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.
You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

what is the difference between - and NOT operator in Lucene?

In the query syntax of Lucene it is said the following:
The NOT operator excludes documents that contain the term after NOT.
...
The "-" or prohibit operator excludes documents that contain
the term after the "-" symbol
I think the difference is that the - operator can be used alone, which is not the case for NOT. Is that it?
There is a very subtle difference. Take a look at this long thread on "Getting a Better Understanding of Lucene's Search Operators" which should hopefully answer your question.
Quick answer:
There is no difference between the behavior of the - (prohibit) operator and the NOT operator. The documentation does not make this especially clear, I think.
NOT is a synonym for -, here.
This can be demonstrated with some tests, summarized below.
See also the extract at the end of this answer for a summary which does a great job of distilling various points about the Lucene classic query parser.
Probably the most important point to take away is that the AND, OR, and NOT operators are not the same as "traditional" boolean operators. They are subtly different. This is because Lucene's classic query parser is only partially reliant on boolean operations - specifically, whether a document should receive a score or not. Beyond that, operators can be used in distinctly "non-boolean" ways, to affect how documents are scored relative to each other.
This makes sense, given Lucene's purpose of showing results in order of relevance.
Test inputs:
I am using:
Lucene 8.9.0
the StandardAnalyzer
a TextField named "body"
the classic query parser
the default query parser operator (A B means "A or B")
the following 6 test documents:
apples
oranges
apples oranges
bananas
apples bananas
oranges bananas
See here for the official "classic query parser" syntax documentation.
First test case: A -B
My paraphrase: "documents which contain A but cannot contain B"
The following query strings...
apples -oranges
apples NOT oranges
apples OR -oranges
apples OR NOT oranges
...are all parsed to the same query, using org.apache.lucene.queryparser.classic.QueryParser. That query is:
body:apples -body:oranges
They therefore all generate the same hits:
doc = 0; score = 0.3648143
field = apples
doc = 4; score = 0.2772589
field = apples bananas
Second test case: -X
The following query strings...
-apples
NOT apples
-anything
NOT anything
...are all parsed to the following 2 queries:
-body:apples
-body:anything
These queries always generate no hits, regardless of the data in the source documents.
This may be counterintuitive - especially -anything.
In the first case, the single clause -body:apples forces all documents containing apples to be given a score of zero. But now there are no more clauses in the query - and therefore there is no additional information which can be used to calculate any scores for the remaining documents. They therefore all stay at their initial state of "unscored". Therefore, no documents can be returned.
In the second case -body:anything, the overall logic is the same. After removing all the documents containing anything from scoring consideration (even if that means removing no documents at all), there is still no more information in the query which can be used for scoring purposes.
Third test case: A AND -B
The following query strings...
apples AND -oranges
apples AND NOT oranges
...are both parsed to the same query:
+body:apples -body:oranges
This is very similar to the first test case - and actually returns the same hits with the same score. This specific case is not significant when investigating the differences between - and NOT, since it gives the same results as test case 1.
Digression: A more interesting test case would be A B versus +A B, where there is a difference in results and scoring (+A forces A to be required). But that is outside the scope of this question.
More Background
Looking at the e-mail thread referred to in another answer, here is a copy of the most relevant section, reproduced here for reference:
begin copied section
In a nutshell...
Lucene's QueryParser class does not parse boolean expressions -- it
might look like it, but it does not.
Lucene's BooleanQuery clause does not model Boolean Queries ... it
models aggregate queries.
the most native way to represent the options available in a lucene
"BooleanQuery" as a string is with the +/- prefixes, where...
+foo ... means foo is a required clause and docs must match it
-foo ... means foo is prohibited clause and docs must not match it
foo ... means foo is an optional clause and docs that match it will
get score benefits for doing so.
in an attempt to make things easier for people who have
simple needs, QueryParser "fakes" that it parses boolean expressions
by interpreting A AND B as +A +B; A OR B as A B and NOT A as
-A
if you change the default operator on QueryParser to be AND then
things get more complicated, mainly because then QueryParser treats
A B the same as +A +B
you should avoid thinking in terms of AND, OR, and NOT ... think in
terms of OPTIONAL, REQUIRED, and PROHIBITED ... your life will be much
easier: documentation will make more sense, conversations on the email
list will be more synergistastic, wine will be sweeter, and food will
taste better.
end copied section
Long time back i read this somewhere... Something similar to your guess... :)
The NOT operator cannot be used with just one term. For example, the following search will return no results:
NOT "jakarta apache"
whereas the "-" or prohibit operator excludes documents that contain the term after the "-" symbol...
Hope this will be useful..

Categories

Resources