Relation between TopDocs.totalHits and parameter 'n' of Indexsearcher.search

Relation between TopDocs.totalHits and parameter 'n' of Indexsearcher.search - java

I would like to find a total number of hits for a query using Lucene index( version 4.3.1).
I understood that I have to use one among the search method of https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,%20int)
public TopDocs search(Query query,
int n) - Finds the top n hits for query.
In the TopDocs, I can see a totalHits field
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/TopDocs.html#totalHits
But I am not able to understand the impact of parameter of ’n’ search() to TopDocs.totalHits.
For eg: If I set n = 1000, then is it TopDocs.totalHits will be < = n ?
In one of my run I passed n = 1 but in that search TopDocs.totalHits was 29.
Can somebody please throw some light.

If I set n = 1000, then is it TopDocs.totalHits will be < = n ?
Yes. With "n" you define how many results you're interested in. TopDocs.totalHits reflects the effective found number of hits.
Usually it's not really useful to search for all document because it may lead to performance issues. Additional to that the user's may not interested in all results -> there's where paging or filtering takes place.
If you wanna search for all results you need to work with a Collector and this search method:
public void search(Query query, Collector results)
Based on your Collector you're able to get all searchresults or number of hits or scores of those hits.

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Google Search API QueryOptions and SortOptions Limits

I'm having trouble understanding "limits" in Google's Search API.
The docs show this example:
// Build the SortOptions with 2 sort keys
SortOptions sortOptions =
SortOptions.newBuilder()
.addSortExpression(
SortExpression.newBuilder()
.setExpression("price")
.setDirection(SortExpression.SortDirection.DESCENDING)
.setDefaultValueNumeric(0))
.addSortExpression(
SortExpression.newBuilder()
.setExpression("brand")
.setDirection(SortExpression.SortDirection.DESCENDING)
.setDefaultValue(""))
.setLimit(1000)
.build();
// Build the QueryOptions
QueryOptions options =
QueryOptions.newBuilder()
.setLimit(25)
.setFieldsToReturn("model", "price", "description")
.setSortOptions(sortOptions)
.build();
Limit for SortOptions is described as:
Maximum number of objects to score and/or sort. Cannot be more than 10,000. Default: 1,000
Limit for QueryOptions is described as:
The documentation explains limit as: The maximum number of documents
to return in the results. Default: 20 Max :1000
I personally want as many results as possible returned with pagination. Note: I am using a cursor.
Does this mean that if I want to use QueryOptions I am limited to 1000 results even though SortOptions could return 10000 results?
Or will all documents be returned with only the first 1000 sorted?
I am worried that once I get to the end of 1000 documents with my cursor, no more will be returned even though there are more than 1000 documents.

You will always get only first 1000 documents from the sorted set because
QueryOptions and SortOptions (and their limits) control different stages of extracting result flow.
SortOptions.limit is notifying index engine to use the limited set of documents when preparing result set. But QueryOptions.limit used when you receive documents from the index.
For example, imagine the next situation similarly to RDBMS: you need to create complicated request and you create the sorted view with top/limit. In this case:
SortOptions.limit - used when you create view;
QueryOptions.limit - when you select data from view.

ElasticSearch query a specific term, not other terms

When I query for a term (standard-analyzer), I get a list of results sorted on score. Which is good. But when calling:
QueryBuilders.termQuery(fieldname, word);
I get a mixture of:
word
some word
WORD
word and such
In no particular ordering, since all score the same, because they all contain word. Since the number of results vary between 0 and towards 1M, I need to most exact matches first (or the others filtered).
I tried adding based on ES regex filter, but looks like they are not being processed:
FilterBuilders.regexQuery(fieldname, "~"+word).flag(RegexpFlag.ALL);
FilterBuilders.regexQuery(fieldname, "^((?!" + word+").)*$".flag(RegexpFlag.ALL);// and this
FilterBuilders.regexQuery(fieldname, "^\\(\\(\\?!" + word+"\\)\\.\\)*$".flag(RegexpFlag.ALL);// or
I've also tried the QueryBuilders.boostingQuery which I also seem to fail in - besides I came across some comments that the negative querying does not work.
So basically, I'm looking for a query that queries for a particular term, while filtering/negative boosting the results that contains other words.
If possible I'd what to stay away from scripting for now (bad experiences).
So query: Must/should not contain a word different from word

In fact the most easy set of queries is:
final int fetchAmount = 100; // number of items to return
final FilterBuilder filterBuilder = FilterBuilders.termFilter(fieldname, word);
final QueryBuilder combinedQuery = QueryBuilders.termQuery(fieldname, word);
final QueryBuilder queryBuilder = QueryBuilders.filteredQuery(combinedQuery, filterBuilder);
final SearchResponse builder = CLIENT.prepareSearch(index_name).setQuery(queryBuilder).setExplain(true)
.setTypes(type_name).setSize(fetchAmount).setSearchType(SearchType.QUERY_THEN_FETCH).execute().actionGet();
Using the FilterBuilder to, cheaply, discard the values that don't contain word. Use the same query (TermQuery) for the QueryBuilder will result in a scoring mechanism. Take the score SearchHit.score() from the first, then continue until one is found for which the score < firstScore.
The problem, as described in question, occurs when instead of using TermQuery for QueryBuilder QueryBuilders.matchAllQuery() is used. The same set of results will be returned in the latter case, but no scoring (hence no sorting) mechanism is applied.
Keep the setSize relatively low, for speed purposes, when the last item is still of interest, call the above query again, but then add setFrom(fetchAmount ) so that the second query will start where the first one stopped, like:
final int xthQueryCalledTime = 1; // if using a loop
final SearchResponse builder = CLIENT.prepareSearch(index_name).setQuery(queryBuilder).setExplain(true)
.setTypes(type_name).setSize(fetchAmount).setSearchType(SearchType.QUERY_THEN_FETCH).setFrom(fetchAmount * xthQueryCalledTime).execute().actionGet();
Do until done.
Ps. Don't using scroll! This will mix-up the score ordering. From JavaDoc on SearchType.SCAN:
Performs scanning of the results which executes the search without any sorting. It will automatically start scrolling the result set

Scoring difference between seemingly equivalent Solr queries

As I understand Solr's scoring function, the following two queries should be equivalent.
Namely, score(q1, d) = score(q2, d) for each docuement d in the corpus.
Query 1: evolution OR selection OR germline OR dna OR rna OR mitochondria
Query 2: (evolution OR selection OR germline) OR (dna OR rna OR mitochondria)
The queries are obviously logically equivalent (they both return the same set of documents). Also, both queries consist of the same 6 terms, and each term has a boost of 1 in both queries. Hence each term is supposed to have the same contribution to the total score (same TF, same IDF, same boost).
In spite of that, the queries don't give the same scores.
In general, a conjunction of terms (a OR b OR c OR d) is not the same as a conjunction of queries ((a OR b) OR (c OR d)). What is the semantic difference between the two types of queries? What is causing them to result in different scorings?
The reason I'm asking is that I'm building a custom request handler in which I construct the second type of query (conjunction of queries) while I might actually need to construct the first type of query (conjunction of terms). In other words, this is what I'm doing:
Query q1 = ... //conjunction of terms evolution, selection, germline
Query q2 = ... //conjunction of terms dna, rna, mitochondria
Query conjunctionOfQueries = new BooleanQuery();
conjunctionOfQueries.add(q1, BooleanClause.Occure.SHOULD);
conjunctionOfQueries.add(q2, BooleanClause.Occure.SHOULD);
while maybe I should actually do:
List<String> terms = ... //extract all 6 terms from q1 and q2
List<TermQuery> termQueries = ... //create a new TermQuery from each term in terms
Query conjunctionOfTerms = new BooleanQuery();
for (TermQuery t : termQueries) {
conjunctionOfTerms.add(t, BooleanClause.Occure.SHOULD);
}

I've followed femtoRgon's advice to check the debug element of the score calculation. What I've found is that the calculations are indeed mathematically equivalent. The only difference is that in the conjunction-of-queries calculation we store intermediate results. More precisely, we store the contribution to the sum of each sub-query in a variable. Apparently, stopping in order to store intermediate results has an effect of accumulating a numerical error: Each time we store the intermediate result we're losing some accuracy. Since the actual queries in the application are quite big (not like the trivial example query), there's plenty of accuracy to be lost, and the accumulated error sometimes even changes the ranking order of the returned documents.
So the conjunction-of-terms query is expected to give a slightly better ranking than the conjunction-of-queries query, because the conjunction-of-queries query accumulates a greater numerical error.

How to reduce an algorithm into smaller parts so I can scale it?

I have updated this question(found last question not clear, if you want to refer to it check out the reversion history). The current answers so far do not work because I failed to explain my question clearly(sorry, second attempt).
Goal:
Trying to take a set of numbers(pos or neg, thus needs bounds to limit growth of specific variable) and find their linear combinations that can be used to get to a specific sum. For example, to get to a sum of 10 using [2,4,5] we get:
5*2 + 0*4 + 0*5 = 10
3*2 + 1*4 + 0*5 = 10
1*2 + 2*4 + 0*5 = 10
0*2 + 0*4 + 2*5 = 10
How can I create an algo that is scalable for large number of variables and target_sums? I can write the code on my own if an algo is given, but if there's a library avail, I'm fine with any library but prefer to use java.

One idea would be to break out of the loop once you set T[z][i] to true, since you are only basically modifying T[z][i] here, and if it does become true, it won't ever be modified again.
for i = 1 to k
for z = 0 to sum:
for j = z-x_i to 0:
if(T[j][i-1]):
T[z][i]=true;
break;
EDIT2: Additionally, if I am getting it right, T[z][i] depends on the array T[z-x_i..0][i-1]. T[z+1][i] depends on T[z+1-x_i..0][i-1]. So once you know if T[z][i] is true, you only need to check one additional element (T[z+1-x_i][i-1]) to know if T[z+1][i-1] will be true.
Let's say you represent the fact whether T[z][i] was updated by a variable changed. Then, you can simply say that T[z][i] = changed && T[z-1][i]. So you should be done in two loops instead of three. This should make it much faster.
Now, to scale it - Now that T[z,i] depends only on T[z-1,i] and T[z-1-x_i,i-1], so to populate T[z,i], you do not need to wait until the whole (i-1)th column is populated. You can start working on T[z,i] as soon as the required values are populated. I can't implement it without knowing the details, but you can try this approach.

I take it this is something like unbounded knapsack? You can dispense with the loop over c entirely.
for i = 1 to k
for z = 0 to sum
T[z][i] = z >= x_i cand (T[z - x_i][i - 1] or T[z - x_i][i])

Based on the original example data you gave (linear combination of terms) and your answer to my question in the comments section (there are bounds), would a brute force approach not work?
c0x0 + c1x1 + c2x2 +...+ cnxn = SUM
I'm guessing I'm missing something important but here it is anyway:
Brute Force Divide and Conquer:
main controller generates coefficients for say, half of the terms (or however many may make sense)
it then sends each partial set of fixed coefficients to a work queue
a worker picks up a partial set of fixed coefficients and proceeds to brute force its own way through the remaining combinations
it doesn't use much memory at all as it works sequentially on each valid set of coefficients
could be optimized to ignore equivalent combinations and probably many other ways
Pseudocode for Multiprocessing
class Controller
work_queue = Queue
solution_queue = Queue
solution_sets = []
create x number of workers with access to work_queue and solution_queue
#say for 2000 terms:
for partial_set in coefficient_generator(start_term=0, end_term=999):
if worker_available(): #generate just in time
push partial set onto work_queue
while solution_queue:
add any solutions to solution_sets
#there is an efficient way to do this type of polling but I forget
class Worker
while true: #actually stops when a stop work token is received
get partial_set from the work queue
for remaining_set in coefficient_generator(start_term=1000, end_term=1999):
combine the two sets (partial_set.extend(remaining_set))
if is_solution(full_set):
push full_set onto the solution queue

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.