Groupping results based on fields - Lucene - java

I'm using Lucene 4.10.4. I want to take "n results" from 20 different fields in an efficient way without searching 20 times. If I search using boolean query, we might get all the results in single search. I want to group results based on fields, is there any grouping concept?

Yes there is:
http://lucene.apache.org/core/4_10_4/grouping/org/apache/lucene/search/grouping/package-summary.html
But in newer versions it only works on DocValues, so you would have to add the field again as a DocValue to be able to group over it. (But maybe in 4.10 it still works with the FieldCache, but I'm not familiar with that)
You can use GroupingSearch or maybe BlockGroupingCollector to have multiple elements per group and specify how results are ordered within a group.
You have to include the lucene-grouping dependency to use it.

Related

How to do a multi query search in Lucene 7.4.0?

I have two queries, one of them is boosted, and I want to combine them into one new query. I understand that in older version of Lucene you can do it with BooleanQuery using add. But in the version I'm using, that method seems to no longer exists. So how do I do it now?
To add queries into a 'BooleanQuery', now I have to use the 'Builder'. So if I want to add queries, the code line should be something like new BooleanQuery().Builder().add(query1).add(query2).build()
Yes, I should've read the migration guide first.
Also if I want to combine a boosted query and a normal one, I could just concate the strings.

Spring Data Paging over combined Result of several Queries

We are using Spring Boot 2 with Spring Data and its PagingAndSortingRepository feature. This works well for single queries, but in one case we have to make three different queries and implement pagination for the combined result.
What is the best way to do it?
Here's what I have tried:
1) Write a UNION or JOIN query of sorts that already returns the combined result as a Page or Slice. However, this query takes almost 10 times as long as shooting three seperate queries and do the aggregation in Java. We are talking complex computations here (PostGIS backend).
2) Manually construct the pages/slices by using the existing SliceImpl or PageImpl classes. This works fine for the initial request, but fails on the second request, when the user says something like: give me page 1 (page size == 10 items). The first page (page 0) may have had 4 items from the first query and 6 of 12 total items from the second query. Asking for page 1 gives me then 0 results from the first query and 2 (instead of 6) from the second, while filling up the rest from the third query. So clearly, this cannot work from a logical point of view.
Any other ideas?
Edit: we are planning to add Hibernate Search and Caching, which might solve this problem externally by making option 1) fast enough. My question was meant to ask for an "internal" solution, i. e. some code I can write today, until we have the external solution in place.
As you have described in point 2, unless you do always left join between queries, no one can guarantee you that what you have retrieved with the first query part is sufficient to generate a page of 10 valid element.
Implement a logic that find element until the page is complete it's more expensive than the single query for sure... specially when you have to increment pages more and more.
I think you have to combine all your queries in a single query.
A solution in this case could be to create a materialized view on your database and apply simpler filters.
To have a cache framework can help too.

Hibernate Search Result Ranking

I am using Hibernate Search Along with Lucene to implement full text search on my data base. I want to know that do hibernate search query or lucene query return top ranked and the most relevant results? Documentation says:
Apache Lucene provides a very flexible and powerful way to sort
results. While the default sorting (by relevance) is appropriate most
of the time
Link: http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#search-query
Section: 5.1.3.3. Sorting
But I am very confused with the results as they are always arranged with the IDs of the objects. I just need the top 100 most relevant records.
See Customizing Lucene's scoring formula
Sorting by relevance is affected by your Analyzer choices. If you are getting results in the order of primary keys it is likely that they are all having the same score, which is normally very unlikely so my guess is that you're not enabling tokenization on any searched field.
Make sure you're tokenizing the fields used in the Query and they are using an appropriate Analyzer. To pick an appropriate one you'll have to experiment a bit as it depends on the language (if it's natural language) or on what kind of data you're indexing.
To actually debug the sort order applied by Relevance sort, see usage of Projections in the Hibernate Search documentation: both FullTextQuery.SCORE and FullTextQuery.EXPLANATION can be very useful to understand what's going on.
A handy utility to quickly experiment the effect of different Analyzers is to use org.hibernate.search.util.AnalyzerUtils. You can either write unit tests creating the Analyzer instance yourself or you can retrieve the analyzers by name using org.hibernate.search.engine.SearchFactory.getAnalyzer(String) or the base one used for a specific indexed entity by entity type: org.hibernate.search.engine.SearchFactory.getAnalyzer(Class).

Performance of search in java list vs on database records using hibernate

Now I have a situation where I need to make some comparisons and result filtration that is not very simple to do, what I want is something like Lucenes search but only I will develop it, it is not my decision though I would have gone with Lucene.
What I will do is:
Find the element according to full word match of a certain field, if not then check if it starts with it the check if it just contains.
Every field has its weight according to matching case(full->begins->contains) and its priority to me.
After one has matched I will also check the weight of the other fields as well to make a final total row weight.
Then I will return an Map of both rows and their weights.
Now I realized that this is not easy done by hibernate's HQL meaning I would have to run multiple queries to do this.
So my question is should I do it in java meaning should I retrieve all records and do my calculations to find my target, or should I do it in hibernate by executing multiple queries? which is better according to performance and speed ?
Unfortunately, I think the right answer is "it depends": how many words, what data structure, whether the data fits in memory, how often you have to do the search, etc.
I am inclined to think that a database is a better solution, even if Hibernate is not part of it. You might need to learn how to write better SQL. Perhaps the dynamic SQL that Hibernate generates for you isn't sufficient. Proper JOINs and indexing might make this perform nicely.
There might be a third way to consider: Lucene and indexing. I'd need to know more about your problem to decide.

How can to group lucene's results?

My application indexes discussion threads. Each entry in the discussion is indexed as a separate Lucene document with a common_id field which can be used to group search hits into one discussion.
Currently when the search is performed, if a thread has 3 entries, then 3 separate hits are returned. Even though this is correct, from the users point of view the same entry is appearing in the search multiple times.
Is there a way to tell lucene to group it's search results by the common_id field before returning them?
I believe what you are asking for is Field Collapsing, which is a feature of Solr (and I believe Elasticsearch as well).
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a
series. You will have to ensure that this gets incremented for every
new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the
members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.
There is nothing built into Lucene that collapses results based on a field. You will need to implement that yourself.
However, they've recently built this feature into Solr.
See http://www.lucidimagination.com/blog/2010/09/16/2446/
Since version 3.2 lucene supports grouping search results based on a field.
http://lucene.apache.org/core/4_1_0/grouping/org/apache/lucene/search/grouping/package-summary.html

Categories

Resources