Lucene 6 - How to influence ranking with numeric value? - java

I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.
This is how I am building my index:
IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
new IndexWriterConfig(new StandardAnalyzer()));
writer.deleteAll();
for (Contributor contributor : contributors) {
Document doc = new Document();
doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
doc.add(new StoredField("contribId", contributor.getContribId()));
doc.add(new NumericDocValuesField("sum", sum));
writer.addDocument(doc);
}
writer.close();
The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.
Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!

As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.
Using a FunctionScoreQuery
A FunctionScoreQuery can modify a score based on a field.
Let's say you create you are usually performing a search like this:
Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results
Then you can modify the query in between like this:
Query q = .....
// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");
// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);
// Search as usual
TopDocs hits = searcher.search(query, 10);
This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.
To have more control, consider using the CustomScoreQuery.
Using a CustomScoreQuery
Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:
doc.add(new StoredField("sum", sum));
Then we will have to create our very own query class:
private static class MyScoreQuery extends CustomScoreQuery {
public MyScoreQuery(Query subQuery) {
super(subQuery);
}
// The CustomScoreProvider is what actually alters the score
private class MyScoreProvider extends CustomScoreProvider {
private LeafReader reader;
private Set<String> fieldsToLoad;
public MyScoreProvider(LeafReaderContext context) {
super(context);
reader = context.reader();
// We create a HashSet which contains the name of the field
// which we need. This allows us to retrieve the document
// with only this field loaded, which is a lot faster.
fieldsToLoad = new HashSet<>();
fieldsToLoad.add("sum");
}
#Override
public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
// Get the result document from the index
Document doc = reader.document(doc_id, fieldsToLoad);
// Get boost value from index
IndexableField field = doc.getField("sum");
Number number = field.numericValue();
// This is just an example on how to alter the current score
// based on the value of "sum". You will have to experiment
// here.
float influence = 0.01f;
float boost = number.floatValue() * influence;
// Return the new score for this result, based on the
// original lucene score.
return currentScore + boost;
}
}
// Make sure that our CustomScoreProvider is being used.
#Override
public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
return new MyScoreProvider(context);
}
}
Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:
Query q = .....
// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);
// Search as usual
TopDocs hits = searcher.search(query, 10);
Final remarks
Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.
I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a

Related

Page<> vs Slice<> when to use which?

I've read in Spring Jpa Data documentation about two different types of objects when you 'page' your dynamic queries made out of repositories.
Page and Slice
Page<User> findByLastname(String lastname, Pageable pageable);
Slice<User> findByLastname(String lastname, Pageable pageable);
So, I've tried to find some articles or anything talking about main difference and different usages of both, how performance changes and how sorting affercts both type of queries.
Does anyone has this type of knowledge, articles or some good source of information?
Page extends Slice and knows the total number of elements and pages available by triggering a count query. From the Spring Data JPA documentation:
A Page knows about the total number of elements and pages available. It does so by the infrastructure triggering a count query to calculate the overall number. As this might be expensive depending on the store used, Slice can be used as return instead. A Slice only knows about whether there’s a next Slice available which might be just sufficient when walking through a larger result set.
The main difference between Slice and Page is the latter provides non-trivial pagination details such as total number of records(getTotalElements()), total number of pages(getTotalPages()), and next-page availability status(hasNext()) that satisfies the query conditions, on the other hand, the former only provides pagination details such as next-page availability status(hasNext()) compared to its counterpart Page. Slice gives significant performance benefits when you deal with a colossal table with burgeoning records.
Let's dig deeper into its technical implementation of both variants.
Page
static class PagedExecution extends JpaQueryExecution {
#Override
protected Object doExecute(final AbstractJpaQuery repositoryQuery, JpaParametersParameterAccessor accessor) {
Query query = repositoryQuery.createQuery(accessor);
return PageableExecutionUtils.getPage(query.getResultList(), accessor.getPageable(),
() -> count(repositoryQuery, accessor));
}
private long count(AbstractJpaQuery repositoryQuery, JpaParametersParameterAccessor accessor) {
List<?> totals = repositoryQuery.createCountQuery(accessor).getResultList();
return (totals.size() == 1 ? CONVERSION_SERVICE.convert(totals.get(0), Long.class) : totals.size());
}
}
If you observe the above code snippet, PagedExecution#doExecute method underlyingly calls PagedExecution#count method to get the total number of records satisfying the condition.
Slice
static class SlicedExecution extends JpaQueryExecution {
#Override
protected Object doExecute(AbstractJpaQuery query, JpaParametersParameterAccessor accessor) {
Pageable pageable = accessor.getPageable();
Query createQuery = query.createQuery(accessor);
int pageSize = 0;
if (pageable.isPaged()) {
pageSize = pageable.getPageSize();
createQuery.setMaxResults(pageSize + 1);
}
List<Object> resultList = createQuery.getResultList();
boolean hasNext = pageable.isPaged() && resultList.size() > pageSize;
return new SliceImpl<>(hasNext ? resultList.subList(0, pageSize) : resultList, pageable, hasNext);
}
}
If you observe the above code snippet, to findout whether next set of results present or not (for hasNext()) the SlicedExecution#doExecute method always fetch extra one element(createQuery.setMaxResults(pageSize + 1)) and skip it based on the pageSize condition(hasNext ? resultList.subList(0, pageSize) : resultList).
Application:
Page
Use when UI/GUI expects to displays all the results at the initial stage of the search/query itself, with page numbers to traverse(ex., bankStatement with pagenumbers)
Slice
Use when UI/GUI expects to doesnot interested to show all the results at the initial stage of the search/query itself, but intent to show the records to traverse based on scrolling or next button click event (ex., facebook feed search)

How to boost document when match is found at the beginning of the text in lucene

I wanted to how's it possible. suppose i'm searching for ka then Karthik should have more score than of Aakash. how to boost those documents?.
I've already tried this.
I'm trying to use SpanFirstQuery like shown below. but it's not working. i'm using lucene 4.0
//queryString is searchText. e.g ka
//NAME, ORGANIZATION_NAME and ORGANIZATION_POSITION are indexed field names.
Map<String, Analyzer> searchAnalyzers = new HashMap<String, Analyzer>();
searchAnalyzers.put(NAME, new KeywordAnalyzer());
searchAnalyzers.put(ORGANIZATION_NAME, new KeywordAnalyzer());
searchAnalyzers.put(ORGANIZATION_POSITION, new KeywordAnalyzer());
PerFieldAnalyzerWrapper perFieldAnalyzerWrapper = new PerFieldAnalyzerWrapper(new KeywordAnalyzer(), searchAnalyzers);
MultiFieldQueryParser multiFieldQueryParser = new MultiFieldQueryParser(Version.LUCENE_40, mSearchFields, perFieldAnalyzerWrapper); //mSearchFiels is array of fiels
multiFieldQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = (Utils.isEmpty(queryString)) ? new MatchAllDocsQuery() : multiFieldQueryParser.parse(QueryParser.escape(queryString)); //queryString is text to be searched
Term term = new Term(NAME, queryString);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 5);
spanFirstQuery.setBoost(5.0f);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(spanFirstQuery, BooleanClause.Occur.SHOULD);
booleanQuery.add(query, BooleanClause.Occur.MUST);
indexSearcher.search(booleanQuery, 100);
My thoughts, why SpanFirstQuery is a bad idea - it looks a lot like a workaround, which will likely work poorly (also, i'm not sure how to make it work at first place), in terms of performance, and also will require from you to store positions (additional space), which isn't really needed.
Proposed solution:
Caveat - this is experimental, probably not production ready solution, which still will require some work to do.
I used WildcardQuery as a basis for this functionality, since it's support queries like *ka*, that's what you want. Second problem - scoring, since WildcardQuery is a subclass of MultiTermQuery, you could specify custom rewrite method:
public class BoostPrefixScoringRewrite extends ScoringRewrite<BooleanQuery.Builder> {
private final String text;
public BoostPrefixScoringRewrite(String text) {
// todo should be handled more carefully, since wildcard query supports other than * symbols
this.text = text.replace("*", "");
}
#Override
protected BooleanQuery.Builder getTopLevelBuilder() {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.setDisableCoord(true);
return builder;
}
protected Query build(BooleanQuery.Builder builder) {
return builder.build();
}
#Override
protected void addClause(BooleanQuery.Builder topLevel, Term term, int docCount,
float boost, TermContext states) {
final TermQuery tq = new TermQuery(term, states);
if (term.text().startsWith(this.text)) {
// experiment with the boost value
topLevel.add(new BoostQuery(tq, 100f), BooleanClause.Occur.SHOULD);
} else {
topLevel.add(new BoostQuery(tq, boost), BooleanClause.Occur.SHOULD);
}
}
#Override
protected void checkMaxClauseCount(int count) {
if (count > BooleanQuery.getMaxClauseCount())
throw new BooleanQuery.TooManyClauses();
}
}
Pay attention to the boosting value, now it's hardcoded to 100, which should be enough to always place terms starting with search text on top. Also, the thing to look at - if your term list will broad, with Boolean rewrite, you could face TooManyClauses exception, than you will need to have a workaround, to increase this number, or to rewrite this query differently.
For full test take a look here - https://raw.githubusercontent.com/MysterionRise/information-retrieval-adventure/master/lucene5/src/main/java/org/mystic/BoostBeginningWithTest.java

Hibernate row count on Criteria with already set Projection

For a grid component I have in my web applications I have a "GridModel" class which gets passed a Criteria.
The GridModel class has a method to get the results for a specific page by adding setFirstResult(...) and setMaxResults(...) to the Criteria.
But I also need the total count of rows for the Criteria, so I have the following method:
public int getAvailableRows() {
Criteria c = criteriaProvider.getCriteria();
c.setProjection(Projections.rowCount());
return((Long)c.uniqueResult()).intValue();
}
This worked perfectly, but now I have a grid that requires a Criteria that already uses setProjection() in combination with setResultTransformer(). It seems that the getAvailableRows() method above overrides the setProjection() of the original Criteria creating wrong results.
Can I wrap a count Criteria around the original Criteria instead somehow? Or how would I solve this?
I've had a similar experience when trying to use the Projections.rowCount() in conjunction with a groupBy expression. I was able to circumvent things in a slightly 'hacky' manner by:
Remembering the previous projection and result transformer
Setting the projection on the Criteria to be a modified version (see below)
Perform the row count DB hit
Restore the previous projection + transformer so the Criteria can be used for actual result retrieving if
final Projection originalProjection = criteriaImpl.getProjection();
final ResultTransformer originalResultTransformer =
criteriaImpl.getResultTransformer();
final Projection rowCountProjection;
// If we identify that we have a function with a group by clause
// we need to handle it in a special fashion
if ( originalProjection != null && originalProjection.isGrouped() )
{
final CriteriaQueryTranslator criteriaQueryTranslator =
new CriteriaQueryTranslator(
(SessionFactoryImplementor)mySessionFactory,
criteriaImpl,
criteriaImpl.getEntityOrClassName(),
CriteriaQueryTranslator.ROOT_SQL_ALIAS );
rowCountProjection = Projections.projectionList()
.add( Projections.rowCount() )
.add( Projections.sqlGroupProjection(
// This looks stupid but is seemingly required to ensure we have a valid query
"count(count(1))",
criteriaQueryTranslator.getGroupBy(),
new String[]{}, new Type[]{} ) );
}
else
{
rowCountProjection = Projections.rowCount();
}
// Get total count of elements by setting a count projection
final Long rowCount =
(Long)criteria.setProjection( rowCountProjection ).uniqueResult();
A few caveats here:
This still wont give the expected results if you try and give it a criteria with a single sum projection as that is not considered an isGrouped() projection - it will splat the sum with a count. I don't consider this an issue because getting the rowcount for an expression of that nature probably doesnt make sense
When I was dealing with this I wrote some unit tests to make sure rowcount was as expected without projections, with property based projections and with groupby projections but I've written this from memory so can't guarantee small kinks won't need ironing out

How do I call this object to return all strings it finds?

I have the following code that defines a getParts method to find a given Part Name and Part Number in the system. Note that this code comes from our system's API, so if no one can help I'll just delete this question. I figured someone could potentially see a solution or help me along the way.
<%! private QueryResult getParts( String name, String number )
throws WTException, WTPropertyVetoException {
Class cname = wt.part.WTPart.class;
QuerySpec qs = new QuerySpec(cname);
QueryResult qr = null;
qs.appendWhere
(new SearchCondition(cname,
"master>name",
SearchCondition.EQUAL,
name,
false));
qs.appendAnd();
qs.appendWhere
(new SearchCondition(cname,
"master>number",
SearchCondition.EQUAL,
number,
false));
qr = PersistenceHelper.manager.find(qs);
System.out.println("...found: " + qr.size());
return qr;
}
%>
But I would like to allow the user more flexibility in finding these parts. So I set up conditional statements to check for a radio button. This allows them to search by part name and part number, find all, or search using a wildcard. However, I'm having trouble implementing the two latter options.
To attempt to accomplish the above, I have written the below code:
<%
String partName = request.getParameter("nameInput");
String partNumber = request.getParameter("numberInput");
String searchMethod = request.getParameter("selection");
//out.print(searchMethod);
QueryResult myResult = new QueryResult();
if(searchMethod.equals("search"))
myResult = getParts(partName, partNumber);
else if(searchMethod.equals("all"))
{
//Should I write a new function and do this?
//myResult = getAllParts();
//or is there a way I could use a for each loop to accomplish this?
}
//else if(searchMethod.equals("wildcard"))
//get parts matching %wildcard%
while(myResult.hasMoreElements())
{
out.print(myResult.nextElement().toString());
}
%>
Basically, it accepts user input and checks what type of search they would like to perform. Is there an easy way to pass all the values into the myResult object? And likewise for the wildcard search? Like I said before, it may be futile trying to help without access to the API, but hopefully it isn't.
Thanks!
You can (and should) reuse the function, but in order to do so, you will need a part name and number (as those are its input parameters). So for the multi-result options you will need to get a list/collection of part names+numbers and feed them individually to the function, then collect the result in the format that is most appropriate for your needs

Hibernate memory management

I have an application that uses hibernate. At one part I am trying to retrieve documents. Each document has an account number. The model looks something like this:
private Long _id;
private String _acct;
private String _message;
private String _document;
private String _doctype;
private Date _review_date;
I then retrieve the documents with a document service. A portion of the code is here:
public List<Doc_table> getDocuments(int hours_, int dummyFlag_,List<String> accts) {
List<Doc_table> documents = new ArrayList<Doc_table>();
Session session = null;
Criteria criteria = null;
try {
// Lets create a previous Date by subtracting the number of
// subtractHours_ passed.
session = HibernateUtil.getSession();
session.beginTransaction();
if (accts == null) {
Calendar cutoffTime = Calendar.getInstance();
cutoffTime.add(Calendar.HOUR_OF_DAY, hours_);
criteria = session.createCriteria(Doc_table.class).add(
Restrictions.gt("dbcreate_date", cutoffTime.getTime()))
.add(Restrictions.eq("dummyflag", dummyFlag_));
} else
{ criteria = session.createCriteria(Doc_table.class).add(Restrictions.in("acct", accts));
}
documents = criteria.list();
for (int x = 0; x < documents.size(); x++) {
Doc_table document = documents.get(x);
......... more stuff here
}
This works great if I'm retrieving a small number of documents. But when the document size is large I get a heap space error, probably because the documents take up a lot of space and when you retrieve several thousand of them, bad things happen.
All I really want to do is retrieve each document that fits my criteria, grab the account number and return a list of account numbers (a far smaller object than a list of objects). If this were jdbc, I would know exactly what to do.
But in this case I'm stumped. I guess I'm looking for a way where I can bring just get the account numbers of the Doc_table object back.
Or alternatively, some way where I can retrieve documents one at a time from the database using hibernate that fit my criteria (instead of bringing back the whole List of objects which uses too much memory).
There are several ways to deal with the problem:
loading the docs in batches of an smaller size
(The way you noticed) not to query for the Document, but only for the account numbers:
List accts = session.createQuery("SELECT d._acct FROM Doc d WHERE ...");
or
List<String> accts = session.createCriteria(Doc.class).
setProjection(Projections.property("_acct")).
list();
When there is a special field in you Document class that contains the huge amount Document byte data, then you could map this special field as a Lazy loaded field.
Create a second entity class (read only) that contains only the fields that you need and map it to the same table
Instead of fetching all documents i.e, all records at once, try to limit the rows being fetched. Also, deploy a strategy where in you can store documents temporarily as flat files and fetch them later or delete after usage. Though its a bit long process,its efficient way of handling and delivering documents from database.

Categories

Resources