Fuzzy Queries in Lucene

Fuzzy Queries in Lucene - java

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.
For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".
From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.
The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.
Like I said I have indexed the database on company name, and then have the following code:
IndexSearcher searcher = new IndexSearcher(directory);
new QueryParser(Version.LUCENE_30, "company", analyzer);
Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));
I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(fuzzy_query, collector);
System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());
Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:
import org.apache.lucene.queryParser.*;
Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:
Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);
Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.
You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:
TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.score);
}

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

My method to extract fuzzy matched terms is crashing when the number of possible matches is large , how can I prevent this?

Using Lucene 8.6.3, see the end of this post for how to run an example
I need to get a list of indexed words which match a fuzzy search expression, and my code is fine for small numbers of matches but throws an exception when the fuzzy term potentially matches too many words. It's possible that I am using the wrong technique, or that my implementation needs some defensive code somewhere.
The query is read as a text query:
text:sle~1
and it returns the correct number of matches from the index, through this code:
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerField.put("version", new VersionAnalyzer());
PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(new TechAnalyzer(), analyzerPerField);
QueryParser parser = new QueryParser(field, analyzerWrapper);
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR);
Query query = parser.parse(line); // line read earlier from input stream
int max = reader.numDocs();
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
HashSet<String> matchedWords = new HashSet<String>();
int end = Math.min(numTotalHits, max);
for (int i = 0; i < end; i++) {
saveMatchedWords(query, searcher, hits[i].doc, matchedWords, false);
}
The method which captures the matching fuzzy terms is here:
private static void saveMatchedWords(Query query, IndexSearcher searcher, int docId, HashSet<String> matchedWords, boolean fuzzy) throws IOException {
boolean inFuzzy = fuzzy;
if (query instanceof FuzzyQuery)
inFuzzy = true;
if (query instanceof TermQuery) {
if (inFuzzy && searcher.explain(query, docId).isMatch())
matchedWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
}
else if (query instanceof BooleanQuery) {
for (BooleanClause clause : (BooleanQuery) query) {
saveMatchedWords(clause.getQuery(), searcher, docId, matchedWords, inFuzzy);
}
}
else if (query instanceof MultiTermQuery) {
((MultiTermQuery) query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
saveMatchedWords(query.rewrite(searcher.getIndexReader()), searcher, docId, matchedWords, inFuzzy);
}
else if (query instanceof BoostQuery) {
saveMatchedWords(((BoostQuery) query).getQuery(), searcher, docId, matchedWords, inFuzzy);
}
}
When I run fuzzy searches like sle~1 the query returns 165 matching documents, and the matchedWords set contains 13 words (in this case, "sale, see, sole, ...").
The problem comes when the fuzzy term matches too many words. I am not sure what the cutoff is, but the largest set of matchedWords I have seen without an exception contained 78 words.
In this example, if we change the fuzzy term to sle~ (i.e. a distance of 2) then the query executes and returns 422 matching documents, but the saveMatchedWords call throws an exception:
Exception in thread "main" java.lang.IllegalArgumentException: boost must be a positive float, got -1.0
at org.apache.lucene.search.BoostQuery.<init>(BoostQuery.java:44)
at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:69)
at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:54)
at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:117)
at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:313)
at myDemo.SearchIndex.saveMatchedWords(SearchIndex.java:262)
at myDemo.SearchIndex.saveMatchedWords(SearchIndex.java:247)
I think the error it encounters is misleading, because if it was a problem in how I am extracting the terms then it wouldn't only happen when the number of terms climbs to 80 or more. More likely something somewhere is overflowing, but I'm not sure what.
The example I've given is a bit artificial because searching for such a short term with fuzzy distance 2 is practically useless, but it has revealed that my code is not nearly robust enough as it stands.
Can anybody offer some pointers here that might help?
Cheers
T
Edit:
I have set up a small Ant application which illustrates this behaviour.
Download the zipped archive and unzip it into a directory called "mydemo".
(That download is approx 5MB because it includes the three Lucene libraries I have used. If you prefer to use your own Lucene, then there is a smaller zip (230KB), but you will have to modify the build appropriately).
I'm assuming you have Ant, but if not, it should be straightforward to figure out what to do from looking at the targets in build.xml.
From the "mydemo" folder, type "ant index" which builds the index from the supplied data files (which are just long lists of words).
Next type "ant search". This queries the index with the search text "sle~1", and it finds 8 matching files, listing their paths in data/query.out, with 18 fuzzy matched words which are listed in data/query.out.matches.
Now edit the file data/query.in, remove the final 1, and save the file. The search text is now "sle~", which will match far more words. In the mydemo folder, type "ant search" again.
You should see that the query now matches 12 files, but the code which stores the matched words throws the exception.
Edit 2:
This has been an interesting exercise... I have found that the exception is happening because the Queryparser has (correctly) parsed the input into a FuzzyQuery, and that the query.rewrite() called in my saveMatchedWords method has attempted to rewrite this as an N-term Boolean query where N is not the number of matches in the document, but the number of matches in the entire index. For a distance=2 fuzzy match this is potentially a very large number indeed.
I tested BooleanQuery.setMaxClauseCount(num) with various values to see if increasing the limit fixed the problem, but even with values greater than the total number of indexed terms I still got the exception. So something is overflowing, but probably not the maximum number of Boolean clauses.
However, because this code is iterating over the total number of potential matched terms once for every single matched document, it really isn't a suitable technique for acquiring the fuzzy matching terms, even though it seems to work when the numbers it is dealing with are relatively small. If it doesn't scale, it isn't going to meet the need.
Back to the drawing board.
Edit 3:
I have now removed the download links (as the files are no longer available at the url I provided).

Lucene Solr using complex filters

I am currently having a problem with specifying filters for Lucene/Solr. Every solution I come up with breaks other solutions. Let me start with an example. Assume that we have the following 5 documents:
doc1 = [type:Car, sold:false, owner:John]
doc2 = [type:Bike, productID:1, owner:Brian]
doc3 = [type:Car, sold:true, owner:Mike]
doc4 = [type:Bike, productID:2, owner:Josh]
doc5 = [type:Car, sold:false, owner:John]
So I need to construct the following filter queries:
Give me all documents of type:Car which has sold:false only and if it is a type that is different that Car, include in the result. So basically I want docs 1, 2, 4, 5 the only document I don't want is doc3 because it is has sold:true. To put it more precisely:
for each document d in solr/lucene
if d.type == Car {
if d.sold == false, then add to result
else ignore
}
else {
add to result
}
return result
Filter in all documents that are of (type:Car and sold:false) or (type:Bike and productID:1). So for this I will get 1,2,5.
Get all documents that if the type:Car then get only with sold:false, otherwise get me documents from owners John, Brian, Josh. So for this query I should get 1, 2, 4, 5.
Note: You don't know all the types in the documents. Here it is obvious because of the small number of documents.
So my solutions were:
(-type:Car) OR ((type:Car) AND (sold:false). This works fine and as expected.
((-type:Car) OR ((type:Car) AND (sold:false)) AND ((-type:Bike) OR ((type:Bike) AND (productID:1))). This solution does not work.
((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((-type:Car) OR ((type:Car) AND (sold:false)). This does not work, I can make it work if I do I do this: ((owner:John) OR (owner:Brian) OR (owner:Josh)) AND ((version:* OR (-type:Car)) OR ((type:Car) AND (sold:false)). I don't understand how this works, because logically it should work, but Solr/Lucene somehow does something.

Okay, to get anything but a sold car, you could use -(type:Car sold:true).
This can be incorporated into the other queries, but you'll need to be careful with lonely negative queries like this. Lucene doesn't handle them well, generally speaking, and Solr has some odd gotchas as well. Particularly, A -B reads more like "get all A but forbid B" rather than "get all A and anything but B". Similar problem with A or -B, see this question for more.
To get around that, you'll need to surround the negative with an extra set of parentheses, to ensure it is understood by Solr to be a standalone negative query, like: (-(type:Car AND sold:true))
So:
-(type:Car AND sold:true) (This doesn't get the result you stated, but as per my comment, I don't really understand your stated results)
(type:Bike AND productID:1) (-(type:Car AND sold:true)) (You actually wrote this in the description of the problem!)
(-(type:Car AND sold:false)) owner:(John Brian Josh)

My advice is to use programmatic Lucene (that is, directly in Java using the Java Lucene API) rather than issuing text queries which will be interpreted. This will give you much more fine-grained control.
What you're going to want to do is construct a Lucene Filter Object using the QueryWrapperFilter API. A QueryWrapperFilter is a filter which takes a Lucene Query, and filters out any documents which do not match that query.
In order to use QueryWrapperFilter, you'll need to construct a Query which matches the terms you're interested in. The best way to do this is to use TermQuery:
TermQuery tq = new TermQuery(new Term("fieldname", "value"));
As you might have guessed, you'll want to replace "fieldname" with the name of a field, and "value" with a desired value. For example, from your example in the OP, you might want to do something like new Term("type", "Car").
This only matches a single term. You're going to need multiple TermQueries, and a way to combine them to create a single, larger query. The best way to do this is with BooleanQuery:
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanQuery.Occur.MUST);
You can call bq.add as many times as you want - once for each TermQuery that you have. The second argument specifies how strict the query is. It can specify that a sub-query MUST appear, SHOULD appear, or should NOT appear (these are the three values of the BooleanQuery.Occur enum).
After you've added each of the sub-queries, this BooleanQuery represents the full query which will match only the documents you ask for. However, it's still not a filter. We now need to feed it to QueryWrapperFilter, which will give us back a filter object:
QueryWrapperFilter qwf = new QueryWrapperFilter(bq);
That should do it. Then if you want to run queries over only the documents allowed through by that filter, you just take your new query (call it q) and your filter, and create a FilteredQuery:
FilteredQuery fq = new FilteredQuery(q, qwf);

Handling large search queries on relatively small index documents in Lucene

I'm working on a project where we index relatively small documents/sentences, and we want to search these indexes using large documents as query. Here is a relatively simple example :
I'm indexing document :
docId : 1
text: "back to black"
And i want to query using the following input :
"Released on 25 July 1980, Back in Black was the first AC/DC album recorded without former lead singer Bon Scott, who died on 19 February at the age of 33, and was dedicated to him."
What is the best approach for this in Lucene ? For simple examples, where the text i want to find is exactly the input query, i get better results using my own analyzer + a PhraseQuery than using QueryParser.parse(QueryParser.escape(...my large input...)) - which ends up creating a big Boolean/Term Query.
But i can't try to use a PhraseQuery approach for a real world example, i think i have to use a word N-Gram approach like the ShingleAnalyzerWrapper but as my input documents can be quite large the combinatorics will become hard to handle...
In other words, i'm stuck and any idea would be greatly appreciated :)
P.S. i didn't mention it but one of the annoying thing with indexing small documents is also that due to "norms"-value (float) being encoded on only 1 byte, all 3-4 words sentences get the same Norm Value, so searching sentences like "A B C" makes results "A B C" and "A B C D" show up with the same score.
Thanks !

I don't know how many sentences you have, but you may want to inverse the problem: store your sentences as queries, index incoming documents in a transient in-memory index and run all your queries on it to find the matching ones.
(Note: this is how Elasticsearch's percolator works.)
Edit (2013-06-21):
If you have a very large number of sentences, it might still be better to store sentences in an index. But instead of using phrase queries, you could try to index using Lucene's ShingleFilter. At query time, your approach to build the query manually instead of using QueryParser is the good one, but if you index shingles, you could just build a pure boolean query where each clause matches a shingle instead of a phrase query.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.