Hibernate search fuzzy more than 2 - java

I have a Java backend with hibernate, lucene and hibernate-search. Now I want to do a fuzzy query, BUT instead of 0, 1, or 2, I want to allow more "differences" between the query and the expected result (to compensate for example misspelling in long words). Is there any way to achieve this? The maximum of allowed differences will later be calculated by the length of the query.
What I want this for, is an autocomplete search with correction of wrong letters. This autocomplete should only search for missing characters BEHIND the given query, not in front of it. If characters in front of the query compared to the entry are missing, they should be counted as difference.
Examples:
Maximum allowed different characters in this example is 2.
fooo should match
fooo (no difference)
fooobar (only characters added -> autocomplete)
fouubar (characters added and misspelled -> autocomplete and spelling correction)
fooo should NOT match
barfooo (we only allow additional characters behind the query, but this example is less important)
fuuu (more than 2 differences)
This is my current code for the SQL query:
FullTextEntityManager fullTextEntityManager = this.sqlService.getFullTextEntityManager();
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(MY_CLASS.class).overridesForField("name", "foo").get();
Query query = queryBuilder.keyword().fuzzy().withEditDistanceUpTo(2).onField("name").matching("QUERY_TO_MATCH").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, MY_CLASS.class);
List<MY_CLASS> results = fullTextQuery.getResultList();
Notes:
1. I use org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory for indexing, but that should not make any change.
2. This is using a custom framework, which is not open source. You can just ignore the sqlService, it only provides the FullTextEntityManager and handles all things around hibernate, which do not require custom code each time.
3. This code does already work, but only with withEditDistanceUpTo(2), which means maximum 2 "differences" between QUERY_TO_MATCH and the matching entry in the database or index. Missing characters also count as differences.
4. withEditDistanceUpTo(2) does not accept values greater than 2.
Does anyone have any ideas to achieve that?

I am not aware of any solution where you would specify an exact number of changes that are allowed.
That approach has serious drawbacks, anyway: what does it mean to match "foo" with up to 3 changes? Just match anything? As you can see, a solution that works with varying term lengths might be better.
One solution is to index n-grams. I'm not talking about edge-ngrams, like you already do, but actual ngrams extracted from the whole term, not just the edges. So when indexing 2-grams of foooo, you would index:
fo
oo (occurring multiple times)
And when querying, the term fouuu would be transformed to:
fo
ou
uu
... and it would match the indexed document, since they have at least one term in common (fo).
Obviously there are some drawbacks. With 2-grams, the term fuuuu wouldn't match foooo, but the term barfooo would, because they have a 2-gram in common. So you would get false positives. The longer the grams, the less likely you are to get false positives, but the less fuzzy your search will be.
You can make these false positives go away by relying on scoring and on a sort by score to place the best matches first in the result list. For example, you could configure the ngram filter to preserve the original term, so that fooo will be transformed to [fooo, fo, oo] instead of just [fo, oo], and thus an exact search of fooo will have a better score for a document containing fooo than for a document containing barfooo (since there are more matches). You could also set up multiple separate fields: one without ngrams, one with 3-grams, one with 2-grams, and build a boolean query with on should clause per field: the more clauses are matched, the higher the score will be, and the higher you will find the document in the hits.
Also, I'd argue that fooo and similar are really artificial examples and you're unlikely to have these terms in a real-world dataset; you should try whatever solution you come up with against a real dataset and see if it works well enough. If you want fuzzy search, you'll have to accept some false positives: the question is not whether they exist, but whether they are rare enough that users can still easily find what they are looking for.
In order to use ngrams, apply the n-gram filter using org.apache.lucene.analysis.ngram.NGramFilterFactory. Apply it both when indexing and when querying. Use the parameters minGramSize/maxGramSize to configure the size of ngrams, and keepShortTerm (true/false) to control whether to preserve the original term or not.
You may keep the edge-ngram filter or not; see if it improves the relevance of your results? I suspect it may improve the relevance slightly if you use keepShortTerm = true. In any case, make sure to apply the edge-ngram filter before the ngram filter.

Ok, my friend and I found a solution.
We found a question in the changelog of lucene which asks for the same feature, and we implemented a solution:
There is a SlowFuzzyQuery in a sandbox version of lucene. It is slower (obviously) but supports an editDistance greater than 2.

Related

Lucene BooleanQuery with multiple FuzzyQuery is too slow

A Document is an employee data of a company with multiple fields name like: empName, empId, departmentId etc.
Using custom analyzer have indexed around 4 million data.
Search query is having a list of employees' name, and know that all employees in list belong to same department. There are multiple departments in company.
So I want to do fuzzy search for all employees' names for under given department id.
For this I am using boolean query which looks like:
Query termQuery = new TermQuery(new Term("departmentId","1234"));
BooleanQuery.Builder bld = new BooleanQuery.Builder();
for(String str:employeeNameList) {
bld.add(new FuzzyQuery(new Term("name",str)), BooleanClause.Occur.SHOULD);
}
BooleanQuery bq = bld.build();
BooleanQuery finalBooleanQuery = new BooleanQuery.Builder()
.add(termQuery, BooleanClause.Occur.MUST)
.add(bq, BooleanClause.Occur.MUST).build();
Now passing finalBooleanQuery inside search method of IndexSearcher and getting results.
Problem is its taking too much time, when size of employeeNameList more than 50 it takes around 500 ms for search.
How can I reduce time from 500 ms to 50 ms ?
There is any other solution for this problem ?
If you take a look at the other constructors for FuzzyQuery, you'll see some easy ways to improve performance. Each additional argument is there for you to reduce the amount of work the FuzzyQuery is going to do, and so improve performance.
First, and most important:
Prefix length: I strongly recommend setting this to a non-zero value. This is how many characters at the beginning of the term will not be subject to fuzzy matching. So, if searching for "abc" with a prefix of 1, "abb" and "acc" would be matched, but not "bbc". This allows lucene to work with the index when attempting to find matching terms, instead of having to scan the whole term dictionary. It's likely you will see the largest performance improvement here. Many seem to find 2 to be a good balance point between performance and meeting search demands.
The rest of the available arguments can also help:
maxEdits - 2 is the default, and the maximum. Setting this to 1 will match less, and as such, work faster.
maxExpansions - Under the hood, this query finds terms that match the fuzzy parameters, then performs a search for those terms. If you are searching for short terms, especially, this list of matching terms could turn out to be very long. Setting maxExpansions will prevent these extremely long lists of matches from occurring. Default is 50.
transpositions - Whether swapping two characters is an allowed edit. Default is true. Basically, the difference between Levenshtein and Damerau-Levenshtein. false is less work and less matches, so will perform better. Don't know if the difference will be that big though.

Fuzzy String matching of Strings in Java

I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.
There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.
Is there a better way to solve this problem?
Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.
The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.
So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.
This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:
Return only results that match enough NGrams with Solr
Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:
Sybase n-gram text index
Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.
First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.
Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.
Use Solr (or Lucene) could be a suitable solution for you?
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Fuzzy Matching Duplicates in Java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

Search for terms in the index which are a prefix of the search term or vice versa (!)

I would like for Lucene to find a document containing a term "bahnhofstr" if I search for "bahnhofstrasse", i.e., I don't only want to find documents containing terms of which my search term is a prefix but also documents that contain terms that are themselves a prefix of my search term...
How would I go about this?
If I understand you correctly, and your search string is an exact string, you can set queryParser.setAllowLeadingWildcard(true); in Lucene to allow for leading-wildcard searches (which may or may not be slow -- I have seen them reasonably fast but in a case where there were only 60,000+ Lucene documents).
Your example query syntax could look something like:
*bahnhofstr bahnhofstr*
or possibly (have not tested this) just:
*bahnhofstr*
I think a fuzzy query might be most helpful for you. This will score terms based on the Levenshtein distance from your query. Without a minimum similarity specified, it will effectively match every term available. This can make it less than performant, but does accomplish what you are looking for.
A fuzzy query is signalled by the ~ character, such as:
firstname:bahnhofstr~
Or with a minimum similarity (a number between 0 and 1, 0 being the loosest with no minimum)
firstname:bahnhofstr~0.4
Or if you are constructing your own queries, use the FuzzyQuery
This isn't quite Exactly what you specified, but is the easiest way to get close.
As far as exactly what you are looking for, I don't know of a simple Lucene call to accomplish it. I would probably just split the term into a series of termqueries, that you could represent in a query string something like:
firstname:b
firstname:ba
firstname:bah
firstname:bahn
firstname:bahnh
firstname:bahnho
firstname:bahnhof
firstname:bahnhofs
firstname:bahnhofst
firstname:bahnhofstr*
I wouldn't actually generate a query string for it myself, by the way. I'd just construct the TermQuery and PrefixQuery objects myself.
Scoring would be bit warped, and I'dd probably boost longer queries more highly to get better ordering out of it, but that's the method that comes to mind to accomplish exactly what you're looking for fairly easily. A DisjunctionMaxQuery would help you use something like this with other terms and acquire more reasonable scoring.
Hopefully a fuzzy query works well for you though. Seems a much nicer solution.
Another option, if you have a lot of need for queries of this nature, might be, when indexing, tokenize fields into n-grams (see NGramTokenizer), which would allow you to effectively use an NGramPhraseQuery to achieve the results you want.

Lucene query parsing behaviour - joining query parts with AND

Let's say we have a Lucene index having few documents indexed using StopAnalyzer.ENGLISH_STOP_WORDS_SET. A user is issuing two queries:
foo:bar
baz:"there is"
Let's assume that the first query yields some results because there are documents matching that query.
The second query yields 0 results. The reason for this is because when baz:"there is" is parsed, it ends up as a void query as both there and is are stopwords (technically speaking, this is converted to an empty BooleanQuery having no clauses). So far so good.
However, any of the following combined queries
+foo:bar +baz:"there is"
foo:bar AND baz:"there is"
behave exactly the same way as query +foo:bar, that is, brings back some results - all despite the second AND part which yields no results.
One might argue that when ANDing, both conditions have to be met, but they aren't.
It seems contradictory as an atomic query component has different impact on the overall query depending on the context. Is there any logical explanation for this? Can this be addressed in any way, preferably without writing own QueryAnalyzer? Can this be classified as a Lucene bug?
If this makes any difference, observed behaviour happens under Lucene v3.0.2.
This question was also posted on Lucene Java users mailing list, no answers came so far.
I would suggest not using the StopAnalyzer if you want to be able to search for phrases like "there is". StopAnalyzer is essentially a lossy optimization method and unless you are indexing huge text documents it's probably not worth it.
I think it is perfectly fine. You can imagine the result for an empty query being the whole document collection. However, this result is omitted for practical reasons. Sone basically you're ANDing with superset not an empty set.
E: You can think of it in a way that additional keywords refine the result set. This makes most sense when you take prefix search into account. The shorter your prefix is, the more matches there are. The most extreme case would be the empty query matching the whole document collection
Erick Ericksson from Lucene mailing list answered one part of this question:
But imagine the impact of what you're
requesting. If all stop words get
removed, then no query would ever
match yours. Which would be very
counter-intuitive IMO. Your users have
no clue that you've removed stopwords,
so they'll sit there saying "Look, I
KNOW that "bar" was in foo and I KNOW
that "there is" was in baz, why the
heck didn't this cursed system find my
doc?
So it looks like the only sensible way is to cease using stopwords or reduce the stopword set.

Categories

Resources