Fuzzy String matching of Strings in Java

Fuzzy String matching of Strings in Java - java

I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.
There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.
Is there a better way to solve this problem?

Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.
The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.
So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.
This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:
Return only results that match enough NGrams with Solr
Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:
Sybase n-gram text index
Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.

First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.
Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.

Use Solr (or Lucene) could be a suitable solution for you?
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Related

Hibernate search fuzzy more than 2

I have a Java backend with hibernate, lucene and hibernate-search. Now I want to do a fuzzy query, BUT instead of 0, 1, or 2, I want to allow more "differences" between the query and the expected result (to compensate for example misspelling in long words). Is there any way to achieve this? The maximum of allowed differences will later be calculated by the length of the query.
What I want this for, is an autocomplete search with correction of wrong letters. This autocomplete should only search for missing characters BEHIND the given query, not in front of it. If characters in front of the query compared to the entry are missing, they should be counted as difference.
Examples:
Maximum allowed different characters in this example is 2.
fooo should match
fooo (no difference)
fooobar (only characters added -> autocomplete)
fouubar (characters added and misspelled -> autocomplete and spelling correction)
fooo should NOT match
barfooo (we only allow additional characters behind the query, but this example is less important)
fuuu (more than 2 differences)
This is my current code for the SQL query:
FullTextEntityManager fullTextEntityManager = this.sqlService.getFullTextEntityManager();
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(MY_CLASS.class).overridesForField("name", "foo").get();
Query query = queryBuilder.keyword().fuzzy().withEditDistanceUpTo(2).onField("name").matching("QUERY_TO_MATCH").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, MY_CLASS.class);
List<MY_CLASS> results = fullTextQuery.getResultList();
Notes:
1. I use org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory for indexing, but that should not make any change.
2. This is using a custom framework, which is not open source. You can just ignore the sqlService, it only provides the FullTextEntityManager and handles all things around hibernate, which do not require custom code each time.
3. This code does already work, but only with withEditDistanceUpTo(2), which means maximum 2 "differences" between QUERY_TO_MATCH and the matching entry in the database or index. Missing characters also count as differences.
4. withEditDistanceUpTo(2) does not accept values greater than 2.
Does anyone have any ideas to achieve that?

I am not aware of any solution where you would specify an exact number of changes that are allowed.
That approach has serious drawbacks, anyway: what does it mean to match "foo" with up to 3 changes? Just match anything? As you can see, a solution that works with varying term lengths might be better.
One solution is to index n-grams. I'm not talking about edge-ngrams, like you already do, but actual ngrams extracted from the whole term, not just the edges. So when indexing 2-grams of foooo, you would index:
fo
oo (occurring multiple times)
And when querying, the term fouuu would be transformed to:
fo
ou
uu
... and it would match the indexed document, since they have at least one term in common (fo).
Obviously there are some drawbacks. With 2-grams, the term fuuuu wouldn't match foooo, but the term barfooo would, because they have a 2-gram in common. So you would get false positives. The longer the grams, the less likely you are to get false positives, but the less fuzzy your search will be.
You can make these false positives go away by relying on scoring and on a sort by score to place the best matches first in the result list. For example, you could configure the ngram filter to preserve the original term, so that fooo will be transformed to [fooo, fo, oo] instead of just [fo, oo], and thus an exact search of fooo will have a better score for a document containing fooo than for a document containing barfooo (since there are more matches). You could also set up multiple separate fields: one without ngrams, one with 3-grams, one with 2-grams, and build a boolean query with on should clause per field: the more clauses are matched, the higher the score will be, and the higher you will find the document in the hits.
Also, I'd argue that fooo and similar are really artificial examples and you're unlikely to have these terms in a real-world dataset; you should try whatever solution you come up with against a real dataset and see if it works well enough. If you want fuzzy search, you'll have to accept some false positives: the question is not whether they exist, but whether they are rare enough that users can still easily find what they are looking for.
In order to use ngrams, apply the n-gram filter using org.apache.lucene.analysis.ngram.NGramFilterFactory. Apply it both when indexing and when querying. Use the parameters minGramSize/maxGramSize to configure the size of ngrams, and keepShortTerm (true/false) to control whether to preserve the original term or not.
You may keep the edge-ngram filter or not; see if it improves the relevance of your results? I suspect it may improve the relevance slightly if you use keepShortTerm = true. In any case, make sure to apply the edge-ngram filter before the ngram filter.

Ok, my friend and I found a solution.
We found a question in the changelog of lucene which asks for the same feature, and we implemented a solution:
There is a SlowFuzzyQuery in a sandbox version of lucene. It is slower (obviously) but supports an editDistance greater than 2.

Fast string retrieval based on metric distance in Java

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.

BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.

Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

Efficient Data Structure for "Model Number" Sub string Searching

We have a problem where we want to do substring searching on a large number [1MM - 10MM] of strings ("model numbers") quickly identifying any "model number" that contains the given substring. Model numbers are short strings such as:
ABB1924DEW
WTW9400PDQB
GLEW1874
The goal is simple, given a substring, quickly find all the model numbers match the substring. For example (on the universe of above model numbers), if we searched on the string "EW" the function would return GLEW1874 and ABB1924DEW (because both contained the substring EW within them).
The data structure also needs to be able to support quick searches for model numbers that start with a given substring and/or end with a given substring. For example, we need to be able to quicly do searchs like WTW...B (which would match WTW9400PDQB because it starts with WTW and ends with B)
What I am looking for is an in memory data structure that does that does these searches very efficiently. Ideally, there would also be a nice (simple) implementation in Java already done somewhere that we could use. Simple (and fast) is better than uber complicated and slightly faster. The naive algorithm (just loop over all part numbers doing a substring search on each) is too slow for our purposes, we are looking for something much faster (prepossessing ok)
So, what is the textbook data structure/algorithm for this problem?

What you need is a Suffix Tree. I don't know of a library in Java to recommend so you might have to implement one yourself

Search for terms in the index which are a prefix of the search term or vice versa (!)

I would like for Lucene to find a document containing a term "bahnhofstr" if I search for "bahnhofstrasse", i.e., I don't only want to find documents containing terms of which my search term is a prefix but also documents that contain terms that are themselves a prefix of my search term...
How would I go about this?

If I understand you correctly, and your search string is an exact string, you can set queryParser.setAllowLeadingWildcard(true); in Lucene to allow for leading-wildcard searches (which may or may not be slow -- I have seen them reasonably fast but in a case where there were only 60,000+ Lucene documents).
Your example query syntax could look something like:
*bahnhofstr bahnhofstr*
or possibly (have not tested this) just:
*bahnhofstr*

I think a fuzzy query might be most helpful for you. This will score terms based on the Levenshtein distance from your query. Without a minimum similarity specified, it will effectively match every term available. This can make it less than performant, but does accomplish what you are looking for.
A fuzzy query is signalled by the ~ character, such as:
firstname:bahnhofstr~
Or with a minimum similarity (a number between 0 and 1, 0 being the loosest with no minimum)
firstname:bahnhofstr~0.4
Or if you are constructing your own queries, use the FuzzyQuery
This isn't quite Exactly what you specified, but is the easiest way to get close.
As far as exactly what you are looking for, I don't know of a simple Lucene call to accomplish it. I would probably just split the term into a series of termqueries, that you could represent in a query string something like:
firstname:b
firstname:ba
firstname:bah
firstname:bahn
firstname:bahnh
firstname:bahnho
firstname:bahnhof
firstname:bahnhofs
firstname:bahnhofst
firstname:bahnhofstr*
I wouldn't actually generate a query string for it myself, by the way. I'd just construct the TermQuery and PrefixQuery objects myself.
Scoring would be bit warped, and I'dd probably boost longer queries more highly to get better ordering out of it, but that's the method that comes to mind to accomplish exactly what you're looking for fairly easily. A DisjunctionMaxQuery would help you use something like this with other terms and acquire more reasonable scoring.
Hopefully a fuzzy query works well for you though. Seems a much nicer solution.
Another option, if you have a lot of need for queries of this nature, might be, when indexing, tokenize fields into n-grams (see NGramTokenizer), which would allow you to effectively use an NGramPhraseQuery to achieve the results you want.

Matching substrings from a dictionary to other string: suggestions?

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.
I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".
On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".
By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".
My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.
I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".
Please provide your views, ideas and suggestions on this problem.
I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.
Thank you, Stack Overflow people, for your time! :)

If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.
I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.

Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n2)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.
You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n2). That's probably not a big deal if you're adding short words.

If you have only 200 terms to look for, regexps might actually work for you. Of course the regular expression is large, but if you compile it once and just use this compiled Pattern the lookup time is probably linear in the combined length of all the strings in array#1 and I don't see how you can hope for being better than that.
So the algorithm would be: concatenate the words of array#2 which you want to look for into the regular expression, compile it, and then find the matches in array#1 .
(Regular expressions are compiled into a state machine - that is on each character of the string it just does a table lookup for the next state. If the regular expression is complicated you might have backtracking that increases the time, but your regular expression has a very simple structure.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.