Fuzzy Matching Duplicates in Java

Fuzzy Matching Duplicates in Java - java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?

I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.

Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

Related

Hibernate search fuzzy more than 2

I have a Java backend with hibernate, lucene and hibernate-search. Now I want to do a fuzzy query, BUT instead of 0, 1, or 2, I want to allow more "differences" between the query and the expected result (to compensate for example misspelling in long words). Is there any way to achieve this? The maximum of allowed differences will later be calculated by the length of the query.
What I want this for, is an autocomplete search with correction of wrong letters. This autocomplete should only search for missing characters BEHIND the given query, not in front of it. If characters in front of the query compared to the entry are missing, they should be counted as difference.
Examples:
Maximum allowed different characters in this example is 2.
fooo should match
fooo (no difference)
fooobar (only characters added -> autocomplete)
fouubar (characters added and misspelled -> autocomplete and spelling correction)
fooo should NOT match
barfooo (we only allow additional characters behind the query, but this example is less important)
fuuu (more than 2 differences)
This is my current code for the SQL query:
FullTextEntityManager fullTextEntityManager = this.sqlService.getFullTextEntityManager();
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(MY_CLASS.class).overridesForField("name", "foo").get();
Query query = queryBuilder.keyword().fuzzy().withEditDistanceUpTo(2).onField("name").matching("QUERY_TO_MATCH").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, MY_CLASS.class);
List<MY_CLASS> results = fullTextQuery.getResultList();
Notes:
1. I use org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory for indexing, but that should not make any change.
2. This is using a custom framework, which is not open source. You can just ignore the sqlService, it only provides the FullTextEntityManager and handles all things around hibernate, which do not require custom code each time.
3. This code does already work, but only with withEditDistanceUpTo(2), which means maximum 2 "differences" between QUERY_TO_MATCH and the matching entry in the database or index. Missing characters also count as differences.
4. withEditDistanceUpTo(2) does not accept values greater than 2.
Does anyone have any ideas to achieve that?

I am not aware of any solution where you would specify an exact number of changes that are allowed.
That approach has serious drawbacks, anyway: what does it mean to match "foo" with up to 3 changes? Just match anything? As you can see, a solution that works with varying term lengths might be better.
One solution is to index n-grams. I'm not talking about edge-ngrams, like you already do, but actual ngrams extracted from the whole term, not just the edges. So when indexing 2-grams of foooo, you would index:
fo
oo (occurring multiple times)
And when querying, the term fouuu would be transformed to:
fo
ou
uu
... and it would match the indexed document, since they have at least one term in common (fo).
Obviously there are some drawbacks. With 2-grams, the term fuuuu wouldn't match foooo, but the term barfooo would, because they have a 2-gram in common. So you would get false positives. The longer the grams, the less likely you are to get false positives, but the less fuzzy your search will be.
You can make these false positives go away by relying on scoring and on a sort by score to place the best matches first in the result list. For example, you could configure the ngram filter to preserve the original term, so that fooo will be transformed to [fooo, fo, oo] instead of just [fo, oo], and thus an exact search of fooo will have a better score for a document containing fooo than for a document containing barfooo (since there are more matches). You could also set up multiple separate fields: one without ngrams, one with 3-grams, one with 2-grams, and build a boolean query with on should clause per field: the more clauses are matched, the higher the score will be, and the higher you will find the document in the hits.
Also, I'd argue that fooo and similar are really artificial examples and you're unlikely to have these terms in a real-world dataset; you should try whatever solution you come up with against a real dataset and see if it works well enough. If you want fuzzy search, you'll have to accept some false positives: the question is not whether they exist, but whether they are rare enough that users can still easily find what they are looking for.
In order to use ngrams, apply the n-gram filter using org.apache.lucene.analysis.ngram.NGramFilterFactory. Apply it both when indexing and when querying. Use the parameters minGramSize/maxGramSize to configure the size of ngrams, and keepShortTerm (true/false) to control whether to preserve the original term or not.
You may keep the edge-ngram filter or not; see if it improves the relevance of your results? I suspect it may improve the relevance slightly if you use keepShortTerm = true. In any case, make sure to apply the edge-ngram filter before the ngram filter.

Ok, my friend and I found a solution.
We found a question in the changelog of lucene which asks for the same feature, and we implemented a solution:
There is a SlowFuzzyQuery in a sandbox version of lucene. It is slower (obviously) but supports an editDistance greater than 2.

What is the best algorithm for matching two string containing less than 10 words in latin script

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.
Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:
https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics
BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
but I'm not well versed in these algorithms and what would be a good choice ?
I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.
Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.
I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.
Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.
To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.
If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.
TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.

Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.
In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.
I think http://jtmt.sourceforge.net/ would be a good place to start.

Interesting. Have you thought about a radix sort?
http://en.wikipedia.org/wiki/Radix_sort
The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:
http://algs4.cs.princeton.edu/51radix/LSD.java.html
http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java

Did you take a look at the levenshtein distance ?
int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)
Find the Levenshtein distance between two Strings.
This is the number of changes needed to change one String into
another, where each change is a single character modification
(deletion, insertion or substitution).
The previous implementation of the Levenshtein distance algorithm was
from http://www.merriampark.com/ld.htm
Chas Emerick has written an implementation in Java, which avoids an
OutOfMemoryError which can occur when my Java implementation is used
with very large strings. This implementation of the Levenshtein
distance algorithm is from http://www.merriampark.com/ldjava.htm
Anyway, I'm curious to know what do you choose in this case !

most efficient Java data structure for searching triples of strings

Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!

You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.

I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)

You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.

10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.

Find a large collection of strings within a larger collection of strings

I have a collection of strings that I want to filter. They'll be in this pattern:
xxx_xxx_xxx_xxx
so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.
What data structure could I use to efficiently do something like this:
Get all strings starts with: "abc_123_456"
Get all strings starts with: "def_999_888"
etc..
for example, I could do this:
List<String> matched = new ArrayList<String>();
for (String it : strings) {
if (it.startsWith(match)) {
matched.add(it);
}
}
but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.
The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:
select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';
I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:
myusername_prodIdA_prodIdB_recommenderusername
The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,
Thanks

To do that in Java, you can use a Trie structure.
That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.
That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.

I think you are looking for a SortedMap.
"headMap(K toKey)
Returns a view of the portion of this map whose keys are strictly less than toKey."

I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though
If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.
So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))
You can get rid of inner list and optimize it in many ways, but this is the idea.

The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.

Matching substrings from a dictionary to other string: suggestions?

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.
I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".
On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".
By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".
My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.
I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".
Please provide your views, ideas and suggestions on this problem.
I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.
Thank you, Stack Overflow people, for your time! :)

If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.
I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.

Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n2)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.
You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n2). That's probably not a big deal if you're adding short words.

If you have only 200 terms to look for, regexps might actually work for you. Of course the regular expression is large, but if you compile it once and just use this compiled Pattern the lookup time is probably linear in the combined length of all the strings in array#1 and I don't see how you can hope for being better than that.
So the algorithm would be: concatenate the words of array#2 which you want to look for into the regular expression, compile it, and then find the matches in array#1 .
(Regular expressions are compiled into a state machine - that is on each character of the string it just does a table lookup for the next state. If the regular expression is complicated you might have backtracking that increases the time, but your regular expression has a very simple structure.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.