Given two names that have variations in the way they are represented, is there any API/tool/algorithm that can give a score of how similar/different the names are?
Tim O' Reilly is one input and T Reilly is another input. The score returned between these two should be lesser than that got between Tim O' Reilly and Tim Reilly.
I am looking for such score calculation mechanisms. Few challenges that the algorithm should be capable of handling are:
1) The first names and last names could be swapped when a name is given as input
2) There might be initials in place of names
3) One of the names may not have the last name while the other may have both first name and last name.
... and so on which are common errors in name representations.
Two libraries including a handful of distance scores for name similarity are:
SimPack:
SecondString
No single method covers the cases that you mention but for 1) and 3) Feature and Set similarity measures (jaccard, tfidf for instance) work- For 2) besides soundex (as mentioned by #houman001) you may consider levensthein or jaro. Experiment with some examples of your use case and combine.
For the "API/tool/algorithm that can give a score of how similar/different the names are" part, I can give you a hint:
There are a few heuristic libraries that search engines use, but there is also this coding called soundex that computes a number out of a word. Words with the same soundex code are those that are slightly different. There are some Java implementations around as well.
On the points you mentioned later about names, look for contact management libraries/utilities and do some coding as these requirements are pretty specific.
Related
I am a complete Drools noob. I have been tasked with implementing a set rule which, in the absence of nested rules, seems very complex to me. The problem is as follows:
I have a fact called Person with attributes age, gender, income, height, weight and a few others. A person may be classified as level_1, level_2, ..., level_n based on the values of the attributes. For example,
when age < a and any value for other attributes then
classification = level_1.
when gender == female and any
value for other attributes then classification = level_2.
when age < a and gender == female and any value for other
attributes then classification = level_10.
...
So, in any rule any arbitrary combination of attributes may be used. Can anyone help me in expressing this?
The second part of the problem is that the levels are ordered and if a person satisfies more than 1 rule, the highest level is chosen. The only way I can think of of ordering levels is to order the rules themselves using salience. So rules resulting is higher levels will have higher salience. Is there a more elegant way of doing this?
I found a similar question here but that seems to deal with only 1 rule and the OP is probably more familiar with Drools than I am because I have not understood the solution. That talks about introducing a separate control fact but I didn't get how that works.
EDIT:
I would eventually have to create a template and supply the data using a csv. It probably does not matter for this problem, but, if it helps in any way...
The problem of assigning a discrete value to facts of a type based on attribute combinations is what I call "classification problem" (cf. Design Patterns in Production Systems). The simple approach is to write one rule for each discrete value, with constraints separating the attribute value cleanly from all other attribute sets. Note that statements such as
when attribute value age < a and any value for other attributes then classify as level 1
are misleading and must not be used to derive rules, because, evidently, this isn't a correct requirement since we have
when age < a && gender == female (...) then classify as level 10
and this contradicts the former requirement, correctly written as
when age < a && gender == male then classify as level 1
Likewise, the specification for level 2 must also be completed (and it'll become evident that there is no country for old men). With this approach, a classification based on i attributes with just 2 intervals each results in 2n rules. If the number of resulting levels is anywhere near this number, this approach is best. For implementation a decision table is suitable.
If major subsets of the n-dimensional space should fall into the same class, a more economical solution should be used. If, for instance, all women should fall into the same level, a rule selecting all women can be written and given highest precedence; the remaining rules will have to deal with n-1 dimenions. Thus, the simplest scenario would require just n rules, one for each dimension.
It is also possible to describe other intervals in an n-dimensional space, providing the full set of values for all dimensions with each interval. Using the appropriate subset of values for each interval avoids the necessity of ordering the rules (using salience) and ensures that really all cases are handled. Of course, a "fall-through" rule firing with low priority if the level hasn't been set is only prudent.
I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.
Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:
https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics
BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
but I'm not well versed in these algorithms and what would be a good choice ?
I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.
Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.
They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.
I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.
Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.
To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.
If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.
TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.
You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.
Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.
In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.
I think http://jtmt.sourceforge.net/ would be a good place to start.
Interesting. Have you thought about a radix sort?
http://en.wikipedia.org/wiki/Radix_sort
The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:
http://algs4.cs.princeton.edu/51radix/LSD.java.html
http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java
Did you take a look at the levenshtein distance ?
int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)
Find the Levenshtein distance between two Strings.
This is the number of changes needed to change one String into
another, where each change is a single character modification
(deletion, insertion or substitution).
The previous implementation of the Levenshtein distance algorithm was
from http://www.merriampark.com/ld.htm
Chas Emerick has written an implementation in Java, which avoids an
OutOfMemoryError which can occur when my Java implementation is used
with very large strings. This implementation of the Levenshtein
distance algorithm is from http://www.merriampark.com/ldjava.htm
Anyway, I'm curious to know what do you choose in this case !
I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.
I have responses from users to multiple choice questions, e.g. (roughly):
Married/Single
Male/Female
American/Latin American/European/Asian/African
What I want is to estimate similarity by aggregating all responses into a single field which can be compared across users in the database - rather than running queries against each column.
So, for example, some responses might look like:
Married-Female-American
Single-Female-European
But I don't want to store a massive text object to represent all of the possible concatenated responses since there are maybe 50 of them.
So, is there some way to represent a set of responses more concisely using a Java library method of some kind.
In other words, this method would take Married-Female-American and generate a code, say of abc while Single-Female-European would generate a code of, say, def?
This way if I want to find out if two users are Married-Female-Americans I can simply query a single column for the code abc.
Well, if it was a multiple choice question, you have choices enumerated. That is, numbered. Why not use 1-1-2 and 23-1-75 then? Even if you have 50 answers, it's still manageable.
Now if you happen to need the similarity, aggregating is the last thing you want. What you want is a simple array of ids of the answers given and a function defining a distance between two answer arrays. Do not use Strings, do not aggregate. Leave clean nice vectors, and all the ML libraries will be at your service.
To quote a Java ML library, try http://www.cs.waikato.ac.nz/~ml/weka/
Update: One more thing you may want to try is locality sensitive hashing. I don't think it's a good idea in your case, but your question looks like a request for it. Give it a try.
Do you have a finite number of options (multiple-choice seems to imply this)?
It is a common technique for performance to go from strings to a numerical data set, by essentially indexing the available strings. As long as you only need identity, this is perfect. Comparing an integer is much faster than comparing a string, and they usually take less memory, too.
A character is essentially an integer in 0-255, so you can of course use this.
So just define an alphabet:
a Married
b Single
c Male
d Female
e American
f Latin American
g European
h Asian
i African
You can in fact use this even when you have more than 256 words, if they are positional (and no single question has more than 256 choices). You would then use
a Q1: Married
b Q1: Single
a Q2: Male
b Q2: Female
a Q3: American
b Q3: Latin American
c Q3: European
d Q3: Asian
e Q3: African
Your examples would then be encoded as either (variant 1) ade and bdg or (variant 2) aba and bbc. The string should then have a fixed length of 50 (if you have 50 questions) and can be stored very effectively.
For comparing answers, just access the nth character of the string. Maybe your database allows for indexed substring queries, too. As you can see in above example, both strings agree only on the second character, just like the answers agreed.
I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down.
Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.
Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.
Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.
I've never used Hamming distance for this purpose, so I can't say much about it.
I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.
One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.
For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.
Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.
Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.
Consider the example on wikipedia for Levenshtein distance:
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
1. kitten → sitten (substitution of 's' for 'k')
2. sitten → sittin (substitution of 'i' for 'e')
3. sittin → sitting (insertion of 'g' at the end).
Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".
Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
for(int j = i + 1; j < papers.length; j++) {
Paper first = papers[i];
Paper second = papers[j];
int dist = compareSimilarities(first.text,second.text);
System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
}
}
Compare those results and peg the kids with the lowest distance scores.
In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)