I'm looking for a tool that would compare two text strings and return a result being in fact the indicator of their similarity (e.g. 95%). It needs to be implemented on a platform supporting Java libraries.
My best guess is that I need some fuzzy logic comparison tool that would do the fuzzy match and then return the similarity level.
I've seen some posts here related to fuzzy search but I need the exact opposite - meaning I don't want to set some parameters and have similar entries returned. Instead I have the entries on hand but need to have those similarity parameter derived from them...
Can you advise me on that? Many thanks
Apache's StringUtils has something called Levenshtein distance indicator.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html
Levenshstein distance is an algorithm that outputs the similarity based on "edit distance". Although I'm not sure if this is "fuzzy".
Example:
int distance = StringUtils.getLevenshteinDistance("cat", "hat");
There is now a library that does exactly that
https://github.com/intuit/fuzzy-matcher
Related
My requirement is to be able to match two strings that are similar but not an exact match.
For example, given the following strings
First Name
Last Name
LName
FName
The output should be FirstName, FName and Last Name, LName as they are a logical match. Are there any libraries that I could use to do this? I am using JAVA to achieve this functionality.
Thanks
Raam
You could use Apache Commons StringUtils...
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance(java.lang.CharSequence,%20java.lang.CharSequence)
But it's worth noting that this may not be the best algorithm for the specific use-case in the question - I recommend reading some of the other answers here for more ideas.
According to the example you gave, you should use a modified Levenshtein distance where the penalty for adding spaces is small and the penalty for mismatched characters is larger. This will handle matching abbreviations to the strings that were abbreviated quite well. However that's assuming that you're mainly dealing with aligning abbreviations to corresponding longer versions of the strings. You should elaborate more exactly what kind of matchings you want to perform (e.g. more examples, or some kind of high-level description) if you want a more detailed and pointed answer about what methods you can/should use.
StringUtils is simply best for this - this is one of the examples i found on stackOverflow - as #CupawnTae said already
Below is the one of the simple example i came across
public static Object getTheClosestMatch(Collection<?> collection, Object target) {
int distance = Integer.MAX_VALUE;
Object closest = null;
for (Object compareObject : collection) {
int currentDistance = StringUtils.getLevenshteinDistance(compareObject.toString(), target.toString());
if(currentDistance < distance) {
distance = currentDistance;
closest = compareObject;
}
}
return closest;
}
An answer to a really similar question to yours can be found here.
Also, wikipedia has an article on Approximate String Matching that can be found here. If the first link isn't what you're looking for, I would suggest reading the wikipedia article and digging through the sources to find what you need.
Sorry I can't personally be of more help to you, but I really hope that these resources can help you find what you're looking for!
The spell check algorithms use a variant of this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance. I implemented it in class for a project and it was fairly simple to do so. If you don't want to implement it yourself you can use the name to search for other libraries.
Is there an algorithm that given two strings yields the degree of equality between them, applying metrics that can be provided externally? For example, the two strings "Plant code" and "PlantCode" could be 0.8 equal, "Plant code" and "Plant" could be 0.6 equal, "Truck no" and "shipment details" could be 0.6 equal (using extrenally provided synonyms dictionary). The numbers are made up, but I hope they get the point across. Does there exist such an algorithm? I'd prefer if it comes as a library, rather than having to implement it on my own. Any help would be greatly appreciated. Thanks.
Try the Simmetrics library. It provides a whole number of simmilarity metrics.
Maybe the google-diff-match-patch library can help: This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff.
There's also Levenshtein distance algorithm and its example java implementation. It does not make it possible to provide an external metrics, though.
I'm currently working on a Java application where I need to implement a system for building BPF expressions. I also need to implement mechanism for detecting equivalent BPF expressions.
Building the expression is not too hard. I can build a syntax tree using the Interpreter design pattern and implement the toString for getting the BPF syntax.
However, detecting if two expressions are equivalent is much harder. A simple example would be the following:
A: src port 1024 and dst port 1024
B: dst port 1024 and src port 1024
In order to detect that A and B are equivalent I probably need to transform each expression into a "normalized" form before comparing them. This would be easy for above example, however, when working with a combination of nested AND, OR and NOT operations it's getting harder.
Does anyone know how I should best approach this problem?
One way to compare boolean expressions may be to convert both to the disjunctive normal form (DNF), and compare the DNF. Here, the variables would be Berkeley Packet Filter tokens, and the same token (e.g. port 80) appearing anywhere in either of the two expressions would need to be assigned the same variable name.
There is an interesting-looking applet at http://www.izyt.com/BooleanLogic/applet.php - sadly I can't give it a try right now due to Java problems in my browser.
I'm pretty sure detecting equivalent expressions is either an np-hard or np-complete problem, even for boolean-only expressions. Meaning that to do it perfectly, the optimal way is basically to build complete tables of all possible combinations of inputs and the results, then compare the tables.
Maybe BPF expressions are limited in some way that changes that? I don't know, so I'm assuming not.
If your problems are small, that may not be a problem. I do exactly that as part of a decision-tree designing algorithm.
Alternatively, don't try to be perfect. Allow some false negatives (cases which are equivalent, but which you won't detect).
A simple approach may be to do a variant of the normal expression-evaluation, but evaluating an alternative representation of the expression rather than the result. Impose an ordering on commutative operators. Apply some obvious simplifications during the evaluation. Replace a rich operator set with a minimal set of primitive operators - e.g. using de-morgans to eliminate OR operators.
This alternative representation forms a canonical representation for all members of a set of equivalent expressions. It should be an equivalence class in the sense that you always find the same canonical form for any member of that set. But that's only the set-theory/abstract-algebra sense of an equivalence class - it doesn't mean that all equivalent expressions are in the same equivalence class.
For efficient dictionary lookups, you can use hashes or comparisons based on that canonical representation.
I'd definitely go with syntax normalization. That is, like aix suggested, transform the booleans using DNF and reorder the abstract syntax tree such that the lexically smallest arguments are on the left-hand side. Normalize all comparisons to < and <=. Then, two equivalent expressions should have equivalent syntax trees.
Specifically I am converting a python script into a java helper method. Here is a snippet (slightly modified for simplicity).
# hash of values
vals = {}
vals['a'] = 'a'
vals['b'] = 'b'
vals['1'] = 1
output = sys.stdout
file = open(filename).read()
print >>output, file % vals,
So in the file there are %(a), %(b), %(1) etc that I want substituted with the hash keys. I perused the API but couldn't find anything. Did I miss it or does something like this not exist in the Java API?
You can't do this directly without some additional templating library. I recommend StringTemplate. Very lightweight, easy to use, and very optimized and robust.
I doubt you'll find a pure Java solution that'll do exactly what you want out of the box.
With this in mind, the best answer depends on the complexity and variety of Python formatting strings that appear in your file:
If they're simple and not varied, the easiest way might be to code something up yourself.
If the opposite is true, one way to get the result you want with little work is by embedding Jython into your Java program. This will enable you to use Python's string formatting operator (%) directly. What's more, you'll be able to give it a Java Map as if it were a Python dictionary (vals in your code).
I'm looking for a simple java lib/src to highlight differences between two Strings, case-sensitive. A html output would be great, but I would be happy to get the indexes of the diffs, something like:
diff("abcd","aacd")
> [2,2]
diff("maniac", "brainiac")
> ["man",brain"] or [0,3] or something like that
The idea is to higlight typos or such in a swing program, since the input shold follow strict conventions.
Apache Commons Lang has a class called StringUtils which has both difference and indexOfDifference which fulfills your needs.
http://commons.apache.org/lang/
Check it out
The java-diff project might also be useful.
This is an implementation of the longest common subsequences (LCS) algorithm for Java. The Diff#diff() method returns a list of Difference objects, each of which describes an addition, a deletion, or a change between the two collections.