Algorithm / Library for measuring degree of equality of strings

Algorithm / Library for measuring degree of equality of strings - java

Is there an algorithm that given two strings yields the degree of equality between them, applying metrics that can be provided externally? For example, the two strings "Plant code" and "PlantCode" could be 0.8 equal, "Plant code" and "Plant" could be 0.6 equal, "Truck no" and "shipment details" could be 0.6 equal (using extrenally provided synonyms dictionary). The numbers are made up, but I hope they get the point across. Does there exist such an algorithm? I'd prefer if it comes as a library, rather than having to implement it on my own. Any help would be greatly appreciated. Thanks.

Try the Simmetrics library. It provides a whole number of simmilarity metrics.

Maybe the google-diff-match-patch library can help: This library implements Myer's diff algorithm which is generally considered to be the best general-purpose diff.

There's also Levenshtein distance algorithm and its example java implementation. It does not make it possible to provide an external metrics, though.

Related

Java library for fuzzy comparing text strings

I'm looking for a tool that would compare two text strings and return a result being in fact the indicator of their similarity (e.g. 95%). It needs to be implemented on a platform supporting Java libraries.
My best guess is that I need some fuzzy logic comparison tool that would do the fuzzy match and then return the similarity level.
I've seen some posts here related to fuzzy search but I need the exact opposite - meaning I don't want to set some parameters and have similar entries returned. Instead I have the entries on hand but need to have those similarity parameter derived from them...
Can you advise me on that? Many thanks

Apache's StringUtils has something called Levenshtein distance indicator.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html
Levenshstein distance is an algorithm that outputs the similarity based on "edit distance". Although I'm not sure if this is "fuzzy".
Example:
int distance = StringUtils.getLevenshteinDistance("cat", "hat");

There is now a library that does exactly that
https://github.com/intuit/fuzzy-matcher

Java Double Comparison [duplicate]

This question already has answers here:
comparing float/double values using == operator
(9 answers)
Closed 5 years ago.
Are there any java libraries for doing double comparison?
e.g.
public static boolean greaterThanOrEqual(double a, double b, double epsilon){
return a - b > -epsilon;
}
Every project I start I end up re-implementing this and copy-pasting code and test.
NB a good example of why its better to use 3rd party JARs is that IBM recommend the following:
"If you don't know the scale of the underlying measurements, using the
test "abs(a/b - 1) < epsilon" is likely to be more robust than simply
comparing the difference"
I doubt many people would have thought of this and illustrates that even simple code can be sub-optimal.

Guava has DoubleMath.fuzzyCompare().

In the standard Java library there are no methods to handle your problem actually I suggest you to follow Joachim's link and use that library which is quite good for your needs, even though my suggestion would be to create an utils library in which you could add frequently used methods as the one you've stated in your question, as for different implementations of your problem you should consider looking into this :
Java double comparison epsilon
Feel free to ask out any other ambiguities

You should abstain from any library that uses the naive "maximum absolute difference" approach (like Guava). As detailed in the Bruce Dawson's excellent article Comparing Floating Point Numbers, 2012 edition, it is highly error-prone as it only works for a very limited range of values. A much more robust approach is to use relative differences or ULPs for approximate comparisons.
The only library I know of that does implement a correct approximate comparison algorithm is apache.common.math.

Random Number Generation within range with different distribution in Java

I want to generate Random number in different range. For example range 10^14 in Java with different distribution like log, normal, binomial etc. Is there any particular library for the same. I found discussion on colt and math uncommon library. But is it safe enough to generate values as int and then multiply by the corresponding range suffix. What is best practice for the same.

Apache Commons Math has a RandomDataImpl class that does nextBinomial, nextExponential and some other types (above my head unfortunately).
Hopefully that gets you everything you need. You might need to check some of the other classes in the library.

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Java simple String diff util

I'm looking for a simple java lib/src to highlight differences between two Strings, case-sensitive. A html output would be great, but I would be happy to get the indexes of the diffs, something like:
diff("abcd","aacd")
> [2,2]
diff("maniac", "brainiac")
> ["man",brain"] or [0,3] or something like that
The idea is to higlight typos or such in a swing program, since the input shold follow strict conventions.

Apache Commons Lang has a class called StringUtils which has both difference and indexOfDifference which fulfills your needs.
http://commons.apache.org/lang/
Check it out

The java-diff project might also be useful.
This is an implementation of the longest common subsequences (LCS) algorithm for Java. The Diff#diff() method returns a list of Difference objects, each of which describes an addition, a deletion, or a change between the two collections.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.