Unexpected results from Metaphone algorithm

Unexpected results from Metaphone algorithm - java

I am using phonetic matching for different words in Java. i used Soundex but its too crude. i switched to Metaphone and realized it was better. However, when i rigorously tested it. i found weird behaviour. i was to ask whether thats the way metaphone works or am i using it in wrong way. In following example its works fine:-
Metaphone meta = new Metaphone();
if (meta.isMetaphoneEqual("cricket","criket")) System.out.prinlnt("Match 1");
if (meta.isMetaphoneEqual("cricket","criketgame")) System.out.prinlnt("Match 2");
This would Print
Match 1
Mathc 2
Now "cricket" does sound like "criket" but how come "cricket" and "criketgame" are the same. If some one would explain this. it would be of great help.

Your usage is slightly incorrect. A quick investigation of the encoded strings and default maximum code length shows that it is 4, which truncates the end of the longer "criketgame":
System.out.println(meta.getMaxCodeLen());
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Output (note "criketgame" is truncated from "KRKTKM" to "KRKT", which matches "cricket"):
4
KRKT
KRKT
KRKT
Solution: Set the maximum code length to something appropriate for your application and the expected input. For example:
meta.setMaxCodeLen(8);
System.out.println(meta.encode("cricket"));
System.out.println(meta.encode("criket"));
System.out.println(meta.encode("criketgame"));
Now outputs:
KRKT
KRKT
KRKTKM
And now your original test gives the expected results:
Metaphone meta = new Metaphone();
meta.setMaxCodeLen(8);
System.out.println(meta.isMetaphoneEqual("cricket","criket"));
System.out.println(meta.isMetaphoneEqual("cricket","criketgame"));
Printing:
true
false
As an aside, you may also want to experiment with DoubleMetaphone, which is an improved version of the algorithm.
By the way, note the caveat from the documentation regarding thread-safety:
The instance field maxCodeLen is mutable but is not volatile, and accesses are not synchronized. If an instance of the class is shared between threads, the caller needs to ensure that suitable synchronization is used to ensure safe publication of the value between threads, and must not invoke setMaxCodeLen(int) after initial setup.

Related

Compare Code Submissions with Previous Submissions?

Users submit code (mainly java) on my site to solve simple programming challenges, but sending the code to a server to compile and execute it can sometimes take more than 10 seconds.
To speed up this process, I plan to first check the submissions database to see if equivalent code has been submitted before. I realize this will cause Random methods to always return the same result, but that doesn't matter much. Is there any other potential problem that could be caused by not running the code?
To find matches, I remove comments and whitespace when comparing code. However, the same code can still be written in different ways, such as with different variable names. Is there a way to compare code that will find more equivalent code?

You could store a SHA1 hash of the code to compare with a previous submission. You are right that different variable names would give different hashes. Try running the code through a minifier or obfuscator. That way, variable cat and dog will both end up like a1, then you could see if they are unique. The only other way would be to actually compile it into bytecode, but then it's too late.
Instead of analyzing the source code, why not speed up the compilation? Try having a servlet container always running with a custom ClassLoader, and use the JDK tools.jar to compile on the fly. You could even submit the code via AJAX REST and get the results back the same way.
Consider how Eclipse compiles your files in the background.
Also, consider how http://ideone.com implements their online compiler.
FYI It is a big security risk to allow random code execution. You have to be very careful about hackers.

Variable names:
You can write code to match variable names in one file with the variable names in the other, then you can replace both sets with a consistent variable name.
File 1:
var1 += this(var1 - 1);
File 2:
sum += this(sum - 1);
After you read File 1, you look for what variable name File 2 is using in the place of sum, then make the variable names the same across both files.
*Note, if variables are used in similar ways you may get incorrect substitutions. This is most likely when variables are being declared. To help mitigate this, you can start searching for variable names at the bottom of the file and work up.
Short hands:
Force {} and () braces into each if/else/for/while/etc...
rewrite operations like "i+=..." as "i=i+..."
Functions:
In cases where function order doesn't matter, you can make sure functions are equivalent and then ignore them.
Operator precedence:
"3 + (2 * 4)" is usually equivalent to "2 * 4 + 3"
A way around this could be by determining the precedence of each operation and then matching it to an operation of the same precedence in the other set of code. Once a set of operations have been matched, you can replace them with a variable to represent them.
Ex.
(2+4) * 3 + (2+6) * 5 == someotherequation
//substitute most precedent: (2+4) and (2+6) for a and b
... a * 3 + b * 5
//substitute most precedent: (a*3) and (b*5) for c and d
... c + d
//substitute most precedent....
These are just a couple ways I could think of. If you do it this way, it'll end up being quite a big project... especially if you're working with multiple languages.

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Java's String.replace() vs. String.replaceFirst() vs. homebrew

I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?

So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)

First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.

Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)

Best way to test CRC logic?

How can I verify two CRC implementations will generate the same checksums?
I'm looking for an exhaustive implementation evaluating methodology specific to CRC.

You can separate the problem into edge cases and random samples.
Edge cases. There are two variables to the CRC input, number of bytes, and value of each byte. So create arrays of 0, 1, and MAX_BYTES, with values ranging from 0 to MAX_BYTE_VALUE. The edge case suite will be something you'll most likely want to keep within a JUnit suite.
Random samples. Using the ranges above, run CRC on randomly generated arrays of bytes in a loop. The longer you let the loop run, the more you exhaust the inputs. If you are low on computing power, consider deploying the test to EC2.

Create several unit tests with the same input that will compare the output of both implementations against each other.

One nice property of CRCs is that for a given set of parameters (polynomial, reflection, initial state, etc.) you will get a constant value when you recompute the CRC over the original dataset + the original CRC. These constants are documented for common CRCs but you can just blindly generate them using two different random data sets and check that they are the same:
implementation 1: crc(rand_data_1 + crc(rand_data_1)) -> constant_1
implementation 2: crc(rand_data_2 + crc(rand_data_2)) -> constant_2
assert constant_1 == constant_2
You can use the same method within an implementation to get a warm fuzzy feeling about its correctness. If your implementation works with arbitrary polynomials, you can have the unittest exhaustively check every possible polynomial using this method without needing to know what the constants are.
This technique is powerful but it would also be wise to add an independent test that verifies the result based on known input for the pathological case where your CRC implementations both produce bad results that happen to get by the constant equivalence check.

First, if it is a standard CRC implementation, you should be able to find known values somewhere on the net.
Second, you could generate some number of payloads and run the each CRC on the payloads and check that the CRC values match.

By writing a unit test for each which takes the same input and verify against the expected output.

Why should I use Hamcrest-Matcher and assertThat() instead of traditional assertXXX()-Methods

When I look at the examples in the Assert class JavaDoc
assertThat("Help! Integers don't work", 0, is(1)); // fails:
// failure message:
// Help! Integers don't work
// expected: is <1>
// got value: <0>
assertThat("Zero is one", 0, is(not(1))) // passes
I dont see a big advantage over, let's say, assertEquals( 0, 1 ).
It's nice maybe for the messages if the constructs get more complicated but do you see more advantages? Readability?

There's no big advantage for those cases where an assertFoo exists that exactly matches your intent. In those cases they behave almost the same.
But when you come to checks that are somewhat more complex, then the advantage becomes more visible:
val foo = List.of("someValue");
assertTrue(foo.contains("someValue") && foo.contains("anotherValue"));
Expected: is <true>
but: was <false>
vs.
val foo = List.of("someValue");
assertThat(foo, containsInAnyOrder("someValue", "anotherValue"));
Expected: iterable with items ["someValue", "anotherValue"] in any order
but: no item matches: "anotherValue" in ["someValue"]
One can discuss which one of those is easier to read, but once the assert fails, you'll get a good error message from assertThat, but only a very minimal amount of information from assertTrue.

The JUnit release notes for version 4.4 (where it was introduced) state four advantages :
More readable and typeable: this syntax allows you to think in terms of subject, verb, object (assert "x is 3") rather than assertEquals, which uses verb, object, subject (assert "equals 3 x")
Combinations: any matcher statement s can be negated (not(s)), combined (either(s).or(t)), mapped to a collection (each(s)), or used in custom combinations (afterFiveSeconds(s))
Readable failure messages. (...)
Custom Matchers. By implementing the Matcher interface yourself, you can get all of the above benefits for your own custom assertions.
More detailed argumentation from the guy who created the new syntax : here.

Basically for increasing the readability of the code.
Besides hamcrest you can also use the fest assertions.
They have a few advantages over hamcrest such as:
they are more readable
(assertEquals(123, actual); // reads "assert equals 123 is actual" vs
assertThat(actual).isEqualTo(123); // reads "assert that actual is equal to 123")
they are discoverable (you can make autocompletion work with any IDE).
Some examples
import static org.fest.assertions.api.Assertions.*;
// common assertions
assertThat(yoda).isInstanceOf(Jedi.class);
assertThat(frodo.getName()).isEqualTo("Frodo");
assertThat(frodo).isNotEqualTo(sauron);
assertThat(frodo).isIn(fellowshipOfTheRing);
assertThat(sauron).isNotIn(fellowshipOfTheRing);
// String specific assertions
assertThat(frodo.getName()).startsWith("Fro").endsWith("do")
.isEqualToIgnoringCase("frodo");
// collection specific assertions
assertThat(fellowshipOfTheRing).hasSize(9)
.contains(frodo, sam)
.excludes(sauron);
// map specific assertions (One ring and elves ring bearers initialized before)
assertThat(ringBearers).hasSize(4)
.includes(entry(Ring.oneRing, frodo), entry(Ring.nenya, galadriel))
.excludes(entry(Ring.oneRing, aragorn));
October 17th, 2016 Update
Fest is not active anymore, use AssertJ instead.

A very basic justification is that it is hard to mess up the new syntax.
Suppose that a particular value, foo, should be 1 after a test.
assertEqual(1, foo);
--OR--
assertThat(foo, is(1));
With the first approach, it is very easy to forget the correct order, and type it backwards. Then rather than saying that the test failed because it expected 1 and got 2, the message is backwards. Not a problem when the test passes, but can lead to confusion when the test fails.
With the second version, it is almost impossible to make this mistake.

Example:
assertThat(5 , allOf(greaterThan(1),lessThan(3)));
// java.lang.AssertionError:
// Expected: (a value greater than <1> and a value less than <3>)
// got: <5>
assertTrue("Number not between 1 and 3!", 1 < 5 && 5 < 3);
// java.lang.AssertionError: Number not between 1 and 3!
you can make your tests more particular
you get a more detailed Exception, if tests fail
easier to read the Test
btw: you can write Text in assertXXX too...

assertThat(frodo.getName()).isEqualTo("Frodo");
Is close to natural language.
Easier read, easier analyze code.
Programer spend more time to analyze code than write new one. So if code will be easy to analyze then developer should be more productive.
P.S.
Code should be as well-written book.
Self documented code.

there are advantages to assertThat over assertEquals -
1) more readable
2) more information on failure
3) compile time errors - rather than run time errors
4) flexibility with writing test conditions
5) portable - if you are using hamcrest - you can use jUnit or TestNG as the underlying framework.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.