Can not understand the idea/usefulness of edit distance among strings - java

I am reading about the problem of Edit Distance between 2 strings.
It can be solved by Dynamic Programming using a formula of Edit Distance. What I can not understand is its usefulness.
First of all how is this any different than knowing the longest common subsequense of 2 strings?
If the idea is to pick a string with the smallest edit distance you might as well use the max LCS among the strings.Right?
Additionally when we actually code to do the replacement, the code would be similar to the following:
if(a.length == b.length){
for(int i = 0;i < a.length;i++){
a[i] = b[i];
}
}
else{
a = new char[b.length];
for(int i = 0;i < a.length;i++){
a[i] = b[i];
}
}
I mean just replace the characters. Is there any difference between doing an assignment and checking if the characters are the same and if not, only then do the assignment at runtime? Aren't both constant time operations?
What am I misunderstanding with this problem?

Edit Distance and LCS are related by a simple formula if no substitution is allowed in editing (or if substitution is twice as expensive as insertion or deletion):
ed(x,y) = x.length + y.length - 2*lcs(x,y).length
If substitution is a separate unit-cost operation, then ED can be less than that. This is important in practice since we want a way to create shorter diff files. Not just asymptotically bounded up to a constant factor, but actually smallest possible ones.
Edit shorter diff files are probably not a concern here, they won't be substantially shorter if we do not allow substitution. There are more interesting applications, like ranking correction suggestions in a spell checker (this is based on a comment by #nhahtdh below).

Edit distance is quite different form LCS.
The edit distance is the minimal number of edit operations you need to transform the one String to the other String. A very popular example is the Levinstein distance which has the edit operations:
insert one character
delete one character
replace one character
All this operations a biased with a cost of 1.
Put there are a lot of other operations and cost functions possible.
E.g. you could also allow the operation: swap two adjacent characters.
It is e.g. used to align DNA-sequences (or protein-sequences).
If the STring have length n and m the complexity is:
time: O(n*m)
space: O(min(n,m))
Can became worse for complex cost-functions.

Related

How to find two number whose sum is given number in sorted array in O(n)?

public static void findNumber(int number) {
int[] soretedArray = { 1, 5, 6, 8, 9 };
for (int i = 0; i <= soretedArray.length; i++) {
for (int j = i + 1; j < soretedArray.length; j++) {
if (soretedArray[i] + soretedArray[j] == number) {
System.out.println(soretedArray[i] + "::" + soretedArray[j]);
return;
}
}
}
}
Using this code I am able to find the number and its complexity is O(N^2) but I have to find this using O(N) complexity i.e using only one for loop or hash-map or similar in Java.
I remember, I was watching the official Google video about this problem. Although it is not demonstrated in java, it is explained step-by-step in different variations of the problem. You should definitely check it:
How to: Work at Google — Example Coding/Engineering Interview
As explained in the Google video that Alexander G is linking to, use two array indexes. Initialize one to the first element (0) and the other to the last element (sortedArray.length - 1). In a loop, check the sum of the two elements at the two indexes. If the sum is the number you were looking for, you’re done. If it’s too high, you need to find a smaller number at one of the indexes; move the right index one step to the left (since the array is sorted, this is the right way). If on the other hand, the sum you got was too low, move the left index to the right to obtain a higher first addend. When the two indexes meet, if you still haven’t found the sum you were looking for, there isn’t any. At this point you have been n - 1 times through the loop, so the algorithm runs in O(n).
We ought to first check the precondition, that the array is really sorted. This too can be done in O(n), so doing it doesn’t break any requirements.
The algorithm may need refinement if you are required to find all possible pairs of numbers that yield the desired sum rather than just one pair.
Is this answer superfluous when the video link has already said it? For one thing, my explanation is shorter, so if it suffices, you’re fine. Most importantly, if the video is removed or just moved to another URL, my answer will still be here.
With fixed number, for any chosen x in the array you just have to find if number-x is in the array (Note that you can also bound x). This will not give you O(n), but O(n.log(n)).
Maybe by remarking that if you have a_i and a_j (j>i), taking the sum and comparing against number, if the result is greater next interesting tests are with a_(i-1) or a_(j-1), and if result is lower next interesting tests are with a_(i+1) or a_(j+1), will give hint to linear-time?

Fastest algorithm to find frequencies of each element of an array of reals?

The problem is to find frequencies of each element of an array of reals.
double[] a = new double[n]
int[] freq = new int[n]
I have come up with two solution:
First solution O(n^2):
for (int i = 0; i < a.length; i++) {
if (freq[i] != -1) {
for (int j = i + 1; j < a.length; j++) {
if (a[i] == a[j]) {
freq[i]++;
freq[j] = -1;
}
}
}
}
Second solution O(nlogn):
quickSort(a, 0, a.length - 1);
freq[j] = 1;
for (int i = 0; i < a.length - 1; i++) {
if (a[i] == a[i + 1]) {
freq[j]++;
}
else {
j = i + 1;
freq[j] = 1;
}
}
Is there any faster algorithm for this problem (O(n) maybe)?
Thank you in advance for any help you can provide.
Let me start by saying that checking for identity of doubles is not a good practice. For more details see: What every programmer should know about floating points.
You should use more robust double comparisons.
Now, that we are done with that, let's face your problem.
You are dealing with variation of Element Distinctness Problem with floating points number.
Generally speaking, under the algebraic tree computation model, one cannot do it better than Omega(nlogn) (references in this thread: https://stackoverflow.com/a/7055544/572670).
However, if you are going to stick with the doubles identity checks (please don't), you can use a stronger model and hash table to achieve O(n) solution, by maintaining a hash-table based histogram (implemented as HashMap<Double,Integer>) of the elements, and when you are done, scan the histogram and yield the key of the highest value.
(Please don't do it)
There is a complex way to do achieve O(n) time based on hashing, even when dealing with floating points. This is based on adding elements to multiple entries of the hash table and assuming a hash function is taking a range of elements [x-delta/2,x+delta/2) to the same hash value (so it is hashing in chunks [x1,x2)->h1, [x2,x3)->h2, [x3,x4)->h3, ....) . You then can create a hash table where an element x will be hashed to 3 values: x-3/4delta, x, x + 3/4delta.
This guarantees when checking an equal value later, it will have a match in at least one of the 3 places you put the element.
This is significantly more complex to implement, but it should work. A variant of this can be found in cracking the code interview, mathematical, question 6. (Just make sure you look at edition 5, the answer in edition 4 is wrong and was fixed in the newer edition)
As another side note, you don't need to implement your own sort. Use Arrays.sort()
Your doubles have already been rounded appropriately and you are confident there isn't an error to worry about, you can use a hash map like
Map<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, Collectors.counting()));
This uses quite a bit of memory, but is O(n).
The alternative is to use the following as a first pass
NavigableMap<Double, Long> freqCount = DoubleStream.of(reals).boxed()
.collect(Collectors.groupingBy(d -> d, TreeMap::new, Collectors.counting()));
This will count all the values which are exactly the same, and you can use a grouping strategy to combine double values which are almost the same, but should be considered equal for your purposes. This is O(N log N)
Using a Trie would perform in pretty much linear time because insertions are going to be extremely fast (or as fast as the order of your real number).
Sorting and counting is definitely way too slow if all you need is the frequencies. Your friend is the trie: https://en.wikipedia.org/wiki/Trie
If you were using a Trie, then you would convert each integer into a String (simple enough in Java). The complexity of an insertion into a Trie varies slightly based on the implementation, but in general it will be proportional to the length of the String.
If you need an implementation of a Trie, I suggest looking into Robert Sedgwick's implementation for his Algorithm's course here:
http://algs4.cs.princeton.edu/52trie/TrieST.java.html

java - Speed up recursive function

I am working on this spellchecker and one of the methods I use to suggest corrections to the user is inserting multiple characters into the word. This allows for words like exmpl to be corrected to example
Here's the actual code:
public static Set<String> multiInsert(String word, int startIndex) {
Set<String> result = new HashSet<>();
//List of characters to insert
String correction = "azertyuiopqsdfghjklmwxcvbnùûüéèçàêë";
for (int i = startIndex; i <= word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
for (int k = 0; k < correction.length(); k++) {
String newWord = word.substring(0, j) + correction.charAt(k) + word.substring(j);
result.addAll(multiInsert(newWord, startIndex + 2));
if(dico.contains(newWord)) result.add(newWord);
}
}
}
return result;
}
The problem with this function is that it takes A LOT of time to process especially when the word is long or when I have too many words to correct. Is there any better way to implement this function or optimize it?
What's making it slow is you're testing for strings that are not in the dictionary.
There are lots more possible misspellings than there are words in the dictionary.
You need to be guided by the dictionary.
This is the general spelling correction problem.
I've programmed it several times.
In a nutshell, the method is to store the dictionary as a trie, and do bounded depth-first walk of the trie.
At each step, you keep track of the distance between the word in the trie and the original word.
Whenever that distance exceeds the bound, you prune the search.
So you do it in cycles, increasing the bound each time.
First you do it with a bound of 0, so it will only find an exact match.
That is equivalent to ordinary trie search.
If that did not yield a match, do the walk again with a bound of 1.
That will find all dictionary words that are distance 1 from the original word.
If that didn't yield any, increase the bound to 2, and so on.
(What constitutes an increment of distance is any transformation you choose, like insertion, deletion, replacement, or more general re-writings.)
The performance is bounded by the true distance times the dictionary size.
Short of that, it is exponential in the true distance. Since each walk costs a factor times the previous walk, the time is dominated by the final walk, so the prior walks do not add much time.
There is an advantage to organizing the dictionary as a trie, since a trie is just a particular form of Finite State Machine.
You can add sub-machines to it to handle common prefixes and suffixes, without massively expanding the dictionary.
Consider these words: nation, national, nationalism, nationalistic, nationalistical, nationalisticalism, ...
Such words may not be common, but they are not impossible.
A suffix trie handles them easily.
Similarly prefixes like pre-, post-, un-, de-, in-, etc.
You can have a look at Jazzy which is a Java spellchecker API.
You might also want to consider Fuzzy String Matching.

The most efficient way to search for an array of strings in another string

I have a large arrray of strings that looks something like this:
String temp[] = new String[200000].
I have another String, let's call it bigtext. What I need to do is iterate through each entry of temp, checking to see if that entry is found in bigtext and then do some work based on it. So, the skeletal code looks something like this:
for (int x = 0; x < temp.length; x++) {
if (bigtext.indexOf(temp[x]) > -1 {
//do some stuff
} else continue;
}
Because there are so many entries in temp and there are many instances of bigtext as well, I want to do this in the most efficient way. I am wondering if what I've outlined is the most efficient way to iterate through this search of if there are better ways to do this.
Thanks,
Elliott
I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.
Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).
By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!
Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.
If you have additional information about temp, you can maybe improve the iteration.
You can also reduce the time spent, if you parallelize the iteration.
Efficency depends heavily on what is valuable to you.
Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.
With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.
An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.
Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing
bidTextSet.retainAll(mainWordsSet);
What remains will be the words that occur in bigText that are in your "dictionary".
Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.
I'm afraid it's not efficient at all in any case!
To pick the right algorithm, you need to provide some answers:
What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
Are you actually searching for words? If so, index them. Bloom filter can help, too.
If you need a bit of fuzziness, may stem or soundex can do the job?
Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.
That is a very efficient approach. You can improve it slightly by only evaluating temp.length once
for(int x = 0, len = temp.length; x < len; x++)
Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

Text similarity Algorithms

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down.
Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.
Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.
Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.
I've never used Hamming distance for this purpose, so I can't say much about it.
I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.
One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.
For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.
Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.
Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.
Consider the example on wikipedia for Levenshtein distance:
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
1. kitten → sitten (substitution of 's' for 'k')
2. sitten → sittin (substitution of 'i' for 'e')
3. sittin → sitting (insertion of 'g' at the end).
Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".
Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
for(int j = i + 1; j < papers.length; j++) {
Paper first = papers[i];
Paper second = papers[j];
int dist = compareSimilarities(first.text,second.text);
System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
}
}
Compare those results and peg the kids with the lowest distance scores.
In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

Categories

Resources