Modifying Levenshtein Distance algorithm to not calculate all distances

Modifying Levenshtein Distance algorithm to not calculate all distances - java

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.
Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.
If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]
Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
My modifications (only to the for loops):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}

The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}

I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.

Here someone answers a very similar question:
Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
just the main part, more code in the original

I used the original code and places this just before the end of the j for loop:
if (p[n] > s.length() + 5)
break;
The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.

Apache Commons Lang 3.4 has this implementation:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* http://www.merriampark.com/ld.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #param threshold the target threshold, must not be negative
* #return result distance, or {#code -1} if the distance would be greater than the threshold
* #throws IllegalArgumentException if either String input {#code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}

Related

Simple algorithm but running out of memory

Problem
A sequence of positive rational numbers is defined as follows:
An infinite full binary tree labeled by positive rational numbers is
defined by:
The label of the root is 1/1
The left child of label p/q is p/(p+q)
The right child of label p/q is (p+q)/q
The top of the tree is shown in the following figure:
The sequence is defined by doing a level order (breadth first)
traversal of the tree (indicated by the light dashed line). So that:
F(1)=1/1,F(2)=1/2,F(3)=2/1,F(4)=1/3,F(5)=3/2,F(6)=2/3,…
Write a program which finds the value of n for which F(n) is p/q for
inputs p and q.
Input
The first line of input contains a single integer P, (1≤P≤1000), which
is the number of data sets that follow. Each data set should be
processed identically and independently. Each data set consists of a
single line of input. It contains the data set number, K, a single
space, the numerator, p, a forward slash (/) and the denominator, q,
of the desired fraction.
Output
For each data set there is a single line of output. It contains the
data set number, K, followed by a single space which is then followed
by the value of n for which F(n) is p/q. Inputs will be chosen so n
will fit in a 32-bit integer.
Source to question
My approach
I create the heap and planned to iterate over it until I find the element(s) in question, but I ran out of memory so I'm pretty sure I'm supposed to do it without creating the heap at all?
Code
public ARationalSequenceTwo() {
Kattio io = new Kattio(System.in, System.out);
StringBuilder sb = new StringBuilder(10000);
int iter = io.getInt();
// create heap
int parent;
Node[] heap = new Node[Integer.MAX_VALUE];
int counter = 1;
heap[0] = new Node(1, 1);
while (counter < Integer.MAX_VALUE) {
parent = (counter - 1) / 2;
// left node
heap[counter++] = new Node(heap[parent].numerator, heap[parent].numerator + heap[parent].denominator);
// right node
heap[counter++] = new Node(heap[parent].numerator + heap[parent].denominator, heap[parent].denominator);
}
// find Node
int dataSet;
String word;
int numerator;
int denominator;
for (int i = 0; i < iter; i++) {
dataSet = io.getInt();
word = io.getWord();
numerator = Integer.parseInt(word.split("/")[0]);
denominator = Integer.parseInt(word.split("/")[1]);
for (int j = 0; j < Integer.MAX_VALUE; j++) {
Node node = heap[j];
if (node.numerator == numerator && node.denominator == denominator) {
sb.append(dataSet).append(" ").append(j).append("\n");
}
}
}
System.out.println(sb);
io.close();
}

let's consider node n = a/b. If n is a left child of its parent, then n = p/(p+q), where the parent is p/q. I.e.
p = a,
b = p + q
p = a,
q = b - a
If n is a right child of its parent, then n = (p+q)/q:
a = p + q
b = q
p = a - b =
q = b
so, given for example 3/5, is it a left child or a right child? If it was a left child, then it's parent would be 3/(5-3) = 3/2. For the right child, we would have (3-5)/5 = -2/5. As this would not be positive, clearly n is a left child.
So, generalizing:
given a rational n, we can find the path to the root as follows:
ArrayList lefts = new ArrayList<>();
while (nNum != nDen) {
if (nNum < nDen) {
//it's a left child
nDen = nDen - nNum;
lefts.add(true);
} else {
nNum = nNum - nDen;
lefts.add(false);
}
}
Now that we have the path in the array, how do we translate it in the final result? Let's observe that
if the value given was 1/1, then the array is empty, and we should return 1
Every time we go from level n to level n+1, we add 2^n to the result. For example, going from level 0 to level 1 we add 1 (the root). going from level 1 to level 2 we add all two nodes of level 1, which are 2, etc.
We're left with the last piece, which is adding the nodes to the left of the last node we have, the one corresponding to the input rational, plus one. How many node are on the left? if you try to label each arc going left with 0 and each arc going right with 1, you'll notice that the path spells in binary the number of nodes in the last level. For example, 3/5 is the left child of 3/2. the array will be populated with false, true, false. in binary, 010. The final result is 2^0 + 2^1 + 2^2 + 010 + 1 = 1 + 2 + 4 + 2 + 1 = 10
Finally, note that sum(2^i) is 2^(i+1) - 1. so, we can finally write the code for the second part:
int s = (1 << lefts.size()) - 1) // 2^(i+1) - 1
int k = 0
for (int i = lefts.size() - 1; i >= 0; i---) {
if (lefts.get(i)) {
k += 1 << i;
}
}
return s + k + 1;
A full program taking in input num and den:
import java.util.ArrayList;
public class Z {
public static int func(int num, int den) {
ArrayList<Boolean> lefts = new ArrayList<>();
while (num != den) {
if (num < den) {
//it's a left child
den = den - num;
lefts.add(true);
} else {
num = num - den;
lefts.add(false);
}
}
int s = (1 << lefts.size()) - 1; // 2^(i+1) - 1
int k = 0;
for (int i = lefts.size() - 1; i >= 0; i--) {
if (!lefts.get(i)) {
k += 1 << i;
}
}
return s + k + 1;
}
public static void main(String[] args) {
System.out.println(func(Integer.parseInt(args[0]),
Integer.parseInt(args[1])));
}
}

Given a number p/q you can see whether it's a left or right child of its parent by considering whether p > q or p < q. And one can repeat that process all the way up the tree back to the root.
That gives a relatively simple recursive solution. In pseudocode:
T(p, q) =
1 if p == q == 1
2 * T(p, q-p) if p < q
2 * T(p-q, q) + 1 if p > q
This in theory could cause a stack overflow, because it runs in O(p+q) time and space. For example, T(1000000, 1) will require 1 million recursive calls. But it's given in the question that T(p, q) < 2**31, so the depth of the tree can be at most 32, and this solution works just fine.

Restaurant Maximum Profit using Dynamic Programming

Its an assignment task,I have spend 2 days to come up with a solution but still having lots of confusion,however here I need to make few points clear. Following is the problem:
Yuckdonald’s is considering opening a series of restaurant along QVH. n possible locations are along a straight line and the distances of these locations from the start of QVH are in miles and in increasing order m1, m2, ...., mn. The constraints are as follows:
1. At each location, Yuckdonald may open one restaurant and expected profit from opening a restaurant at location i is given as pi
2. Any two restaurants should be at least k miles apart, where k is a positive integer
My solution:
public class RestaurantProblem {
int[] Profit;
int[] P;
int[] L;
int k;
public RestaurantProblem(int[] L , int[] P, int k) {
this.L = L;
this.P = P;
this.k = k;
Profit = new int[L.length];
}
public int compute(int i){
if(i==0)
return 0;
Profit[i]= P[i]+(L[i]-L[i-1]< k ? 0:compute(i-1));//if condition satisfies then adding previous otherwise zero
if (Profit[i]<compute(i-1)){
Profit[i] = compute(i-1);
}
return Profit[i];
}
public static void main(String args[]){
int[] m = {0,5,10,15,19,25,28,29};
int[] p = {0,10,4,61,21,13,19,15};
int k = 5;
RestaurantProblem rp = new RestaurantProblem(m, p ,k);
rp.compute(m.length-1);
for(int n : rp.Profit)
System.out.println(n);
}
}
This solution giving me 88 however if I exclude (Restaurant at 25 with Profit 13) and include (Restaurant 28 with profit 19) I can have 94 max...
point me if I am wrong or how can I achieve this if its true.

I was able to identify 2 mistakes:
You are not actually using dynamic programming
, you are just storing the results in a data structure, which wouldn't be that bad for performance if the program worked the way you have written it and if you did only 1 recursive call.
However you do at least 2 recursive calls. Therefore the program runs in Ω(2^n) instead of O(n).
Dynamic programming usually works like this (pseudocode):
calculate(input) {
if (value already calculated for input)
return previously calculated value
else
calculate and store value for input and return result
}
You could do this by initializing the array elements to -1 (or 0 if all profits are positive):
Profit = new int[L.length];
Arrays.fill(Profit, -1); // no need to do this, if you are using 0
public int compute(int i) {
if (Profit[i] >= 0) { // modify the check, if you're using 0 for non-calculated values
// reuse already calculated value
return Profit[i];
}
...
You assume the previous restaurant can only be build at the previous position
Profit[i] = P[i] + (L[i]-L[i-1]< k ? 0 : compute(i-1));
^
Just ignores all positions before i-1
Instead you should use the profit for the last position that is at least k miles away.
Example
k = 3
L 1 2 3 ... 100
P 5 5 5 ... 5
here L[i] - L[i-1] < k is true for all i and therefore the result will just be P[99] = 5 but it should be 34 * 5 = 170.
int[] lastPos;
public RestaurantProblem(int[] L, int[] P, int k) {
this.L = L;
this.P = P;
this.k = k;
Profit = new int[L.length];
lastPos = new int[L.length];
Arrays.fill(lastPos, -2);
Arrays.fill(Profit, -1);
}
public int computeLastPos(int i) {
if (i < 0) {
return -1;
}
if (lastPos[i] >= -1) {
return lastPos[i];
}
int max = L[i] - k;
int lastLastPos = computeLastPos(i - 1), temp;
while ((temp = lastLastPos + 1) < i && L[temp] <= max) {
lastLastPos++;
}
return lastPos[i] = lastLastPos;
}
public int compute(int i) {
if (i < 0) {
// no restaurants can be build before pos 0
return 0;
}
if (Profit[i] >= 0) { // modify the check, if you're using 0 for non-calculated values
// reuse already calculated value
return Profit[i];
}
int profitNoRestaurant = compute(i - 1);
if (P[i] <= 0) {
// no profit can be gained by building this restaurant
return Profit[i] = profitNoRestaurant;
}
return Profit[i] = Math.max(profitNoRestaurant, P[i] + compute(computeLastPos(i)));
}

To my understanding, the prolem can be modelled with a two-dimensional state space, which I don't find in the presented implementation. For each (i,j) in{0,...,n-1}times{0,...,n-1}` let
profit(i,j) := the maximum profit attainable for selecting locations
from {0,...,i} where the farthest location selected is
no further than at position j
(or minus infinity if no such solution exist)
and note that the recurrence relation
profit(i,j) = min{ p[i] + profit(i-1,lastpos(i)),
profit(i-1,j)
}
where lastpos(i) is the location which is farthest from the start, but no closer than k to position i; the first case above corresponds to selection location i into the solution while the second case corresponds to omitting location j in the solution. The overall solution can be obtained by evaluating profit(n-1,n-1); the evaluation can be done either recursively or by filling a two-dimensional array in a bottom-up manner and returning its contents at (n-1,n-1).

Compare two strings without Apache StringUtils

Hi I am working with a voice command project. So I want to receive user's voice at first then I want to check the matches and then I want to do something according to the command. For this, I found a way to match the strings using org.apache.commons.lang3.StringUtils but I find so many trouble with this. For ex:- I face problem when I go to import the apache's external library to my android studio.
So my question is that:- is there any other way to compare the user's voice data and my specific command without using Apache's StringUtils method? Please help if you can

Take the source right from the library (Obviously follow the requirements of the Apache license)
https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html
Line 6865
/**
* <p>Find the Levenshtein distance between two Strings.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>The previous implementation of the Levenshtein distance algorithm
* was from http://www.merriampark.com/ld.htm</p>
*
* <p>Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError
* which can occur when my Java implementation is used with very large strings.<br>
* This implementation of the Levenshtein distance algorithm
* is from http://www.merriampark.com/ldjava.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","") = 0
* StringUtils.getLevenshteinDistance("","a") = 1
* StringUtils.getLevenshteinDistance("aaapppp", "") = 7
* StringUtils.getLevenshteinDistance("frog", "fog") = 1
* StringUtils.getLevenshteinDistance("fly", "ant") = 3
* StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
* StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
* StringUtils.getLevenshteinDistance("hello", "hallo") = 1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #return result distance
* #throws IllegalArgumentException if either String input {#code null}
* #since 3.0 Changed signature from getLevenshteinDistance(String, String) to
* getLevenshteinDistance(CharSequence, CharSequence)
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
/*
The difference between this impl. and the previous is that, rather
than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,
we maintain two single-dimensional arrays of length s.length() + 1. The first, d,
is the 'current working' distance array that maintains the newest distance cost
counts as we iterate through the characters of String s. Each time we increment
the index of String t we are comparing, d is copied to p, the second int[]. Doing so
allows us to retain the previous cost counts as required by the algorithm (taking
the minimum of the cost count to the left, up one, and diagonally up and to the left
of the current cost count being calculated). (Note that the arrays aren't really
copied anymore, just switched...this is clearly much better than cloning an array
or doing a System.arraycopy() each time through the outer loop.)
Effectively, the difference between the two implementations is this one does not
cause an out of memory condition when calculating the LD over two very large strings.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; //'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i <= n; i++) {
p[i] = i;
}
for (j = 1; j <= m; j++) {
t_j = t.charAt(j - 1);
d[0] = j;
for (i = 1; i <= n; i++) {
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}

There are many string functions you can use to compare strings, for example
if (result.equals("hello")) {
doSomething();
}
compares two strings
result.startsWith("search for") {
doSomething()
}
checks the beginning of the result
result.matches("yes|sure") {
doSomething()
}
checks result with regular expression.
You can find all that in a Java textbook. See for example
https://docs.oracle.com/javase/tutorial/java/data/comparestrings.html
If you want to use Levenshtein distance you can insert the following function in your code:
public int LevenshteinDistance (String s0, String s1) {
int len0 = s0.length() + 1;
int len1 = s1.length() + 1;
// the array of distances
int[] cost = new int[len0];
int[] newcost = new int[len0];
// initial cost of skipping prefix in String s0
for (int i = 0; i < len0; i++) cost[i] = i;
// dynamically computing the array of distances
// transformation cost for each letter in s1
for (int j = 1; j < len1; j++) {
// initial cost of skipping prefix in String s1
newcost[0] = j;
// transformation cost for each letter in s0
for(int i = 1; i < len0; i++) {
// matching current letters in both strings
int match = (s0.charAt(i - 1) == s1.charAt(j - 1)) ? 0 : 1;
// computing cost for each transformation
int cost_replace = cost[i - 1] + match;
int cost_insert = cost[i] + 1;
int cost_delete = newcost[i - 1] + 1;
// keep minimum cost
newcost[i] = Math.min(Math.min(cost_insert, cost_delete), cost_replace);
}
// swap cost/newcost arrays
int[] swap = cost; cost = newcost; newcost = swap;
}
// the distance is the cost for transforming all letters in both strings
return cost[len0 - 1];
}

Bit mask generation to minimize number of 1

In order to explore some solutions, I need to generate all possibilities. I'm doing it by using bit masking, like this:
for (long i = 0; i < 1L << NB; i++) {
System.out.println(Long.toBinaryString(i));
if(checkSolution(i)) {
this.add(i); // add i to solutions
}
}
this.getBest(); // get the solution with lowest number of 1
this allow me to explore (if NB=3):
000
001
010
011
100
101
110
111
My problem is that the best solution is the one with the lowest number of 1.
So, in order to stop the search as soon as I found a solution, I would like to have a different order and produce something like this:
000
001
010
100
011
101
110
111
That would make the search a lot faster since I could stop as soon as I get the first solution. But I don't know how can I change my loop to get this output...
PS: NB is undefined...

The idea is to turn your loop into two nested loops; the outer loop sets the number of 1's, and the inner loop iterates through every combination of binary numbers with N 1's. Thus, your loop becomes:
for (long i = 1; i < (1L << NB); i = (i << 1) | 1) {
long j = i;
do {
System.out.println(Long.toBinaryString(j));
if(checkSolution(j)) {
this.add(j); // add j to solutions
}
j = next_of_n(j);
} while (j < (1L << NB));
}
next_of_n() is defined as:
long next_of_n(long j) {
long smallest, ripple, new_smallest, ones;
if (j == 0)
return j;
smallest = (j & -j);
ripple = j + smallest;
new_smallest = (ripple & -ripple);
ones = ((new_smallest / smallest) >> 1) - 1;
return (ripple | ones);
}
The algorithm behind next_of_n() is described in C: A Reference Manual, 5th edition, section 7.6, while showing an example of a SET implementation using bitwise operations. It may be a little hard to understand the code at first, but here's what the book says about it:
This code exploits many unusual properties of unsigned arithmetic. As
an illustration:
if x == 001011001111000, then
smallest == 000000000001000
ripple == 001011010000000
new_smallest == 000000010000000
ones == 000000000000111
the returned value == 001011010000111
The overall idea is that you find the rightmost contiguous group of
1-bits. Of that group, you slide the leftmost 1-bit to the left one
place, and slide all the others back to the extreme right. (This code
was adapted from HAKMEM.)
I can provide a deeper explanation if you still don't get it. Note that the algorithm assumes 2 complement, and that all arithmetic should ideally take place on unsigned integers, mainly because of the right shift operation. I'm not a huge Java guy, I tested this in C with unsigned long and it worked pretty well. I hope the same applies to Java, although there's no such thing as unsigned long in Java. As long as you use reasonable values for NB, there should be no problem.

This is an iterator that iterates bit patterns of the same cardinality.
/**
* Iterates all bit patterns containing the specified number of bits.
*
* See "Compute the lexicographically next bit permutation"
* http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
*
* #author OldCurmudgeon
*/
public class BitPattern implements Iterable<BigInteger> {
// Useful stuff.
private static final BigInteger ONE = BigInteger.ONE;
private static final BigInteger TWO = ONE.add(ONE);
// How many bits to work with.
private final int bits;
// Value to stop at. 2^max_bits.
private final BigInteger stop;
// Should we invert the output.
private final boolean not;
// All patterns of that many bits up to the specified number of bits - invberting if required.
public BitPattern(int bits, int max, boolean not) {
this.bits = bits;
this.stop = TWO.pow(max);
this.not = not;
}
// All patterns of that many bits up to the specified number of bits.
public BitPattern(int bits, int max) {
this(bits, max, false);
}
#Override
public Iterator<BigInteger> iterator() {
return new BitPatternIterator();
}
/*
* From the link:
*
* Suppose we have a pattern of N bits set to 1 in an integer and
* we want the next permutation of N 1 bits in a lexicographical sense.
*
* For example, if N is 3 and the bit pattern is 00010011, the next patterns would be
* 00010101, 00010110, 00011001,
* 00011010, 00011100, 00100011,
* and so forth.
*
* The following is a fast way to compute the next permutation.
*/
private class BitPatternIterator implements Iterator<BigInteger> {
// Next to deliver - initially 2^n - 1
BigInteger next = TWO.pow(bits).subtract(ONE);
// The last one we delivered.
BigInteger last;
#Override
public boolean hasNext() {
if (next == null) {
// Next one!
// t gets v's least significant 0 bits set to 1
// unsigned int t = v | (v - 1);
BigInteger t = last.or(last.subtract(BigInteger.ONE));
// Silly optimisation.
BigInteger notT = t.not();
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
// w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
// The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros.
next = t.add(ONE).or(notT.and(notT.negate()).subtract(ONE).shiftRight(last.getLowestSetBit() + 1));
if (next.compareTo(stop) >= 0) {
// Dont go there.
next = null;
}
}
return next != null;
}
#Override
public BigInteger next() {
last = hasNext() ? next : null;
next = null;
return not ? last.not(): last;
}
#Override
public void remove() {
throw new UnsupportedOperationException("Not supported.");
}
#Override
public String toString () {
return next != null ? next.toString(2) : last != null ? last.toString(2): "";
}
}
public static void main(String[] args) {
System.out.println("BitPattern(3, 10)");
for (BigInteger i : new BitPattern(3, 10)) {
System.out.println(i.toString(2));
}
}
}

First you loop over your number of ones, say n. First you start with 2^n-1, which is the first integer to contain exactly n ones and test it. To get the next one, you use the algorithm from Hamming weight based indexing (it's C code, but should not be to hard to translate it to java).

Here's some code I put together some time ago to do this. Use the combinadic method giving it the number of digits you want, the number of bits you want and which number in the sequence.
// n = digits, k = weight, m = position.
public static BigInteger combinadic(int n, int k, BigInteger m) {
BigInteger out = BigInteger.ZERO;
for (; n > 0; n--) {
BigInteger y = nChooseK(n - 1, k);
if (m.compareTo(y) >= 0) {
m = m.subtract(y);
out = out.setBit(n - 1);
k -= 1;
}
}
return out;
}
// Algorithm borrowed (and tweaked) from: http://stackoverflow.com/a/15302448/823393
public static BigInteger nChooseK(int n, int k) {
if (k > n) {
return BigInteger.ZERO;
}
if (k <= 0 || k == n) {
return BigInteger.ONE;
}
// ( n * ( nChooseK(n-1,k-1) ) ) / k;
return BigInteger.valueOf(n).multiply(nChooseK(n - 1, k - 1)).divide(BigInteger.valueOf(k));
}
public void test() {
System.out.println("Hello");
BigInteger m = BigInteger.ZERO;
for ( int i = 1; i < 10; i++ ) {
BigInteger c = combinadic(5, 2, m);
System.out.println("c["+m+"] = "+c.toString(2));
m = m.add(BigInteger.ONE);
}
}
Not sure how it matches up in efficiency with the other posts.

re-arranging letters to produce a palindrome

I wrote this simple function (in Java) to re-arrange letters in a given String to produce a palindrome, or if it is not possible, just prints -1 and returns.
For some reason, I can't figure out why this is not working (because it does not pass the automated-grading scripts). I tested every case that I could think of, and it does pass.
Could anyone please provide some insights on this? Thanks!
/**
* Pseudo-code:
* (0) Compute the occurrences of each characters.
* (0')(Also remember how many groups of characters have an odd number of members
* (1) If the number remembered above is greater than 1
* (meaning there are more than one group of characters with an odd
* number of members),
* print -1 and return (no palindrome!)
* (2) Else, for each group of character
* - if the number of member is odd, save it to a var called 'left'
* - put a char from the group at the current position, and
* another one at postion [len - cur -1].
* (3) If a variable 'left' is defined, put it in the middle of the string
*
* #param wd
*/
private static void findPalin(String wd)
{
if (wd.isEmpty())
{
// Empty String is a palindrome itself!
System.out.println("");
return;
}
HashMap<Character, Integer> stats = new HashMap<Character, Integer>();
int len = wd.length();
int oddC = 0;
for (int n = 0; n < len; ++n)
{
Integer prv = stats.put(wd.charAt(n), 1);
if (prv != null)
{
if (prv % 2 == 0)
++oddC;
else
--oddC;
stats.put(wd.charAt(n), ++prv);
}
else
++oddC;
}
if (oddC > 1)
System.out.println(-1);
else
{
int pos = 0;
char ch[] = new char[len];
char left = '\0';
for (char theChar : stats.keySet())
{
Integer c = stats.get(theChar);
if (c % 2 != 0)
{
left = theChar;
--c;
}
while (c > 1)
{
ch[len - pos - 1] = ch[pos] = theChar;
++pos;
--c;
}
}
if (left != '\0')
ch[(len - 1) / 2] = left;
for (char tp : ch)
System.out.print(tp);
System.out.println();
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Modifying Levenshtein Distance algorithm to not calculate all distances - java

I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.

Related

Simple algorithm but running out of memory

Restaurant Maximum Profit using Dynamic Programming

Compare two strings without Apache StringUtils

Bit mask generation to minimize number of 1

re-arranging letters to produce a palindrome

Categories

Resources