Douglas-Peucker point count tolerance - java

I am trying to implement Douglas-Peucker Algorithm with point count tolerance. I mean that i specifies that i want 50% compression. I found this algorithm on this page http://psimpl.sourceforge.net/douglas-peucker.html under Douglas-Peucker N. But i am not sure how this algorithm is working. Is there any implementation of this in java or some good specification about this version of algorithm?
What i dont understand from psimpl explanation is what will happend after we choose fist point into simplification? We will broke the edge into two new edges and rank all points and choose best point from both edges?

DP searches the polyline for the farthest vertex from the baseline. If this vertex is farther than the tolerance, the polyline is split there and the procedure applied recursively.
Unfortunately, there is no relation between this distance and the number of points to keep. Usually the "compression" is better than 50%, so you may try to continue the recursion deeper. But achieving a good balance of point density looks challenging.

Combine the Douglas-peucker algorithm with iteration, and consider the remained points as a judge criteria.
Here is my algorithm, array 'points' stores the points of the trajectory.Integer 'd' is the threshold.
public static Point[] divi(Point[] points,double d)
{
System.out.println("threshold"+d);
System.out.println("the nth divi iteration");
int i = 0;
Point[] p1 = new Point[points.length];
for (i = 0;i<points.length;i++)
p1[i] = points[i];
compress(p1, 0,p1.length - 1,d); //first compression
int size = 0;
for (Point p : p1) //trajectory after compression
if(p != null)
size ++;
System.out.println("size of points"+size);
if(size<=200 && size>=100)
return p1;
else if(size>200)
return divi(p1,d + d/2.0);
else
return divi(points,d/2.0);
}
public static void compress(Point[] points,int m, int n,double D)
{
System.out.println("threshold"+D);
System.out.println("startIndex"+m);
System.out.println("endIndex"+n);
while (points[m] == null)
m ++;
Point from = points[m];
while(points[n] == null)
n--;
Point to = points[n];
double A = (from.x() - to.x()) /(from.y() - to.y());
/** -
* 由起始点和终止点构成的直线方程一般式的系数
*/
double B = -1;
double C = from.x() - A *from.y();
double d = 0;
double dmax = 0;
if (n == m + 1)
return;
List<Double> distance = new ArrayList<Double>();
for (int i = m + 1; i < n; i++) {
if (points[i] ==null)
{
distance.add(0.0);
continue;
}
else
{
Point p = points[i];
d = Math.abs(A * (p.y()) + B * (p.x()) + C) / Math.sqrt(Math.pow(A, 2) + Math.pow(B, 2));
distance.add(d);
}
}
dmax= Collections.max(distance);
if (dmax < D)
for(int i = n-1;i > m;i--)
points[i] = null;
else
{
int middle = distance.indexOf(dmax) + m + 1;
compress(points,m, middle,D);
compress(points,middle, n,D);
}
}

Related

How can I make my BFS algorithm run faster?

So I have a function that looks at a grid (2D array) and finds all the paths from the starting point to the end point. So far the algorithm works as intended and I get the values that I'm looking for.
The problem is that it takes forever. It can run over a 100 x 100 grid no problem, but once I get to a 10000 x 10000 grid, it'll take about 10 min to give back an answer, where I'm looking for maybe 1 min at most.
Here's what it looks like right now:
public void BFS(Point s, Point e){
/**
* North, South, East, West coordinates
*/
int[] x = {0,0,1,-1};
int[] y = {1,-1,0,0};
LinkedList<Point> queue = new LinkedList<>();
queue.add(s);
/**
* 2D int[][] grid that stores the distances of each point on the grid
* from the start
*/
int[][] dist = new int[numRow][numCol];
for(int[] a : dist){
Arrays.fill(a,-1);
}
/**
* "obstacles" is an array of Points that contain the (x, y) coordinates of obstacles on the grid
* designated as a -2, which the BFS algorithm will avoid.
*/
for(Point ob : obstacles){
dist[ob.x][ob.y] = -2;
}
// Start point
dist[s.x][s.y] = 0;
/**
* Loops over dist[][] from starting point, changing each [x][y] coordinate to the int
* value that is the distance from S.
*/
while(!queue.isEmpty()){
Point p = queue.removeFirst();
for(int i = 0; i < 4; i++){
int a = p.x + x[i];
int b = p.y + y[i];
if(a >= 0 && b >= 0 && a < numRow && b < numCol && dist[a][b] == -1){
dist[a][b] = 1 + dist[p.x][p.y];
Point tempPoint = new Point(a, b);
if(!queue.contains(tempPoint)){
queue.add(tempPoint);
}
}
}
}
/**
* Works backwards to find all shortest path points between S and E, and adds each
* point to an array called "validPaths"
*/
queue.add(e);
while(!queue.isEmpty()){
Point p = queue.removeFirst();
// Checks grid space (above, below, left, and right) from Point p
for(int i = 0; i < 4; i++){
int curX = p.x + x[i];
int curY = p.y + y[i];
// Index Out of Bounds check
if(curX >= 0 && curY >= 0 && !(curX == start.x && curY == start.y) && curX < numRow && curY < numCol){
if(dist[curX][curY] < dist[p.x][p.y] && dist[curX][curY] != -2){ // -2 is an obstacle
Point tempPoint = new Point(curX, curY);
if(!validPaths.contains(tempPoint)){
validPaths.add(tempPoint);
}
if(!queue.contains(tempPoint)){
queue.add(tempPoint);
}
}
}
}
}
So again, while it works, it's really slow. I'm trying to get a O(n + m), but I believe that it might be running in O(n^2).
Does anyone know any good ideas to make this faster?
An clear reason for the observed inefficiency are the comparisons !validPaths.contains(tempPoint) and !queue.contains(tempPoint) which are both O(n). To do these comparisons you should be striving for an O(1) comparison, which can be accomplished by using a special datastructure such as a hash-set or simply a bitset.
As it stands now, your implementation is clearly O(n^2) because of these comparisons.

Newton's Method for finding Complex Roots in Java

I got a project in my Java class which I'm having trouble with.
The project is basically marking coordinates on the screen, making a (complex) polynomial out of them, then solving the polynomial with Newton's method using random guesses and drawing the path of the guesses on the screen.
I don't have a problem with any of the drawing, marking, etc.
But for some reason, my Newton's method algorithm randomly misses roots. Sometimes it hits none of them, sometimes it misses one or two. I've been changing stuff up for hours now but I couldn't really come up with a solution.
When a root is missed, usually the value I get in the array is either converging to infinity or negative infinity (very high numbers)
Any help would be really appreciated.
> // Polynomial evaluation method.
public Complex evalPoly(Complex complexArray[], Complex guess) {
Complex result = new Complex(0, 0);
for (int i = 0; i < complexArray.length; i++) {
result = result.gaussMult(guess).addComplex(complexArray[complexArray.length - i - 1]);
}
return result;
}
> // Polynomial differentation method.
public Complex[] diff(Complex[] comp) {
Complex[] result = new Complex[comp.length - 1];
for (int j = 0; j < result.length; j++) {
result[j] = new Complex(0, 0);
}
for (int i = 0; i < result.length - 1; i++) {
result[i].real = comp[i + 1].real * (i + 1);
result[i].imaginary = comp[i + 1].imaginary * (i + 1);
}
return result;
}
> // Method which eliminates some of the things that I don't want to go into the array
public boolean rootCheck2(Complex[] comps, Complex comp) {
double accLim = 0.01;
if (comp.real == Double.NaN)
return false;
if (comp.real == Double.NEGATIVE_INFINITY || comp.real == Double.POSITIVE_INFINITY)
return false;
if (comp.imaginary == Double.NaN)
return false;
if (comp.imaginary == Double.NEGATIVE_INFINITY || comp.imaginary == Double.POSITIVE_INFINITY)
return false;
for (int i = 0; i < comps.length; i++) {
if (Math.abs(comp.real - comps[i].real) < accLim && Math.abs(comp.imaginary - comps[i].imaginary) < accLim)
return false;
}
return true;
}
> // Method which finds (or attempts) to find all of the roots
public Complex[] addUnique2(Complex[] poly, Bitmap bitmapx, Paint paint, Canvas canvasx) {
Complex[] rootsC = new Complex[poly.length - 1];
int iterCount = 0;
int iteLim = 20000;
for (int i = 0; i < rootsC.length; i++) {
rootsC[i] = new Complex(0, 0);
}
while (iterCount < iteLim && MainActivity.a < rootsC.length) {
double guess = -492 + 984 * rand.nextDouble();
double guess2 = -718 + 1436 * rand.nextDouble();
if (rootCheck2(rootsC, findRoot2(poly, new Complex(guess, guess2), bitmapx, paint, canvasx))) {
rootsC[MainActivity.a] = findRoot2(poly, new Complex(guess, guess2), bitmapx, paint, canvasx);
MainActivity.a = MainActivity.a + 1;
}
iterCount = iterCount + 1;
}
return rootsC;
}
> // Method which finds a single root of the complex polynomial.
public Complex findRoot2(Complex[] comp, Complex guess, Bitmap bitmapx, Paint paint, Canvas canvasx) {
int iterCount = 0;
double accLim = 0.001;
int itLim = 20000;
Complex[] diffedComplex = diff(comp);
while (Math.abs(evalPoly(comp, guess).real) >= accLim && Math.abs(evalPoly(comp, guess).imaginary) >= accLim) {
if (iterCount >= itLim) {
return new Complex(Double.NaN, Double.NaN);
}
if (evalPoly(diffedComplex, guess).real == 0 || evalPoly(diffedComplex, guess).imaginary == 0) {
return new Complex(Double.NaN, Double.NaN);
}
iterCount = iterCount + 1;
guess.real = guess.subtractComplex(evalPoly(comp, guess).divideComplex(evalPoly(diffedComplex, guess))).real;
guess.imaginary = guess.subtractComplex(evalPoly(comp, guess).divideComplex(evalPoly(diffedComplex, guess))).imaginary;
drawCircles((float) guess.real, (float) guess.imaginary, paint, canvasx, bitmapx);
}
return guess;
}
> // Drawing method
void drawCircles(float x, float y, Paint paint, Canvas canvasx, Bitmap bitmapx) {
canvasx.drawCircle(x + 492, shiftBackY(y), 5, paint);
coordPlane.setAdjustViewBounds(false);
coordPlane.setImageBitmap(bitmapx);
}
}
Error 1
The lines
guess.real = guess.subtractComplex(evalPoly(comp, guess).divideComplex(evalPoly(diffedComplex, guess))).real;
guess.imaginary = guess.subtractComplex(evalPoly(comp, guess).divideComplex(evalPoly(diffedComplex, guess))).imaginary;
first introduce a needless complication and second introduce an error that makes it deviate from Newton's method. The guess used in the second line is different from the guess used in the first line since the real part has changed.
Why do you not use, like in the evaluation procedure, the complex assignment in
guess = guess.subtractComplex(evalPoly(comp, guess).divideComplex(evalPoly(diffedComplex, guess)));
Error 2 (Update)
In the computation of the differentiated polynomial, you are missing the highest degree term in
for (int i = 0; i < result.length - 1; i++) {
result[i].real = comp[i + 1].real * (i + 1);
result[i].imaginary = comp[i + 1].imaginary * (i + 1);
It should be either i < result.length or i < comp.length - 1. Using the wrong derivative will of course lead to unpredictable results in the iteration.
On root bounds and initial values
To each polynomial you can assign an outer root bound such as
R = 1+max(abs(c[0:N-1]))/abs(c[N])
Using 3*N points, random or equidistant, on or close to this circle should increase the probability to reach each of the roots.
But the usual way to find all of the roots is to use polynomial deflation, that is, splitting off the linear factors corresponding to the root approximations already found. Then a couple of additional Newton steps using the full polynomial restores maximal accuracy.
Newton fractals
Each root has a basin or domain of attraction with fractal boundaries between the domains. In rebuilding a similar situation to the one used in
I computed a Newton fractal showing that the attraction to two of the roots and ignorance of the other two is a feature of the mathematics behind it, not an error in implementing the Newton method.
Different shades of the same color belong to the domain of the same root where brightness corresponds to the number of steps used to reach the white areas around the roots.

Closest pair of points using sweep line algorithm in Java

First off; I'm doing this as an assignment for school, and that's why I'm using the sweep line algorithm. I'm basing it off of the pseudocode given by my teacher.
I've done an implementation of my own using TreeMap instead of a balanced binary search tree, which I was told would provide the same functionality. (Don't know if this is true, though?)
However I don't get the proper end result, and I really have no idea why. I've been staring myself blind.
Below is the part of my code that performs the actual computation. I've omitted the creation of the points-list, and other unimportant stuff.
count = 0;
TreeMap<Double, Point> tree = new TreeMap<Double, Point>();
double dist = Double.POSITIVE_INFINITY;
// Sorts points on x-axis
Collections.sort(points);
// Gets left-most point
Point q = points.get(count++);
for (Point p : points) {
while (q.getX() < p.getX() - dist) {
tree.remove(q.getY());
q = points.get(count++);
}
NavigableSet<Double> keys = tree.navigableKeySet();
// Look at the 4 points above 'p'
int i = 1;
Iterator<Double> iterHi = keys.tailSet(p.getY()).iterator();
while (i <= 4 && iterHi.hasNext()) {
double tmp = p.distanceTo(tree.get(iterHi.next()));
if (tmp < dist) {
dist = tmp;
pClosest = p;
qClosest = q;
}
i++;
}
// Look at the 4 points below 'p'
i = 1;
Iterator<Double> iterLo = keys.headSet(p.getY()).iterator();
while (i <= 4 && iterLo.hasNext()) {
double tmp = q.distanceTo(tree.get(iterLo.next()));
if (tmp < dist) {
dist = tmp;
pClosest = p;
qClosest = q;
}
i++;
}
tree.put(p.getY(), p);
}
double finalDist = pClosest.distanceTo(qClosest);
Edit: The pseudocode can be found here: http://pastebin.com/i0XbPp1a . It's based on notes taken from what my teacher wrote on the whiteboard.
Regarding results:
Using the following points (X, Y):
(0, 2) - (6, 67) - (43, 71) - (39, 107) - (189, 140)
I should get ~36, but I'm getting ~65.
I have already found several bugs in your code(I'm not sure that there are no others):
What if several points have the same y coordinate? A TreeMap can hold only one point for each y value. Is that what you want?
When you look at points below and above the current one, you compute a distance to the iterHi.next(): double tmp = p.distanceTo(tree.get(iterHi.next()));, but then assign qClosest to q. It is not correct(obviously, iterHi.next() and q is not the same point).
In the second inner loop, you compute the distance from q to the element of the set: double tmp = q.distanceTo(tree.get(iterLo.next()));. It should be p instead.
I would also recommend maintaining a TreeSet of Point instead of using a TreeMap(they should compared by their y coordinate, of course).

How to write Extended Euclidean Algorithm code wise in Java?

I have a question which is actually requires a bit of understanding Euclidian Algorithm. Problem is simple. An int "First" and int "Second" numbers are given by the user via Scanner.
Than we need to find greatest common divisor of them. Than the process goes like explained below:
Now Assume that the First number is: 42 and the Second is: 30 - they've given by the user. -
int x, y;
(x * First) + (y * Second) = gcd(First, Second); // x ? y ?
To Find GCD you may use: gcd(First, Second); Code is below:
public static int gcd(int a, int b)
{
if(a == 0 || b == 0) return a+b; // base case
return gcd(b,a%b);
}
Sample Input: First: 24 Second: 48 and Output should be x: (-3) and y: 2
Sample Input: First: 42 Second: 30 and Output should be x: (-2) and y: 3
Sample Input: First: 35 Second: 05 and Output should be x: (0) and y: 1
(x * First) + (y * Second) = gcd(First, Second); // How can we find x and y ?
I would very appreciate it if you could show a solution code wise in java thanks for checking!
The Extended Euclidean Algorithm is described in this Wikipedia article. The basic algorithm is stated like this (it looks better in the Wikipedia article):
More precisely, the standard Euclidean algorithm with a and b as
input, consists of computing a sequence q1,...,
qk of quotients and a sequence r0,...,
rk+1 of remainders such that
r0=a r1=b ...
ri+1=ri-1-qi ri and 0 <
ri+1 < |ri| ...
It is the main property of Euclidean division that the inequalities on
the right define uniquely ri+1 from ri-1 and
ri.
The computation stops when one reaches a remainder rk+1
which is zero; the greatest common divisor is then the last non zero
remainder rk.
The extended Euclidean algorithm proceeds similarly, but adds two
other sequences defined by
s0=1, s1=0 t0=0,
t1=1 ...
si+1=si-1-qi si
ti+1=ti-1-qi ti
This should be easy to implement in Java, but the mathematical way it's expressed may make it hard to understand. I'll try to break it down.
Note that this is probably going to be easier to implement in a loop than recursively.
In the standard Euclidean algorithm, you compute ri+1 in terms of ri-1 and ri. This means that you have to save the two previous versions of r. This part of the formula:
ri+1=ri-1-qi ri and 0 <
ri+1 < |ri| ...
just means that ri+1 will be the remainder when ri-1 is divided by ri. qi is the quotient, which you don't use in the standard Euclidean algorithm, but you do use in the extended one. So Java code to perform the standard Euclidean algorithm (i.e. compute the GCD) might look like:
prevPrevR = a;
prevR = b;
while ([something]) {
nextR = prevPrevR % prevR;
quotient = prevPrevR / prevR; // not used in the standard algorithm
prevPrevR = prevR;
prevR = nextR;
}
Thus, at any point, prevPrevR will be essentially ri-1, and prevR will be ri. The algorithm computes the next r, ri+1, then shifts everything which in essence increments i by 1.
The extended Euclidean algorithm will be done the same way, saving two s values prevPrevS and prevS, and two t values prevPrevT and prevT. I'll let you work out the details.
Thank's for helping me out ajb I solved it after digging your answer. So for the people who would like to see code wise:
public class Main
{
public static void main (String args[])
{
#SuppressWarnings("resource")
System.out.println("How many times you would like to try ?")
Scanner read = new Scanner(System.in);
int len = read.nextInt();
for(int w = 0; w < len; w++)
{
System.out.print("Please give the numbers seperated by space: ")
read.nextLine();
long tmp = read.nextLong();
long m = read.nextLong();
long n;
if (m < tmp) {
n = m;
m = tmp;
}
else {
n = tmp;
}
long[] l1 = {m, 1, 0};
long[] l2 = {n, 0, 1};
long[] l3 = new long[3];
while (l1[0]-l2[0]*(l1[0]/l2[0]) > 0) {
for (int j=0;j<3;j++) l3[j] = l2[j];
long q = l1[0]/l2[0];
for (int i = 0; i < 3; i++) {
l2[i] = (l1[i]-l2[i]*q);
}
for (int k=0;k<3;k++) l1[k] = l3[k];
}
System.out.printf("%d %d %d",l2[1],l2[2],l2[0]); // first two Bezouts identity Last One gcd
}
}
}
Here is the code that I came up with if anyone is still looking. It is in C# but I am sure it similar to java. Enjoy
static void Main(string[] args)
{
List<long> U = new List<long>();
List<long> V = new List<long>();
List<long> W = new List<long>();
long a, b, d, x, y;
Console.Write("Enter value for a: ");
string firstInput = Console.ReadLine();
long.TryParse(firstInput, out a);
Console.Write("Enter value for b: ");
string secondInput = Console.ReadLine();
long.TryParse(secondInput, out b);
long temp;
//Make sure that a > b
if(a < b)
{
temp = a;
a = b;
b = temp;
}
//Initialise List U
U.Add(a);
U.Add(1);
U.Add(0);
//Initialise List V
V.Add(b);
V.Add(0);
V.Add(1);
while(V[0] > 0)
{
decimal difference = U[0] / V[0];
var roundedDown = Math.Floor(difference);
long rounded = Convert.ToInt64(roundedDown);
for (int i = 0; i < 3; i++)
W.Add(U[i] - rounded * V[i]);
U.Clear();
for (int i = 0; i < 3; i++)
U.Add(V[i]);
V.Clear();
for (int i = 0; i < 3; i++)
V.Add(W[i]);
W.Clear();
}
d = U[0];
x = U[1];
y = U[2];
Console.WriteLine("\nd = {0}, x = {1}, y = {2}", d, x, y);
//Check Equation
Console.WriteLine("\nEquation check: d = ax + by\n");
Console.WriteLine("\t{0} = {1}({2}) + {3}({4})", d, a, x, b, y);
Console.WriteLine("\t{0} = {1} + {2}", d, a*x, b*y);
Console.WriteLine("\t{0} = {1}", d, (a * x) + (b * y));
if (d == (a * x) + (b * y))
Console.WriteLine("\t***Equation is satisfied!***");
else
Console.WriteLine("\tEquation is NOT satisfied!");
}
}
}

Modifying Levenshtein Distance algorithm to not calculate all distances

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.
Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.
If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]
Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
My modifications (only to the for loops):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}
I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.
According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.
Here someone answers a very similar question:
Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
just the main part, more code in the original
I used the original code and places this just before the end of the j for loop:
if (p[n] > s.length() + 5)
break;
The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.
Apache Commons Lang 3.4 has this implementation:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* http://www.merriampark.com/ld.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #param threshold the target threshold, must not be negative
* #return result distance, or {#code -1} if the distance would be greater than the threshold
* #throws IllegalArgumentException if either String input {#code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}

Categories

Resources