Verifying correctness of FFT algorithm

Verifying correctness of FFT algorithm - java

Today I wrote an algorithm to compute the Fast Fourier Transform from a given array of points representing a discrete function. Now I'm trying to test it to see if it is working. I've tried about a dozen different input sets, and they seem to match up with examples I've found online. However, for my final test, I gave it the input of cos(i / 2), with i from 0 to 31, and I've gotten 3 different results based on which solver I use. My solution seems to be the least accurate:
Does this indicate a problem with my algorithm, or is it simply a result of the relatively small data set?
My code is below, in case it helps:
/**
* Slices the original array, starting with start, grabbing every stride elements.
* For example, slice(A, 3, 4, 5) would return elements 3, 8, 13, and 18 from array A.
* #param array The array to be sliced
* #param start The starting index
* #param newLength The length of the final array
* #param stride The spacing between elements to be selected
* #return A sliced copy of the input array
*/
public double[] slice(double[] array, int start, int newLength, int stride) {
double[] newArray = new double[newLength];
int count = 0;
for (int i = start; count < newLength && i < array.length; i += stride) {
newArray[count++] = array[i];
}
return newArray;
}
/**
* Calculates the fast fourier transform of the given function. The parameters are updated with the calculated values
* To ignore all imaginary output, leave imaginary null
* #param real An array representing the real part of a discrete-time function
* #param imaginary An array representing the imaginary part of a discrete-time function
* Pre: If imaginary is not null, the two arrays must be the same length, which must be a power of 2
*/
public void fft(double[] real, double[] imaginary) throws IllegalArgumentException {
if (real == null) {
throw new NullPointerException("Real array cannot be null");
}
int N = real.length;
// Make sure the length is a power of 2
if ((Math.log(N) / Math.log(2)) % 1 != 0) {
throw new IllegalArgumentException("The array length must be a power of 2");
}
if (imaginary != null && imaginary.length != N) {
throw new IllegalArgumentException("The two arrays must be the same length");
}
if (N == 1) {
return;
}
double[] even_re = slice(real, 0, N/2, 2);
double[] odd_re = slice(real, 1, N/2, 2);
double[] even_im = null;
double[] odd_im = null;
if (imaginary != null) {
even_im = slice(imaginary, 0, N/2, 2);
odd_im = slice(imaginary, 1, N/2, 2);
}
fft(even_re, even_im);
fft(odd_re, odd_im);
// F[k] = real[k] + imaginary[k]
// even odd
// F[k] = E[k] + O[k] * e^(-i*2*pi*k/N)
// F[k + N/2] = E[k] - O[k] * e^(-i*2*pi*k/N)
// Split complex arrays into component arrays:
// E[k] = er[k] + i*ei[k]
// O[k] = or[k] + i*oi[k]
// e^ix = cos(x) + i*sin(x)
// Let x = -2*pi*k/N
// F[k] = er[k] + i*ei[k] + (or[k] + i*oi[k])(cos(x) + i*sin(x))
// = er[k] + i*ei[k] + or[k]cos(x) + i*or[k]sin(x) + i*oi[k]cos(x) - oi[k]sin(x)
// = (er[k] + or[k]cos(x) - oi[k]sin(x)) + i*(ei[k] + or[k]sin(x) + oi[k]cos(x))
// { real } { imaginary }
// F[k + N/2] = (er[k] - or[k]cos(x) + oi[k]sin(x)) + i*(ei[k] - or[k]sin(x) - oi[k]cos(x))
// { real } { imaginary }
// Ignoring all imaginary parts (oi = 0):
// F[k] = er[k] + or[k]cos(x)
// F[k + N/2] = er[k] - or[k]cos(x)
for (int k = 0; k < N/2; ++k) {
double t = odd_re[k] * Math.cos(-2 * Math.PI * k/N);
real[k] = even_re[k] + t;
real[k + N/2] = even_re[k] - t;
if (imaginary != null) {
t = odd_im[k] * Math.sin(-2 * Math.PI * k/N);
real[k] -= t;
real[k + N/2] += t;
double t1 = odd_re[k] * Math.sin(-2 * Math.PI * k/N);
double t2 = odd_im[k] * Math.cos(-2 * Math.PI * k/N);
imaginary[k] = even_im[k] + t1 + t2;
imaginary[k + N/2] = even_im[k] - t1 - t2;
}
}
}

Validation
look here: slow DFT,iDFT at the end is mine slow implementation of DFT and iDFT they are tested and correct. I also used them for fast implementations validation in the past.
Your code
stop recursion is wrong (you forget to set the return element) mine looks like this:
if (n<=1) { if (n==1) { dst[0]=src[0]*2.0; dst[1]=src[1]*2.0; } return; }
so when your N==1 set the output element to Re=2.0*real[0], Im=2.0*imaginary[0] before return. Also I am a bit lost in your complex math (t,t1,t2) and to lazy to analyze.
Just to be sure here is mine fast implementation. It need too much things from class hierarchy so it will not be of another use for you then visual comparison to your code.
My Fast implementation (cc means complex output and input):
//---------------------------------------------------------------------------
void transform::DFFTcc(double *dst,double *src,int n)
{
if (n>N) init(n);
if (n<=1) { if (n==1) { dst[0]=src[0]*2.0; dst[1]=src[1]*2.0; } return; }
int i,j,n2=n>>1,q,dq=+N/n,mq=N-1;
// reorder even,odd (buterfly)
for (j=0,i=0;i<n+n;) { dst[j]=src[i]; i++; j++; dst[j]=src[i]; i+=3; j++; }
for ( i=2;i<n+n;) { dst[j]=src[i]; i++; j++; dst[j]=src[i]; i+=3; j++; }
// recursion
DFFTcc(src ,dst ,n2); // even
DFFTcc(src+n,dst+n,n2); // odd
// reorder and weight back (buterfly)
double a0,a1,b0,b1,a,b;
for (q=0,i=0,j=n;i<n;i+=2,j+=2,q=(q+dq)&mq)
{
a0=src[j ]; a1=+_cos[q];
b0=src[j+1]; b1=+_sin[q];
a=(a0*a1)-(b0*b1);
b=(a0*b1)+(a1*b0);
a0=src[i ]; a1=a;
b0=src[i+1]; b1=b;
dst[i ]=(a0+a1)*0.5;
dst[i+1]=(b0+b1)*0.5;
dst[j ]=(a0-a1)*0.5;
dst[j+1]=(b0-b1)*0.5;
}
}
//---------------------------------------------------------------------------
dst[] and src[] are not overlapping !!! so you cannot transform array to itself .
_cos and _sin are precomputed tables of cos and sin values (computed by init() function like this:
double a,da; int i;
da=2.0*M_PI/double(N);
for (a=0.0,i=0;i<N;i++,a+=da) { _cos[i]=cos(a); _sin[i]=sin(a); }
N is power of 2 (zero padded size of data set) (last n from init(n) call)
Just to be complete here is mine complex to complex slow version:
//---------------------------------------------------------------------------
void transform::DFTcc(double *dst,double *src,int n)
{
int i,j;
double a,b,a0,a1,_n,b0,b1,q,qq,dq;
dq=+2.0*M_PI/double(n); _n=2.0/double(n);
for (q=0.0,j=0;j<n;j++,q+=dq)
{
a=0.0; b=0.0;
for (qq=0.0,i=0;i<n;i++,qq+=q)
{
a0=src[i+i ]; a1=+cos(qq);
b0=src[i+i+1]; b1=+sin(qq);
a+=(a0*a1)-(b0*b1);
b+=(a0*b1)+(a1*b0);
}
dst[j+j ]=a*_n;
dst[j+j+1]=b*_n;
}
}
//---------------------------------------------------------------------------

I'd use something authoritative like Wolfram Alpha to verify.
If I evalute cos(i/2) for 0 <= i < 32, I get this array:
[1,0.878,0.540,0.071,-0.416,-0.801,-0.990,-0.936,-0.654,-0.211,0.284,0.709,0.960,0.977,0.754,0.347,-0.146,-0.602,-0.911,-0.997,-0.839,-0.476,0.004,0.483,0.844,0.998,0.907,0.595,0.137,-0.355,-0.760,-0.978]
If I give that as input to Wolfram Alpha's FFT function I get this result.
The plot that I get looks symmetric, which makes sense. The plot looks nothing like any of the ones you supplied.

Related

Random strings with given length unit testing

I have a program that is generating pseudo random numbers(Only lowercase, uppercase and digits are allowed).
/**
*
* #return - returns a random digit (from 0 to 9)
*
*/
int randomDigits() {
return (int) (Math.random() * 10);
}
/**
*
* #return - returns a random lowercase (from "a" to "z")
*/
char randomLowerCase() {
return (char) ('a' + Math.random() * 26);
}
/**
*
* #return - returns a random uppercase (from "A" to "Z")
*/
char randomUpperCase() {
return (char) ('A' + Math.random() * 26);
}
/**
*
* #return - returns a random number between 1 and 3.
*
*/
char randomChoice() {
return (char) ((char) (Math.random() * 3) + 1);
}
/**
*
* #param length
* - the length of the random string.
* #return - returns a combined random string. All elements are builder side
* by side.
*
* We use the randomChoice method to get randomly upper, lower and
* digits.
*/
public String stringBuilder(int length) {
StringBuilder result = new StringBuilder();
int len = length;
for (int i = 0; i < len; i++) {
int ch = randomChoice();
if (ch == 1) {
result.append(randomDigits());
}
if (ch == 2) {
result.append(randomLowerCase());
}
if (ch == 3) {
result.append(randomUpperCase());
}
}
return result.toString();
}
How can i make a test for that code. I try to test the range for the digits (form 0 to 9)
int minRange = 0;
int maxRange = 0;
for (int i = 0; i < 100000; i++) {
int result = item.randomDigits();
if (result == 52) {
minRange++;
} else {
if (result == 19) {
maxRange++;
}
}
}
LOGGER.info("The min range in the digit is 0, and in the test appeared: {}", minRange);
LOGGER.info("The max range in the digit is 9, and in the test appeared: {}", maxRange);
But i cant find how to test the lower or upper?

Testing code which uses any randomness is tricky. There are two approaches you can take:
Your test can have sufficient iterations that it has a good chance to show any errors in your logic. For many cases iterating 1000 or 1000000 times and checking consistency of the answers is reasonable. This is your only option if you are also looking to check some required distribution across a range.
This might look something like:
for (int i = 0; i < 1000000; i++)
assertTrue(isValid(new RandomVal()));
If you want to check that all your characters appear at least once:
assertEquals(26 * 2 + 9, IntStream.range(0, 1000000)
.mapToObj(n -> stringBuilder(6))
.flatMap(String::chars)
.collect(Collectors.toSet())
.size());
This uses Java 8 and essentially adds every character (as an integer) to the set and then checks how large it is afterwards.
Using a mocking framework (such as Mockito) to check the result is the expected one for specific outputs from whatever you are using to generate randomness. This is the best way to test that you get the correct result boundary conditions (i.e. the generator returning results at each end of its range).
This might look something like:
Random mockRandom = mock(Random.class);
when(mockRandom.nextFloat()).thenReturn(0.0f);
assertTrue(isValid(new RandomVal(mockRandom));
when(mockRandom.nextFloat()).thenReturn(1.0f - Float.MIN_VALUE);
assertTrue(isValid(new RandomVal(mockRandom));
For completeness it's worth doing both of these.

If I understand your problem resolution is simple
int minRange = 999999; //improbable big value
int maxRange = -999999; //improbable low value
for (int i = 0; i < 100000; i++) {
int result = item.randomDigits();
minRange = Math.min(result, minRange);
maxRange = Math.max(result, maxRange);
}
Please try it.
If you don't like Math library you can of course do it without it
for (int i = 0; i < 100000; i++) {
int result = item.randomDigits();
if (result < minRange) {
minRange = result;
}
if (result > maxRange) {
maxRange = result;
}
}

Compare two strings without Apache StringUtils

Hi I am working with a voice command project. So I want to receive user's voice at first then I want to check the matches and then I want to do something according to the command. For this, I found a way to match the strings using org.apache.commons.lang3.StringUtils but I find so many trouble with this. For ex:- I face problem when I go to import the apache's external library to my android studio.
So my question is that:- is there any other way to compare the user's voice data and my specific command without using Apache's StringUtils method? Please help if you can

Take the source right from the library (Obviously follow the requirements of the Apache license)
https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html
Line 6865
/**
* <p>Find the Levenshtein distance between two Strings.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>The previous implementation of the Levenshtein distance algorithm
* was from http://www.merriampark.com/ld.htm</p>
*
* <p>Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError
* which can occur when my Java implementation is used with very large strings.<br>
* This implementation of the Levenshtein distance algorithm
* is from http://www.merriampark.com/ldjava.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","") = 0
* StringUtils.getLevenshteinDistance("","a") = 1
* StringUtils.getLevenshteinDistance("aaapppp", "") = 7
* StringUtils.getLevenshteinDistance("frog", "fog") = 1
* StringUtils.getLevenshteinDistance("fly", "ant") = 3
* StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
* StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
* StringUtils.getLevenshteinDistance("hello", "hallo") = 1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #return result distance
* #throws IllegalArgumentException if either String input {#code null}
* #since 3.0 Changed signature from getLevenshteinDistance(String, String) to
* getLevenshteinDistance(CharSequence, CharSequence)
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
/*
The difference between this impl. and the previous is that, rather
than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,
we maintain two single-dimensional arrays of length s.length() + 1. The first, d,
is the 'current working' distance array that maintains the newest distance cost
counts as we iterate through the characters of String s. Each time we increment
the index of String t we are comparing, d is copied to p, the second int[]. Doing so
allows us to retain the previous cost counts as required by the algorithm (taking
the minimum of the cost count to the left, up one, and diagonally up and to the left
of the current cost count being calculated). (Note that the arrays aren't really
copied anymore, just switched...this is clearly much better than cloning an array
or doing a System.arraycopy() each time through the outer loop.)
Effectively, the difference between the two implementations is this one does not
cause an out of memory condition when calculating the LD over two very large strings.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; //'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i <= n; i++) {
p[i] = i;
}
for (j = 1; j <= m; j++) {
t_j = t.charAt(j - 1);
d[0] = j;
for (i = 1; i <= n; i++) {
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}

There are many string functions you can use to compare strings, for example
if (result.equals("hello")) {
doSomething();
}
compares two strings
result.startsWith("search for") {
doSomething()
}
checks the beginning of the result
result.matches("yes|sure") {
doSomething()
}
checks result with regular expression.
You can find all that in a Java textbook. See for example
https://docs.oracle.com/javase/tutorial/java/data/comparestrings.html
If you want to use Levenshtein distance you can insert the following function in your code:
public int LevenshteinDistance (String s0, String s1) {
int len0 = s0.length() + 1;
int len1 = s1.length() + 1;
// the array of distances
int[] cost = new int[len0];
int[] newcost = new int[len0];
// initial cost of skipping prefix in String s0
for (int i = 0; i < len0; i++) cost[i] = i;
// dynamically computing the array of distances
// transformation cost for each letter in s1
for (int j = 1; j < len1; j++) {
// initial cost of skipping prefix in String s1
newcost[0] = j;
// transformation cost for each letter in s0
for(int i = 1; i < len0; i++) {
// matching current letters in both strings
int match = (s0.charAt(i - 1) == s1.charAt(j - 1)) ? 0 : 1;
// computing cost for each transformation
int cost_replace = cost[i - 1] + match;
int cost_insert = cost[i] + 1;
int cost_delete = newcost[i - 1] + 1;
// keep minimum cost
newcost[i] = Math.min(Math.min(cost_insert, cost_delete), cost_replace);
}
// swap cost/newcost arrays
int[] swap = cost; cost = newcost; newcost = swap;
}
// the distance is the cost for transforming all letters in both strings
return cost[len0 - 1];
}

Finding a "correct" value that is unknown prior to a calculation

The problem is as follows i have a large or small number (can be either one) and i need to tweak this number and put it through a caluclation. Given the result of the calulation it has to mach a certain value on the 5th decimal at least.
So i need to make a method that takes this starting value, tries to increase or decrease it given what the current result is until i get the correct result. I have made some atempts with no success.
Here is an example that won't woork at all but it hints towards what i mean... (this is just a small scale test case)
public class Test {
public static void main(String[]args)
{
double ran = 100 + (int)(Math.random() * 100000.999999999);
int count = 0;
double tmpPay = 3666.545;
double top = tmpPay;
double low = 0;
while ( tmpPay != ran )
{
if ( tmpPay > ran)
{
if( low == 0)
{
tmpPay = top / 2;
top = tmpPay;
}
else
{
tmpPay = tmpPay + ((top - low) / 2);
top = tmpPay;
}
}
if (tmpPay < ran)
{
tmpPay = top * 1.5;
low = top;
top = tmpPay;
}
}
System.out.println(" VAlue of RAN: " +ran + "----VALUE OF tmpPay: " + tmpPay + "---------- COUNTER: " + count);
}
Example 2 mabey a more clear description. This is my solution now..
guessingValue = firstImput;
while (amortization > tmpPV)
{
guessingValue -= (decimal)1;
//guessingVlue -- > blackbox
amortization = blackboxResults;
}
while (amortization < tmpPV)
{
guessingValue += (decimal)0.00001;
//guessingVlue -- > blackbox
amortization = blackboxResults;
}
}

As I already mentioned in the comment above you should not compare doubles using build-in operators. This is the main reason why your code is not working. The second one is that in else clause tmpPay = tmpPay + ((top-low) /2); instead of tmpPay = tmpPay - ((top-low) /2 );
Complete fixed code is below :
public class Test {
private static final double EPSILON = 0.00001;
public static boolean isEqual( double a, double b){
return (Math.abs(a - b) < EPSILON);
}
public static void main(String[]args)
{
double ran = 100 + (int)(Math.random() * 100000.999999999);
int count = 0;
double tmpPay = 3666.545;
double top = tmpPay;
double low = 0;
while ( !isEqual(tmpPay, ran))
{
if ( tmpPay > ran)
{
if( isEqual(low, 0.0))
{
tmpPay = top / 2;
top = tmpPay;
}
else
{
tmpPay = tmpPay - ((top - low) / 2);
top = tmpPay;
}
}
if (tmpPay < ran)
{
tmpPay = top * 1.5;
low = top;
top = tmpPay;
}
System.out.println("RAN:"+ran+" tmpPay:"+tmpPay+" top:"+top+" low:"+low+" counter:"+count);
count++;
}
System.out.println(" VAlue of RAN: " +ran + "----VALUE OF tmpPay: " + tmpPay + "---------- COUNTER: " + count);
}
}

One way would be to define your problem as a local optimization task and use an local optimizer (for example Brent's method or Nelder Mead Simplex from Apache commons).
Your goal function here would be the distance between the desired value and what you get from your black box.

If I understand correctly, you have a function g(x) and a value K, you want to find x0 such that g(x0) = K.
This is equivalent to find the roots of the function f(x) = g(x) - K, because f(x0) == f(x0) - K == K - K == 0.
A simple algorithm would be Newton's method.

If trying to run the program, it will easily be in infinite loop, since the while condition (for double values comparison) could hardly equal.
e.g.
There are 2 values as follows:
double value1 = 3666.545;
double value2 = 3666.54500001;
value1 == value2 is false.
Even this kind of values are not equal.
You'd better define a range for deviation.
e.g, if |value1 - value2| < 0.005, then break the while condition and print the random num information.

How to write a program that reads a sequence of integers into an array and that computes the alternating sum of all elements in the array?

Write a program that reads a sequence of integers into an array and that computes the alternating sum of all elements in the array. For example, if the program is executed with the input data
1 4 9 16 9 7 4 9 11
then it computes
1 - 4 + 9 - 16 + 9 - 7 + 4 - 9 + 11 = - 2
I have below code so far:
import java.util.Arrays;
/**
This class computes the alternating sum
of a set of data values.
*/
public class DataSet
{
private double[] data;
private int dataSize;
/**
Constructs an empty data set.
*/
public DataSet()
{
final int DATA_LENGTH = 100;
data = new double[DATA_LENGTH];
dataSize = 0;
}
/**
Adds a data value to the data set.
#param x a data value
*/
public void add(double x)
{
if (dataSize == data.length)
data = Arrays.copyOf(data, 2 * data.length);
data[dataSize] = x;
dataSize++;
}
/**
Gets the alternating sum of the added data.
#return sum the sum of the alternating data or 0 if no data has been added
*/
public double alternatingSum()
{
. . .
}
}
I have to use the following class as the tester class:
/**
This program calculates an alternating sum.
*/
public class AlternatingSumTester
{
public static void main(String[] args)
{
DataSet data = new DataSet();
data.add(1);
data.add(4);
data.add(9);
data.add(16);
data.add(9);
data.add(7);
data.add(4);
data.add(9);
data.add(11);
double sum = data.alternatingSum();
System.out.println("Alternating Sum: " + sum);
System.out.println("Expected: -2.0");
}
}

I implemented the method alternatingSum for you:
public double alternatingSum() {
double alternatingSum = 0;
if(data != null || dataSize > 0) {
for(int i = 0; i < dataSize; i = i + 2) {
alternatingSum += data[i];
}
for(int i = 1; i < dataSize; i = i + 2) {
alternatingSum -= data[i];
}
}
return alternatingSum;
}

I would use this simple logic to achieve the goal. First add all the odd numbers in the array. Then add all the even numbers from the same. Now subtract the both values n you will get your answer. Hope this helps.

I would solve this using a for loop and a boolean flag:
set flag to false
set sum to zero
for alle elements in array
if flag is set
add to sum
else
subtract from sum
When loop is done you have your sum.

int[] a = {50, 60, 60, 45, 70};
int sum = IntStream.range(0, a.length).filter(i -> i % 2 == 0).map(i -> a[i]).sum()
- IntStream.range(0, a.length).filter(i -> i % 2 == 1).map(i -> a[i]).sum();
System.out.println("Sum= " + sum);
I did this using stream, first I used Intstream, then filter out the even indexes and get the mapped the value and added them, and did the same for odd indexes as well, then subtracted it.

If You have 4 numbers, for example a[]={1, 3, 5, 6}, there are several cases:
operation:
+ + +
+ + -
+ - +
+ - -
- + +
- + -
- - +
- - -
In your case "operation" will be only + - +
use array with this symbols and calculate your result.
int k=a[0];
for(int i = 1; i<= 3; i++){
if(operation[i-1]=="+".charAt(0))
{k=k+a[i];}
etc...
}
It's not hard :)
good luck.

Modifying Levenshtein Distance algorithm to not calculate all distances

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.
Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.
If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]
Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
My modifications (only to the for loops):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}

The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}

I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.

Here someone answers a very similar question:
Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
just the main part, more code in the original

I used the original code and places this just before the end of the j for loop:
if (p[n] > s.length() + 5)
break;
The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.

Apache Commons Lang 3.4 has this implementation:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* http://www.merriampark.com/ld.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #param threshold the target threshold, must not be negative
* #return result distance, or {#code -1} if the distance would be greater than the threshold
* #throws IllegalArgumentException if either String input {#code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Verifying correctness of FFT algorithm - java

Related

Random strings with given length unit testing

Compare two strings without Apache StringUtils

Finding a "correct" value that is unknown prior to a calculation

How to write a program that reads a sequence of integers into an array and that computes the alternating sum of all elements in the array?

Modifying Levenshtein Distance algorithm to not calculate all distances

Categories

Resources