Related
So I have a function that looks at a grid (2D array) and finds all the paths from the starting point to the end point. So far the algorithm works as intended and I get the values that I'm looking for.
The problem is that it takes forever. It can run over a 100 x 100 grid no problem, but once I get to a 10000 x 10000 grid, it'll take about 10 min to give back an answer, where I'm looking for maybe 1 min at most.
Here's what it looks like right now:
public void BFS(Point s, Point e){
/**
* North, South, East, West coordinates
*/
int[] x = {0,0,1,-1};
int[] y = {1,-1,0,0};
LinkedList<Point> queue = new LinkedList<>();
queue.add(s);
/**
* 2D int[][] grid that stores the distances of each point on the grid
* from the start
*/
int[][] dist = new int[numRow][numCol];
for(int[] a : dist){
Arrays.fill(a,-1);
}
/**
* "obstacles" is an array of Points that contain the (x, y) coordinates of obstacles on the grid
* designated as a -2, which the BFS algorithm will avoid.
*/
for(Point ob : obstacles){
dist[ob.x][ob.y] = -2;
}
// Start point
dist[s.x][s.y] = 0;
/**
* Loops over dist[][] from starting point, changing each [x][y] coordinate to the int
* value that is the distance from S.
*/
while(!queue.isEmpty()){
Point p = queue.removeFirst();
for(int i = 0; i < 4; i++){
int a = p.x + x[i];
int b = p.y + y[i];
if(a >= 0 && b >= 0 && a < numRow && b < numCol && dist[a][b] == -1){
dist[a][b] = 1 + dist[p.x][p.y];
Point tempPoint = new Point(a, b);
if(!queue.contains(tempPoint)){
queue.add(tempPoint);
}
}
}
}
/**
* Works backwards to find all shortest path points between S and E, and adds each
* point to an array called "validPaths"
*/
queue.add(e);
while(!queue.isEmpty()){
Point p = queue.removeFirst();
// Checks grid space (above, below, left, and right) from Point p
for(int i = 0; i < 4; i++){
int curX = p.x + x[i];
int curY = p.y + y[i];
// Index Out of Bounds check
if(curX >= 0 && curY >= 0 && !(curX == start.x && curY == start.y) && curX < numRow && curY < numCol){
if(dist[curX][curY] < dist[p.x][p.y] && dist[curX][curY] != -2){ // -2 is an obstacle
Point tempPoint = new Point(curX, curY);
if(!validPaths.contains(tempPoint)){
validPaths.add(tempPoint);
}
if(!queue.contains(tempPoint)){
queue.add(tempPoint);
}
}
}
}
}
So again, while it works, it's really slow. I'm trying to get a O(n + m), but I believe that it might be running in O(n^2).
Does anyone know any good ideas to make this faster?
An clear reason for the observed inefficiency are the comparisons !validPaths.contains(tempPoint) and !queue.contains(tempPoint) which are both O(n). To do these comparisons you should be striving for an O(1) comparison, which can be accomplished by using a special datastructure such as a hash-set or simply a bitset.
As it stands now, your implementation is clearly O(n^2) because of these comparisons.
I want to find all distinct triplets (a, b, c) in an array such that a + b + c = 0.
I implemented the algorithm in java but I am getting TLE when the input is large (for example 100,000 zeroes, etc).
For 100,000 zeroes, it should output (0, 0, 0) only.
Can someone give some idea about how to speed this up?
Below is the function which I have written. It takes an array as input and returns all unique triplets having the desired property as a list.
public List<List<Integer>> threeSum(int[] nums) {
Arrays.sort(nums);
List<List<Integer>> ll = new ArrayList<List<Integer>>();
for(int i = 0; i < nums.length - 1; i++){
int x = nums[i];
int start = i + 1;
int end = nums.length - 1;
int sum = -x;
while(start < end){
int y = nums[start] + nums[end];
if(y == sum){
List<Integer> list = new ArrayList<Integer>();
list.add(nums[start]);
list.add(nums[end]);
list.add(x);
Collections.sort(list);
ll.add(list);
}
if(y < sum)
start++;
else
end--;
}
}
return ll.stream()
.distinct()
.collect(Collectors.toList());
}
I think that there is nothing you can do about the time complexity. Two indices must explore the array independently (except for starting/ending points), while the third can be constrained, like in your algorithm, which means that the complexity is O(n2). This dominates the preliminary sorting of the array, which is O(n·log(n)), and also a “demultiplication” step, which is O(n).
I wrote “demultiplication” because a “deduplication” is not desirable: suppose the array is [-1,-1,0,2]. Deduplicating it would eliminate the only solution. But a solution can't contain an integer more than twice, unless it's 0, in which case [0,0,0] is a solution. All integers appearing more than twice, or thrice in the case of 0, are redundant and can be eliminated in one pass after sorting and before the main algorithm.
As for the factor, it could be improved by limiting the exploration to what makes sense. I would modify your algorithm by making the pair of indices that you move until they meet, start outwards from where they meet, until the lower one hits the major index, or the upper one hits the end of the array. The starting point of the scan can be remembered across scans, adjusting it downwards as the major index moves upwards. If the starting point (actually a starting pair of adjacent indices) is outside the current range, the scan can be omitted tout court. Finding the initial starting point is an additional part of the algorithm which, after sorting, could be O(log(n)), but a very simple O(n) version would do just as well.
I have no time now to translate all the above into Java code, sorry. All I can do now is jot down the “demultiplication” code (untested) that goes right after the sorting of the array:
int len = 1;
int last = nums[0];
int count = 1;
for (int i = 1; i < nums.length; i++) {
int x = nums[i];
if (x != last) {
nums[len++] = x;
last = x;
count = 1;
} else if (count < 2 || x == 0 && count < 3) {
nums[len++] = x;
count++;
}
}
// use len instead of nums.length from this point on
The big time component I see, is that for the 100,000 zeroes example, you will hitting the if (y == sum) block for every single possible case. This appears to be the worst case for performance since you will never skip that block.
The largest improvement I can see is to first de-duplicate your input. Unfortunately sets won't work as we need still maintain up to three of the same entry. Thus, my recommendation is, after your sort, to loop through the input array and whenever you encounter more than three copies of a number in a row, remove the extras. They are not needed for the problem and just waste time.
You could create a List (an implementation of that is ArrayList) to store the combinations you already had. Always store a new value in the format of
a,b,c
where a <= b <= c
so, whenever you get a combination which may or may not already have been found, generate a String in the same format and check whether it is present in your List. If so, then do not add it. Otherwise add it to your List. After this you could convert the found values into numeric values. If you want to quicken it up, you could create a class like:
class XYZ {
public int x;
public int y;
public int z;
public XYZ(int x, int y, int z) {
this.x = x;
this.y = y;
this.z = z;
}
public isMatch(int x, int y, int z) {
return (this.x == x) &&
(this.y == y) &&
(this.z == z);
}
public static boolean anyMatch(List<XYZ> list, int x, int y, int z) {
for (XYZ xyz : list) {
if (xyz.isMatch(x, y, z)) return true;
}
return false;
}
public static void addIfNotExists(List<XYZ> list, int x, int y, int z) {
if (!anyMatch(list, x, y, z)) list.add(new XYZ(x, y, z));
}
}
and you could use this class for your purpose, just make sure that x <= y <= z.
The filtering of non-unique triplets at the end can be eliminated by using a hash-table that stores the triplets in a sorted order, so all combinations of a triplet (with different ordering) gets stored exactly once.
Use a hashmap/hashset instead of an arraylist.
HashSet<List<Integer>> ll = new HashSet<List<Integer>>();
. . .
list.addAll(a,b,c)
Collections.sort(list)
ll.add(list)
In addition to this, You could also use another lookup table to ensure each repeating item in nums[] is used to calculate the triplets only once.
lookup_table = HashMap();
for(int i = 0; i < nums.length - 1; i++){
// we have already found triplets starting from nums[i]
// eg. [-1,-1,0,1], we don't need to calculate
// the same triplets for the second '-1'.
if (lookup_table.contains(nums[i]))
continue;
// Mark nums[i] as 'solved'
lookup_table.add(nums[i])
// usual processing here
int x = nums[i];
Or, since your nums[] list is already sorted, you could just skip repeating items, doing away the need for another lookup table.
i = 0;
while (i < nums.length - 1){
// we have already found triplets starting from nums[i]
// eg. [-1,-1,0,1], we don't need to calculate
// the same triplets for the second '-1'.
x = nums[i];
// skip repeating items
while (x == nums[i++]);
// usual processing here
. . .
i++;
}
and then you could just return the hashset as a list at the end.
The original question has a list of unsorted list of 2 integers. To simplify this problem let's just consider the input is 2 sorted arrays of integers and an integer target. Value pair can repeat if there are more than 1 solution pair.
For example: [7,8,14],[5,10,14] target: 20
The solution is [14, 5] as 14 from first array and 5 from second array sums 19 which is closest to 20.
My solution was to loop through both array from beginning to end and compare against a tracked minimum difference and update if new difference is smaller.
But this is brute force. Is there any better solution?
Most solutions I found online was to find the target from the same array, is there any similarities between 2 arrays target problem and 1 array?
One key insight: Given a pair (x, y) whose sum is higher than the target, that sum is closer than the sum of any pair (x, y'), where y' > y. Conversely, if the sum of (x, y) is lower than the target, that sum is closer than the sum of any pair (x', y) where x' < x.
This yields an algorithm in linear time:
Start the first element of list X and the last element of list Y
Check if it's the best pair so far (if so, remember it)
If that sum is less than the target, move to the next higher element of X. If that sum is greater than the target, move to the next lower element of Y
Loop steps 2 - 3 until you run out of elements in X or Y
In Java:
private static Pair<Integer, Integer> findClosestSum(List<Integer> X, List<Integer> Y, int target) {
double bestDifference = Integer.MAX_VALUE;
Pair<Integer, Integer> bestPair = null;
int xIndex = 0;
int yIndex = Y.size() - 1;
while (true) {
double sum = X.get(xIndex) + Y.get(yIndex);
double difference = Math.abs(sum - target);
if (difference < bestDifference) {
bestPair = new Pair<>(X.get(xIndex), Y.get(yIndex));
bestDifference = difference;
}
if (sum > target) {
yIndex -= 1;
if (yIndex < 0) {
return bestPair;
}
} else if (sum < target) {
xIndex += 1;
if (xIndex == X.size()) {
return bestPair;
}
} else {
// Perfect match :)
return bestPair;
}
}
}
You can prove this algorithm is exhaustive through the logic in the starting paragraph. For any pair that wasn't visited, there must be a pair including one of its two elements that was visited, and which has a sum strictly closer to the target.
EDIT: If you only want sums which are less than the target (not those which overshoot), the same logic still applies. In the overshoot case, (x, y') is just as invalid as (x, y), and therefore it cannot be a better candidate sum. In this case, only step 2 needs to be modified, to store the sum only if it's the closest non-exceeding sum so far.
Thank you for your algorithm, I have implemented my logic. Yes it does need to be the closest pair below target so I have made code changes accordingly. Since the inputs could be duplicates hence I made sure the handle that as well. Also result could be multiple so that is handled as well. Let me know if you find any potential optimization. Here is the code:
public static List<List<Integer>> findClosest(int[] x, int[] y, int target){
List<List<Integer>> result = new ArrayList<List<Integer>>();
int[] pair = new int[2];
int bestDiff = Integer.MIN_VALUE;
int xIndex = 0;
int yIndex = y.length - 1;
//while left doesn't reach left end and right doesn't reach right end
while(xIndex < x.length && yIndex >= 0){
int xValue = x[xIndex];
int yValue = y[yIndex];
int diff = xValue + yValue - target;
//values greater than target, y pointer go right
if(diff > 0){
yIndex--;
while(yIndex > 0 && yValue == y[yIndex - 1]) yIndex--;
}else{//combined == 0 which match target and < 0 which means the sum is less than target
//duplicates result, just add
if(diff == bestDiff){
result.add(Arrays.asList(xValue, yValue));
}
//found better pair, clear array and add new pair
else if(diff > bestDiff){
result.clear();
result.add(Arrays.asList(xValue, yValue));
bestDiff = diff;
}
xIndex++;
}
}
return result;
}
I'm solving Codility questions as practice and couldn't answer one of the questions. I found the answer on the Internet but I don't get how this algorithm works. Could someone walk me through it step-by-step?
Here is the question:
/*
You are given integers K, M and a non-empty zero-indexed array A consisting of N integers.
Every element of the array is not greater than M.
You should divide this array into K blocks of consecutive elements.
The size of the block is any integer between 0 and N. Every element of the array should belong to some block.
The sum of the block from X to Y equals A[X] + A[X + 1] + ... + A[Y]. The sum of empty block equals 0.
The large sum is the maximal sum of any block.
For example, you are given integers K = 3, M = 5 and array A such that:
A[0] = 2
A[1] = 1
A[2] = 5
A[3] = 1
A[4] = 2
A[5] = 2
A[6] = 2
The array can be divided, for example, into the following blocks:
[2, 1, 5, 1, 2, 2, 2], [], [] with a large sum of 15;
[2], [1, 5, 1, 2], [2, 2] with a large sum of 9;
[2, 1, 5], [], [1, 2, 2, 2] with a large sum of 8;
[2, 1], [5, 1], [2, 2, 2] with a large sum of 6.
The goal is to minimize the large sum. In the above example, 6 is the minimal large sum.
Write a function:
class Solution { public int solution(int K, int M, int[] A); }
that, given integers K, M and a non-empty zero-indexed array A consisting of N integers, returns the minimal large sum.
For example, given K = 3, M = 5 and array A such that:
A[0] = 2
A[1] = 1
A[2] = 5
A[3] = 1
A[4] = 2
A[5] = 2
A[6] = 2
the function should return 6, as explained above. Assume that:
N and K are integers within the range [1..100,000];
M is an integer within the range [0..10,000];
each element of array A is an integer within the range [0..M].
Complexity:
expected worst-case time complexity is O(N*log(N+M));
expected worst-case space complexity is O(1), beyond input storage (not counting the storage required for input arguments).
Elements of input arrays can be modified.
*/
And here is the solution I found with my comments about parts which I don't understand:
public static int solution(int K, int M, int[] A) {
int lower = max(A); // why lower is max?
int upper = sum(A); // why upper is sum?
while (true) {
int mid = (lower + upper) / 2;
int blocks = calculateBlockCount(A, mid); // don't I have specified number of blocks? What blocks do? Don't get that.
if (blocks < K) {
upper = mid - 1;
} else if (blocks > K) {
lower = mid + 1;
} else {
return upper;
}
}
}
private static int calculateBlockCount(int[] array, int maxSum) {
int count = 0;
int sum = array[0];
for (int i = 1; i < array.length; i++) {
if (sum + array[i] > maxSum) {
count++;
sum = array[i];
} else {
sum += array[i];
}
}
return count;
}
// returns sum of all elements in an array
private static int sum(int[] input) {
int sum = 0;
for (int n : input) {
sum += n;
}
return sum;
}
// returns max value in an array
private static int max(int[] input) {
int max = -1;
for (int n : input) {
if (n > max) {
max = n;
}
}
return max;
}
So what the code does is using a form of binary search (How binary search works is explained quite nicely here, https://www.topcoder.com/community/data-science/data-science-tutorials/binary-search/. It also uses an example quite similar to your problem.). Where you search for the minimum sum every block needs to contain. In the example case, you need the divide the array in 3 parts
When doing a binary search you need to define 2 boundaries, where you are certain that your answer can be found in between. Here, the lower boundary is the maximum value in the array (lower). For the example, this is 5 (this is if you divide your array in 7 blocks). The upper boundary (upper) is 15, which is the sum of all the elements in the array (this is if you divide the array in 1 block.)
Now comes the search part: In solution() you start with your bounds and mid point (10 for the example).
In calculateBlockCount you count (count ++ does that) how many blocks you can make if your sum is a maximum of 10 (your middle point/ or maxSum in calculateBlockCount).
For the example 10 (in the while loop) this is 2 blocks, now the code returns this (blocks) to solution. Then it checks whether is less or more than K, which is the number of blocks you want. If its less than K your mid point is high because you're putting to many array elements in your blocks. If it's more than K, than your mid point is too high and you're putting too little array elements in your array.
Now after the checking this, it halves the solution space (upper = mid-1).
This happens every loop, it halves the solution space which makes it converge quite quickly.
Now you keep going through your while adjusting the mid, till this gives the amount blocks which was in your input K.
So to go though it step by step:
Mid =10 , calculateBlockCount returns 2 blocks
solution. 2 blocks < K so upper -> mid-1 =9, mid -> 7 (lower is 5)
Mid =7 , calculateBlockCount returns 2 blocks
solution() 2 blocks < K so upper -> mid-1 =6, mid -> 5 (lower is 5, cast to int makes it 5)
Mid =5 , calculateBlockCount returns 4 blocks
solution() 4 blocks < K so lower -> mid+1 =6, mid -> 6 (lower is 6, upper is 6
Mid =6 , calculateBlockCount returns 3 blocks
So the function returns mid =6....
Hope this helps,
Gl learning to code :)
Edit. When using binary search a prerequisite is that the solution space is a monotonic function. This is true in this case as when K increases the sum is strictly decreasing.
Seems like your solution has some problems. I rewrote it as below:
class Solution {
public int solution(int K, int M, int[] A) {
// write your code in Java SE 8
int high = sum(A);
int low = max(A);
int mid = 0;
int smallestSum = 0;
while (high >= low) {
mid = (high + low) / 2;
int numberOfBlock = blockCount(mid, A);
if (numberOfBlock > K) {
low = mid + 1;
} else if (numberOfBlock <= K) {
smallestSum = mid;
high = mid - 1;
}
}
return smallestSum;
}
public int sum(int[] A) {
int total = 0;
for (int i = 0; i < A.length; i++) {
total += A[i];
}
return total;
}
public int max(int[] A) {
int max = 0;
for (int i = 0; i < A.length; i++) {
if (max < A[i]) max = A[i];
}
return max;
}
public int blockCount(int max, int[] A) {
int current = 0;
int count = 1;
for (int i = 0; i< A.length; i++) {
if (current + A[i] > max) {
current = A[i];
count++;
} else {
current += A[i];
}
}
return count;
}
}
This is helped me in case anyone else finds it helpful.
Think of it as a function: given k (the block count) we get some largeSum.
What is the inverse of this function? It's that given largeSum we get a k. This inverse function is implemented below.
In solution() we keep plugging guesses for largeSum into the inverse function until it returns the k given in the exercise.
To speed up the guessing process, we use binary search.
public class Problem {
int SLICE_MAX = 100 * 1000 + 1;
public int solution(int blockCount, int maxElement, int[] array) {
// maxGuess is determined by looking at what the max possible largeSum could be
// this happens if all elements are m and the blockCount is 1
// Math.max is necessary, because blockCount can exceed array.length,
// but this shouldn't lower maxGuess
int maxGuess = (Math.max(array.length / blockCount, array.length)) * maxElement;
int minGuess = 0;
return helper(blockCount, array, minGuess, maxGuess);
}
private int helper(int targetBlockCount, int[] array, int minGuess, int maxGuess) {
int guess = minGuess + (maxGuess - minGuess) / 2;
int resultBlockCount = inverseFunction(array, guess);
// if resultBlockCount == targetBlockCount this is not necessarily the solution
// as there might be a lower largeSum, which also satisfies resultBlockCount == targetBlockCount
if (resultBlockCount <= targetBlockCount) {
if (minGuess == guess) return guess;
// even if resultBlockCount == targetBlockCount
// we keep searching for potential lower largeSum that also satisfies resultBlockCount == targetBlockCount
// note that the search range below includes 'guess', as this might in fact be the lowest possible solution
// but we need to check in case there's a lower one
return helper(targetBlockCount, array, minGuess, guess);
} else {
return helper(targetBlockCount, array, guess + 1, maxGuess);
}
}
// think of it as a function: given k (blockCount) we get some largeSum
// the inverse of the above function is that given largeSum we get a k
// in solution() we will keep guessing largeSum using binary search until
// we hit k given in the exercise
int inverseFunction(int[] array, int largeSumGuess) {
int runningSum = 0;
int blockCount = 1;
for (int i = 0; i < array.length; i++) {
int current = array[i];
if (current > largeSumGuess) return SLICE_MAX;
if (runningSum + current <= largeSumGuess) {
runningSum += current;
} else {
runningSum = current;
blockCount++;
}
}
return blockCount;
}
}
From anhtuannd's code, I refactored using Java 8. It is slightly slower. Thanks anhtuannd.
IntSummaryStatistics summary = Arrays.stream(A).summaryStatistics();
long high = summary.getSum();
long low = summary.getMax();
long result = 0;
while (high >= low) {
long mid = (high + low) / 2;
AtomicLong blocks = new AtomicLong(1);
Arrays.stream(A).reduce(0, (acc, val) -> {
if (acc + val > mid) {
blocks.incrementAndGet();
return val;
} else {
return acc + val;
}
});
if (blocks.get() > K) {
low = mid + 1;
} else if (blocks.get() <= K) {
result = mid;
high = mid - 1;
}
}
return (int) result;
I wrote a 100% solution in python here. The result is here.
Remember: You are searching the set of possible answers not the array A
In the example given they are searching for possible answers. Consider [5] as 5 being the smallest max value for a block. And consider [2, 1, 5, 1, 2, 2, 2] 15 as the largest max value for a block.
Mid = (5 + 15) // 2. Slicing out blocks of 10 at a time won't create more than 3 blocks in total.
Make 10-1 the upper and try again (5+9)//2 is 7. Slicing out blocks of 7 at a time won't create more than 3 blocks in total.
Make 7-1 the upper and try again (5+6)//2 is 5. Slicing out blocks of 5 at a time will create more than 3 blocks in total.
Make 5+1 the lower and try again (6+6)//2 is 6. Slicing out blocks of 6 at a time won't create more than 3 blocks in total.
Therefore 6 is the lowest limit to impose on the sum of a block that will permit breaking into 3 blocks.
I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.
Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.
If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]
Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
My modifications (only to the for loops):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}
I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.
According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.
Here someone answers a very similar question:
Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
just the main part, more code in the original
I used the original code and places this just before the end of the j for loop:
if (p[n] > s.length() + 5)
break;
The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.
Apache Commons Lang 3.4 has this implementation:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* http://www.merriampark.com/ld.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #param threshold the target threshold, must not be negative
* #return result distance, or {#code -1} if the distance would be greater than the threshold
* #throws IllegalArgumentException if either String input {#code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}