Edit Distance solution for Large Strings

Edit Distance solution for Large Strings - java

I'm trying to solve the edit distance problem. the code I've been using is below.
public static int minDistance(String word1, String word2) {
int len1 = word1.length();
int len2 = word2.length();
// len1+1, len2+1, because finally return dp[len1][len2]
int[][] dp = new int[len1 + 1][len2 + 1];
for (int i = 0; i <= len1; i++) {
dp[i][0] = i;
}
for (int j = 0; j <= len2; j++) {
dp[0][j] = j;
}
//iterate though, and check last char
for (int i = 0; i < len1; i++) {
char c1 = word1.charAt(i);
for (int j = 0; j < len2; j++) {
char c2 = word2.charAt(j);
//if last two chars equal
if (c1 == c2) {
//update dp value for +1 length
dp[i + 1][j + 1] = dp[i][j];
} else {
int replace = dp[i][j] + 1 ;
int insert = dp[i][j + 1] + 1 ;
int delete = dp[i + 1][j] + 1 ;
int min = replace > insert ? insert : replace;
min = delete > min ? min : delete;
dp[i + 1][j + 1] = min;
}
}
}
return dp[len1][len2];
}
It's a DP approach. The problem it since it use a 2D array we cant solve this problem using above method for large strings. Ex: String length > 100000.
So Is there anyway to modify this algorithm to overcome that difficulty ?
NOTE:
The above code will accurately solve the Edit Distance problem for small strings. (which has length below 1000 or near)
As you can see in the code it uses a Java 2D Array "dp[][]" . So we can't initialize a 2D array for large rows and columns.
Ex : If i need to check 2 strings whose lengths are more than 100000
int[][] dp = new int[len1 + 1][len2 + 1];
the above will be
int[][] dp = new int[100000][100000];
So it will give a stackOverflow error.
So the above program only good for small length Strings.
What I'm asking is , Is there any way to solve this problem for large strings(length > 100000) efficiently in java.

First of all, there's no problem in allocating a 100k x 100k int array in Java, you just have to do it in the Heap, not the Stack (and on a machine with around 80GB of memory :))
Secondly, as a (very direct) hint:
Note that in your loop, you are only ever using 2 rows at a time - row i and row i+1. In fact, you calculate row i+1 from row i. Once you get i+1 you don't need to store row i anymore.
This neat trick allows you to store only 2 rows at the same time, bringing down the space complexity from n^2 to n. Since you stated that this is not homework (even though you're a CS undergrad by your profile...), I'll trust you to come up with the code yourself.
Come to think of it I recall having this exact problem when I was doing a class in my CS degree...

Related

Is there any reason why Java would be performing faster than C for sorting int arrays with manual Insertion / Selection / Radix sorts?

Platform: OpenBSD, compiler: gcc, javac (OpenJDK version 17)
I have made a few sorting benchmarks in various languages, and I'm rather surprised by the performance of the Java program over the C program.
I have programmed the exact same sorting algorithms in both languages, and the Java program finishes almost twice as fast, all other languages are slower than the C implementation except the Java one.
The benchmarks involve running the sorting algorithm on a random array of numbers a set number of times.
I am compiling the program with -O3 and -Ofast, so I cannot apply any more compiler optimizations.
The exact code can be found here, but here is an excerpt from it:
Java:
public static void benchmark(SortingFunction func, int arraySize, int numTimes, String name, BufferedOutputStream bo) throws IOException {
int[][] arrs = new int[numTimes][arraySize];
for (int i = 0; i < numTimes; i ++) {
arrs[i] = genRandArray(arraySize);
}
long start = System.nanoTime();
for (int i = 0; i < numTimes; i ++) {
func.sort(arrs[i]);
}
long end = System.nanoTime();
double time = (double)(end - start) / 1e9;
System.out.println("It took " + time + " seconds to do " + name + " sort " +
numTimes + " times on arrays of size " + arraySize
);
String out = name+","+numTimes+","+arraySize+","+time;
for (char c : out.toCharArray()) {
bo.write(c);
}
bo.write('\n');
}
public static void insertionSort(int[] array) {
for (int i = 1; i < array.length; i ++) {
int temp = array[i];
int j;
for (j = i - 1; j >= 0 && array[j] > temp; j --) {
array[j+1] = array[j];
}
array[j+1] = temp;
}
}
C:
void benchmark(void (*f)(int *, int), int arr_size, int num_times, char *name,
FILE *fp) {
int *arrs[num_times];
struct timeval start, end;
double t;
for (int i = 0; i < num_times; i++) {
arrs[i] = gen_rand_arr(arr_size);
}
gettimeofday(&start, NULL);
for (int i = 0; i < num_times; i++) {
f(arrs[i], arr_size);
}
gettimeofday(&end, NULL);
for (int i = 0; i < num_times; i++) {
free(arrs[i]);
}
t = ((double)(end.tv_sec * 1000000 + end.tv_usec -
(start.tv_sec * 1000000 + start.tv_usec))) /
1000000;
printf("It took %f seconds to do %s sort %d times on arrays of size %d\n", t,
name, num_times, arr_size);
if (fp != NULL) {
fprintf(fp, "%s,%d,%d,%f\n", name, num_times, arr_size, t);
}
}
void insertion_sort(int *arr, int arr_size) {
for (int i = 1; i < arr_size; i++) {
int temp = arr[i];
int j;
for (j = i - 1; j >= 0 && *(arr + j) > temp; j--) {
arr[j + 1] = arr[j];
}
arr[j + 1] = temp;
}
return;
}
Are there some optimizations that Java is making to run faster that somehow change the algorithm? What is going on here?
Any explanations would be appreciated.
Here is a table of results that might help explain the difference:
Java:
name
rep
size
time
Insertion
10000
1200
1.033
Insertion
10000
5000
15.677
Insertion
10000
12000
88.190
Selection
10000
1200
3.118
Selection
10000
5000
48.377
Selection
10000
12000
268.608
Radix
10000
1200
0.385
Radix
10000
5000
1.491
Radix
10000
12000
3.563
Bogo
1
11
1.330
Bogo
1
12
0.572
Bogo
1
13
122.777
C:
name
rep
size
time
Insertion
10000
1200
1.766
Insertion
10000
5000
26.106
Insertion
10000
12000
140.582
Selection
10000
1200
4.011
Selection
10000
5000
60.442
Selection
10000
12000
340.608
Radix
10000
1200
0.430
Radix
10000
5000
1.788
Radix
10000
12000
4.295
Bogo
1
11
1.378
Bogo
1
12
2.296
Bogo
1
13
1586.73
Edit:
I modified the benchmarking function to generate the arrays beforehand, in Java it overflows the heap, and in C it makes it not much faster (around 25%, but Java is still faster).
fwiw I also changed how I indexed things in C from *(arr + i) to arr[i].
Here's the implementation for the random array generator in Java and C:
Java:
public static int[] genRandArray(int arraySize) {
int[] ret = new int[arraySize];
Random rand = new Random();
for (int i = 0; i < ret.length; i ++) {
ret[i] = rand.nextInt(100);
}
return ret;
}
C:
int *gen_rand_arr(int arr_size) {
int *arr;
if ((arr = malloc(arr_size * sizeof(int))) == NULL) {
exit(1);
}
for (int i = 0; i < arr_size; i++) {
arr[i] = arc4random_uniform(100);
}
return arr;
}

TL;DR
In general, short snippets like this are not a fair way to compare languages. There are a lot of factors that comes into play. Code does not automatically get faster when you write it in C instead of Java. If that were the case, you could just write a Java2C converter. Compiler flags matters a lot, but also the skill of the programmer.
Longer explanation
I cannot say for sure, but this:
for (j = i - 1; j >= 0 && arr[j] > temp; j--) {
arr[j + 1] = arr[j];
}
is not very cache friendly, because you're traversing the list backwards. I would try changing the loop so that the outer loop do the backwards traversing instead of the inner loop.
But I'd say that your question is fundamentally flawed. Code does not automatically get a performance boost just because you rewrite it from Java to C. In the same way, C programs does not automatically get faster because you rewrite them to assembly. One could say that C allows you to write faster programs than Java, but in the end, the result depend on the programmer.
One thing that can speed up Java programs is the JIT compiler, which can do statistics to optimize the code during runtime for the specific conditions right there and then. While it is possible to make a C compiler to optimize for typical workload, it cannot optimize for current workload.
You said that you used -O3 for the C code. But what target did you use? Did you optimize for your machine or a general one? The JIT compiler knows the target to optimize for. Try using -march=native
Are you sure that you're using the same size for int? It's 32 bit in Java, but might be 64 in C. It might speed up the C code if you switch to int32_t instead. But it might also slow it down. (Very unlikely that this is the cause, but I just wanted to mention it as a possibility)
C usually shines when it comes to very low level stuff.
And if we look in your Java code:
for (int i = 1; i < array.length; i ++) {
int temp = array[i];
In this example, a smart compiler can easily see that array will never be accessed out of bounds. But what if we instead would have something like:
while(<condition>) {
int temp = array[foo()];
where it cannot be determined beforehand that array will not go out of bounds? Then Java is forced to do constant boundary checking to be able to throw exceptions. The code would be translated to something like:
while(<condition>) {
int i = foo();
if(i >= array.length)
throw exception;
int temp = array[i];
This gives security, but costs performance. C would simply allow you to access out of bounds, which is faster but may cause bugs that are hard to find.
I found a nice question with more info: Why would it ever be possible for Java to be faster than C++?
Apart from that, I can see that you're including the data generation in the benchmark. That's very bad. Generate the data before starting the timer. Like this:
int *arrs[num_times];
for (int i = 0; i < num_times; i++)
arrs[i] = gen_rand_arr(arr_size);
gettimeofday(&start, NULL);
for (int i = 0; i < num_times; i++)
f(arrs[i], arr_size);
gettimeofday(&end, NULL);

Improving the algorithm for removal of element

Problem
Given a string s and m queries. For each query delete the K-th occurrence of a character x.
For example:
abcdbcaab
5
2 a
1 c
1 d
3 b
2 a
Ans abbc
My approach
I am using BIT tree for update operation.
Code:
for (int i = 0; i < ss.length(); i++) {
char cc = ss.charAt(i);
freq[cc-97] += 1;
if (max < freq[cc-97]) max = freq[cc-97];
dp[cc-97][freq[cc-97]] = i; // Counting the Frequency
}
BIT = new int[27][ss.length()+1];
int[] ans = new int[ss.length()];
int q = in.nextInt();
for (int i = 0; i < q; i++) {
int rmv = in.nextInt();
char c = in.next().charAt(0);
int rr = rmv + value(rmv, BIT[c-97]); // Calculating the original Index Value
ans[dp[c-97][rr]] = Integer.MAX_VALUE;
update(rmv, 1, BIT[c-97], max); // Updating it
}
for (int i = 0; i < ss.length(); i++) {
if (ans[i] != Integer.MAX_VALUE) System.out.print(ss.charAt(i));
}
Time Complexity is O(M log N) where N is length of string ss.
Question
My solution gives me Time Limit Exceeded Error. How can I improve it?
public static void update(int i , int value , int[] arr , int xx){
while(i <= xx){
arr[i ]+= value;
i += (i&-i);
}
}
public static int value(int i , int[] arr){
int ans = 0;
while(i > 0){
ans += arr[i];
i -= (i &- i);
}
return ans ;
}

There are key operations not shown, and odds are that one of them (quite likely the update method) has a different cost than you think. Furthermore your stated complexity is guaranteed to be wrong because at some point you have to scan the string which is at minimum O(N).
But anyways the obviously right strategy here is to go through the queries, separate them by character, and then go through the queries in reverse order to figure out the initial positions of the characters to be suppressed. Then run through the string once, emitting characters only when it fits. This solution, if implemented well, should be doable in O(N + M log(M)).
The challenge is how to represent the deletions efficiently. I'm thinking of some sort of tree of relative offsets so that if you find that the first deletion was 3 a you can efficiently insert it into your tree and move every later deletion after that one. This is where the log(M) bit will be.

in Java, design linear algorithm that finds contiguous subsequence with highest sum

this is the question, and yes it is homework, so I don't necessarily want anyone to "do it" for me; I just need suggestions: Maximum sum: Design a linear algorithm that finds a contiguous subsequence of at most M in a sequence of N long integers that has the highest sum among all such subsequences. Implement your algorithm, and confirm that the order of growth of its running time is linear.
I think that the best way to design this program would be to use nested for loops, but because the algorithm must be linear, I cannot do that. So, I decided to approach the problem by making separate for loops (instead of nested ones).
However, I'm really not sure where to start. The values will range from -99 to 99 (as per the range of my random number generating program).
This is what I have so far (not much):
public class MaxSum {
public static void main(String[] args){
int M = Integer.parseInt(args[0]);
int N = StdIn.readInt();
long[] a = new long[N];
for (int i = 0; i < N; i++) {
a[i] = StdIn.readLong();}}}
if M were a constant, this wouldn't be so difficult. For example, if M==3:
public class MaxSum2 {
public static void main(String[] args){
int N = StdIn.readInt(); //read size for array
long[] a = new long[N]; //create array of size N
for (int i = 0; i < N; i++) { //go through values of array
a[i] = StdIn.readLong();} //read in values and assign them to
//array indices
long p = a[0] + a[1] + a[2]; //start off with first 3 indices
for (int i =0; i<N-4; i++)
{if ((a[i]+a[i+1]+a[1+2])>=p) {p=(a[i]+a[i+1]+a[1+2]);}}
//if sum of values is greater than p, p becomes that sum
for (int i =0; i<N-4; i++) //prints the subsequence that equals p
{if ((a[i]+a[i+1]+a[1+2])==p) {StdOut.println((a[i]+a[i+1]+a[1+2]));}}}}
If I must, I think MaxSum2 will be acceptable for my lab report (sadly, they don't expect much). However, I'd really like to make a general program, one that takes into consideration the possibility that, say, there could be only one positive value for the array, meaning that adding the others to it would only reduce it's value; Or if M were to equal 5, but the highest sum is a subsequence of the length 3, then I would want it to print that smaller subsequence that has the actual maximum sum.
I also think as a novice programmer, this is something I Should learn to do. Oh and although it will probably be acceptable, I don't think I'm supposed to use stacks or queues because we haven't actually covered that in class yet.

Here is my version, adapted from Petar Minchev's code and with an important addition that allows this program to work for an array of numbers with all negative values.
public class MaxSum4 {
public static void main(String[] args)
{Stopwatch banana = new Stopwatch(); //stopwatch object for runtime data.
long sum = 0;
int currentStart = 0;
long bestSum = 0;
int bestStart = 0;
int bestEnd = 0;
int M = Integer.parseInt(args[0]); // read in highest possible length of
//subsequence from command line argument.
int N = StdIn.readInt(); //read in length of array
long[] a = new long[N];
for (int i = 0; i < N; i++) {//read in values from standard input
a[i] = StdIn.readLong();}//and assign those values to array
long negBuff = a[0];
for (int i = 0; i < N; i++) { //go through values of array to find
//largest sum (bestSum)
sum += a[i]; //and updates values. note bestSum, bestStart,
// and bestEnd updated
if (sum > bestSum) { //only when sum>bestSum
bestSum = sum;
bestStart = currentStart;
bestEnd = i; }
if (sum < 0) { //in case sum<0, skip to next iteration, reseting sum=0
sum = 0; //and update currentStart
currentStart = i + 1;
continue; }
if (i - currentStart + 1 == M) { //checks if sequence length becomes equal
//to M.
do { //updates sum and currentStart
sum -= a[currentStart];
currentStart++;
} while ((sum < 0 || a[currentStart] < 0) && (currentStart <= i));
//if sum or a[currentStart]
} //is less than 0 and currentStart<=i,
} //update sum and currentStart again
if(bestSum==0){ //checks to see if bestSum==0, which is the case if
//all values are negative
for (int i=0;i<N;i++){ //goes through values of array
//to find largest value
if (a[i] >= negBuff) {negBuff=a[i];
bestSum=negBuff; bestStart=i; bestEnd=i;}}}
//updates bestSum, bestStart, and bestEnd
StdOut.print("best subsequence is from
a[" + bestStart + "] to a[" + bestEnd + "]: ");
for (int i = bestStart; i<=bestEnd; i++)
{
StdOut.print(a[i]+ " "); //prints sequence
}
StdOut.println();
StdOut.println(banana.elapsedTime());}}//prints elapsed time
also, did this little trace for Petar's code:
trace for a small array
M=2
array: length 5
index value
0 -2
1 2
2 3
3 10
4 1
for the for-loop central to program:
i = 0 sum = 0 + -2 = -2
sum>bestSum? no
sum<0? yes so sum=0, currentStart = 0(i)+1 = 1,
and continue loop with next value of i
i = 1 sum = 0 + 2 = 2
sum>bestSum? yes so bestSum=2 and bestStart=currentStart=1 and bestEnd=1=1
sum<0? no
1(i)-1(currentStart)+1==M? 1-1+1=1 so no
i = 2 sum = 2+3 = 5
sum>bestSum? yes so bestSum=5, bestStart=currentStart=1, and bestEnd=2
sum<0? no
2(i)-1(currentStart)+1=M? 2-1+1=2 so yes:
sum = sum-a[1(curentstart)] =5-2=3. currentStart++=2.
(sum<0 || a[currentStart]<0)? no
i = 3 sum=3+10=13
sum>bestSum? yes so bestSum=13 and bestStart=currentStart=2 and bestEnd=3
sum<0? no
3(i)-2(currentStart)+1=M? 3-2+1=2 so yes:
sum = sum-a[1(curentstart)] =13-3=10. currentStart++=3.
(sum<0 || a[currentStart]<0)? no
i = 4 sum=10+1=11
sum>bestSum? no
sum<0? no
4(i)-3(currentStart)+1==M? yes but changes to sum and currentStart now are
irrelevent as loop terminates
Thanks again! Just wanted to post a final answer and I was slightly proud for catching the all negative thing.

Each element is looked at most twice (one time in the outer loop, and one time in the while loop).
O(2N) = O(N)
Explanation: each element is added to the current sum. When the sum goes below zero, it is reset to zero. When we hit M length sequence, we try to remove elements from the beginning, until the sum is > 0 and there are no negative elements in the beginning of it.
By the way, when all elements are < 0 inside the array, you should take only the largest negative number. This is a special edge case which I haven't written below.
Beware of bugs in the below code - it only illustrates the idea. I haven't run it.
int sum = 0;
int currentStart = 0;
int bestSum = 0;
int bestStart = 0;
int bestEnd = 0;
for (int i = 0; i < N; i++) {
sum += a[i];
if (sum > bestSum) {
bestSum = sum;
bestStart = currentStart;
bestEnd = i;
}
if (sum < 0) {
sum = 0;
currentStart = i + 1;
continue;
}
//Our sequence length has become equal to M
if (i - currentStart + 1 == M) {
do {
sum -= a[currentStart];
currentStart++;
} while ((sum < 0 || a[currentStart] < 0) && (currentStart <= i));
}
}

I think what you are looking for is discussed in detail here
Find the subsequence with largest sum of elements in an array
I have explained 2 different solutions to resolve this problem with O(N) - linear time.

Dynamic programming with large inputs

I am trying to solve a classic Knapsack problem with huge capacity of 30.000.000 and it works well up until 20.000.000 but then it runs out of memory:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I have tried to divide all values and capacity by 1.000.000 but that generates floats and I don't think that is the correct approach. I have also tried to make the arrays and matrix of type long but that does not help.
Perhaps another data-structure?
Any pointers welcome...
Code:
public class Knapsack {
public static void main(String[] args) {
int N = Integer.parseInt(args[0]); // number of items
int W = Integer.parseInt(args[1]); // maximum weight of knapsack
int[] profit = new int[N+1];
int[] weight = new int[N+1];
// generate random instance, items 1..N
for (int n = 1; n <= N; n++) {
profit[n] = (int) (Math.random() * 1000000);
weight[n] = (int) (Math.random() * W);
}
// opt[n][w] = max profit of packing items 1..n with weight limit w
// sol[n][w] = does opt solution to pack items 1..n with weight limit w include item n?
int[][] opt = new int[N+1][W+1];
boolean[][] sol = new boolean[N+1][W+1];
for (int n = 1; n <= N; n++) {
for (int w = 1; w <= W; w++) {
// don't take item n
int option1 = opt[n-1][w];
// take item n
int option2 = Integer.MIN_VALUE;
if (weight[n] <= w) option2 = profit[n] + opt[n-1][w-weight[n]];
// select better of two options
opt[n][w] = Math.max(option1, option2);
sol[n][w] = (option2 > option1);
}
}
// determine which items to take
boolean[] take = new boolean[N+1];
for (int n = N, w = W; n > 0; n--) {
if (sol[n][w]) { take[n] = true; w = w - weight[n]; }
else { take[n] = false; }
}
// print results
System.out.println("item" + "\t" + "profit" + "\t" + "weight" + "\t" + "take");
for (int n = 1; n <= N; n++) {
System.out.println(n + "\t" + profit[n] + "\t" + weight[n] + "\t" + take[n]);
}
//Copyright © 2000–2011, Robert Sedgewick and Kevin Wayne. Last updated: Wed Feb 9 //09:20:16 EST 2011.
}

Here are a couple of tricks I've used for things like that that.
First, a variant of a sparse matrix. It's not really sparse, but instead of assuming that "non-stored entries" are zero, you assume they're the same as the entry before. This can work in either direction (in the direction of the capacity or in the direction of the items), afaik not (easily) in both directions at the same time. Good trick, but doesn't defeat instances that are huge in both directions.
Secondly, a combination of Dynamic Programming and Branch & Bound. First, use DP with only the "last two rows". That gives you the value of the optimal solution. Then use Branch & Bound to find the subset of items that corresponds to the optimal solution. Sort by value/weight, apply the relaxation value[next_item] * (capacity_left / weight[next_item]) to bound with. Knowing the optimal value ahead of time makes pruning very effective.
The "last two rows" refers to the "previous row" (a slice of the tableau that has the solutions for all items up to i) and the "current row" (that you're filling right now). it could look something like this, for example: (this is C# btw, but should be easy to port)
int[] row0 = new int[capacity + 1], row1 = new int[capacity + 1];
for (int i = 0; i < weights.Length; i++)
{
for (int j = 0; j < row1.Length; j++)
{
int value_without_this_item = row1[j];
if (j >= weights[i])
row0[j] = Math.Max(value_without_this_item,
row1[j - weights[i]] + values[i]);
else
row0[j] = value_without_this_item;
}
// swap rows
int[] t = row1;
row1 = row0;
row0 = t;
}
int optimal_value = row1[capacity];

Use a recursive method to solve the problem. see http://penguin.ewu.edu/~trolfe/Knapsack01/Knapsack01.html for further information.
Hope it will be of help.

Break your for loops down into method calls.
This will have the effect of making the local variables GC'able once the method itself has completed.
So instead of nested for loops within the same main method call a method with the same functionality, which then calls a second method and you are effectively breaking the code up into small packets of local variables which can be collected when out of scope.

String Substrings Generation in Java

I am trying to find all substrings within a given string. For a random string like rymis the subsequences would be [i, is, m, mi, mis, r, ry, rym, rymi, rymis, s, y, ym, ymi, ymis]. From Wikipedia, a string of a length of n will have n * (n + 1) / 2 total substrings.
Which can be found by doing the following snippet of code:
final Set<String> substring_set = new TreeSet<String>();
final String text = "rymis";
for(int iter = 0; iter < text.length(); iter++)
{
for(int ator = 1; ator <= text.length() - iter; ator++)
{
substring_set.add(text.substring(iter, iter + ator));
}
}
Which works for small String lengths but obviously slows down for larger lengths as the algorithm is near O(n^2).
Also reading up on Suffix Trees which can do insertions in O(n) and noticed the same subsequences could be obtained by repeatedly inserting substrings by removing 1 character from the right until the string is empty. Which should be about O(1 + … + (n-1) + n) which is a summation of n -> n(n+1)/2 -> (n^2 + n)/ 2, which again is near O(n^2). Although there seems to be some Suffix Trees that can do insertions in log2(n) time which would be a factor better being O(n log2(n)).
Before I delve into Suffix Trees is this the correct route to be taking, is there some another algorithm that would be more efficient for this, or is O(n^2) as good as this will get?

I am fairly sure you can't beat O(n^2) for this as has been mentioned in comments to the question.
I was interested in different ways of coding that so I made one quickly, and I decided to post it here.
The solution I put here isn't asymptotically faster I don't think, but when counting the inner and outer loops there are less. There are also less duplicate insertions here - no duplicate insertions.
String str = "rymis";
ArrayList<String> subs = new ArrayList<String>();
while (str.length() > 0) {
subs.add(str);
for (int i=1;i<str.length();i++) {
subs.add(str.substring(i));
subs.add(str.substring(0,i));
}
str = str.substring(1, Math.max(str.length()-1, 1));
}

This is an inverted way of your example, but still o(n^2).
string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length(); i++){//collect substrings of length i
for(int k = 0; k < s.length(); k++){//start index for sbstr len i
if(i + k > s.length())break;//if the sbstr len i runs over end of s move on
al.add(s.substring(k, k + i));//add sbstr len i at index k to al
}
}
Let me see if I can post a recursive example. I started doing a couple recursive tries and came up with this iterative approach using dual sliding windows as a sort of improvement to the above method. I had a recursive example in mind but was having issues reducing the tree size.
string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length() + 1; i ++)
{
for(int k = 0; k < s.length(); k++)
{
int a = k;//left bound window 1
int b = k + i;//right bound window 1
int c = s.length() - 1 - k - i;//left bound window 2
int d = s.length() - 1 - k;//right bound window 2
al.add(s.substring(a,b));//add window 1
if(a < c)al.add(s.substring(c,d));//add window 2
}
}
There was an issue mentioned with using arraylist affecting performance so this next one will be with more basic structures.
string s = "rymis";
StringBuilder sb = new StringBuilder();
for(int i = 1; i < s.length() + 1; i ++)
{
for(int k = 0; k < s.length(); k++)
{
int a = k;//left bound window 1
int b = k + i;//right bound window 1
int c = s.length() - 1 - k - i;//left bound window 2
int d = s.length() - 1 - k;//right bound window 2
if(i > 1 && k > 0)sb.append(",");
sb.append(s.substring(a,b));//add window 1
if(a < c){
sb.append(",");
sb.append(s.substring(c,d));//add window 2
}
}
}
string s = sb.toString();
String[] sArray = s.split("\\,");

I am not sure about the exact algorithm but you may look into Ropes:
http://en.wikipedia.org/wiki/Rope_(computer_science)
In summary, ropes are better suited when the data is large and frequently modified.
I believe Rope outperforms String for your problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Edit Distance solution for Large Strings - java

Related

Is there any reason why Java would be performing faster than C for sorting int arrays with manual Insertion / Selection / Radix sorts?

Improving the algorithm for removal of element

in Java, design linear algorithm that finds contiguous subsequence with highest sum

Dynamic programming with large inputs

String Substrings Generation in Java

Categories

Resources