Suffix array O(NlogN) implementation

Suffix array O(NlogN) implementation - java

I'm looking into the specific O(NlogN) implementation of suffix array found at this link : https://sites.google.com/site/indy256/algo/suffix_array
I'm able to understand the core concepts but understanding the implementation in its entirety is a problem.
public static int[] suffixArray(CharSequence S) {
int n = S.length();
Integer[] order = new Integer[n];
for (int i = 0; i < n; i++)
order[i] = n - 1 - i;
// stable sort of characters
Arrays.sort(order, (a, b) -> Character.compare(S.charAt(a), S.charAt(b)));
int[] sa = new int[n];
int[] classes = new int[n];
for (int i = 0; i < n; i++) {
sa[i] = order[i];
classes[i] = S.charAt(i);
}
// sa[i] - suffix on i'th position after sorting by first len characters
// classes[i] - equivalence class of the i'th suffix after sorting by first len characters
for (int len = 1; len < n; len *= 2) {
int[] c = classes.clone();
for (int i = 0; i < n; i++) {
// condition sa[i - 1] + len < n simulates 0-symbol at the end of the string
// a separate class is created for each suffix followed by simulated 0-symbol
classes[sa[i]] = i > 0 && c[sa[i - 1]] == c[sa[i]] && sa[i - 1] + len < n && c[sa[i - 1] + len / 2] == c[sa[i] + len / 2] ? classes[sa[i - 1]] : i;
}
// Suffixes are already sorted by first len characters
// Now sort suffixes by first len * 2 characters
int[] cnt = new int[n];
for (int i = 0; i < n; i++)
cnt[i] = i;
int[] s = sa.clone();
for (int i = 0; i < n; i++) {
// s[i] - order of suffixes sorted by first len characters
// (s[i] - len) - order of suffixes sorted only by second len characters
int s1 = s[i] - len;
// sort only suffixes of length > len, others are already sorted
if (s1 >= 0)
sa[cnt[classes[s1]]++] = s1;
}
}
return sa;
}
I'm wondering about the use of cnt[] array and places it is useful.
Any pointers would be helpful.
Thanks.

Related

Why don't we redistribute inversely in this implementation of Radix sort of strings

I found this implementation of radix sort LSD for strings :
public static void sort(String[] input, int w) {
String[] aux = new String[input.length];
//ascii chars
int R = 256;
int n = input.length;
for(int d = w-1; d >= 0; d--) {
int[] count = new int[R+1];
//update the frequency at i+1 index
for(int i=0; i<n; i++) {
count[input[i].charAt(d) + 1] ++;
}
//transform the frequency into indices
for(int r=0; r< R; r++) {
count[r+1] += count[r];
}
//redistribute
for(int i=0; i<n; i++) {
aux[count[input[i].charAt(d)]++] = input[i];
}
for(int i=0; i<n; i++) {
input[i] = aux[i];
}
}
}
But I don't understand two things :
why here we have count[input[i].charAt(d) + 1] ++; rather than count[input[i].charAt(d)] ++; ?
why we don't redistribute the characters inversely ? I think it's way simpler (my implementation) :
public static void sort(String[] arr, int lenStr) {
int R = 256;
int len = arr.length;
String[] arrSorted = new String[len];
for (int d = lenStr - 1; d >= 0; d--) {
// frequency count of each character
int[] count = new int[R + 1];
for (int i = 0; i < len; i++) {
count[arr[i].charAt(d)]++;
}
for (int i = 1; i < count.length; i++) {
count[i] += count[i - 1];
}
for (int i = len - 1; i >= 0; i--) {
count[arr[i].charAt(d)]--;
arrSorted[count[arr[i].charAt(d)]] = arr[i];
}
for (int i = 0; i < len; i++) {
arr[i] = arrSorted[i];
}
}
}

I think most of it comes down to personal preference.
why here we have count[input[i].charAt(d) + 1] ++; rather than count[input[i].charAt(d)] ++; ?
Their count[x+1] means, after the second inner loop, how many times character x and any character prior to it appear. For example, we might have the initial counts:
count[0] = 0
count[1] = 2
count[2] = 3
Then after the second for loop we will have:
count[0] = 0
count[1] = 2
count[2] = 5
This means that character 0 takes the positions between count[0] and count[1], character 1 takes the positions between count[1] and count[2] and in general, character x takes the positions between count[x] and count[x+1] This allows them to do this:
for(int i=0; i<n; i++) {
aux[count[input[i].charAt(d)]++] = input[i];
}
Which is a nice one liner that ties everything together neatly IMO, because count[x] changes to mean at what position should we next place character x in our sorted array.
Your implementation works just as well and can also be turned into a one liner:
for (int i = len - 1; i >= 0; i--) {
arrSorted[--count[arr[i].charAt(d)]] = arr[i];
}
If you think it's simpler then you can use it, I don't see any downsides (assuming you've tested it well enough). It's a pretty complex algorithm and once you understand one way of doing it, you tend to stick with it. This is just the implementation that stuck I guess. Simplicity is highly subjective here, personally I think your version is just as complex.

Two player coin game : tracing optimal sequence in dynamic programming

Two players take turns choosing one of the outer coins. At the end we calculate the difference between the score two players get, given that they play optimally. for example the list{4,3,2,1},
the optimal sequence would be 4, 3, 2, 1. then i will get 4+2 = 6 scores and the opponent 4 scores.
Now i have developed an algorithm as follow:
My Job is to print the scores out, and also the optimal sequence in index. so in the array {4,3,2,1} the optimal sequence would be 0,1,2,3.
The maximum Runtime and Memory should not exceed n^2.
Therefore I implemented the above algorithm with bottom up approach,which means in an i*j table, according to my algorithm, subproblems are solved one by one until the only main problem, which locates at the top right corner(where i =0 and j = n-1). It works calculating the scores, but i have no idea how to trace the optimal sequence during runtime, since when I calculate subproblems by subproblems, only the score will be save and used in the next problem, while the sequence, which led to the final result, is hard to trace back.
I tried to create Pairs or multidimensional ArrayList to record the sequences and their corresponding memo[i][j]...... Well, they worked, but the memory needed would then be greater than n^2 and this is not allowed in my task.
So, does anymore have a better idea that does not require that much memory space?
Any help would be appreciated, cheers!
My code:
public int maxGain(int[] values) {
int n = values.length;
int [][] memo = new int[n][n];
for (int i = 0; i < n; i++)
memo[i][i] = values[i];
for (int i = 0, j = 1; j < n; i++, j++)
memo[i][j] = Math.max(values[i], values[j]);
for (int k = 2; k < n; k++) {
for (int i = 0, j = k; j < n; i++, j++) {
int a = values[i] + Math.min(memo[i + 2][j], memo[i + 1][j - 1]);
int b = values[j] + Math.min(memo[i + 1][j - 1], memo[i][j - 2]);
memo[i][j] = Math.max(a, b);
}
}
return memo[0][n - 1];
}

I guess your question is similar to Predict the Winner of LeetCode (486) with some minor changes that you would want to make:
Java
class Solution {
public boolean maxGain(int[] nums) {
int length = nums.length;
int[][] dp = new int[length][length];
for (int i = 0; i < length; i++)
dp[i][i] = nums[i];
for (int l = 1; l < length; l++)
for (int i = 0; i < length - l; i++) {
int j = i + l;
dp[i][j] = Math.max(nums[i] - dp[i + 1][j], nums[j] - dp[i][j - 1]);
}
return dp[0][length - 1] > -1;
}
}
Python
class Solution:
def max_gain(self, nums):
length = len(nums)
memo = [[-1 for _ in range(length)] for _ in range(length)]
#functools.lru_cache(None)
def f():
def helper(nums, i, j):
if i > j:
return 0
if i == j:
return nums[i]
if memo[i][j] != -1:
return memo[i][j]
cur = max(nums[i] + min(helper(nums, i + 2, j), helper(nums, i + 1, j - 1)),
nums[j] + min(helper(nums, i, j - 2), helper(nums, i + 1, j - 1)))
memo[i][j] = cur
return cur
score = helper(nums, 0, length - 1)
total = sum(nums)
return 2 * score >= total
return f()
O(N) Memory
The space complexity might be an order of N for the second solution provided in this link:
class Solution {
public boolean maxGain(int[] nums) {
if (nums == null)
return true;
int length = nums.length;
int[] dp = new int[length];
for (int i = length - 1; i >= 0; i--) {
for (int j = i; j < length; j++) {
if (i == j)
dp[i] = nums[i];
else
dp[j] = Math.max(nums[i] - dp[j], nums[j] - dp[j - 1]);
}
}
return dp[length - 1] > -1;
}
}
Reference
Most optimal solutions are here in the discussion board

Finding the count of common array sequencce

I am trying to the length of the longest sequence of numbers shared by two arrays. Given the following two arrays:
int [] a = {1, 2, 3, 4, 6, 8,};
int [] b = {2, 1, 2, 3, 5, 6,};
The result should be 3 as the the longest common sequence between the two is{1, 2, 3}.
The numbers must be in a sequence for the program to consider to count it.
I have thought about it and wrote a small beginning however, I am not sure how to approach this
public static int longestSharedSequence(int[] arr, int[] arr2){
int start = 0;
for(int i = 0; i < arr.length; i++){
for(int j = 0; j < arr2.length; j++){
int n = 0;
while(arr[i + n] == arr2[j + n]){
n++;
if(((i + n) >= arr.length) || ((j + n) >= arr2.length)){
break;
}
}
}

That is a very good start that you have. All you need to do is have some way of keeping track of the best n value that you have encountered. So at the start of the method, declare int maxN = 0. Then, after the while loop within the two for loops, check if n (the current matching sequence length) is greater than maxN (the longest matching sequence length encountered so far). If so, update maxN to the value of n.
Since you also want the matching elements to be in sequential order, you will need to check that the elements in the two arrays not only match, but that they are also 1 greater than the previous element in each array.
Putting these together gives the following code:
public static int longestSharedSequence(int[] arr, int[] arr2) {
int maxN = 0;
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < arr2.length; j++) {
int n = 0;
// Check that elements match and that they are either the
// first element in the sequence that is currently being
// compared or that they are 1 greater than the previous
// element
while (arr[i + n] == arr2[j + n]
&& (n == 0 || arr[i + n] == arr[i + n - 1] + 1)) {
n++;
if (i + n >= arr.length || j + n >= arr2.length) {
break;
}
}
// If we found a longer sequence than the previous longest,
// update maxN
if (n > maxN) {
maxN = n;
}
}
}
return maxN;
}

I didn't think of anything smarter than the path you were already on:
import java.util.Arrays;
import java.util.Random;
public class MaxSeq {
public static void main(String... args) {
int[] a = new int[10000];
int[] b = new int[10000];
final Random r = new Random();
Arrays.parallelSetAll(a, i -> r.nextInt(100));
Arrays.parallelSetAll(b, i -> r.nextInt(100));
System.out.println(longestSharedSequence(a, b));
}
private static int longestSharedSequence(final int[] arr, final int[] arr2) {
int max = 0;
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < arr2.length; j++) {
int n = 0;
while ((i + n) < arr.length
&& (j + n) < arr2.length
&& arr[i + n] == arr2[j + n]) {
n++;
}
max = Math.max(max, n);
}
}
return max;
}
}
see: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem

Why does my Biginteger.multiply() shows NullPointerException?

I tried to initialize factorials from 1 to 1000 to an biginteger array and calculating the sum of the digits. Why this code showing java.lang.NullPointerException? I think everything was initialized correctly.
class Main {
public static void main(String[] args) {
BigInteger[] b = new BigInteger[1010];
int[] ara = new int[1010];
BigInteger c;
b[0] = BigInteger.ONE;
b[1] = BigInteger.ONE;
ara[0] = ara[1] = 1;
String s;
int l, sum;
for (int i = 2; i <= 1001; i++) {
c = b[i - 1];
b[i] = b[i].multiply(c);
s = b[i].toString();
l = s.length();
sum = 0;
for (int j = 0; j < l; j++) {
sum += Character.getNumericValue(s.charAt(j));
}
ara[i] = sum;
}

Problem :
b[i] = b[i].multiply(c);
And look at your b array which you initialised
b[0] = BigInteger.ONE;
b[1] = BigInteger.ONE;
And now look at the for loop
for (int i = 2; i <= 1001; i++) {
c = b[i - 1];
b[i] = b[i].multiply(c);
You have only 0,1 indexes. It will throw NPE for index 2.
You are trying loop on 1001 elements and there are only 2 elements inside your array. Fill the b array with zeros first.
Solution :
Change your for loop as below and keep everything same. It works.
for (int i = 2; i <= 1001; i++) {
b[i] = BigInteger.ONE;
c = b[i - 1];
b[i] = b[i].multiply(c);
s = b[i].toString();

The algorithm for factorial is to take some value n and multiply that n by n - 1 until the value 1 is arrived at. Your algorithm doesn't appear to do that (it generates ones). I think you wanted something like
int len = 1010;
BigInteger[] b = new BigInteger[len];
int[] ara = new int[len];
for (int i = 0; i < len; i++) {
// calculate factorial.
b[i] = BigInteger.valueOf(i + 1);
for (int j = i; j > 1; j--) {
b[i] = b[i].multiply(BigInteger.valueOf(j));
}
// now sum digits.
for (char ch : b[i].toString().toCharArray()) {
ara[i] += Character.getNumericValue(ch);
}
}

Suffix Array Implementation Error

I keep getting compiler errors with an implementation of a suffix array by Arrays.sort.
I get the following errors:
a cannot be resolved to a variable
Syntax error on token ",", . expected
Syntax error on token "-", -- expected
a cannot be resolved to a variable
b cannot be resolved to a variable
In the following code:
import java.util.*;
public class SuffixArray {
// sort suffixes of S in O(n*log(n))
public static int[] suffixArray(CharSequence S) {
int n = S.length();
Integer[] order = new Integer[n];
for (int i = 0; i < n; i++)
order[i] = n - 1 - i;
// stable sort of characters
Arrays.sort(order, (a, b) -> Character.compare(S.charAt(a), S.charAt(b)));
int[] sa = new int[n];
int[] classes = new int[n];
for (int i = 0; i < n; i++) {
sa[i] = order[i];
classes[i] = S.charAt(i);
}
// sa[i] - suffix on i'th position after sorting by first len characters
// classes[i] - equivalence class of the i'th suffix after sorting by first len characters
for (int len = 1; len < n; len *= 2) {
int[] c = classes.clone();
for (int i = 0; i < n; i++) {
// condition sa[i - 1] + len < n simulates 0-symbol at the end of the string
// a separate class is created for each suffix followed by simulated 0-symbol
classes[sa[i]] = i > 0 && c[sa[i - 1]] == c[sa[i]] && sa[i - 1] + len < n && c[sa[i - 1] + len / 2] == c[sa[i] + len / 2] ? classes[sa[i - 1]] : i;
}
// Suffixes are already sorted by first len characters
// Now sort suffixes by first len * 2 characters
int[] cnt = new int[n];
for (int i = 0; i < n; i++)
cnt[i] = i;
int[] s = sa.clone();
for (int i = 0; i < n; i++) {
// s[i] - order of suffixes sorted by first len characters
// (s[i] - len) - order of suffixes sorted only by second len characters
int s1 = s[i] - len;
// sort only suffixes of length > len, others are already sorted
if (s1 >= 0)
sa[cnt[classes[s1]]++] = s1;
}
}
return sa;
}
// sort rotations of S in O(n*log(n))
public static int[] rotationArray(CharSequence S) {
int n = S.length();
Integer[] order = new Integer[n];
for (int i = 0; i < n; i++)
order[i] = i;
Arrays.sort(order, (a, b) -> Character.compare(S.charAt(a), S.charAt(b)));
int[] sa = new int[n];
int[] classes = new int[n];
for (int i = 0; i < n; i++) {
sa[i] = order[i];
classes[i] = S.charAt(i);
}
for (int len = 1; len < n; len *= 2) {
int[] c = classes.clone();
for (int i = 0; i < n; i++)
classes[sa[i]] = i > 0 && c[sa[i - 1]] == c[sa[i]] && c[(sa[i - 1] + len / 2) % n] == c[(sa[i] + len / 2) % n] ? classes[sa[i - 1]] : i;
int[] cnt = new int[n];
for (int i = 0; i < n; i++)
cnt[i] = i;
int[] s = sa.clone();
for (int i = 0; i < n; i++) {
int s1 = (s[i] - len + n) % n;
sa[cnt[classes[s1]]++] = s1;
}
}
return sa;
}
// longest common prefixes array in O(n)
public static int[] lcp(int[] sa, CharSequence s) {
int n = sa.length;
int[] rank = new int[n];
for (int i = 0; i < n; i++)
rank[sa[i]] = i;
int[] lcp = new int[n - 1];
for (int i = 0, h = 0; i < n; i++) {
if (rank[i] < n - 1) {
for (int j = sa[rank[i] + 1]; Math.max(i, j) + h < s.length() && s.charAt(i + h) == s.charAt(j + h); ++h)
;
lcp[rank[i]] = h;
if (h > 0)
--h;
}
}
return lcp;
}
// Usage example
public static void main(String[] args) {
String s1 = "abcab";
int[] sa1 = suffixArray(s1);
// print suffixes in lexicographic order
for (int p : sa1)
System.out.println(s1.substring(p));
System.out.println("lcp = " + Arrays.toString(lcp(sa1, s1)));
// random test
Random rnd = new Random(1);
for (int step = 0; step < 100000; step++) {
int n = rnd.nextInt(100) + 1;
StringBuilder s = new StringBuilder();
for (int i = 0; i < n; i++)
s.append((char) ('\1' + rnd.nextInt(10)));
int[] sa = suffixArray(s);
int[] ra = rotationArray(s.toString() + '\0');
int[] lcp = lcp(sa, s);
for (int i = 0; i + 1 < n; i++) {
String a = s.substring(sa[i]);
String b = s.substring(sa[i + 1]);
if (a.compareTo(b) >= 0
|| !a.substring(0, lcp[i]).equals(b.substring(0, lcp[i]))
|| (a + " ").charAt(lcp[i]) == (b + " ").charAt(lcp[i])
|| sa[i] != ra[i + 1])
throw new RuntimeException();
}
}
System.out.println("Test passed");
}
}

a cannot be resolved to a variable
Syntax error on token ",", . expected
Syntax error on token "-", -- expected
a cannot be resolved to a variable
b cannot be resolved to a variable
You are getting these errors on this line (which appears twice in the code) :
Arrays.sort(order, (a, b) -> Character.compare(S.charAt(a), S.charAt(b)));
^^ ^ ^ ^
The reason must be that you are not compiling the code in Java 8. Lambda expressions require Java 8.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Suffix array O(NlogN) implementation - java

Related

Why don't we redistribute inversely in this implementation of Radix sort of strings

Two player coin game : tracing optimal sequence in dynamic programming

Finding the count of common array sequencce

Why does my Biginteger.multiply() shows NullPointerException?

Suffix Array Implementation Error

Categories

Resources