String Substrings Generation in Java

String Substrings Generation in Java - java

I am trying to find all substrings within a given string. For a random string like rymis the subsequences would be [i, is, m, mi, mis, r, ry, rym, rymi, rymis, s, y, ym, ymi, ymis]. From Wikipedia, a string of a length of n will have n * (n + 1) / 2 total substrings.
Which can be found by doing the following snippet of code:
final Set<String> substring_set = new TreeSet<String>();
final String text = "rymis";
for(int iter = 0; iter < text.length(); iter++)
{
for(int ator = 1; ator <= text.length() - iter; ator++)
{
substring_set.add(text.substring(iter, iter + ator));
}
}
Which works for small String lengths but obviously slows down for larger lengths as the algorithm is near O(n^2).
Also reading up on Suffix Trees which can do insertions in O(n) and noticed the same subsequences could be obtained by repeatedly inserting substrings by removing 1 character from the right until the string is empty. Which should be about O(1 + … + (n-1) + n) which is a summation of n -> n(n+1)/2 -> (n^2 + n)/ 2, which again is near O(n^2). Although there seems to be some Suffix Trees that can do insertions in log2(n) time which would be a factor better being O(n log2(n)).
Before I delve into Suffix Trees is this the correct route to be taking, is there some another algorithm that would be more efficient for this, or is O(n^2) as good as this will get?

I am fairly sure you can't beat O(n^2) for this as has been mentioned in comments to the question.
I was interested in different ways of coding that so I made one quickly, and I decided to post it here.
The solution I put here isn't asymptotically faster I don't think, but when counting the inner and outer loops there are less. There are also less duplicate insertions here - no duplicate insertions.
String str = "rymis";
ArrayList<String> subs = new ArrayList<String>();
while (str.length() > 0) {
subs.add(str);
for (int i=1;i<str.length();i++) {
subs.add(str.substring(i));
subs.add(str.substring(0,i));
}
str = str.substring(1, Math.max(str.length()-1, 1));
}

This is an inverted way of your example, but still o(n^2).
string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length(); i++){//collect substrings of length i
for(int k = 0; k < s.length(); k++){//start index for sbstr len i
if(i + k > s.length())break;//if the sbstr len i runs over end of s move on
al.add(s.substring(k, k + i));//add sbstr len i at index k to al
}
}
Let me see if I can post a recursive example. I started doing a couple recursive tries and came up with this iterative approach using dual sliding windows as a sort of improvement to the above method. I had a recursive example in mind but was having issues reducing the tree size.
string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length() + 1; i ++)
{
for(int k = 0; k < s.length(); k++)
{
int a = k;//left bound window 1
int b = k + i;//right bound window 1
int c = s.length() - 1 - k - i;//left bound window 2
int d = s.length() - 1 - k;//right bound window 2
al.add(s.substring(a,b));//add window 1
if(a < c)al.add(s.substring(c,d));//add window 2
}
}
There was an issue mentioned with using arraylist affecting performance so this next one will be with more basic structures.
string s = "rymis";
StringBuilder sb = new StringBuilder();
for(int i = 1; i < s.length() + 1; i ++)
{
for(int k = 0; k < s.length(); k++)
{
int a = k;//left bound window 1
int b = k + i;//right bound window 1
int c = s.length() - 1 - k - i;//left bound window 2
int d = s.length() - 1 - k;//right bound window 2
if(i > 1 && k > 0)sb.append(",");
sb.append(s.substring(a,b));//add window 1
if(a < c){
sb.append(",");
sb.append(s.substring(c,d));//add window 2
}
}
}
string s = sb.toString();
String[] sArray = s.split("\\,");

I am not sure about the exact algorithm but you may look into Ropes:
http://en.wikipedia.org/wiki/Rope_(computer_science)
In summary, ropes are better suited when the data is large and frequently modified.
I believe Rope outperforms String for your problem.

Related

Count the number of all possible distinct 3-digit numbers given a String array

Below is my code for counting the number of distinct 3 digit strings which works correctly, HOWEVER, I would like to optimize this code to where I can improve the time complexity. Can someone help me w/ this?
input: [1,2,1,4]
output: 12
Thanks.
static int countUnique(String [] arr)
{
Set<String> s = new TreeSet<>();
for (int i = 0; i<arr.length; i++)
{
for (int j = 0; j<arr.length; j++)
{
for (int k = 0; k<arr.length; k++)
{
if (i!=j && j != k && i!=k)
s.add((arr[i] +""+arr[j]+""+arr[k]));
}
}
}
return s.size();
}

Here's an O(n) solution:
Iterate over each distinct available digit in turn.
(A) Add 1 if there are three instances of it, accounting for one string of three of this digit.
(B) If there are two instances of it, add 3 times the number of digits already iterated over, accounting for 3 choose 2 ways to arrange two instances of this digit with one other digit already iterated over.
(C) Add the number of ways we can pick two of the digits seen so far, accounting for arranging just one instance of this digit with each of those.
(D) Finally, add to our record of the counts of ways to arrange two digits: if there are two instances of this digit, add 3 choose 2 = 3, accounting for just arranging two instances of this digit. Also add (2 * 3 choose 2 = 6) times the number of digits already iterated over, accounting for the number of ways to arrange this digit with another already seen.
For example:
1 2 1 4
1 -> D applies, add 3 to the two-digit-arrangements count
11x, 1x1, x11
2 -> C applies, add 3 to result
112, 121, 211
D applies, add 6 to the two-digit-arrangements count (total 9)
12x, 1x2, x12, 21x, 2x1, x21
4 -> C applies, add 9 to result
Result 12
JavaScript code with random tests, comparing with your brute force approach:
function f(A){
const counts = {};
for (let a of A)
counts[a] = counts[a] ? -~counts[a] : 1;
let result = 0;
let numTwoDigitArrangements = 0;
let numSeen = 0;
for (let d of Object.keys(counts)){
if (counts[d] > 2)
result += 1;
if (counts[d] > 1)
result += 3 * numSeen;
result += numTwoDigitArrangements;
if (counts[d] > 1)
numTwoDigitArrangements += 3;
numTwoDigitArrangements += 6 * numSeen;
numSeen = numSeen + 1;
}
return result;
}
function bruteForce(arr){
const s = new Set();
for (let i=0; i<arr.length; i++){
for (let j=0; j<arr.length; j++){
for (let k=0; k<arr.length; k++){
if (i != j && j != k && i != k)
s.add((arr[i] + "" + arr[j]+ "" + arr[k]));
}
}
}
return s.size;
}
// Random tests
var numTests = 500;
var maxLength = 25;
for (let i=0; i<numTests; i++){
const n = Math.ceil(Math.random() * maxLength);
const A = new Array(n);
for (let j=0; j<n; j++)
A[j] = Math.floor(Math.random() * 10);
const _f = f(A);
const _bruteForce = bruteForce(A);
if (_f != _bruteForce){
console.log('Mismatch found:');
console.log('' + A);
console.log(`f: ${ _f }`);
console.log(`brute force: ${ _bruteForce }`);
}
}
console.log('Done testing.');

Another way to solve this is by Backtracking algorithm. Any combination or permutation kind of problem can be solved using Backtracking.
Here is some information on Backtracking algorithm - https://en.wikipedia.org/wiki/Backtracking
Note: This is not most optimized solution nor O(n) solution. This solution is O(n! * n). But there are many opportunities to make it more optimized.
Java code using Backtracking:
int countUniqueOpt(String[] arr) {
//Set to avoid duplicates
Set<String> resultList = new HashSet<>();
backtracking(arr, 3, resultList, new ArrayList<>());
return resultList.size();
}
void backtracking(String[] arr, int k, Set<String> resultList, List<Integer> indexList) {
if (indexList.size() == k) {
String tempString = arr[indexList.get(0)] + arr[indexList.get(1)] + arr[indexList.get(2)];
resultList.add(tempString);
} else {
for (int i = 0; i < arr.length; i++) {
if (!indexList.contains(i)) {
indexList.add(i);
backtracking(arr, k, resultList, indexList);
indexList.remove(indexList.size() - 1);
}
}
}
}

Permutations of a set of data in Java

I have 10,000 items in a set whereby each must be made into triads.
I need an algorithm to efficiently find each triad.
For example:
{A,B,C,D,...}
1.AAA
2.AAB
3.AAC
4.AAD
...
all the way to ZZY, ZZZ.
The method I'm using is very inefficient, I created a nested forloop of 3 which iterates through an array, which has a run-time of O(N^3) and terrible on performance obvious. Which kind of algo and data structure would be better for this? Thank you

Function to print all permutations of K length from a set of n characters with
repetition of characters:
static void printKLengthPerm(char[] set, String prefix, int n, int k)
{
if (k == 0)
{
System.out.println(prefix);
return;
}
for (int i = 0; i < n; i++)
{
String newPrefix = prefix + set[i];
printKLengthPerm(set, newPrefix, n, k - 1);
}
}
Calling the function to print all permutations of 3 length from a set all capital english alphabets:
char[] set = new char[26];
for(int i = 0; i < 26; i++)
set[i] = (char)(i+65);
int n = set.length;
printKLengthPerm(set, "", n, 3);

Efficiently generate 2^n combinations (/w java)

I'm trying to generate all 2^n as efficiently as possible (and save them to an array), like
0001
0010
0011
etc.
Where n could be up to 15.
Here is my code:
public static void main(String args[]) {
final long startTime = System.nanoTime();
final int N = 15;
int m = (int) Math.pow(2, N) - 1;
int[][] array = new int[m][N];
int arrLength = array.length;
for (int i = 0; i < arrLength; i++) {
String str = String.format("%" + N + "s", Integer.toBinaryString(i + 1)).replace(' ', '0');
for (int j = 0; j < N; j++) {
array[i][j] = Character.getNumericValue(str.charAt(j));
}
}
final long duration = System.nanoTime() - startTime;
double sec = (double) duration / 1000000000.0;
System.out.println(sec);
}
Any suggestion on how i can do this faster?
As of now, my timer says it takes ~0.1 to ~0.12

String processing tends to be slow (typically requires loops and allocations). You can just shift the interesting bit to position 0 instead, then cut off higher bit using bitwise and with 1.
for (int i = 0; i < arrLength; i++) {
for (int j = 0; j < N; j++) {
array[i][j] = (i >> j) & 1;
}
}
p.s. I have left out adding 1 to i, wasn't sure if this was intended in the original code, should be straightforward to add as needed.

My most efficient way would be by not generating them at all, which roughly takes... 0 nanoseconds.
These strings are the textual representation of all integers from 0 to 2^n-1, for which enumeration is no mystery. There is no need to store them (in an array), as the keys would be the same as the indexes.
If you have compelling reasons to process them as strings, you can perform the conversion when required, with you own routine or with toBinaryString.
Depending on your application, f.i. string lookup, another option can be to turn the given string to its integer value. If the goal is to check presence/absence of items in a given combination, binary masks will do an effective job.

Improving the algorithm for removal of element

Problem
Given a string s and m queries. For each query delete the K-th occurrence of a character x.
For example:
abcdbcaab
5
2 a
1 c
1 d
3 b
2 a
Ans abbc
My approach
I am using BIT tree for update operation.
Code:
for (int i = 0; i < ss.length(); i++) {
char cc = ss.charAt(i);
freq[cc-97] += 1;
if (max < freq[cc-97]) max = freq[cc-97];
dp[cc-97][freq[cc-97]] = i; // Counting the Frequency
}
BIT = new int[27][ss.length()+1];
int[] ans = new int[ss.length()];
int q = in.nextInt();
for (int i = 0; i < q; i++) {
int rmv = in.nextInt();
char c = in.next().charAt(0);
int rr = rmv + value(rmv, BIT[c-97]); // Calculating the original Index Value
ans[dp[c-97][rr]] = Integer.MAX_VALUE;
update(rmv, 1, BIT[c-97], max); // Updating it
}
for (int i = 0; i < ss.length(); i++) {
if (ans[i] != Integer.MAX_VALUE) System.out.print(ss.charAt(i));
}
Time Complexity is O(M log N) where N is length of string ss.
Question
My solution gives me Time Limit Exceeded Error. How can I improve it?
public static void update(int i , int value , int[] arr , int xx){
while(i <= xx){
arr[i ]+= value;
i += (i&-i);
}
}
public static int value(int i , int[] arr){
int ans = 0;
while(i > 0){
ans += arr[i];
i -= (i &- i);
}
return ans ;
}

There are key operations not shown, and odds are that one of them (quite likely the update method) has a different cost than you think. Furthermore your stated complexity is guaranteed to be wrong because at some point you have to scan the string which is at minimum O(N).
But anyways the obviously right strategy here is to go through the queries, separate them by character, and then go through the queries in reverse order to figure out the initial positions of the characters to be suppressed. Then run through the string once, emitting characters only when it fits. This solution, if implemented well, should be doable in O(N + M log(M)).
The challenge is how to represent the deletions efficiently. I'm thinking of some sort of tree of relative offsets so that if you find that the first deletion was 3 a you can efficiently insert it into your tree and move every later deletion after that one. This is where the log(M) bit will be.

Edit Distance solution for Large Strings

I'm trying to solve the edit distance problem. the code I've been using is below.
public static int minDistance(String word1, String word2) {
int len1 = word1.length();
int len2 = word2.length();
// len1+1, len2+1, because finally return dp[len1][len2]
int[][] dp = new int[len1 + 1][len2 + 1];
for (int i = 0; i <= len1; i++) {
dp[i][0] = i;
}
for (int j = 0; j <= len2; j++) {
dp[0][j] = j;
}
//iterate though, and check last char
for (int i = 0; i < len1; i++) {
char c1 = word1.charAt(i);
for (int j = 0; j < len2; j++) {
char c2 = word2.charAt(j);
//if last two chars equal
if (c1 == c2) {
//update dp value for +1 length
dp[i + 1][j + 1] = dp[i][j];
} else {
int replace = dp[i][j] + 1 ;
int insert = dp[i][j + 1] + 1 ;
int delete = dp[i + 1][j] + 1 ;
int min = replace > insert ? insert : replace;
min = delete > min ? min : delete;
dp[i + 1][j + 1] = min;
}
}
}
return dp[len1][len2];
}
It's a DP approach. The problem it since it use a 2D array we cant solve this problem using above method for large strings. Ex: String length > 100000.
So Is there anyway to modify this algorithm to overcome that difficulty ?
NOTE:
The above code will accurately solve the Edit Distance problem for small strings. (which has length below 1000 or near)
As you can see in the code it uses a Java 2D Array "dp[][]" . So we can't initialize a 2D array for large rows and columns.
Ex : If i need to check 2 strings whose lengths are more than 100000
int[][] dp = new int[len1 + 1][len2 + 1];
the above will be
int[][] dp = new int[100000][100000];
So it will give a stackOverflow error.
So the above program only good for small length Strings.
What I'm asking is , Is there any way to solve this problem for large strings(length > 100000) efficiently in java.

First of all, there's no problem in allocating a 100k x 100k int array in Java, you just have to do it in the Heap, not the Stack (and on a machine with around 80GB of memory :))
Secondly, as a (very direct) hint:
Note that in your loop, you are only ever using 2 rows at a time - row i and row i+1. In fact, you calculate row i+1 from row i. Once you get i+1 you don't need to store row i anymore.
This neat trick allows you to store only 2 rows at the same time, bringing down the space complexity from n^2 to n. Since you stated that this is not homework (even though you're a CS undergrad by your profile...), I'll trust you to come up with the code yourself.
Come to think of it I recall having this exact problem when I was doing a class in my CS degree...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String Substrings Generation in Java - java

I am not sure about the exact algorithm but you may look into Ropes: http://en.wikipedia.org/wiki/Rope_(computer_science) In summary, ropes are better suited when the data is large and frequently modified. I believe Rope outperforms String for your problem.

Related

Count the number of all possible distinct 3-digit numbers given a String array

Permutations of a set of data in Java

Efficiently generate 2^n combinations (/w java)

Improving the algorithm for removal of element

Edit Distance solution for Large Strings

Categories

Resources