Finding all the common substrings of given two strings

Finding all the common substrings of given two strings - java

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:
Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.
For example, given the input strings eatsleepnightxyz and eatsleepabcxyz, the results should be:
eatsleep (due to eatsleepnightxyz eatsleepabcxyz)
xyz (due to eatsleepnightxyz eatsleepabcxyz)
a (due to eatsleepnightxyz eatsleepabcxyz)
t (due to eatsleepnightxyz eatsleepabcxyz)
However, the result set should not include e from
eatsleepnightxyz eatsleepabcxyz, because both es are already contained in the eatsleep mentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.
In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.
My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.
For String S1:
Find all the substrings of S1 of all the lengths
While doing so: Check if it is also a substring of
S2.
Attempt to figure out the time complexity of my approach.
Let the two given strings be n1-String and n2-String
The number of substrings of S1 is clearly n1(n1+1)/2.
But we have got to find the average length a substring of S1.
Let’s say it is m. We’ll find m separately.
Time Complexity to check whether an m-String is a substring of an
n-String is O(n*m).
Now, we are checking for each m-String is a substring of S2,
which is an n2-String.
This, as we have seen above, is an O(n2 m) algorithm.
The time required by the overall algorithm then is
Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
By performing certain calculations, I came to conclusion that the
time complexity is O(n3 m2)
Now, our job is to find m in terms of n1.
Attempt to find m in terms of n1.
Tn = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where Tn is the sum of lengths of all the substrings.
Average will be the division of this sum by the total number of Substrings produced.
This, simply is a summation and division problem whose solution is as follows O(n)
Therefore...
Running time of my algorithm is O(n^5).
With this in mind I wrote the following code:
package pack.common.substrings;
import java.util.ArrayList;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Set;
public class FindCommon2 {
public static final Set<String> commonSubstrings = new LinkedHashSet<String>();
public static void main(String[] args) {
printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
System.out.println(commonSubstrings);
}
public static void printCommonSubstrings(String s1, String s2) {
for (int i = 0; i < s1.length();) {
List<String> list = new ArrayList<String>();
for (int j = i; j < s1.length(); j++) {
String subStr = s1.substring(i, j + 1);
if (isSubstring(subStr, s2)) {
list.add(subStr);
}
}
if (!list.isEmpty()) {
String s = list.get(list.size() - 1);
commonSubstrings.add(s);
i += s.length();
}
}
}
public static boolean isSubstring(String s1, String s2) {
boolean isSubstring = true;
int strLen = s2.length();
int strToCheckLen = s1.length();
if (strToCheckLen > strLen) {
isSubstring = false;
} else {
for (int i = 0; i <= (strLen - strToCheckLen); i++) {
int index = i;
int startingIndex = i;
for (int j = 0; j < strToCheckLen; j++) {
if (!(s1.charAt(j) == s2.charAt(index))) {
break;
} else {
index++;
}
}
if ((index - startingIndex) < strToCheckLen) {
isSubstring = false;
} else {
isSubstring = true;
break;
}
}
}
return isSubstring;
}
}
Explanation for my code:
printCommonSubstrings: Finds all the substrings of S1 and
checks if it is also a substring of
S2.
isSubstring : As the name suggests, it checks if the given string
is a substring of the other string.
Issue: Given the inputs
S1 = “neerajisgreat”;
S2 = “neerajisnotgreat”
S3 = “rajeatneerajisnotgreat”
In case of S1 and S2, the output should be: neerajis and great
but in case of S1 and S3, the output should have been:
neerajis, raj, great, eat but still I am getting neerajis and great as output. I need to figure this out.
How should I design my code?

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.
The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:
public static Set<String> longestCommonSubstrings(String s, String t) {
int[][] table = new int[s.length()][t.length()];
int longest = 0;
Set<String> result = new HashSet<>();
for (int i = 0; i < s.length(); i++) {
for (int j = 0; j < t.length(); j++) {
if (s.charAt(i) != t.charAt(j)) {
continue;
}
table[i][j] = (i == 0 || j == 0) ? 1
: 1 + table[i - 1][j - 1];
if (table[i][j] > longest) {
longest = table[i][j];
result.clear();
}
if (table[i][j] == longest) {
result.add(s.substring(i - longest + 1, i + 1));
}
}
}
return result;
}
Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:
e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
The xyz result is the 123 diagonal at the bottom-right.
The a result is indicated by the 1 near the top (second row, ninth column).
The t result is indicated by the 1 near the bottom left.
What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.
I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

Typically this type of substring matching is done with the assistance of a separate data structure called a Trie (pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.

Related

Transitive closure and Warshall algorithm

My input matrix is following:
0 1 1 0
0 0 1 0
1 0 0 1
0 0 0 0
My code for the Warshall's algorithm is following:
int V = A.length;
for(int k = 0; k < V; k++) {
for(int i = 0; i < V; i++) {
for(int j = 0; j < V; j++) {
A[i][j] = A[i][j] == 1 || A[i][k] == 1 && A[k][j] == 1 ? 1 : 0;
}
}
}
The formula for the transitive closure of a matrix is (matrix)^2 + (matrix). Following the formula, I get this as an answer:
1 1 1 1
1 0 1 1
1 1 1 1
0 0 0 0
Using the piece of code I mentioned before, i get this as an answer:
1 1 1 1
1 1 1 1
1 1 1 1
0 0 0 0
So, my question is following, which one of the answers is right and is the algorithm working the correct way?

The formula for the transitive closure of a matrix is (matrix)^2 +
(matrix). Following the formula, I get this as an answer:
Not exactly, you are looking for the transitive closure of (matrix)^2 + matrix, this is the formula for a single step - not for the entire solution.
This means, you need to apply it again, and then you get in a second iteration:
1 1 1 1
1 1 1 1
1 1 1 1
0 0 0 0
(Since there is a path from 0 to 1)
The third iteration will apply the same matrix, which means this is the transitive closure.
So, from this test case, it seems the algorithm is corret.
That said, the line A[i][j] = A[i][j] == 1 || A[i][k] == 1 && A[k][j] == 1 ? 1 : 0; is quite unreadable, imho - don't be afraid to break it or at least add some parentheses

Java - Adjacency Matrix and DFS

I've been working on a program to implement a DFS in Java (by taking an adjacency matrix as input from a file). Basically, assuming vertices are traveled in numerical order, I would like to print the order that vertices become dead ends, the number of connected components in the graph, the tree edges and the back edges. But I'm not completely there yet. When I run my program, I get the number "1" as output, and nothing more. I've tried debugging certain parts of the DFS class, but I still can't quite figure out where I'm going wrong. Here is my code:
A basic "Driver" class:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class Driver {
public static void main(String[] args) throws FileNotFoundException {
Scanner scanner = new Scanner(new File("sample1.txt"));
scanner.useDelimiter("[\\s,]+");
int input = scanner.nextInt();
int[][] adjMatrix = new int[8][8];
for(int i=0; i < input; i++) {
for (int j=0; j < input; j++) {
adjMatrix[i][j] = scanner.nextInt();
}
}
scanner.close();
new DFS(adjMatrix);
}
}
DFS class:
import java.util.Stack;
public class DFS {
Stack<Integer> stack;
int first;
int[][] adjMatrix;
int[] visited = new int[7];
public DFS(int[][] Matrix) {
this.adjMatrix = Matrix;
stack = new Stack<Integer>();
int[] node = {0, 1, 2, 3, 4, 5, 6};
int firstNode = node[0];
depthFirstSearch(firstNode, 7);
}
public void depthFirstSearch(int first,int n){
int v,i;
stack.push(first);
while(!stack.isEmpty()){
v = stack.pop();
if(visited[v]==0) {
System.out.print("\n"+(v+1));
visited[v]=1;
}
for (i=0;i<n;i++){
if((adjMatrix[v][i] == 1) && (visited[i] == 0)){
stack.push(v);
visited[i]=1;
System.out.print(" " + (i+1));
v = i;
}
}
}
}
}
And the matrix from the input file looks like this:
0 1 0 0 1 1 0 0
1 0 0 0 0 1 1 0
0 0 0 1 0 0 1 0
0 0 1 0 0 0 0 1
1 0 0 0 0 1 0 0
1 1 0 0 1 0 0 0
0 1 1 0 0 0 0 1
0 0 0 1 0 0 1 0

Take a look at this part:
int input = scanner.nextInt();
int[][] adjMatrix = new int[8][8];
for(int i=0; i < input; i++) {
for (int j=0; j < input; j++) {
adjMatrix[i][j] = scanner.nextInt();
}
}
First you read a number, input.
Then you read input rows, in each row input columns.
This is your input data:
0 1 0 0 1 1 0 0
1 0 0 0 0 1 1 0
0 0 0 1 0 0 1 0
0 0 1 0 0 0 0 1
1 0 0 0 0 1 0 0
1 1 0 0 1 0 0 0
0 1 1 0 0 0 0 1
0 0 0 1 0 0 1 0
What is the first number, that will be read by scanner.nextInt().
It's 0. So the loop will do nothing.
Prepend the number 8 to your input, that is:
8
0 1 0 0 1 1 0 0
1 0 0 0 0 1 1 0
0 0 0 1 0 0 1 0
0 0 1 0 0 0 0 1
1 0 0 0 0 1 0 0
1 1 0 0 1 0 0 0
0 1 1 0 0 0 0 1
0 0 0 1 0 0 1 0
Btw, it's a good idea to verify that you have correctly read the matrix.
Here's an easy way to do that:
for (int[] row : adjMatrix) {
System.out.println(Arrays.toString(row));
}
There are several other issues in this implementation:
The number 7 appears in a couple of places. It's actually a crucial value in the depth-first-search algorithm, and it's actually incorrect. It should be 8. And it should not be hardcoded, it should be derived from the size of the matrix.
It's not a good practice to do computation in a constructor. The purpose of a constructor is to create an object. The depth-first-logic could be moved to a static utility method, there's nothing in the current code to warrant a dedicated class.
Fixing the above issues, and a few minor ones too, the implementation can be written a bit simpler and cleaner:
public static void dfs(int[][] matrix) {
boolean[] visited = new boolean[matrix.length];
Deque<Integer> stack = new ArrayDeque<>();
stack.push(0);
while (!stack.isEmpty()) {
int v = stack.pop();
if (!visited[v]) {
System.out.print("\n" + (v + 1));
visited[v] = true;
}
for (int i = 0; i < matrix.length; i++) {
if (matrix[v][i] == 1 && !visited[i]) {
visited[i] = true;
stack.push(v);
v = i;
System.out.print(" " + (i + 1));
}
}
}
}

Breadth-first algorithm on adjacency matrix: premature ending of search, returns queue of size 1?

adj is an adjacency matrix:
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
1 0 0 0 1 0 1 0 0
0 0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0 1
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
adj maps adjacency for a maze below (to the side is the element #):
S X E | 0 1 2
O O O | 3 4 5
O X O | 6 7 8
I'm using the adjacency matrix to traverse elements in the maze. Here's a Breadth-first-search algorithm to traverse the matrix.
queue Q = new queue();
boolean[] visited = new boolean[];
int num = 9; //number of elements
int begin = 0; //begin point as defined 'S'
int end = 2; //end point as defined 'E'
private queue BFS(int[][] adj, int begin) {
visited[begin] = true;
Q.enqueue(begin);
while (!Q.isEmpty()) {
int element = Q.dequeue();
int temp = element;
while (temp <= num) {
if ((!visited[temp]) && (adj[element][temp] == 1)) {
if (temp == end) {
return Q;
}
Q.enqueue(temp);
visited[temp] = true;
}
temp++;
}
}
return Q;
}
Doing BFS() returns a queue of size() 1. Specifically, the queue only contains begin (in this case, 0). The algorithm should produce a queue of [0, 3, 4, 5, 2]. Inserting a System.out.println("blah") right after the first while() opening shows it only iterates once.
Why does my algorithm stop prematurely? How can I tweak it to get my desired output? My goal is to device an algorithm that finds the shortest path possible between '0' and '2' in this scenario.

You hit the condition temp == end at the first iteration because your temp starts from 0, then it's 1 without inserting element to queue(because it's not adjacent to 0), then it's 2 without inserting, and here you are, under condition temp == end (both equal 2) you return Q that contains only begin as nothing yet has been inserted. So you only have one iteration of outer loop and three iterations of inner loop.
I would suggest following modification
queue Q = new queue();
boolean[] visited = new boolean[];
int num = 9; //number of elements
int begin = 0; //begin point as defined 'S'
int end = 2; //end point as defined 'E'
private void BFS(int[][] adj, int begin) {
visited[begin] = true;
Q.enqueue(begin);
while (!Q.isEmpty()) {
int element = Q.dequeue();
if (element == end) {
return Q;
}
int temp = 0;
while (temp < num) {
if ((!visited[temp]) && (adj[element][temp] == 1)) {
Q.enqueue(temp);
visited[temp] = true;
}
temp++;
}
}
return Q;
}

NumberFormatException in my code

I was solving the Connected Sets problem on Amazon's Interview Street site https://amazon.interviewstreet.com/challenges and my code worked perfectly for the public sample test cases provided by the site, but I'm getting a NumberFormatException for the hidden test cases on line 25. Here is the part of my code that parses the input:
public class Solution
{
static int [][] arr;
static int num = 2;
static int N;
static String output = "";
public static void main(String[] args) throws IOException
{
int T;
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
T = Integer.parseInt(reader.readLine());
int i, j, k;
for(i=0;i<T;i++) //the loop for each of the 'T' test cases
{
N = Integer.parseInt(reader.readLine()); //line 25
arr = new int[N][N];
for(j=0;j<N;j++) //the loops for storing the input 2D array
{
for(k=0;k<N;k++)
{
arr[j][k] = Character.getNumericValue(reader.read());
reader.read();
}
}
I spent a lot of time trying to find what the problem is, but I've been unsuccessful at it. Thanks for your help in advance.
EDIT: The problem statement is given as follows on the site:
Given a 2–d matrix, which has only 1’s and 0’s in it. Find the total number of connected sets in that matrix.
Explanation:
Connected set can be defined as group of cell(s) which has 1 mentioned on it and have at least one other cell in that set with which they share the neighbor relationship. A cell with 1 in it and no surrounding neighbor having 1 in it can be considered as a set with one cell in it. Neighbors can be defined as all the cells adjacent to the given cell in 8 possible directions ( i.e N , W , E , S , NE , NW , SE , SW direction ). A cell is not a neighbor of itself.
Input format:
First line of the input contains T, number of test-cases.
Then follow T testcases. Each testcase has given format.
N [ representing the dimension of the matrix N X N ].
Followed by N lines , with N numbers on each line.
Output format:
For each test case print one line, number of connected component it has.
Sample Input:
4
4
0 0 1 0
1 0 1 0
0 1 0 0
1 1 1 1
4
1 0 0 1
0 0 0 0
0 1 1 0
1 0 0 1
5
1 0 0 1 1
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
0 0 0 0 0
8
0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 1
0 0 1 0 0 1 0 1
0 1 0 0 0 1 0 0
1 0 0 0 0 0 0 0
0 0 1 1 0 1 1 0
1 0 1 1 0 1 1 0
0 0 0 0 0 0 0 0
Sample output:
1
3
3
9
Constraint:
0 < T < 6
0 < N < 1009
Note that the above sample test cases worked on my code. The hidden test cases gave the exception.

Okay, so I modified my program to incorporate LeosLiterak's suggestion to use trim() and the new code is as follows:
public class Solution
{
static int [][] arr;
static int num = 2;
static int N;
static String output = "";
public static void main(String[] args) throws IOException
{
int T;
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
T = Integer.parseInt(reader.readLine().trim());
int i, j, k;
for(i=0;i<T;i++)
{
N = Integer.parseInt(reader.readLine().trim());
arr = new int[N][N];
for(j=0;j<N;j++)
{
String [] temp = reader.readLine().trim().split(" ");
for(k=0;k<N;k++)
{
arr[j][k] = Integer.parseInt(temp[k]);
}
}
So instead of reading each character in the input matrix and converting it to an integer and storing, I now read the entire line and trim and split the string and convert each substring into an integer and store in my array.

Copied from comments: try to remove all white characters like spaces, tabulators etc. These characters are not prohibited by goal definition but they cannot be parsed to number. You must trim them first.

Check if its number before parse to int.
Character.isNumber()

T = Integer.parseInt(reader.readLine()); This where the problem. when user try to give string
and it convert into integer and run through loop? For eg : if user gives 'R' as char than how the integer conversion can happen and will the loop run further?? No right.

Converting very big integer to bit vector fails

import java.math.BigInteger;
import java.util.ArrayList;
public class Factorial {
public static int[] bitVector(int n) {
ArrayList<Integer> bitList = new ArrayList<Integer>();
BigInteger input = computeFactorial(n);
System.out.println(input);
BigInteger[] result = input.divideAndRemainder(new BigInteger(String.valueOf(2)));
if (result[0].intValue()==0) {return new int[]{result[1].intValue()};}
else {
bitList.add(result[1].intValue());
}
while(result[0].intValue() != 0) {
result = result[0].divideAndRemainder(new BigInteger(String.valueOf(2)));
bitList.add(result[1].intValue());
}
int[] array = new int[bitList.size()];
for (int i=0; i<array.length; i++) {
array[i]=bitList.get(i).intValue();
}
return array;
}
public static BigInteger computeFactorial(int n) {
if (n==0) {
return new BigInteger(String.valueOf(1));
} else {
return new BigInteger(String.valueOf(n)).multiply(computeFactorial(n-1));
}
}
public static void main(String[] args) {
int[] bitVector = bitVector(35);
for (int bit: bitVector)
System.out.print(bit+" ");
System.out.println();
}
}
The code above works fine when the input to bitVector is no bigger than 35. However, when I pass 36 as a parameter to bitVector, all but one bit are gone in the output.
I have potentially ruled out the following causes:
It may have nothing to do with the BigInteger type since it was designed to never overflow.
It may not be related to memory usage of the program, which uses only 380M at runtime.
I print out the value of computeFactorial(36), which looks good.
What on earth is going on there?

So what you are trying to do is
public static String factorialAsBinary(int n) {
BigInteger bi = BigInteger.ONE;
for (; n > 1; n--)
bi = bi.multiply(BigInteger.valueOf(n));
return bi.toString(2);
}
public static void main(String... args) {
String fact36 = factorialAsBinary(36);
for (char ch : fact36.toCharArray())
System.out.print(ch + " ");
System.out.println();
}
which prints
1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
What on earth is going on there?
Your code is much more complicated than it needs to be which also makes it easier to make mistakes and harder to understand.

result[0].intValue() is wrong:
intValue():
Converts this BigInteger to an int. This conversion is analogous to a narrowing primitive conversion from long to int as defined in the Java Language Specification: if this BigInteger is too big to fit in an int, only the low-order 32 bits are returned. Note that this conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign.
in your case it returns the lowest 32 bits which are zero therefore you return 0 after first devision

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding all the common substrings of given two strings - java

Related

Transitive closure and Warshall algorithm

Java - Adjacency Matrix and DFS

Breadth-first algorithm on adjacency matrix: premature ending of search, returns queue of size 1?

NumberFormatException in my code

Converting very big integer to bit vector fails

Categories

Resources