Method to find similar substrings from two strings

Method to find similar substrings from two strings - java

I'm using this piece of Java code to find similar strings:
if( str1.indexof(str2) >= 0 || str2.indexof(str1) >= 0 ) .......
but With str1 = "pizzabase" and str2 = "namedpizzaowl" it doesn't work.
how do I find the common substrings i.e. "pizza"?

Iterate over each letter in str1, checking for it's existence in str2. If it doesn't exist, move on to the next letter, if it does, increase the length of the substring in str1 that you check for in str2 to two characters, and repeat until no further matches are found or you have iterated through str1.
This will find all substrings shared, but is - like bubble sort - hardly optimal while a very basic example of how to solve a problem.
Something like this pseudo-ish example:
pos = 0
len = 1
matches = [];
while (pos < str1.length()) {
while (str2.indexOf(str1.substring(pos, len))) {
len++;
}
matches.push(str1.substring(pos, len - 1));
pos++;
len = 1;
}

If your algorithm says two strings are similar when they contain a common substring, then this algorithm will always return true; the empty string "" is trivially a substring of every string. Also it makes more sense to determine the degree of similarity between strings, and return a number rather than a boolean.
This is a good algorithm for determining string (or more generally, sequence) similarity: http://en.wikipedia.org/wiki/Levenshtein_distance.

Related

How to compare characters in two different strings

Lets say i have "jfk" and "jfc". I want to iterate through both strings and find out if and where they differ. I am trying to see if the strings are anagrams. "new door" and "one word" are anagrams. If its not an anagram i want the code to tell me by how many characters the strings differ. The jfk and jfc differ by 1. "macd" and "mebc" differ by 2 and they cant be anagrams. If the two strings are different lengths then they can't be anagrams.
I tried iterating through the strings but that's when i got stuck. I have no idea how to iterate through both strings at the same time and find out how if they differ by certain characters or not. I only got as far as checking if both strings were the same length.
static void isAnagram(List <String> s1, List <String> s2) {
if (s1.length() != s2.length()) {
System.out.println("Not anagrams");
} else {
for(int i = 0; i < s1.length(); i++) {
for(int j = 0; j < s2.lenth(); j++) {//i know that iterating through both strings like this does not make sense but i am stuck.
}```

Convert the arrays to char arrays. Then sort the arrays alphabetically after that compare them character by character.

String str = "abc";
char[] chars = str.toCharArray();
you can use this to convert string to char array,next it will be very easy loop the char array by running a simple if condition and increment a variable which will tell the difference of words.

If you are allowed additional libraries, you should have a look at Google’s Guava, espacially at com.google.common.collect.Multiset<E> and it’s implementations. You can put the characters of each string into an Multiset<Character> (not a Multiset, that will not work, as E must be a reference type, no primitive). Both strings are anagrams if multiset1.equals(multiset2).
Of both for loops, it seems to me that you only need one of the loop and use the same counter for both strings?

Find out recursive pattern in a string java

In one of my interview I had asked one program on java string, I am unable to answer it. I don't know it is a simple program or complex one. I have explored on the internet for it, but unable to find the exact solution for it. My question is as follow,
I have supposed one string which contains recursive pattern like,
String str1 = "abcabcabc";
In above string recursive pattern is "abc" which repeated in one string, because this string only contains "abc" pattern recursively.
if I passed this string to a function/method as a parameter that function/method should return me "This string has a recursive pattern." If that string doesn't have any recursive pattern then simply function/method should return "This string doesn't contain the recursive pattern."
Following are probabilities,
String str1 = "abcabcabc";
//This string contains recursive pattern 'abc'
String str2 = "abcabcabcabdac";
//This string doesn't contains recursive pattern
String str2 = "abcddabcddabcddddabc";
//This string contains recursive pattern 'abc' & 'dd'
Can anybody suggest me solution/algorithm for this, I am struggling with it. What is the best way for different probabilities, so that I implement?

From LeetCode
public boolean repeatedSubstringPattern(String str) {
int l = str.length();
for(int i=l/2;i>=1;i--) {
if(l%i==0) {
int m = l/i;
String subS = str.substring(0,i);
StringBuilder sb = new StringBuilder();
for(int j=0;j<m;j++) {
sb.append(subS);
}
if(sb.toString().equals(str)) return true;
}
}
return false;
}
The length of the repeating substring must be a divisor of the length of the input string
Search for all possible divisor of str.length, starting for length/2
If i is a divisor of length, repeat the substring from 0 to i the number of times i is contained in s.length
If the repeated substring is equals to the input str return true

Solution is not in Javascript. However, problem looked interesting, so attempted to solve it in python. Apologies!
In python, I wrote a logic which worked [Could be written much better, thought the logic would help you]
Script is
def check(lst):
return all(x in lst[-1] for x in lst)
s = raw_input("Enter string:: ")
if check(sorted(s.split(s[0])[1:])):
print("String, {} is recursive".format(s))
else:
print("String, {} is NOT recursive".format(s))
Output of the script:
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabcabdac
String, abcabcabcabdac is NOT recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabc
String, abcabcabc is recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcddabcddabcddddabc
String, abcddabcddabcddddabc is recursive

This can also be solved using a part of the Knuth–Morris–Pratt Algorithm.
The idea is to build a 1-D array with each entry representing a character in the word. For each character i in the word we check if there is a prefix which is also a suffix in the word up 0 to i. The reason being if we have common suffix and prefix we can continue searching from the character after prefix ends which we update the array with the corresponding character index.
For s="abcababcababcab", the array will be
Index : 0 1 2 3 4 5 6 7 8
String: a b c a b c a b c
KMP : 0 0 0 1 2 3 4 5 6
For Index = 2, we see that there is no suffix which is also the prefix in the string ab i.e) up until Index = 2
For Index = 4, the suffix ab(Index = 3, 4) is same as the prefix ab(Index = 0, 1) so we update the KMP[4] = 2 which is the index of the pattern from which we have to resume searching.
Thus KMP[i] holds the index of the string s where prefix matches the longest possible suffix in the range 0 to i plus 1. Which essentially means that the a prefix with length index + 1 - KMP[index] exists in the string previously. using this information we can find out if all the substrings of that length are the same.
For Index = 8, we know KMP[index] = 6, which means there is a prefix(s[3] to s[5]) of length 9 - 6 = 3 which is equal to the suffix(s[6] to s[8]), If this is the only repetitive pattern we have this will follow
For a clearer explanation of this algorithm please check this video lecture.
This table can be build in linear time,
vector<int> buildKMPtable(string word)
{
vector<int> kmp(word.size());
int j=0;
for(int i=1; i < word.size(); ++i)
{
j = word[j] == word[i] ? j : kmp[j-1];
if(word[j] == word[i])
{
kmp[i] = j + 1;
++j;
}
else
{
kmp[i] = j;
}
}
return kmp;
}
bool repeatedSubstringPattern(string s) {
auto kmp = buildKMPtable(s);
if(kmp[s.size() -1] == 0) // Occurs when the string has no prefix with suffix ending at the last character of the string
{
return false;
}
int diff = s.size() - kmp[s.size() -1]; //Length of the repetitive pattern
if(s.size() % diff != 0) //Length of repetitive pattern must be a multiple of the size of the string
{
return false;
}
// Check if that repetitive pattern is the only repetitive pattern.
string word = s.substr(0, diff);
int w_size = word.size();
for(int i=0; i < w_size; ++i)
{
int j = i;
while(j < s.size())
{
if(word[i] == s[j])
{
j += w_size;
}
else
{
return false;
}
}
}
return true;
}

If you know the 'parts' in advance, then the answer could be Recursive regular expressions, it seems.
So for abcabcabc we need an expression like abc(?R)* where:
abc matches the literal characters
(?R) recurses the pattern
A * to match between zero and unlimited number of times
The third one is a little trickier. See this regex101 link but it looks like:
((abc)|(dd))(?R)*
where we have either 'abc' or 'dd' and there are any number of these.
Otherwise, I don't see how you could determine from just a string that it has some undefined recursive structure like this.

Longest Common Subsequence Algorithm Explanation

So the psuedocode for longest common subsequence problem is listed as below.
longest-common-subsequence(s1, s2):
If the strings begin with the same letter c, the result to return is c
plus the longest common subsequence between the rest of s1 and s2
(that is, s1 and s2 without their first letter). For example, the
longest subsequence between "hollow" and "hello" is an "h" plus the
longest subsequence found between "ollow" and "ello".
Otherwise, if
the strings do not begin with the same letter, return the longer of
the following two: The longest common subsequence between s1 and the
rest of s2 (s2 without its first letter), The longest common
subsequence between the rest of s1 (s1 without its first letter) and
s2. For example, longest-common-subsequence("ollow", "ello") is the
longer of longest-common-subsequence("ollow", "llo") and
longest-common-subsequence("llow", "ello").
The part that I don't get is when the strings do not begin with the same letter, why do we take (s1 without the first letter, s2), (s1, s2 without the first letter). Why do we recursively go through these steps when they do not match? Is it just a set algorithm that is hard to understand? What is the reasoning behind this?

While #yash mahajan already covered everything, I'll just provide another way to think about it.
Go through two strings, assume you are at position i on string A (of length m) and position j on string B (of length n).
1. If the current two characters of both string are the same:
longest common subsequence till now = longest common subsequence between substring A[0...i-1] and substring B[0...j-1] + 1.
2. If two characters are different:
longest common subsequence = Max(longest common subsequence between substring A[0...i-1] and string B, longest common subsequence between string A and substring B[0...j-1])
You will have a clearer idea if you read the codes.
public class Solution {
public int longestCommonSubsequence(String A, String B) {
if(A == null || B == null || A.length() == 0 || B.length() == 0) {
return 0;
}
int m = A.length();
int n = B.length();
int[][] commonSubsequenceLength = new int[m + 1][n + 1];
for(int i = 1; i <= m; i++) {
for(int j = 1; j <= n; j++) {
if(A.charAt(i - 1) == B.charAt(j - 1)) {
commonSubsequenceLength[i][j] = commonSubsequenceLength[i - 1][j - 1] + 1;
} else {
commonSubsequenceLength[i][j] = Math.max(commonSubsequenceLength[i][j - 1], commonSubsequenceLength[i - 1][j]);
}
}
}
return commonSubsequenceLength[m][n];
}
}

When we know that the first characters of both the strings don't match, it's clear we can't include the characters in our longest subsequence as we do it in the first case. So the obvious choice we are left with is to neglect both of these characters and search for the rest of two strings for our subsequence. But if you consider this example : "hello" and "ello" , you can clearly see that if we neglect the first characters we are basically neglecting first character of our subsequence ("Ello"). So we go for two cases:
1.We remove first character of first string and search in the second string.
2.We remove first character of second string and search in first string.
And then we take maximum of those two.

To answer your question:
The part that I don't get is when the strings do not begin with the
same letter, why do we take (s1 without the first letter, s2), (s1, s2
without the first letter). Why do we recursively go through these
steps when they do not match? Is it just a set algorithm that is hard
to understand? What is the reasoning behind this?
What is confusing for you is the recursive call, I guess. The whole idea is to reduce the problem to a smaller input set. In this case 1 character less at a time.
You have 2 cases for the selected character(1st or last)
There is a match (example "h" in "hollow", "hello"), simply reduce the input size by 1 character in both strings and recursively call the same function.
No match. You have 2 options here - You may consider the first string has extra unwanted character or the 2nd one. Hence, make recursive call for both the cases and choose maximum of them.
Extra details:
This problem has properties of a typical Dynamic Programming (DP) Problem.
1) Optimal Substructure
2) Overlapping Sub-problems
Hope it helps!

Check for double letters with indexOf?

I need to check if there are more than one of the same letters in a word.
For example, in the name 'bob' the index of 'b' is '0 and 2' but indexOf only creates a sees the first index of 0.
What I need is for it to check and then skip over 0 and go further down the work and check for more of the same letters. Here is what I have so far.
String wordNow = "bob";
letterGuess = console.next().toUpperCase();
letterIndex = wordNow.indexOf(letterGuess);
System.out.println(letterIndex);
OUTPUT: 0
If anyone has a good efficient way of doing this, i'm all ears.

You can use String.lastIndexOf for this. Since both functions will return -1 if not found, then to check if there is more than one instance, you can just compare the values
return wordNow.indexOf(letterGuess) != wordNow.lastIndexOf(letterGuess);

There are multiple versions of the method indexOf. One of them takes an index itself! Just read the javadoc for the string class carefully. You see there is even one called "lastIndexOf" which would come in really handy.
You can use that for example to see if there are other occurrences of that char "behind" the first index you found.
In any case: the real answer here is that you should study the documentation of classes extensively.

You can use a substring by excluding the matching character, as below:
String wordNow = "bob";
letterGuess = console.next().toUpperCase();
letterIndex = wordNow.indexOf(letterGuess);
System.out.println(letterIndex);
if(letterIndex >= 0) {
int secondIndex = wordNow.subString(letterIndex+1).indexOf(letterGuess);
System.out.println(secondIndex);
}

The most efficient way is to simply just search for the element you are looking for (assuming no order or distribution over the input string).
public boolean isCharacterRepeatedIgnoreCase(String inputString, Character c) {
int numFound = 0;
final Character chUpper = Character.toUpperCase(c);
final String upperCaseString = inputString.toUpperCase();
for (int i=0;i<upperCaseString.length();++i) {
if (upperCaseString.charAt(i) == chUpper) {
numFound++;
}
if (numFound > 1) {
return true;
}
}
return false;
}
Note, I have not run the above code. So please write proper unit tests if you plan on considering the above. Also, I have assumed that your character can fit into 16 bits. You probably want to do something around String or toUpperCase(int) to handle Unicode, see Oracle.

String manipulation of function names

For this Kata, i am given random function names in the PEP8 format and i am to convert them to camelCase.
(input)get_speed == (output)getSpeed ....
(input)set_distance == (output)setDistance
I have a understanding on one way of doing this written in pseudo-code:
loop through the word,
if the letter is an underscore
then delete the underscore
then get the next letter and change to a uppercase
endIf
endLoop
return the resultant word
But im unsure the best way of doing this, would it be more efficient to create a char array and loop through the element and then when it comes to finding an underscore delete that element and get the next index and change to uppercase.
Or would it be better to use recursion:
function camelCase takes a string
if the length of the string is 0,
then return the string
endIf
if the character is a underscore
then change to nothing,
then find next character and change to uppercase
return the string taking away the character
endIf
finally return the function taking the first character away
Any thoughts please, looking for a good efficient way of handing this problem. Thanks :)

I would go with this:
divide given String by underscore to array
from second word until end take first letter and convert it to uppercase
join to one word
This will work in O(n) (go through all names 3 time). For first case, use this function:
str.split("_");
for uppercase use this:
String newName = substring(0, 1).toUpperCase() + stre.substring(1);
But make sure you check size of the string first...
Edited - added implementation
It would look like this:
public String camelCase(String str) {
if (str == null ||str.trim().length() == 0) return str;
String[] split = str.split("_");
String newStr = split[0];
for (int i = 1; i < split.length; i++) {
newStr += split[i].substring(0, 1).toUpperCase() + split[i].substring(1);
}
return newStr;
}
for inputs:
"test"
"test_me"
"test_me_twice"
it returns:
"test"
"testMe"
"testMeTwice"

It would be simpler to iterate over the string instead of recursing.
String pep8 = "do_it_again";
StringBuilder camelCase = new StringBuilder();
for(int i = 0, l = pep8.length(); i < l; ++i) {
if(pep8.charAt(i) == '_' && (i + 1) < l) {
camelCase.append(Character.toUpperCase(pep8.charAt(++i)));
} else {
camelCase.append(pep8.charAt(i));
}
}
System.out.println(camelCase.toString()); // prints doItAgain

The question you pose is whether to use an iterative or a recursive approach. For this case I'd go for the recursive approach because it's straightforward, easy to understand doesn't require much resources (only one array, no new stackframe etc), though that doesn't really matter for this example.
Recursion is good for divide-and-conquer problems, but I don't see that fitting the case well, although it's possible.
An iterative implementation of the algorithm you described could look like the following:
StringBuilder buf = new StringBuilder(input);
for(int i = 0; i < buf.length(); i++){
if(buf.charAt(i) == '_'){
buf.deleteCharAt(i);
if(i != buf.length()){ //check fo EOL
buf.setCharAt(i, Character.toUpperCase(buf.charAt(i)));
}
}
}
return buf.toString();
The check for the EOL is not part of the given algorithm and could be ommitted, if the input string never ends with '_'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Method to find similar substrings from two strings - java

I'm using this piece of Java code to find similar strings: if( str1.indexof(str2) >= 0 || str2.indexof(str1) >= 0 ) ....... but With str1 = "pizzabase" and str2 = "namedpizzaowl" it doesn't work. how do I find the common substrings i.e. "pizza"?

Related

How to compare characters in two different strings

Find out recursive pattern in a string java

Longest Common Subsequence Algorithm Explanation

Check for double letters with indexOf?

String manipulation of function names

Categories

Resources