Most efficient way to search for unknown patterns in a string?

Most efficient way to search for unknown patterns in a string? - java

I am trying to find patterns that:
occur more than once
are more than 1 character long
are not substrings of any other known pattern
without knowing any of the patterns that might occur.
For example:
The string "the boy fell by the bell" would return 'ell', 'the b', 'y '.
The string "the boy fell by the bell, the boy fell by the bell" would return 'the boy fell by the bell'.
Using double for-loops, it can be brute forced very inefficiently:
ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
int limit = (length - i) / 2;
for (int j = limit; j >= 1; j--) {
int candidateEndIndex = i + j;
String candidate = string.substring(i, candidateEndIndex);
if(candidate.length() <= 1) {
continue;
}
if (string.substring(candidateEndIndex).contains(candidate)) {
boolean notASubpattern = true;
for (String pattern : patternsList) {
if (pattern.contains(candidate)) {
notASubpattern = false;
break;
}
}
if (notASubpattern) {
patternsList.add(candidate);
}
}
}
}
However, this is incredibly slow when searching large strings with tons of patterns.

You can build a suffix tree for your string in linear time:
https://en.wikipedia.org/wiki/Suffix_tree
The patterns you are looking for are the strings corresponding to internal nodes that have only leaf children.

You could use n-grams to find patterns in a string. It would take O(n) time to scan the string for n-grams. When you find a substring by using a n-gram, put it into a hash table with a count of how many times that substring was found in the string. When you're done searching for n-grams in the string, search the hash table for counts greater than 1 to find recurring patterns in the string.
For example, in the string "the boy fell by the bell, the boy fell by the bell" using a 6-gram will find the substring "the boy fell by the bell". A hash table entry with that substring will have a count of 2 because it occurred twice in the string. Varying the number of words in the n-gram will help you discover different patterns in the string.
Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";
// Add entries to the hash table
while (count < str.length) {
// copy the words into the substring
int i = 0;
substring = "";
while (ngramcount > 0 && count < str.length) {
substring[i] = str[count];
if (str[i] == ' ')
ngramcount--;
i++;
count++;
}
ngramcount = 6;
substring.Trim(); // get rid of the last blank in the substring
// Update the dictionary (hash table) with the substring
if (dict.Contains(substring)) { // substring is already in hash table so increment the count
int hashCount = dict[substring];
hashCount++;
dict[substring] = hashCount;
}
else
dict[substring] = 1;
}
// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict) {
if (pair.Value > maxCount) {
maxCount = pair.Value;
mostCommonPattern = pair.Key;
}
}

I've written this just for fun. I hope I have understood the problem correctly, this is valid and fast enough; if not, please be easy on me :) I might optimize it a little more I guess, if someone finds it useful.
private static IEnumerable<string> getPatterns(string txt)
{
char[] arr = txt.ToArray();
BitArray ba = new BitArray(arr.Length);
for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--)
{
char[] arr1 = new char[shingle];
int[] indexes = new int[shingle];
HashSet<int> hs = new HashSet<int>();
Dictionary<int, int[]> dic = new Dictionary<int, int[]>();
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
{
int index = i + j;
arr1[j] = arr[index];
indexes[j] = index;
}
int h = getHashCode(arr1);
if (hs.Add(h))
{
int[] indexes1 = new int[indexes.Length];
Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int));
dic.Add(h, indexes1);
}
else
{
bool exists = false;
foreach (int index in indexes)
if (ba.Get(index))
{
exists = true;
break;
}
if (!exists)
{
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
if (ba.Get(index))
{
exists = true;
break;
}
}
if (!exists)
{
foreach (int index in indexes)
ba.Set(index, true);
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
ba.Set(index, true);
dic[h] = null;
yield return new string(arr1);
}
}
}
}
}
private static int getMaxShingleSize(char[] arr)
{
for (int shingle = 2; shingle <= arr.Length / 2 + 1; shingle++)
{
char[] arr1 = new char[shingle];
HashSet<int> hs = new HashSet<int>();
bool noPattern = true;
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
arr1[j] = arr[i + j];
int h = getHashCode(arr1);
if (!hs.Add(h))
{
noPattern = false;
break;
}
}
if (noPattern)
return shingle - 1;
}
return -1;
}
private static int getHashCode(char[] arr)
{
unchecked
{
int hash = (int)2166136261;
foreach (char c in arr)
hash = (hash * 16777619) ^ c.GetHashCode();
return hash;
}
}
Edit
My previous code has serious problems. This one is better:
private static IEnumerable<string> getPatterns(string txt)
{
Dictionary<int, int> dicIndexSize = new Dictionary<int, int>();
for (int shingle = 2, count0 = txt.Length / 2 + 1; shingle <= count0; shingle++)
{
Dictionary<string, int> dic = new Dictionary<string, int>();
bool patternExists = false;
for (int i = 0, count = txt.Length - shingle; i <= count; i++)
{
string sub = txt.Substring(i, shingle);
if (!dic.ContainsKey(sub))
dic.Add(sub, i);
else
{
patternExists = true;
int index0 = dic[sub];
if (index0 >= 0)
{
dicIndexSize[index0] = shingle;
dic[sub] = -1;
}
}
}
if (!patternExists)
break;
}
List<int> lst = dicIndexSize.Keys.ToList();
lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a]));
BitArray ba = new BitArray(txt.Length);
foreach (int i in lst)
{
bool ok = true;
int len = dicIndexSize[i];
for (int j = i, max = i + len; j < max; j++)
{
if (ok) ok = !ba.Get(j);
ba.Set(j, true);
}
if (ok)
yield return txt.Substring(i, len);
}
}
Text in this book took 3.4sec in my computer.

Suffix arrays are the right idea, but there's a non-trivial piece missing, namely, identifying what are known in the literature as "supermaximal repeats". Here's a GitHub repo with working code: https://github.com/eisenstatdavid/commonsub . Suffix array construction uses the SAIS library, vendored in as a submodule. The supermaximal repeats are found using a corrected version of the pseudocode from findsmaxr in Efficient repeat finding via suffix arrays
(Becher–Deymonnaz–Heiber).
static void FindRepeatedStrings(void) {
// findsmaxr from https://arxiv.org/pdf/1304.0528.pdf
printf("[");
bool needComma = false;
int up = -1;
for (int i = 1; i < Len; i++) {
if (LongCommPre[i - 1] < LongCommPre[i]) {
up = i;
continue;
}
if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue;
for (int k = up - 1; k < i; k++) {
if (SufArr[k] == 0) continue;
unsigned char c = Buf[SufArr[k] - 1];
if (Set[c] == i) goto skip;
Set[c] = i;
}
if (needComma) {
printf("\n,");
}
printf("\"");
for (int j = 0; j < LongCommPre[up]; j++) {
unsigned char c = Buf[SufArr[up] + j];
if (iscntrl(c)) {
printf("\\u%.4x", c);
} else if (c == '\"' || c == '\\') {
printf("\\%c", c);
} else {
printf("%c", c);
}
}
printf("\"");
needComma = true;
skip:
up = -1;
}
printf("\n]\n");
}
Here's a sample output on the text of the first paragraph:
Davids-MBP:commonsub eisen$ ./repsub input
["\u000a"
," S"
," as "
," co"
," ide"
," in "
," li"
," n"
," p"
," the "
," us"
," ve"
," w"
,"\""
,"–"
,"("
,")"
,". "
,"0"
,"He"
,"Suffix array"
,"`"
,"a su"
,"at "
,"code"
,"com"
,"ct"
,"do"
,"e f"
,"ec"
,"ed "
,"ei"
,"ent"
,"ere's a "
,"find"
,"her"
,"https://"
,"ib"
,"ie"
,"ing "
,"ion "
,"is"
,"ith"
,"iv"
,"k"
,"mon"
,"na"
,"no"
,"nst"
,"ons"
,"or"
,"pdf"
,"ri"
,"s are "
,"se"
,"sing"
,"sub"
,"supermaximal repeats"
,"te"
,"ti"
,"tr"
,"ub "
,"uffix arrays"
,"via"
,"y, "
]

I would use Knuth–Morris–Pratt algorithm (linear time complexity O(n)) to find substrings. I would try to find the largest substring pattern, remove it from the input string and try to find the second largest and so on. I would do something like this:
string pattern = input.substring(0,lenght/2);
string toMatchString = input.substring(pattern.length, input.lenght - 1);
List<string> matches = new List<string>();
while(pattern.lenght > 0)
{
int index = KMP(pattern, toMatchString);
if(index > 0)
{
matches.Add(pattern);
// remove the matched pattern occurences from the input string
// I would do something like this:
// 0 to pattern.lenght gets removed
// check for all occurences of pattern in toMatchString and remove them
// get the remaing shrinked input, reassign values for pattern & toMatchString
// keep looking for the next largest substring
}
else
{
pattern = input.substring(0, pattern.lenght - 1);
toMatchString = input.substring(pattern.length, input.lenght - 1);
}
}
Where KMP implements Knuth–Morris–Pratt algorithm. You can find the Java implementations of it at Github or Princeton or write it yourself.
PS: I don't code in Java and it is quick try to my first bounty about to close soon. So please don't give me the stick if I missed something trivial or made a +/-1 error.

Related

Efficient way to find longest streak of characters in string

This code works fine but I'm looking for a way to optimize it. If you look at the long string, you can see 'l' appears five times consecutively. No other character appears this many times consecutively. So, the output is 5. Now, the problem is this method checks each and every character and even after the max is found, it continues to check the remaining characters. Is there a more efficient way?
public class Main {
public static void main(String[] args) {
System.out.println(longestStreak("KDDiiigllllldddfnnlleeezzeddd"));
}
private static int longestStreak(String str) {
int max = 0;
for (int i = 0; i < str.length(); i++) {
int count = 0;
for (int j = i; j < str.length(); j++) {
if (str.charAt(i) == str.charAt(j)) {
count++;
} else break;
}
if (count > max) max = count;
}
return max;
}
}

We could add variable for previous char count in single iteration. Also as an additional optimisation we stop iteration if i + max - currentLenght < str.length(). It means that max can not be changed:
private static int longestStreak(String str) {
int maxLenght = 0;
int currentLenght = 1;
char prev = str.charAt(0);
for (int index = 1; index < str.length() && isMaxCanBeChanged(str, maxLenght, currentLenght, index); index++) {
char currentChar = str.charAt(index);
if (currentChar == prev) {
currentLenght++;
} else {
maxLenght = Math.max(maxLenght, currentLenght);
currentLenght = 1;
}
prev = currentChar;
}
return Math.max(maxLenght, currentLenght);
}
private static boolean isMaxCanBeChanged(String str, int max, int currentLenght, int index) {
return index + max - currentLenght < str.length();
}

Here is a regex magic solution, which although a bit brute force perhaps gets some brownie points. We can iterate starting with the number of characters in the original input, decreasing by one at a time, trying to do a regex replacement of continuous characters of that length. If the replacement works, then we know we found the longest streak.
String input = "KDDiiigllllldddfnnlleeezzeddd";
for (int i=input.length(); i > 0; --i) {
String replace = input.replaceAll(".*?(.)(\\1{" + (i-1) + "}).*", "$1");
if (replace.length() != input.length()) {
System.out.println("longest streak is: " + replace);
}
}
This prints:
longest streak is: lllll

Yes there is. C++ code:
string str = "KDDiiigllllldddfnnlleeezzeddd";
int longest_streak = 1, current_streak = 1; char longest_letter = str[0];
for (int i = 1; i < str.size(); ++i) {
if (str[i] == str[i - 1])
current_streak++;
else current_streak = 1;
if (current_streak > longest_streak) {
longest_streak = current_streak;
longest_letter = str[i];
}
}
cout << "The longest streak is: " << longest_streak << " and the character is: " << longest_letter << "\n";
LE: If needed, I can provide the Java code for it, but I think you get the idea.
public class Main {
public static void main(String[] args) {
System.out.println(longestStreak("KDDiiigllllldddfnnlleeezzeddd"));
}
private static int longestStreak(String str) {
int longest_streak = 1, current_streak = 1; char longest_letter = str.charAt(0);
for (int i = 1; i < str.length(); ++i) {
if (str.charAt(i) == str.charAt(i - 1))
current_streak++;
else current_streak = 1;
if (current_streak > longest_streak) {
longest_streak = current_streak;
longest_letter = str.charAt(i);
}
}
return longest_streak;
}
}

The loop could be rewritten a bit smaller, but mainly the condition can be optimized:
i < str.length() - max

Using Stream and collector. It should give all highest repeated elements.
Code:
String lineString = "KDDiiiiiiigllllldddfnnlleeezzeddd";
String[] lineSplit = lineString.split("");
Map<String, Integer> collect = Arrays.stream(lineSplit)
.collect(Collectors.groupingBy(Function.identity(), Collectors.summingInt(e -> 1)));
int maxValueInMap = (Collections.max(collect.values()));
for (Entry<String, Integer> entry : collect.entrySet()) {
if (entry.getValue() == maxValueInMap) {
System.out.printf("Character: %s, Repetition: %d\n", entry.getKey(), entry.getValue());
}
}
Output:
Character: i, Repetition: 7
Character: l, Repetition: 7
P.S I am not sure how efficient this code it. I just learned Streams.

Finding shortest possible substring that contains a String

This was a question asked in a recent programming interview.
Given a random string S and another string T with unique elements, find the minimum consecutive sub-string of S such that it contains all the elements in T.
Say,
S='adobecodebanc'
T='abc'
Answer='banc'
I've come up with a solution,
public static String completeSubstring(String T, String S){
String minSub = T;
StringBuilder sb = new StringBuilder();
for (int i = 0; i <T.length()-1; i++) {
for (int j = i + 1; j <= T.length() ; j++) {
String sub = T.substring(i,j);
if(stringContains(sub, S)){
if(sub.length() < minSub.length()) minSub = sub;
}
}
}
return minSub;
}
private static boolean stringContains(String t, String s){
//if(t.length() <= s.length()) return false;
int[] arr = new int[256];
for (int i = 0; i <t.length() ; i++) {
char c = t.charAt(i);
arr[c -'a'] = 1;
}
boolean found = true;
for (int i = 0; i <s.length() ; i++) {
char c = s.charAt(i);
if(arr[c - 'a'] != 1){
found = false;
break;
}else continue;
}
return found;
}
This algorithm has a O(n3) complexity, which but naturally isn't great. Can someone suggest a better algorithm.

Here's the O(N) solution.
The important thing to note re: complexity is that each unit of work involves incrementing either start or end, they don't decrease, and the algorithm stops before they both get to the end.
public static String findSubString(String s, String t)
{
//algorithm moves a sliding "current substring" through s
//in this map, we keep track of the number of occurrences of
//each target character there are in the current substring
Map<Character,int[]> counts = new HashMap<>();
for (char c : t.toCharArray())
{
counts.put(c,new int[1]);
}
//how many target characters are missing from the current substring
//current substring is initially empty, so all of them
int missing = counts.size();
//don't waste my time
if (missing<1)
{
return "";
}
//best substring found
int bestStart = -1, bestEnd = -1;
//current substring
int start=0, end=0;
while (end<s.length())
{
//expand the current substring at the end
int[] cnt = counts.get(s.charAt(end++));
if (cnt!=null)
{
if (cnt[0]==0)
{
--missing;
}
cnt[0]+=1;
}
//while the current substring is valid, remove characters
//at the start to see if a shorter substring that ends at the
//same place is also valid
while(start<end && missing<=0)
{
//current substring is valid
if (end-start < bestEnd-bestStart || bestEnd<0)
{
bestStart = start;
bestEnd = end;
}
cnt = counts.get(s.charAt(start++));
if (cnt != null)
{
cnt[0]-=1;
if (cnt[0]==0)
{
++missing;
}
}
}
//current substring is no longer valid. we'll add characters
//at the end until we get another valid one
//note that we don't need to add back any start character that
//we just removed, since we already tried the shortest valid string
//that starts at start-1
}
return(bestStart<=bestEnd ? s.substring(bestStart,bestEnd) : null);
}

I know that there already is an adequate O(N) complexity answer, but I tried to figure it out on my own without looking it up, just because it's a fun problem to solve and thought I would share. Here's the O(N) solution that I came up with:
public static String completeSubstring(String S, String T){
int min = S.length()+1, index1 = -1, index2 = -1;
ArrayList<ArrayList<Integer>> index = new ArrayList<ArrayList<Integer>>();
HashSet<Character> targetChars = new HashSet<Character>();
for(char c : T.toCharArray()) targetChars.add(c);
//reduce initial sequence to only target chars and keep track of index
//Note that the resultant string does not allow the same char to be consecutive
StringBuilder filterS = new StringBuilder();
for(int i = 0, s = 0 ; i < S.length() ; i++) {
char c = S.charAt(i);
if(targetChars.contains(c)) {
if(s > 0 && filterS.charAt(s-1) == c) {
index.get(s-1).add(i);
} else {
filterS.append(c);
index.add(new ArrayList<Integer>());
index.get(s).add(i);
s++;
}
}
}
//Not necessary to use regex, loops are fine, but for readability sake
String regex = "([abc])((?!\\1)[abc])((?!\\1)(?!\\2)[abc])";
Matcher m = Pattern.compile(regex).matcher(filterS.toString());
for(int i = 0, start = -1, p1, p2, tempMin, charSize = targetChars.size() ; m.find(i) ; i = start+1) {
start = m.start();
ArrayList<Integer> first = index.get(start);
p1 = first.get(first.size()-1);
p2 = index.get(start+charSize-1).get(0);
tempMin = p2-p1;
if(tempMin < min) {
min = tempMin;
index1 = p1;
index2 = p2;
}
}
return S.substring(index1, index2+1);
}
I'm pretty sure the complexity is O(N), please correct if I'm wrong

Alternative implementation of O(N) algorithm proposed by #MattTimmermans, which uses Map<Integer, Integer> to count occurrences and Set<Integer> to store chars from T that are present in current substring:
public static String completeSubstring(String s, String t) {
Map<Integer, Integer> occ
= t.chars().boxed().collect(Collectors.toMap(c -> c, c -> 0));
Set<Integer> found = new HashSet<>(); // characters from T found in current match
int start = 0; // current match
int bestStart = Integer.MIN_VALUE, bestEnd = -1;
for (int i = 0; i < s.length(); i++) {
int ci = s.charAt(i); // current char
if (!occ.containsKey(ci)) // not from T
continue;
occ.put(ci, occ.get(ci) + 1); // add occurrence
found.add(ci);
for (int j = start; j < i; j++) { // try to reduce current match
int cj = s.charAt(j);
Integer c = occ.get(cj);
if (c != null) {
if (c == 1) { // cannot reduce anymore
start = j;
break;
} else
occ.put(cj, c - 1); // remove occurrence
}
}
if (found.size() == occ.size() // all chars found
&& (i - start < bestEnd - bestStart)) {
bestStart = start;
bestEnd = i;
}
}
return bestStart < 0 ? null : s.substring(bestStart, bestEnd + 1);
}

How to get the count of unmatched character in two strings?

I need to get the count of Unmatched character in two strings. for example
string 1 "hari", string 2 "malar"
Now i need to remove the duplicates from both string ['a' & 'r'] are common in both strings so remove that, now string 1 contain "hi" string 2 contain "mla".
Remaining count = 5
I tried this code, its working fine if duplicate / repeart is not available in same sting like here 'a' come twice in string 2 so my code is didn't work properly.
for (int i = 0; i < first.length; i++) {
for (int j = 0; j < second.length; j++) {
if(first[i] == second[j])
{
getstrings = new ArrayList<String>();
count=count+1;
Log.d("Matches", "string char that matched "+ first[i] +"==" + second[j]);
}
}
}
int tot=(first.length + second.length) - count;
here first & second refers to
char[] first = nameone.toCharArray();
char[] second = nametwo.toCharArray();
this code is working fine for String 1 "sri" string 2 "hari" here in a string character didn't repeat so this above code is working fine. Help me to solve this ?

Here is my solution,
public static void RemoveMatchedCharsInnStrings(String first,String second)
{
for(int i = 0 ;i < first.length() ; i ++)
{
char c = first.charAt(i);
if(second.indexOf(c)!= -1)
{
first = first.replaceAll(""+c, "");
second = second.replaceAll(""+c, "");
}
}
System.out.println(first);
System.out.println(second);
System.out.println(first.length() + second.length());
}
Hope it is what you need. if not i'll update my answer

I saw the other answers and thought: There must be a more declarative and composable way of doing this!
There is, but it's far longer...
public static void main(String[] args) {
String first = "hari";
String second = "malar";
Map<Character, Integer> differences = absoluteDifference(characterCountOf(first), characterCountOf(second));
System.out.println(sumOfCounts(differences));
}
public static Map<Character, Integer> characterCountOf(String text) {
Map<Character, Integer> result = new HashMap<Character, Integer>();
for (int i=0; i < text.length(); i++) {
Character c = text.charAt(i);
result.put(c, result.containsKey(c) ? result.get(c) + 1 : 1);
}
return result;
}
public static <K> Set<K> commonKeys(Map<K, ?> first, Map<K, ?> second) {
Set<K> result = new HashSet<K>(first.keySet());
result.addAll(second.keySet());
return result;
}
public static <K> Map<K, Integer> absoluteDifference(Map<K, Integer> first, Map<K, Integer> second) {
Map<K, Integer> result = new HashMap<K, Integer>();
for (K key: commonKeys(first, second)) {
Integer firstCount = first.containsKey(key) ? first.get(key) : 0;
Integer secondCount = second.containsKey(key) ? second.get(key) : 0;
Integer resultCount = Math.max(firstCount, secondCount) - Math.min(firstCount, secondCount);
if (resultCount > 0) result.put(key, resultCount);
}
return result;
}
public static Integer sumOfCounts(Map<?, Integer> map) {
Integer sum = 0;
for (Integer count: map.values()) {
sum += count;
}
return sum;
}
This is the solution I prefer - but it's lot longer. You've tagged the question with Android, so I didn't use any Java 8 features, which would reduce it a bit (but not as much as I would have hoped for).
However it produces meaningful intermediate results. But it's still so much longer :-(

Try out this code:
String first = "hari";
String second = malar;
String tempFirst = "";
String tempSecond = "";
int maxSize = ((first.length() > second.length()) ? (first.length()) : (second.length()));
for (int i = 0; i < maxSize; i++) {
if (i >= second.length()) {
tempFirst += first.charAt(i);
} else if (i >= first.length()) {
tempSecond += second.charAt(i);
} else if (first.charAt(i) != second.charAt(i)) {
tempFirst += first.charAt(i);
tempSecond += second.charAt(i);
}
}
first = tempFirst;
second = tempSecond;

you need to break; as soon as the match is found:
public static void main(String[] args) {
String nameone="hari";
String nametwo="malar";
char[] first = nameone.toCharArray();
char[] second = nametwo.toCharArray();
List<String>getstrings=null;
int count=0;
for (int i = 0; i < first.length; i++) {
for (int j = 0; j < second.length; j++) {
if(first[i] == second[j])
{
getstrings = new ArrayList<String>();
count++;
System.out.println("Matches"+ "string char that matched "+ first[i] +"==" + second[j]);
break;
}
}
}
//System.out.println(count);
int tot=(first.length-count )+ (second.length - count);
System.out.println("Remaining after match from both strings:"+tot);
}
prints:
Remaining after match from both strings:5

Two things you are missing here.
In the if condition, when the two characters matches, you need to increment count by 2, not one as you are eliminating from both strings.
You need to put a break in the in condition as you are always matching for the first occurrence of the character.
Made those two changes in your code as below, and now it prints the result as you expected.
for (int i = 0; i < first.length; i++) {
for (int j = 0; j < second.length; j++) {
if(first[i] == second[j])
{
count=count+2;
break;
}
}
}
int tot=(first.length + second.length) - count;
System.out.println("Result = "+tot);

You just need to loop over two strings if characters are matched increment the count and just remove those count from total len of two characters
s = 'hackerhappy'\
t = 'hackerrank'\
count = 0
for i in range(len(s)):
for j in range(len(t)):
if s[i] == t[j]:
count += 2
break
char_unmatched = (len(s)+len(t)) - count
char_unmatched contains the count of number of characters from both the strings that are not equal

CYK algorithm implementation java

I'm trying to implement the CYK algorithm based on wikipedia pseudocode. When I test the string "a b" for the grammar input:
S->A B
A->a
B->b
It gives me false, and I think it should be true. I have an arraylist called AllGrammar that contains all the rules. For the example above it would contain:
[0]: S->A B[1]: A->a[2]: B->bFor the example S->hello and the input string hello it gives me true as it should. More complex tests (more productions) gives me false :S
public static boolean cyk(String entrada) {
int n = entrada.length();
int r = AllGrammar.size();
//Vector<String> startingsymbols = getSymbols(AllGrammar);
String[] ent = entrada.split("\\s");
n = ent.length;
System.out.println("length of entry" + n);
//let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
boolean P[][][] = initialize3DVector(n, r);
//n-> number of words of string entrada,
//r-> number of nonterminal symbols
//This grammar contains the subset Rs which is the set of start symbols
for (int i = 1; i < n; i++) {
for(int j = 0; j < r; j++) {
String[] rule = (String[]) AllGrammar.get(j);
if (rule.length == 2) {
if (rule[1].equals(ent[i])) {
System.out.println("entrou");
System.out.println(rule[1]);
P[i][1][j + 1] = true;
}
}
}
}
for(int i = 2; i < n; i++) {
System.out.println("FIRST:" + i);
for(int j = 1; j < n - i + 1; j++) {
System.out.println("SECOND:" + j);
for(int k = 1; k < i - 1; k++) {
System.out.println("THIRD:" + k);
for(int g = 0; g < r; g++) {
String[] rule = (String[]) AllGrammar.get(g);
if (rule.length > 2) {
int A = returnPos(rule[0]);
int B = returnPos(rule[1]);
int C = returnPos(rule[2]);
System.out.println("A" + A);
System.out.println("B" + B);
System.out.println("C" + C);
if (A!=-1 && B!=-1 && C!=-1) {
if (P[j][k][B] && P[j + k][i - k][C]) {
System.out.println("entrou2");
P[j][i][A] = true;
}
}
}
}
}
}
}
for(int x = 0; x < r; x++) {
if(P[1][n][x]) return true;
}
return false;
}

As compared to the CYK algorithm:
you have indexing starting at 1, but the arrays would appear to start at 0
the function returnpos() is not defined, and it's not obvious what it does.
It would seem the problems could be fairly basic in the the use of indexes. If you are new to the language, you might want to get a refresher.

Finding the intersection between two list of string candidates

I wrote the following Java code, to find the intersection between the prefix and the suffix of a String in Java.
// you can also use imports, for example:
// import java.math.*;
import java.util.*;
class Solution {
public int max_prefix_suffix(String S) {
if (S.length() == 0) {
return 1;
}
// prefix candidates
Vector<String> prefix = new Vector<String>();
// suffix candidates
Vector<String> suffix = new Vector<String>();
// will tell me the difference
Set<String> set = new HashSet<String>();
int size = S.length();
for (int i = 0; i < size; i++) {
String candidate = getPrefix(S, i);
// System.out.println( candidate );
prefix.add(candidate);
}
for (int i = size; i >= 0; i--) {
String candidate = getSuffix(S, i);
// System.out.println( candidate );
suffix.add(candidate);
}
int p = prefix.size();
int s = suffix.size();
for (int i = 0; i < p; i++) {
set.add(prefix.get(i));
}
for (int i = 0; i < s; i++) {
set.add(suffix.get(i));
}
System.out.println("set: " + set.size());
System.out.println("P: " + p + " S: " + s);
int max = (p + s) - set.size();
return max;
}
// codility
// y t i l i d o c
public String getSuffix(String S, int index) {
String suffix = "";
int size = S.length();
for (int i = size - 1; i >= index; i--) {
suffix += S.charAt(i);
}
return suffix;
}
public String getPrefix(String S, int index) {
String prefix = "";
for (int i = 0; i <= index; i++) {
prefix += S.charAt(i);
}
return prefix;
}
public static void main(String[] args) {
Solution sol = new Solution();
String t1 = "";
String t2 = "abbabba";
String t3 = "codility";
System.out.println(sol.max_prefix_suffix(t1));
System.out.println(sol.max_prefix_suffix(t2));
System.out.println(sol.max_prefix_suffix(t3));
System.exit(0);
}
}
Some test cases are:
String t1 = "";
String t2 = "abbabba";
String t3 = "codility";
and the expected values are:
1, 4, 0
My idea was to produce the prefix candidates and push them into a vector, then find the suffix candidates and push them into a vector, finally push both vectors into a Set and then calculate the difference. However, I'm getting 1, 7, and 0. Could someone please help me figure it out what I'm doing wrong?

I'd write your method as follows:
public int max_prefix_suffix(String s) {
final int len = s.length();
if (len == 0) {
return 1; // there's some dispute about this in the comments to your post
}
int count = 0;
for (int i = 1; i <= len; ++i) {
final String prefix = s.substring(0, i);
final String suffix = s.substring(len - i, len);
if (prefix.equals(suffix)) {
++count;
}
}
return count;
}
If you need to compare the prefix to the reverse of the suffix, I'd do it like this:
final String suffix = new StringBuilder(s.substring(len - i, len))
.reverse().toString();

I see that the code by #ted Hop is good..
The question specify to return the max number of matching characters in Suffix and Prefix of a given String, which is a proper subset. Hence the entire string is not taken into consideration for this max number.
Ex. "abbabba", prefix and suffix can have abba(first 4 char) - abba (last 4 char),, hence the length 4
codility,, prefix(c, co,cod,codi,co...),, sufix (y, ty, ity, lity....), none of them are same.
hence length here is 0.
By modifying the count here from
if (prefix.equals(suffix)) {
++count;
}
with
if (prefix.equals(suffix)) {
count = prefix.length();// or suffix.length()
}
we get the max length.
But could this be done in O(n).. The inbuilt function of string equals, i believe would take O(n), and hence overall complexity is made O(n2).....

i would use this code.
public static int max_prefix_suffix(String S)
{
if (S == null)
return -1;
Set<String> set = new HashSet<String>();
int len = S.length();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < len - 1; i++)
{
builder.append(S.charAt(i));
set.add(builder.toString());
}
int max = 0;
for (int i = 1; i < len; i++)
{
String suffix = S.substring(i, len);
if (set.contains(suffix))
{
int suffLen = suffix.length();
if (suffLen > max)
max = suffLen;
}
}
return max;
}

#ravi.zombie
If you need the length in O(n) then you just need to change Ted's code as below:
int max =0;
for (int i = 1; i <= len-1; ++i) {
final String prefix = s.substring(0, i);
final String suffix = s.substring(len - i, len);
if (prefix.equals(suffix) && max < i) {
max =i;
}
return max;
}
I also left out the entire string comparison to get proper prefix and suffixes so this should return 4 and not 7 for an input string abbabba.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Most efficient way to search for unknown patterns in a string? - java

You can build a suffix tree for your string in linear time: https://en.wikipedia.org/wiki/Suffix_tree The patterns you are looking for are the strings corresponding to internal nodes that have only leaf children.

Related

Efficient way to find longest streak of characters in string

Finding shortest possible substring that contains a String

How to get the count of unmatched character in two strings?

CYK algorithm implementation java

Finding the intersection between two list of string candidates

Categories

Resources