I am trying to determine the algorithmic complexity of this program:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class SuffixArray
{
private String[] text;
private int length;
private int[] index;
private String[] suffix;
public SuffixArray(String text)
{
this.text = new String[text.length()];
for (int i = 0; i < text.length(); i++)
{
this.text[i] = text.substring(i, i+1);
}
this.length = text.length();
this.index = new int[length];
for (int i = 0; i < length; i++)
{
index[i] = i;
}
suffix = new String[length];
}
public void createSuffixArray()
{
for(int index = 0; index < length; index++)
{
String text = "";
for (int text_index = index; text_index < length; text_index++)
{
text+=this.text[text_index];
}
suffix[index] = text;
}
int back;
for (int iteration = 1; iteration < length; iteration++)
{
String key = suffix[iteration];
int keyindex = index[iteration];
for (back = iteration - 1; back >= 0; back--)
{
if (suffix[back].compareTo(key) > 0)
{
suffix[back + 1] = suffix[back];
index[back + 1] = index[back];
}
else
{
break;
}
}
suffix[ back + 1 ] = key;
index[back + 1 ] = keyindex;
}
System.out.println("SUFFIX \t INDEX");
for (int iterate = 0; iterate < length; iterate++)
{
System.out.println(suffix[iterate] + "\t" + index[iterate]);
}
}
public static void main(String...arg)throws IOException
{
String text = "";
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter the Text String ");
text = reader.readLine();
SuffixArray suffixarray = new SuffixArray(text);
suffixarray.createSuffixArray();
}
}
I have done some research on Wikipedia about the Big-O notation and have run the code with strings of different sizes. Based on the time it takes to run the code with different-size stings, I feel its complexity might be O(n^2). How can I know for sure?
Any help would be greatly appreciated.
The complexity of your createSuffixArray method is O(n^2) and you can determine that by examining the 3 looping blocks. Also, since in your question you said you've read the Wikipedia article covering the Big-O notation, then focus on its properties. They are the core of how to apply the right logic and compute the complexity of an algorithm.
Within your first block, the innermost loop iterating length times is in turn repeated for length times, yielding a complexity of O(n^2) due to the Product Property of the Big-O notation.
for(int index = 0; index < length; index++)
{
String text = "";
for (int text_index = index; text_index < length; text_index++)
{
text+=this.text[text_index];
}
suffix[index] = text;
}
In your second block, although the innermost loop does not perform length iterations at the beginning, its complexity still tends to O(n), producing again an overall complexity of O(n^2) due to the Product Property.
for (int iteration = 1; iteration < length; iteration++)
{
String key = suffix[iteration];
int keyindex = index[iteration];
for (back = iteration - 1; back >= 0; back--)
{
if (suffix[back].compareTo(key) > 0)
{
suffix[back + 1] = suffix[back];
index[back + 1] = index[back];
}
else
{
break;
}
}
suffix[ back + 1 ] = key;
index[back + 1 ] = keyindex;
}
Your third and last block simply iterates length times, giving us a linear complexity of O(n).
for (int iterate = 0; iterate < length; iterate++)
{
System.out.println(suffix[iterate] + "\t" + index[iterate]);
}
At this point to get the method complexity, we need to gather the 3 sub-complexities we've got and apply to them the Sum Property, which will yield O(n^2).
I am trying to find patterns that:
occur more than once
are more than 1 character long
are not substrings of any other known pattern
without knowing any of the patterns that might occur.
For example:
The string "the boy fell by the bell" would return 'ell', 'the b', 'y '.
The string "the boy fell by the bell, the boy fell by the bell" would return 'the boy fell by the bell'.
Using double for-loops, it can be brute forced very inefficiently:
ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
int limit = (length - i) / 2;
for (int j = limit; j >= 1; j--) {
int candidateEndIndex = i + j;
String candidate = string.substring(i, candidateEndIndex);
if(candidate.length() <= 1) {
continue;
}
if (string.substring(candidateEndIndex).contains(candidate)) {
boolean notASubpattern = true;
for (String pattern : patternsList) {
if (pattern.contains(candidate)) {
notASubpattern = false;
break;
}
}
if (notASubpattern) {
patternsList.add(candidate);
}
}
}
}
However, this is incredibly slow when searching large strings with tons of patterns.
You can build a suffix tree for your string in linear time:
https://en.wikipedia.org/wiki/Suffix_tree
The patterns you are looking for are the strings corresponding to internal nodes that have only leaf children.
You could use n-grams to find patterns in a string. It would take O(n) time to scan the string for n-grams. When you find a substring by using a n-gram, put it into a hash table with a count of how many times that substring was found in the string. When you're done searching for n-grams in the string, search the hash table for counts greater than 1 to find recurring patterns in the string.
For example, in the string "the boy fell by the bell, the boy fell by the bell" using a 6-gram will find the substring "the boy fell by the bell". A hash table entry with that substring will have a count of 2 because it occurred twice in the string. Varying the number of words in the n-gram will help you discover different patterns in the string.
Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";
// Add entries to the hash table
while (count < str.length) {
// copy the words into the substring
int i = 0;
substring = "";
while (ngramcount > 0 && count < str.length) {
substring[i] = str[count];
if (str[i] == ' ')
ngramcount--;
i++;
count++;
}
ngramcount = 6;
substring.Trim(); // get rid of the last blank in the substring
// Update the dictionary (hash table) with the substring
if (dict.Contains(substring)) { // substring is already in hash table so increment the count
int hashCount = dict[substring];
hashCount++;
dict[substring] = hashCount;
}
else
dict[substring] = 1;
}
// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict) {
if (pair.Value > maxCount) {
maxCount = pair.Value;
mostCommonPattern = pair.Key;
}
}
I've written this just for fun. I hope I have understood the problem correctly, this is valid and fast enough; if not, please be easy on me :) I might optimize it a little more I guess, if someone finds it useful.
private static IEnumerable<string> getPatterns(string txt)
{
char[] arr = txt.ToArray();
BitArray ba = new BitArray(arr.Length);
for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--)
{
char[] arr1 = new char[shingle];
int[] indexes = new int[shingle];
HashSet<int> hs = new HashSet<int>();
Dictionary<int, int[]> dic = new Dictionary<int, int[]>();
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
{
int index = i + j;
arr1[j] = arr[index];
indexes[j] = index;
}
int h = getHashCode(arr1);
if (hs.Add(h))
{
int[] indexes1 = new int[indexes.Length];
Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int));
dic.Add(h, indexes1);
}
else
{
bool exists = false;
foreach (int index in indexes)
if (ba.Get(index))
{
exists = true;
break;
}
if (!exists)
{
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
if (ba.Get(index))
{
exists = true;
break;
}
}
if (!exists)
{
foreach (int index in indexes)
ba.Set(index, true);
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
ba.Set(index, true);
dic[h] = null;
yield return new string(arr1);
}
}
}
}
}
private static int getMaxShingleSize(char[] arr)
{
for (int shingle = 2; shingle <= arr.Length / 2 + 1; shingle++)
{
char[] arr1 = new char[shingle];
HashSet<int> hs = new HashSet<int>();
bool noPattern = true;
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
arr1[j] = arr[i + j];
int h = getHashCode(arr1);
if (!hs.Add(h))
{
noPattern = false;
break;
}
}
if (noPattern)
return shingle - 1;
}
return -1;
}
private static int getHashCode(char[] arr)
{
unchecked
{
int hash = (int)2166136261;
foreach (char c in arr)
hash = (hash * 16777619) ^ c.GetHashCode();
return hash;
}
}
Edit
My previous code has serious problems. This one is better:
private static IEnumerable<string> getPatterns(string txt)
{
Dictionary<int, int> dicIndexSize = new Dictionary<int, int>();
for (int shingle = 2, count0 = txt.Length / 2 + 1; shingle <= count0; shingle++)
{
Dictionary<string, int> dic = new Dictionary<string, int>();
bool patternExists = false;
for (int i = 0, count = txt.Length - shingle; i <= count; i++)
{
string sub = txt.Substring(i, shingle);
if (!dic.ContainsKey(sub))
dic.Add(sub, i);
else
{
patternExists = true;
int index0 = dic[sub];
if (index0 >= 0)
{
dicIndexSize[index0] = shingle;
dic[sub] = -1;
}
}
}
if (!patternExists)
break;
}
List<int> lst = dicIndexSize.Keys.ToList();
lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a]));
BitArray ba = new BitArray(txt.Length);
foreach (int i in lst)
{
bool ok = true;
int len = dicIndexSize[i];
for (int j = i, max = i + len; j < max; j++)
{
if (ok) ok = !ba.Get(j);
ba.Set(j, true);
}
if (ok)
yield return txt.Substring(i, len);
}
}
Text in this book took 3.4sec in my computer.
Suffix arrays are the right idea, but there's a non-trivial piece missing, namely, identifying what are known in the literature as "supermaximal repeats". Here's a GitHub repo with working code: https://github.com/eisenstatdavid/commonsub . Suffix array construction uses the SAIS library, vendored in as a submodule. The supermaximal repeats are found using a corrected version of the pseudocode from findsmaxr in Efficient repeat finding via suffix arrays
(Becher–Deymonnaz–Heiber).
static void FindRepeatedStrings(void) {
// findsmaxr from https://arxiv.org/pdf/1304.0528.pdf
printf("[");
bool needComma = false;
int up = -1;
for (int i = 1; i < Len; i++) {
if (LongCommPre[i - 1] < LongCommPre[i]) {
up = i;
continue;
}
if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue;
for (int k = up - 1; k < i; k++) {
if (SufArr[k] == 0) continue;
unsigned char c = Buf[SufArr[k] - 1];
if (Set[c] == i) goto skip;
Set[c] = i;
}
if (needComma) {
printf("\n,");
}
printf("\"");
for (int j = 0; j < LongCommPre[up]; j++) {
unsigned char c = Buf[SufArr[up] + j];
if (iscntrl(c)) {
printf("\\u%.4x", c);
} else if (c == '\"' || c == '\\') {
printf("\\%c", c);
} else {
printf("%c", c);
}
}
printf("\"");
needComma = true;
skip:
up = -1;
}
printf("\n]\n");
}
Here's a sample output on the text of the first paragraph:
Davids-MBP:commonsub eisen$ ./repsub input
["\u000a"
," S"
," as "
," co"
," ide"
," in "
," li"
," n"
," p"
," the "
," us"
," ve"
," w"
,"\""
,"–"
,"("
,")"
,". "
,"0"
,"He"
,"Suffix array"
,"`"
,"a su"
,"at "
,"code"
,"com"
,"ct"
,"do"
,"e f"
,"ec"
,"ed "
,"ei"
,"ent"
,"ere's a "
,"find"
,"her"
,"https://"
,"ib"
,"ie"
,"ing "
,"ion "
,"is"
,"ith"
,"iv"
,"k"
,"mon"
,"na"
,"no"
,"nst"
,"ons"
,"or"
,"pdf"
,"ri"
,"s are "
,"se"
,"sing"
,"sub"
,"supermaximal repeats"
,"te"
,"ti"
,"tr"
,"ub "
,"uffix arrays"
,"via"
,"y, "
]
I would use Knuth–Morris–Pratt algorithm (linear time complexity O(n)) to find substrings. I would try to find the largest substring pattern, remove it from the input string and try to find the second largest and so on. I would do something like this:
string pattern = input.substring(0,lenght/2);
string toMatchString = input.substring(pattern.length, input.lenght - 1);
List<string> matches = new List<string>();
while(pattern.lenght > 0)
{
int index = KMP(pattern, toMatchString);
if(index > 0)
{
matches.Add(pattern);
// remove the matched pattern occurences from the input string
// I would do something like this:
// 0 to pattern.lenght gets removed
// check for all occurences of pattern in toMatchString and remove them
// get the remaing shrinked input, reassign values for pattern & toMatchString
// keep looking for the next largest substring
}
else
{
pattern = input.substring(0, pattern.lenght - 1);
toMatchString = input.substring(pattern.length, input.lenght - 1);
}
}
Where KMP implements Knuth–Morris–Pratt algorithm. You can find the Java implementations of it at Github or Princeton or write it yourself.
PS: I don't code in Java and it is quick try to my first bounty about to close soon. So please don't give me the stick if I missed something trivial or made a +/-1 error.
I came across below program which looks perfect. Per me its time complexity is nlogn where n is the length of String.
n for storing different strings,nlog for sorting, n for comparison. So time complexity is nlogn.
Space complexity is n for storing the storing n substrings
My question is can it be further optimized ?
public class LRS {
// return the longest common prefix of s and t
public static String lcp(String s, String t) {
int n = Math.min(s.length(), t.length());
for (int i = 0; i < n; i++) {
if (s.charAt(i) != t.charAt(i))
return s.substring(0, i);
}
return s.substring(0, n);
}
// return the longest repeated string in s
public static String lrs(String s) {
// form the N suffixes
int n = s.length();
String[] suffixes = new String[n];
for (int i = 0; i < n; i++) {
suffixes[i] = s.substring(i, n);
}
// sort them
Arrays.sort(suffixes);
// find longest repeated substring by comparing adjacent sorted suffixes
String lrs = "";
for (int i = 0; i < n-1; i++) {
String x = lcp(suffixes[i], suffixes[i+1]);
if (x.length() > lrs.length())
lrs = x;
}
return lrs;
}
public static void main(String[] args) {
String s = "MOMOGTGT";
s = s.replaceAll("\\s+", " ");
System.out.println("'" + lrs(s) + "'");
}
}
Have a look at this algorithm given in geeksforgeeks, this might be helpful:
http://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/
This was a question asked in a recent programming interview.
Given a random string S and another string T with unique elements, find the minimum consecutive sub-string of S such that it contains all the elements in T.
Say,
S='adobecodebanc'
T='abc'
Answer='banc'
I've come up with a solution,
public static String completeSubstring(String T, String S){
String minSub = T;
StringBuilder sb = new StringBuilder();
for (int i = 0; i <T.length()-1; i++) {
for (int j = i + 1; j <= T.length() ; j++) {
String sub = T.substring(i,j);
if(stringContains(sub, S)){
if(sub.length() < minSub.length()) minSub = sub;
}
}
}
return minSub;
}
private static boolean stringContains(String t, String s){
//if(t.length() <= s.length()) return false;
int[] arr = new int[256];
for (int i = 0; i <t.length() ; i++) {
char c = t.charAt(i);
arr[c -'a'] = 1;
}
boolean found = true;
for (int i = 0; i <s.length() ; i++) {
char c = s.charAt(i);
if(arr[c - 'a'] != 1){
found = false;
break;
}else continue;
}
return found;
}
This algorithm has a O(n3) complexity, which but naturally isn't great. Can someone suggest a better algorithm.
Here's the O(N) solution.
The important thing to note re: complexity is that each unit of work involves incrementing either start or end, they don't decrease, and the algorithm stops before they both get to the end.
public static String findSubString(String s, String t)
{
//algorithm moves a sliding "current substring" through s
//in this map, we keep track of the number of occurrences of
//each target character there are in the current substring
Map<Character,int[]> counts = new HashMap<>();
for (char c : t.toCharArray())
{
counts.put(c,new int[1]);
}
//how many target characters are missing from the current substring
//current substring is initially empty, so all of them
int missing = counts.size();
//don't waste my time
if (missing<1)
{
return "";
}
//best substring found
int bestStart = -1, bestEnd = -1;
//current substring
int start=0, end=0;
while (end<s.length())
{
//expand the current substring at the end
int[] cnt = counts.get(s.charAt(end++));
if (cnt!=null)
{
if (cnt[0]==0)
{
--missing;
}
cnt[0]+=1;
}
//while the current substring is valid, remove characters
//at the start to see if a shorter substring that ends at the
//same place is also valid
while(start<end && missing<=0)
{
//current substring is valid
if (end-start < bestEnd-bestStart || bestEnd<0)
{
bestStart = start;
bestEnd = end;
}
cnt = counts.get(s.charAt(start++));
if (cnt != null)
{
cnt[0]-=1;
if (cnt[0]==0)
{
++missing;
}
}
}
//current substring is no longer valid. we'll add characters
//at the end until we get another valid one
//note that we don't need to add back any start character that
//we just removed, since we already tried the shortest valid string
//that starts at start-1
}
return(bestStart<=bestEnd ? s.substring(bestStart,bestEnd) : null);
}
I know that there already is an adequate O(N) complexity answer, but I tried to figure it out on my own without looking it up, just because it's a fun problem to solve and thought I would share. Here's the O(N) solution that I came up with:
public static String completeSubstring(String S, String T){
int min = S.length()+1, index1 = -1, index2 = -1;
ArrayList<ArrayList<Integer>> index = new ArrayList<ArrayList<Integer>>();
HashSet<Character> targetChars = new HashSet<Character>();
for(char c : T.toCharArray()) targetChars.add(c);
//reduce initial sequence to only target chars and keep track of index
//Note that the resultant string does not allow the same char to be consecutive
StringBuilder filterS = new StringBuilder();
for(int i = 0, s = 0 ; i < S.length() ; i++) {
char c = S.charAt(i);
if(targetChars.contains(c)) {
if(s > 0 && filterS.charAt(s-1) == c) {
index.get(s-1).add(i);
} else {
filterS.append(c);
index.add(new ArrayList<Integer>());
index.get(s).add(i);
s++;
}
}
}
//Not necessary to use regex, loops are fine, but for readability sake
String regex = "([abc])((?!\\1)[abc])((?!\\1)(?!\\2)[abc])";
Matcher m = Pattern.compile(regex).matcher(filterS.toString());
for(int i = 0, start = -1, p1, p2, tempMin, charSize = targetChars.size() ; m.find(i) ; i = start+1) {
start = m.start();
ArrayList<Integer> first = index.get(start);
p1 = first.get(first.size()-1);
p2 = index.get(start+charSize-1).get(0);
tempMin = p2-p1;
if(tempMin < min) {
min = tempMin;
index1 = p1;
index2 = p2;
}
}
return S.substring(index1, index2+1);
}
I'm pretty sure the complexity is O(N), please correct if I'm wrong
Alternative implementation of O(N) algorithm proposed by #MattTimmermans, which uses Map<Integer, Integer> to count occurrences and Set<Integer> to store chars from T that are present in current substring:
public static String completeSubstring(String s, String t) {
Map<Integer, Integer> occ
= t.chars().boxed().collect(Collectors.toMap(c -> c, c -> 0));
Set<Integer> found = new HashSet<>(); // characters from T found in current match
int start = 0; // current match
int bestStart = Integer.MIN_VALUE, bestEnd = -1;
for (int i = 0; i < s.length(); i++) {
int ci = s.charAt(i); // current char
if (!occ.containsKey(ci)) // not from T
continue;
occ.put(ci, occ.get(ci) + 1); // add occurrence
found.add(ci);
for (int j = start; j < i; j++) { // try to reduce current match
int cj = s.charAt(j);
Integer c = occ.get(cj);
if (c != null) {
if (c == 1) { // cannot reduce anymore
start = j;
break;
} else
occ.put(cj, c - 1); // remove occurrence
}
}
if (found.size() == occ.size() // all chars found
&& (i - start < bestEnd - bestStart)) {
bestStart = start;
bestEnd = i;
}
}
return bestStart < 0 ? null : s.substring(bestStart, bestEnd + 1);
}
I have a LinkedList where each node contains a word. I also have a variable that contains 6 randomly generated letters. I have a code the determines all possible letter combinations of those letters. I need to traverse through the linked list and determine the best "match" among the nodes.
Example:
-Letters generated: jghoot
-Linked list contains: cat, dog, cow, loot, hooter, ghlooter (I know ghlooter isn't a word)
The method would return hooter because it shares the most characters and is most similar to it. Any ideas?
I guess you could say I am looking for the word that the generated letters are a substring of.
use a nested for loop, compare each letter of your original string with the one you are comparing it to, and for each match, increase a local int variable. at the end of the loop, compare local int variable to global int variable that holds the "best" match till that one, and if bigger, store local int into global one, and put your found node into global node. at the end you should have a node which matches closest.
something like this
int currBest = 0;
int currBestNode = firstNodeOfLinkedList;
while(blabla)
{
int localBest = 0;
for(i= 0; i < currentNodeWord.length; i++)
{
for(j=0; j < originalWord.length;j++)
{
if(currentNodeWord[i] == originalWord[j])
{
localBest++
}
}
}
if(localBest > currBest)
{
currBest = localBest;
currBestNode = currentNodeStringBeingSearched;
}
}
// here u are out of ur loop, ur currBestNode should be set to the best match found.
hope that helps
If you're considering only character counts, first you need a method to count chars in a word
public int [] getCharCounts(String word) {
int [] result = new int['z' - 'a' + 1];
for(int i = 0; i<word.length(); i++) result[word.charAt(i) - 'a']++;
return result;
}
and then you need to compare character counts of two words
public static int compareCounts(int [] count1, int [] count2) [
int result = 0;
for(int i = 0; i<count1.length; i++) {
result += Math.min(count1[i], count2[i]);
}
return result;
}
public static void main(String[] args) {
String randomWord = "jghoot";
int [] randomWordCharCount = getCharCounts(randomWord);
ArrayList<String> wordList = new ArrayList();
String bestElement = null;
int bestMatch = -1;
for(String word : wordList) {
int [] wordCount = getCharCounts(word);
int cmp = compareCounts(randomWordCharCount, wordCount);
if(cmp > bestMatch) {
bestMatch = cmp;
bestElement = word;
}
}
System.out.println(word);
}
I think this works.
import java.util.*;
import java.io.*;
class LCSLength
{
public static void main(String args[])
{
ArrayList<String> strlist=new ArrayList<String>();
strlist.add(new String("cat"));
strlist.add(new String("cow"));
strlist.add(new String("hooter"));
strlist.add(new String("dog"));
strlist.add(new String("loot"));
String random=new String("jghoot"); //Your String
int maxLength=-1;
String maxString=new String();
for(String s:strlist)
{
int localMax=longestSubstr(s,random);
if(localMax>maxLength)
{
maxLength=localMax;
maxString=s;
}
}
System.out.println(maxString);
}
public static int longestSubstr(String first, String second) {
if (first == null || second == null || first.length() == 0 || second.length() == 0) {
return 0;
}
int maxLen = 0;
int fl = first.length();
int sl = second.length();
int[][] table = new int[fl+1][sl+1];
for(int s=0; s <= sl; s++)
table[0][s] = 0;
for(int f=0; f <= fl; f++)
table[f][0] = 0;
for (int i = 1; i <= fl; i++) {
for (int j = 1; j <= sl; j++) {
if (first.charAt(i-1) == second.charAt(j-1)) {
if (i == 1 || j == 1) {
table[i][j] = 1;
}
else {
table[i][j] = table[i - 1][j - 1] + 1;
}
if (table[i][j] > maxLen) {
maxLen = table[i][j];
}
}
}
}
return maxLen;
}
}
Credits: Wikipedia for longest common substring algorithm.
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring