Finding shortest possible substring that contains a String

Finding shortest possible substring that contains a String - java

This was a question asked in a recent programming interview.
Given a random string S and another string T with unique elements, find the minimum consecutive sub-string of S such that it contains all the elements in T.
Say,
S='adobecodebanc'
T='abc'
Answer='banc'
I've come up with a solution,
public static String completeSubstring(String T, String S){
String minSub = T;
StringBuilder sb = new StringBuilder();
for (int i = 0; i <T.length()-1; i++) {
for (int j = i + 1; j <= T.length() ; j++) {
String sub = T.substring(i,j);
if(stringContains(sub, S)){
if(sub.length() < minSub.length()) minSub = sub;
}
}
}
return minSub;
}
private static boolean stringContains(String t, String s){
//if(t.length() <= s.length()) return false;
int[] arr = new int[256];
for (int i = 0; i <t.length() ; i++) {
char c = t.charAt(i);
arr[c -'a'] = 1;
}
boolean found = true;
for (int i = 0; i <s.length() ; i++) {
char c = s.charAt(i);
if(arr[c - 'a'] != 1){
found = false;
break;
}else continue;
}
return found;
}
This algorithm has a O(n3) complexity, which but naturally isn't great. Can someone suggest a better algorithm.

Here's the O(N) solution.
The important thing to note re: complexity is that each unit of work involves incrementing either start or end, they don't decrease, and the algorithm stops before they both get to the end.
public static String findSubString(String s, String t)
{
//algorithm moves a sliding "current substring" through s
//in this map, we keep track of the number of occurrences of
//each target character there are in the current substring
Map<Character,int[]> counts = new HashMap<>();
for (char c : t.toCharArray())
{
counts.put(c,new int[1]);
}
//how many target characters are missing from the current substring
//current substring is initially empty, so all of them
int missing = counts.size();
//don't waste my time
if (missing<1)
{
return "";
}
//best substring found
int bestStart = -1, bestEnd = -1;
//current substring
int start=0, end=0;
while (end<s.length())
{
//expand the current substring at the end
int[] cnt = counts.get(s.charAt(end++));
if (cnt!=null)
{
if (cnt[0]==0)
{
--missing;
}
cnt[0]+=1;
}
//while the current substring is valid, remove characters
//at the start to see if a shorter substring that ends at the
//same place is also valid
while(start<end && missing<=0)
{
//current substring is valid
if (end-start < bestEnd-bestStart || bestEnd<0)
{
bestStart = start;
bestEnd = end;
}
cnt = counts.get(s.charAt(start++));
if (cnt != null)
{
cnt[0]-=1;
if (cnt[0]==0)
{
++missing;
}
}
}
//current substring is no longer valid. we'll add characters
//at the end until we get another valid one
//note that we don't need to add back any start character that
//we just removed, since we already tried the shortest valid string
//that starts at start-1
}
return(bestStart<=bestEnd ? s.substring(bestStart,bestEnd) : null);
}

I know that there already is an adequate O(N) complexity answer, but I tried to figure it out on my own without looking it up, just because it's a fun problem to solve and thought I would share. Here's the O(N) solution that I came up with:
public static String completeSubstring(String S, String T){
int min = S.length()+1, index1 = -1, index2 = -1;
ArrayList<ArrayList<Integer>> index = new ArrayList<ArrayList<Integer>>();
HashSet<Character> targetChars = new HashSet<Character>();
for(char c : T.toCharArray()) targetChars.add(c);
//reduce initial sequence to only target chars and keep track of index
//Note that the resultant string does not allow the same char to be consecutive
StringBuilder filterS = new StringBuilder();
for(int i = 0, s = 0 ; i < S.length() ; i++) {
char c = S.charAt(i);
if(targetChars.contains(c)) {
if(s > 0 && filterS.charAt(s-1) == c) {
index.get(s-1).add(i);
} else {
filterS.append(c);
index.add(new ArrayList<Integer>());
index.get(s).add(i);
s++;
}
}
}
//Not necessary to use regex, loops are fine, but for readability sake
String regex = "([abc])((?!\\1)[abc])((?!\\1)(?!\\2)[abc])";
Matcher m = Pattern.compile(regex).matcher(filterS.toString());
for(int i = 0, start = -1, p1, p2, tempMin, charSize = targetChars.size() ; m.find(i) ; i = start+1) {
start = m.start();
ArrayList<Integer> first = index.get(start);
p1 = first.get(first.size()-1);
p2 = index.get(start+charSize-1).get(0);
tempMin = p2-p1;
if(tempMin < min) {
min = tempMin;
index1 = p1;
index2 = p2;
}
}
return S.substring(index1, index2+1);
}
I'm pretty sure the complexity is O(N), please correct if I'm wrong

Alternative implementation of O(N) algorithm proposed by #MattTimmermans, which uses Map<Integer, Integer> to count occurrences and Set<Integer> to store chars from T that are present in current substring:
public static String completeSubstring(String s, String t) {
Map<Integer, Integer> occ
= t.chars().boxed().collect(Collectors.toMap(c -> c, c -> 0));
Set<Integer> found = new HashSet<>(); // characters from T found in current match
int start = 0; // current match
int bestStart = Integer.MIN_VALUE, bestEnd = -1;
for (int i = 0; i < s.length(); i++) {
int ci = s.charAt(i); // current char
if (!occ.containsKey(ci)) // not from T
continue;
occ.put(ci, occ.get(ci) + 1); // add occurrence
found.add(ci);
for (int j = start; j < i; j++) { // try to reduce current match
int cj = s.charAt(j);
Integer c = occ.get(cj);
if (c != null) {
if (c == 1) { // cannot reduce anymore
start = j;
break;
} else
occ.put(cj, c - 1); // remove occurrence
}
}
if (found.size() == occ.size() // all chars found
&& (i - start < bestEnd - bestStart)) {
bestStart = start;
bestEnd = i;
}
}
return bestStart < 0 ? null : s.substring(bestStart, bestEnd + 1);
}

Related

Efficient way to find longest streak of characters in string

This code works fine but I'm looking for a way to optimize it. If you look at the long string, you can see 'l' appears five times consecutively. No other character appears this many times consecutively. So, the output is 5. Now, the problem is this method checks each and every character and even after the max is found, it continues to check the remaining characters. Is there a more efficient way?
public class Main {
public static void main(String[] args) {
System.out.println(longestStreak("KDDiiigllllldddfnnlleeezzeddd"));
}
private static int longestStreak(String str) {
int max = 0;
for (int i = 0; i < str.length(); i++) {
int count = 0;
for (int j = i; j < str.length(); j++) {
if (str.charAt(i) == str.charAt(j)) {
count++;
} else break;
}
if (count > max) max = count;
}
return max;
}
}

We could add variable for previous char count in single iteration. Also as an additional optimisation we stop iteration if i + max - currentLenght < str.length(). It means that max can not be changed:
private static int longestStreak(String str) {
int maxLenght = 0;
int currentLenght = 1;
char prev = str.charAt(0);
for (int index = 1; index < str.length() && isMaxCanBeChanged(str, maxLenght, currentLenght, index); index++) {
char currentChar = str.charAt(index);
if (currentChar == prev) {
currentLenght++;
} else {
maxLenght = Math.max(maxLenght, currentLenght);
currentLenght = 1;
}
prev = currentChar;
}
return Math.max(maxLenght, currentLenght);
}
private static boolean isMaxCanBeChanged(String str, int max, int currentLenght, int index) {
return index + max - currentLenght < str.length();
}

Here is a regex magic solution, which although a bit brute force perhaps gets some brownie points. We can iterate starting with the number of characters in the original input, decreasing by one at a time, trying to do a regex replacement of continuous characters of that length. If the replacement works, then we know we found the longest streak.
String input = "KDDiiigllllldddfnnlleeezzeddd";
for (int i=input.length(); i > 0; --i) {
String replace = input.replaceAll(".*?(.)(\\1{" + (i-1) + "}).*", "$1");
if (replace.length() != input.length()) {
System.out.println("longest streak is: " + replace);
}
}
This prints:
longest streak is: lllll

Yes there is. C++ code:
string str = "KDDiiigllllldddfnnlleeezzeddd";
int longest_streak = 1, current_streak = 1; char longest_letter = str[0];
for (int i = 1; i < str.size(); ++i) {
if (str[i] == str[i - 1])
current_streak++;
else current_streak = 1;
if (current_streak > longest_streak) {
longest_streak = current_streak;
longest_letter = str[i];
}
}
cout << "The longest streak is: " << longest_streak << " and the character is: " << longest_letter << "\n";
LE: If needed, I can provide the Java code for it, but I think you get the idea.
public class Main {
public static void main(String[] args) {
System.out.println(longestStreak("KDDiiigllllldddfnnlleeezzeddd"));
}
private static int longestStreak(String str) {
int longest_streak = 1, current_streak = 1; char longest_letter = str.charAt(0);
for (int i = 1; i < str.length(); ++i) {
if (str.charAt(i) == str.charAt(i - 1))
current_streak++;
else current_streak = 1;
if (current_streak > longest_streak) {
longest_streak = current_streak;
longest_letter = str.charAt(i);
}
}
return longest_streak;
}
}

The loop could be rewritten a bit smaller, but mainly the condition can be optimized:
i < str.length() - max

Using Stream and collector. It should give all highest repeated elements.
Code:
String lineString = "KDDiiiiiiigllllldddfnnlleeezzeddd";
String[] lineSplit = lineString.split("");
Map<String, Integer> collect = Arrays.stream(lineSplit)
.collect(Collectors.groupingBy(Function.identity(), Collectors.summingInt(e -> 1)));
int maxValueInMap = (Collections.max(collect.values()));
for (Entry<String, Integer> entry : collect.entrySet()) {
if (entry.getValue() == maxValueInMap) {
System.out.printf("Character: %s, Repetition: %d\n", entry.getKey(), entry.getValue());
}
}
Output:
Character: i, Repetition: 7
Character: l, Repetition: 7
P.S I am not sure how efficient this code it. I just learned Streams.

Trying to find the longest palindrome for this input

Given a string which consists of lowercase or uppercase letters, find the length of the longest palindromes that can be built with those letters.
This is case sensitive, for example "Aa" is not considered a palindrome here.
Note:
Assume the length of given string will not exceed 1,010.
Example:
Input: "abccccdd"
Output: 7
Explanation:
One longest palindrome that can be built is "dccaccd", whose length is 7.
My code works for simple inputs such as "abccccdd" and "banana" but fails for "civilwartestingwhetherthatnaptionoranynartionsoconceivedandsodedicatedcanlongendureWeareqmetonagreatbattlefiemldoftzhatwarWehavecometodedicpateaportionofthatfieldasafinalrestingplaceforthosewhoheregavetheirlivesthatthatnationmightliveItisaltogetherfangandproperthatweshoulddothisButinalargersensewecannotdedicatewecannotconsecratewecannothallowthisgroundThebravelmenlivinganddeadwhostruggledherehaveconsecrateditfaraboveourpoorponwertoaddordetractTgheworldadswfilllittlenotlenorlongrememberwhatwesayherebutitcanneverforgetwhattheydidhereItisforusthelivingrathertobededicatedheretotheulnfinishedworkwhichtheywhofoughtherehavethusfarsonoblyadvancedItisratherforustobeherededicatedtothegreattdafskremainingbeforeusthatfromthesehonoreddeadwetakeincreaseddevotiontothatcauseforwhichtheygavethelastpfullmeasureofdevotionthatweherehighlyresolvethatthesedeadshallnothavediedinvainthatthisnationunsderGodshallhaveanewbirthoffreedomandthatgovernmentofthepeoplebythepeopleforthepeopleshallnotperishfromtheearth". I'm not sure how to debug it.
class Solution {
public int longestPalindrome(String s) {
Map<Character, Integer> map = new HashMap<>();
char[] carr = s.toCharArray();
Arrays.sort(carr);
int leftInd = 0;
int rightInd = 0;
for(int i=0; i<carr.length; i++){
if(map.containsKey(carr[i]))
continue;
else
map.put(carr[i], 1);
}
for(int i=0; i<carr.length-1; i++){
for(int j=i+1; j<carr.length; j++){
if(carr[i]==carr[j]){
if(map.get(carr[i])==null)
continue;
carr[j] = Character.MIN_VALUE;
int count = map.get(carr[i]);
map.put(carr[i], count + 1);
}
}
}
int ans = 0;
int[] oddValArr = new int[map.size()];
int oddInd = 0;
for (Map.Entry<Character, Integer> entry : map.entrySet()) {
Character key = entry.getKey();
Integer value = entry.getValue();
if(value % 2 == 0){
ans += value;
}
else{
oddValArr[oddInd] = value;
oddInd++;
}
}
int biggestOddNum = 0;
for(int i=0; i<oddValArr.length; i++){
if(oddValArr[i] > biggestOddNum)
biggestOddNum = oddValArr[i];
}
return ans + biggestOddNum;
}
}
Output
655
Expected
983

Your mistake here, is that you use only the biggest odd group out of your oddValArr. For example, if the input is "aaabbb", the biggest palindrome is "abbba", so group a had length 3, which is an odd number, and we used 3 - 1 = 2 letters of it.
Also, those nested for loops can be replaced with one for, using Map:
public int longestPalindrome(String s) {
Map<Character, Integer> map = new HashMap<>(); // letter groups
for(int i=0; i<s.length(); i++){
char c = s.charAt(i));
if(map.containsKey(c))
map.put(c, map.get(i) + 1);
else
map.put(c, 1);
}
boolean containsOddGroups = false;
int ans = 0;
for(int count : map.values()){
if(count % 2 == 0) // even group
ans += count;
else{ // odd group
containsOddGroups = true;
ans += count - 1;
}
}
if(!containOddGroups)
return ans;
else
return ans + 1; // we can place one letter in the center of palindrome
}

You are almost there but have over complicated it quite a bit. My solution by almost only deleting code from your solution:
public static int longestPalindrome(String s) {
Map<Character, Integer> map = new HashMap<>();
char[] carr = s.toCharArray();
for (int i = 0; i < carr.length; i++) {
if (map.containsKey(carr[i]))
map.put(carr[i], map.get(carr[i]) + 1);
else
map.put(carr[i], 1);
}
int ans = 0;
int odd = 0;
for (Integer value : map.values()) {
if (value % 2 == 0) {
ans += value;
} else {
ans += value - 1;
odd = 1;
}
}
return ans + odd;
}
Explanation:
the second loop has been removed, together with the sorting - it has been merged into the first loop. There was no need for sorting at all.
then you iterate over the counts of how often a character appeared
if it is even you increase ans as before
if it is odd you can use count - 1 chars of it for the palindrome of even length
if you found at least one odd occurrence you can put that single odd char into the center of the palindrome and increase its length by one

How could I improve the speed/performance for this problem, Java

I saw this challenge on https://www.topcoder.com/ for Beginners. And I really wanted to complete it. I've got so close after so many failures. But I got stuck and don't know what to do no more. Here is what I mean
Question:
Read the input one line at a time and output the current line if and only if you have already read at least 1000 lines greater than the current line and at least 1000 lines less than the current line. (Again, greater than and less than are with respect to the ordering defined by String.compareTo().)
Link to the Challenge
My Solution:
public static void doIt(BufferedReader r, PrintWriter w) throws IOException {
SortedSet<String> linesThatHaveBeenRead = new TreeSet<>();
int lessThan =0;
int greaterThan =0;
Iterator<String> itr;
for (String currentLine = r.readLine(); currentLine != null; currentLine = r.readLine()){
itr = linesThatHaveBeenRead.iterator();
while(itr.hasNext()){
String theCurrentLineInTheSet = itr.next();
if(theCurrentLineInTheSet.compareTo(currentLine) == -1)++lessThan;
else if(theCurrentLineInTheSet.compareTo(currentLine) == 1)++greaterThan;
}
if(lessThan >= 1000 && greaterThan >= 1000){
w.println(currentLine);
lessThan = 0;
greaterThan =0;
}
linesThatHaveBeenRead.add(currentLine);
}
}
PROBLEM
I think the problem with my solution, is because I'm using nested loops which is making it a lot slower, but I've tried other ways and none worked. At this point I'm stuck. The whole point of this challenge is to make use of the most correct data-structure for this problem.
GOAL:
The goal is to use the most efficient data-structure for this problem.

Let me try to present just an accessible refinement of what to do.
public static void
doIt(java.io.BufferedReader r, java.io.PrintWriter w)
throws java.io.IOException {
feedNonExtremes(r, (line) -> { w.println(line);}, 1000, 1000);
}
/** Read <code>r</code> one line at a time and
* output the current line if and only there already were<br/>
* at least <code>nHigh</code> lines greater than the current line <br/>
* and at least <code>nLow</code> lines less than the current line.<br/>
* #param r to read lines from
* #param sink to feed lines to
* #param nLow number of lines comparing too small to process
* #param nHigh number of lines comparing too great to process
*/
static void feedNonExtremes(java.io.BufferedReader r,
Consumer<String> sink, int nLow, int nHigh) {
// collect nLow+nHigh lines into firstLowHigh; instantiate
// - a PriorityQueue(firstLowHigh) highest
// - a PriorityQueue(nLow, (a, b) -> String.compareTo(b, a)) lowest
// remove() nLow elements from highest and insert each into lowest
// for each remaining line
// if greater than the head of highest
// add to highest and remove head
// else if smaller than the head of lowest
// add to lowest and remove head
// else feed to sink
}

Made you a little example with Binary search, now in Java code. It will only use Binary search when newLine is within limits of the sorting.
public static void main(String[] args) {
// Create random lines
ArrayList<String> lines = new ArrayList<String>();
Random rn = new Random();
for (int i = 0; i < 50000; i++) {
int lenght = rn.nextInt(100);
char[] newString = new char[lenght];
for (int j = 0; j < lenght; j++) {
newString[j] = (char) rn.nextInt(255);
}
lines.add(new String(newString));
}
// Here starts logic
ArrayList<String> lowerCompared = new ArrayList<String>();
ArrayList<String> higherCompared = new ArrayList<String>();
int lowBoundry = 1000, highBoundry = 1000;
int k = 0;
int firstLimit = Math.min(lowBoundry, highBoundry);
// first x lines sorter equal
for (; k < firstLimit; k++) {
int index = Collections.binarySearch(lowerCompared, lines.get(k));
if (index < 0)
index = ~index;
lowerCompared.add(index, lines.get(k));
higherCompared.add(index, lines.get(k));
}
for (; k < lines.size(); k++) {
String newLine = lines.get(k);
boolean lowBS = newLine.compareTo(lowerCompared.get(lowBoundry - 1)) < 0;
boolean highBS = newLine.compareTo(higherCompared.get(0)) > 0;
if (lowerCompared.size() == lowBoundry && higherCompared.size() == highBoundry && !lowBS && !highBS) {
System.out.println("Time to print: " + newLine);
continue;
}
if (lowBS) {
int lowerIndex = Collections.binarySearch(lowerCompared, newLine);
if (lowerIndex < 0)
lowerIndex = ~lowerIndex;
lowerCompared.add(lowerIndex, newLine);
if (lowerCompared.size() > lowBoundry)
lowerCompared.remove(lowBoundry);
}
if (highBS) {
int higherIndex = Collections.binarySearch(higherCompared, newLine);
if (higherIndex < 0)
higherIndex = ~higherIndex;
higherCompared.add(higherIndex, newLine);
if (higherCompared.size() > highBoundry)
higherCompared.remove(0);
}
}
}

You need to implement binary search and also need to handle duplicates.
I've done some code sample here which does what you want ( may contains bugs).
public class CheckRead1000 {
public static void main(String[] args) {
// generate strings in revert order to get the worse case
List<String> aaa = new ArrayList<String>();
for (int i = 50000; i > 0; i--) {
aaa.add("some string 123456789" + i);
}
// fast solution
ArrayList<String> sortedLines = new ArrayList<>();
long st1 = System.currentTimeMillis();
for (String a : aaa) {
checkIfRead1000MoreAndLess(sortedLines, a);
}
System.out.println(System.currentTimeMillis() - st1);
// doIt solution
TreeSet<String> linesThatHaveBeenRead = new TreeSet<>();
long st2 = System.currentTimeMillis();
for (String a : aaa) {
doIt(linesThatHaveBeenRead, a);
}
System.out.println(System.currentTimeMillis() - st2);
}
// solution doIt
public static void doIt(SortedSet<String> linesThatHaveBeenRead, String currentLine) {
int lessThan = 0;
int greaterThan = 0;
Iterator<String> itr = linesThatHaveBeenRead.iterator();
while (itr.hasNext()) {
String theCurrentLineInTheSet = itr.next();
if (theCurrentLineInTheSet.compareTo(currentLine) == -1) ++lessThan;
else if (theCurrentLineInTheSet.compareTo(currentLine) == 1) ++greaterThan;
}
if (lessThan >= 1000 && greaterThan >= 1000) {
// System.out.println(currentLine);
lessThan = 0;
greaterThan = 0;
}
linesThatHaveBeenRead.add(currentLine);
}
// will return if we have read more at least 1000 string more and less then our string
private static boolean checkIfRead1000MoreAndLess(List<String> sortedLines, String newLine) {
//adding string to list and calculating its index and the last search range
int indexes[] = addNewString(sortedLines, newLine);
int index = indexes[0]; // index of element
int low = indexes[1];
int high = indexes[2];
//we need to check if this string already was in list for instance
// 1,2,3,4,5,5,5,5,5,6,7 for 5 we need to count 'less' as 4 and 'more' is 2
int highIndex = index;
for (int i = highIndex + 1; i < high; i++) {
if (sortedLines.get(i).equals(newLine)) {
highIndex++;
} else {
//no more duplicates
break;
}
}
int lowIndex = index;
for (int i = lowIndex - 1; i > low; i--) {
if (sortedLines.get(i).equals(newLine)) {
lowIndex--;
} else {
//no more duplicates
break;
}
}
// just calculating how many we did read more and less
if (sortedLines.size() - highIndex - 1 > 1000 && lowIndex > 1000) {
return true;
}
return false;
}
// simple binary search will insert string and return its index and ranges in sorted list
// first int is index,
// second int is start of range - will be used to find duplicates,
// third int is end of range - will be used to find duplicates,
private static int[] addNewString(List<String> sortedLines, String newLine) {
if (sortedLines.isEmpty()) {
sortedLines.add(newLine);
return new int[]{0, 0, 0};
}
// int index = Integer.MAX_VALUE;
int low = 0;
int high = sortedLines.size() - 1;
int mid = 0;
while (low <= high) {
mid = (low + high) / 2;
if (sortedLines.get(mid).compareTo(newLine) < 0) {
low = mid + 1;
} else if (sortedLines.get(mid).compareTo(newLine) > 0) {
high = mid - 1;
} else if (sortedLines.get(mid).compareTo(newLine) == 0) {
// index = mid;
break;
}
if (low > high) {
mid = low;
}
}
if (mid == sortedLines.size()) {
sortedLines.add(newLine);
} else {
sortedLines.add(mid, newLine);
}
return new int[]{mid, low, high};
}
}

Most efficient way to search for unknown patterns in a string?

I am trying to find patterns that:
occur more than once
are more than 1 character long
are not substrings of any other known pattern
without knowing any of the patterns that might occur.
For example:
The string "the boy fell by the bell" would return 'ell', 'the b', 'y '.
The string "the boy fell by the bell, the boy fell by the bell" would return 'the boy fell by the bell'.
Using double for-loops, it can be brute forced very inefficiently:
ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
int limit = (length - i) / 2;
for (int j = limit; j >= 1; j--) {
int candidateEndIndex = i + j;
String candidate = string.substring(i, candidateEndIndex);
if(candidate.length() <= 1) {
continue;
}
if (string.substring(candidateEndIndex).contains(candidate)) {
boolean notASubpattern = true;
for (String pattern : patternsList) {
if (pattern.contains(candidate)) {
notASubpattern = false;
break;
}
}
if (notASubpattern) {
patternsList.add(candidate);
}
}
}
}
However, this is incredibly slow when searching large strings with tons of patterns.

You can build a suffix tree for your string in linear time:
https://en.wikipedia.org/wiki/Suffix_tree
The patterns you are looking for are the strings corresponding to internal nodes that have only leaf children.

You could use n-grams to find patterns in a string. It would take O(n) time to scan the string for n-grams. When you find a substring by using a n-gram, put it into a hash table with a count of how many times that substring was found in the string. When you're done searching for n-grams in the string, search the hash table for counts greater than 1 to find recurring patterns in the string.
For example, in the string "the boy fell by the bell, the boy fell by the bell" using a 6-gram will find the substring "the boy fell by the bell". A hash table entry with that substring will have a count of 2 because it occurred twice in the string. Varying the number of words in the n-gram will help you discover different patterns in the string.
Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";
// Add entries to the hash table
while (count < str.length) {
// copy the words into the substring
int i = 0;
substring = "";
while (ngramcount > 0 && count < str.length) {
substring[i] = str[count];
if (str[i] == ' ')
ngramcount--;
i++;
count++;
}
ngramcount = 6;
substring.Trim(); // get rid of the last blank in the substring
// Update the dictionary (hash table) with the substring
if (dict.Contains(substring)) { // substring is already in hash table so increment the count
int hashCount = dict[substring];
hashCount++;
dict[substring] = hashCount;
}
else
dict[substring] = 1;
}
// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict) {
if (pair.Value > maxCount) {
maxCount = pair.Value;
mostCommonPattern = pair.Key;
}
}

I've written this just for fun. I hope I have understood the problem correctly, this is valid and fast enough; if not, please be easy on me :) I might optimize it a little more I guess, if someone finds it useful.
private static IEnumerable<string> getPatterns(string txt)
{
char[] arr = txt.ToArray();
BitArray ba = new BitArray(arr.Length);
for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--)
{
char[] arr1 = new char[shingle];
int[] indexes = new int[shingle];
HashSet<int> hs = new HashSet<int>();
Dictionary<int, int[]> dic = new Dictionary<int, int[]>();
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
{
int index = i + j;
arr1[j] = arr[index];
indexes[j] = index;
}
int h = getHashCode(arr1);
if (hs.Add(h))
{
int[] indexes1 = new int[indexes.Length];
Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int));
dic.Add(h, indexes1);
}
else
{
bool exists = false;
foreach (int index in indexes)
if (ba.Get(index))
{
exists = true;
break;
}
if (!exists)
{
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
if (ba.Get(index))
{
exists = true;
break;
}
}
if (!exists)
{
foreach (int index in indexes)
ba.Set(index, true);
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
ba.Set(index, true);
dic[h] = null;
yield return new string(arr1);
}
}
}
}
}
private static int getMaxShingleSize(char[] arr)
{
for (int shingle = 2; shingle <= arr.Length / 2 + 1; shingle++)
{
char[] arr1 = new char[shingle];
HashSet<int> hs = new HashSet<int>();
bool noPattern = true;
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
{
for (int j = 0; j < shingle; j++)
arr1[j] = arr[i + j];
int h = getHashCode(arr1);
if (!hs.Add(h))
{
noPattern = false;
break;
}
}
if (noPattern)
return shingle - 1;
}
return -1;
}
private static int getHashCode(char[] arr)
{
unchecked
{
int hash = (int)2166136261;
foreach (char c in arr)
hash = (hash * 16777619) ^ c.GetHashCode();
return hash;
}
}
Edit
My previous code has serious problems. This one is better:
private static IEnumerable<string> getPatterns(string txt)
{
Dictionary<int, int> dicIndexSize = new Dictionary<int, int>();
for (int shingle = 2, count0 = txt.Length / 2 + 1; shingle <= count0; shingle++)
{
Dictionary<string, int> dic = new Dictionary<string, int>();
bool patternExists = false;
for (int i = 0, count = txt.Length - shingle; i <= count; i++)
{
string sub = txt.Substring(i, shingle);
if (!dic.ContainsKey(sub))
dic.Add(sub, i);
else
{
patternExists = true;
int index0 = dic[sub];
if (index0 >= 0)
{
dicIndexSize[index0] = shingle;
dic[sub] = -1;
}
}
}
if (!patternExists)
break;
}
List<int> lst = dicIndexSize.Keys.ToList();
lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a]));
BitArray ba = new BitArray(txt.Length);
foreach (int i in lst)
{
bool ok = true;
int len = dicIndexSize[i];
for (int j = i, max = i + len; j < max; j++)
{
if (ok) ok = !ba.Get(j);
ba.Set(j, true);
}
if (ok)
yield return txt.Substring(i, len);
}
}
Text in this book took 3.4sec in my computer.

Suffix arrays are the right idea, but there's a non-trivial piece missing, namely, identifying what are known in the literature as "supermaximal repeats". Here's a GitHub repo with working code: https://github.com/eisenstatdavid/commonsub . Suffix array construction uses the SAIS library, vendored in as a submodule. The supermaximal repeats are found using a corrected version of the pseudocode from findsmaxr in Efficient repeat finding via suffix arrays
(Becher–Deymonnaz–Heiber).
static void FindRepeatedStrings(void) {
// findsmaxr from https://arxiv.org/pdf/1304.0528.pdf
printf("[");
bool needComma = false;
int up = -1;
for (int i = 1; i < Len; i++) {
if (LongCommPre[i - 1] < LongCommPre[i]) {
up = i;
continue;
}
if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue;
for (int k = up - 1; k < i; k++) {
if (SufArr[k] == 0) continue;
unsigned char c = Buf[SufArr[k] - 1];
if (Set[c] == i) goto skip;
Set[c] = i;
}
if (needComma) {
printf("\n,");
}
printf("\"");
for (int j = 0; j < LongCommPre[up]; j++) {
unsigned char c = Buf[SufArr[up] + j];
if (iscntrl(c)) {
printf("\\u%.4x", c);
} else if (c == '\"' || c == '\\') {
printf("\\%c", c);
} else {
printf("%c", c);
}
}
printf("\"");
needComma = true;
skip:
up = -1;
}
printf("\n]\n");
}
Here's a sample output on the text of the first paragraph:
Davids-MBP:commonsub eisen$ ./repsub input
["\u000a"
," S"
," as "
," co"
," ide"
," in "
," li"
," n"
," p"
," the "
," us"
," ve"
," w"
,"\""
,"–"
,"("
,")"
,". "
,"0"
,"He"
,"Suffix array"
,"`"
,"a su"
,"at "
,"code"
,"com"
,"ct"
,"do"
,"e f"
,"ec"
,"ed "
,"ei"
,"ent"
,"ere's a "
,"find"
,"her"
,"https://"
,"ib"
,"ie"
,"ing "
,"ion "
,"is"
,"ith"
,"iv"
,"k"
,"mon"
,"na"
,"no"
,"nst"
,"ons"
,"or"
,"pdf"
,"ri"
,"s are "
,"se"
,"sing"
,"sub"
,"supermaximal repeats"
,"te"
,"ti"
,"tr"
,"ub "
,"uffix arrays"
,"via"
,"y, "
]

I would use Knuth–Morris–Pratt algorithm (linear time complexity O(n)) to find substrings. I would try to find the largest substring pattern, remove it from the input string and try to find the second largest and so on. I would do something like this:
string pattern = input.substring(0,lenght/2);
string toMatchString = input.substring(pattern.length, input.lenght - 1);
List<string> matches = new List<string>();
while(pattern.lenght > 0)
{
int index = KMP(pattern, toMatchString);
if(index > 0)
{
matches.Add(pattern);
// remove the matched pattern occurences from the input string
// I would do something like this:
// 0 to pattern.lenght gets removed
// check for all occurences of pattern in toMatchString and remove them
// get the remaing shrinked input, reassign values for pattern & toMatchString
// keep looking for the next largest substring
}
else
{
pattern = input.substring(0, pattern.lenght - 1);
toMatchString = input.substring(pattern.length, input.lenght - 1);
}
}
Where KMP implements Knuth–Morris–Pratt algorithm. You can find the Java implementations of it at Github or Princeton or write it yourself.
PS: I don't code in Java and it is quick try to my first bounty about to close soon. So please don't give me the stick if I missed something trivial or made a +/-1 error.

Minimum Window Substring the time complexity of my solution

Minimum Window Substring
this is a problem from Leetcode https://leetcode.com/problems/minimum-window-substring/
I found a solution based on Sliding Window Algorithm, but I cannot figure out the time complexity. Some people said it is O(N), but I think it is not. Please help me, thanks!
public class Solution {
// Minimum Window Algorithm, the algorithm must fit for specific problem, this problem is diff from ...words
// 348ms
public String minWindow(String s, String t) {
int N = s.length(), M = t.length(), count = 0;
String res = "";
if (N < M || M == 0) return res;
int[] lib = new int[256], cur = new int[256]; // ASCII has 256 characters
for (int i = 0; i < M; lib[t.charAt(i++)]++); // count each characters in t
for (int l = 0, r = 0; r < N; r++) {
char c = s.charAt(r);
if (lib[c] != 0) {
cur[c]++;
if (cur[c] <= lib[c]) count++;
if (count == M) {
char tmp = s.charAt(l);
while (lib[tmp] == 0 || cur[tmp] > lib[tmp]) {
cur[tmp]--;
tmp = s.charAt(++l);
}
if (res.length() == 0 || r - l + 1 < res.length())
res = s.substring(l, r + 1);
count--; // should add these three lines for the case cur[c] c is char in s but not the one visited
cur[s.charAt(l)]--;
l++;
}
}
}
return res;
}
}

There are N steps to add every char in s to r position
There are no more than O(N) while operators - at most N working cycles with ++l operations, and at most N worthless checks of while condition
So overall complexity is linear, if we don't take into consideration s.substring.
Note that substring operation should be moved out of the loop, we have to keep the best index pair only, and get substring at the very end.

check out my solution:
public class Solution {
public String minWindow(String S, String T) {
Map<Character, Integer> pattern = new HashMap<Character, Integer>();
Map<Character, Integer> cur = new HashMap<Character, Integer>();
Queue<Integer> queue = new LinkedList<Integer>();
int min = Integer.MAX_VALUE;
int begin = 0, end = 0;
// fill in pattern by T
for(int i = 0;i < T.length();i++) addToMap(pattern, T.charAt(i));
// initialize current set
for(int i = 0;i < T.length();i++) cur.put(T.charAt(i), 0);
// go through S to match the pattern by minimum length
for(int i = 0;i < S.length();i++){
if(pattern.containsKey(S.charAt(i))){
queue.add(i);
addToMap(cur, S.charAt(i));
// check if pattern is matched
while(isMatch(pattern, cur)){ /* Important Code! */
if(i - queue.peek() < min){
min = i - queue.peek();
begin = queue.peek();
end = i+1;
}
cur.put(S.charAt(queue.peek()), cur.get(S.charAt(queue.peek()))-1);
queue.poll();
}
}
}
return end > begin?S.substring(begin, end):"";
}
private void addToMap(Map<Character, Integer> map, Character c){
if(map.containsKey(c))
map.put(c, map.get(c)+1);
else
map.put(c,1);
}
private boolean isMatch(Map<Character, Integer> p, Map<Character, Integer> cur){
for(Map.Entry<Character, Integer> entry: p.entrySet())
if(cur.get((char)entry.getKey()) < (int)entry.getValue()) return false;
return true;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding shortest possible substring that contains a String - java

Related

Efficient way to find longest streak of characters in string

Trying to find the longest palindrome for this input

How could I improve the speed/performance for this problem, Java

Most efficient way to search for unknown patterns in a string?

Minimum Window Substring the time complexity of my solution

Categories

Resources