Find Min Length Substring Containing All Given Strings

Find Min Length Substring Containing All Given Strings - java

Given a large document and a short pattern consisting of a few words
(eg. W1 W2 W3), find the shortest string that has all the words in any
order (for eg. W2 foo bar dog W1 cat W3 -- is a valid pattern)
I structured the "large document" as a list of strings. I believe my solution is O(nlog(n)), but I'm not sure (I'm also not sure whether it's correct). Is there a faster way? Please note that the below is pseudocoded Java, so obviously will not compile, but I believe the message is clear:
main(){
List<String> wordsToCheckFor;
List<String> allWords;
int allWordsLength = allWords.length;
int minStringLength = POS_INFINITY;
List<String> minString;
//The idea here is to divide and conquer the string; I will first
//check the entire string, then the entire string minus the first
//word, then the entire string minus the first two words, and so on...
for(int x = 0; x < allWordsLength; x++){
if(checkString(allWords, wordsToCheckFor) && (allWords.length < minStringLength)){
minString = allWords;
minStringLength = allWords.length();
}
allWords.remove(0);
}
System.out.println(minString);
}
checkString(List<String> allWords, List<String> wordsToCheckFor){
boolean good = true;
foreach(String word : wordsToCheckFor){
if(!allWords.contains(word))
good = false;
}
return good;
}

Your solution has O(n ^ 2) time complexity(in the worst case, each suffix is checked, and each check is O(n), because List.contains method has linear time complexity). Moreover, it is not correct: the answer is not always a suffix, it can be any substring.
A more efficient solution: iterate over your text word by word and keep track of the last occurrence of each word from the pattern(using, for example, a hash table). Update the answer after each iteration(a candidate substring is the one from the minimum last occurence among all words in the pattern to the current position). This solution has linear time complexity(under assumption that the number of words in the pattern is a constant).

Related

Longest Common Subsequence Algorithm Explanation

So the psuedocode for longest common subsequence problem is listed as below.
longest-common-subsequence(s1, s2):
If the strings begin with the same letter c, the result to return is c
plus the longest common subsequence between the rest of s1 and s2
(that is, s1 and s2 without their first letter). For example, the
longest subsequence between "hollow" and "hello" is an "h" plus the
longest subsequence found between "ollow" and "ello".
Otherwise, if
the strings do not begin with the same letter, return the longer of
the following two: The longest common subsequence between s1 and the
rest of s2 (s2 without its first letter), The longest common
subsequence between the rest of s1 (s1 without its first letter) and
s2. For example, longest-common-subsequence("ollow", "ello") is the
longer of longest-common-subsequence("ollow", "llo") and
longest-common-subsequence("llow", "ello").
The part that I don't get is when the strings do not begin with the same letter, why do we take (s1 without the first letter, s2), (s1, s2 without the first letter). Why do we recursively go through these steps when they do not match? Is it just a set algorithm that is hard to understand? What is the reasoning behind this?

While #yash mahajan already covered everything, I'll just provide another way to think about it.
Go through two strings, assume you are at position i on string A (of length m) and position j on string B (of length n).
1. If the current two characters of both string are the same:
longest common subsequence till now = longest common subsequence between substring A[0...i-1] and substring B[0...j-1] + 1.
2. If two characters are different:
longest common subsequence = Max(longest common subsequence between substring A[0...i-1] and string B, longest common subsequence between string A and substring B[0...j-1])
You will have a clearer idea if you read the codes.
public class Solution {
public int longestCommonSubsequence(String A, String B) {
if(A == null || B == null || A.length() == 0 || B.length() == 0) {
return 0;
}
int m = A.length();
int n = B.length();
int[][] commonSubsequenceLength = new int[m + 1][n + 1];
for(int i = 1; i <= m; i++) {
for(int j = 1; j <= n; j++) {
if(A.charAt(i - 1) == B.charAt(j - 1)) {
commonSubsequenceLength[i][j] = commonSubsequenceLength[i - 1][j - 1] + 1;
} else {
commonSubsequenceLength[i][j] = Math.max(commonSubsequenceLength[i][j - 1], commonSubsequenceLength[i - 1][j]);
}
}
}
return commonSubsequenceLength[m][n];
}
}

When we know that the first characters of both the strings don't match, it's clear we can't include the characters in our longest subsequence as we do it in the first case. So the obvious choice we are left with is to neglect both of these characters and search for the rest of two strings for our subsequence. But if you consider this example : "hello" and "ello" , you can clearly see that if we neglect the first characters we are basically neglecting first character of our subsequence ("Ello"). So we go for two cases:
1.We remove first character of first string and search in the second string.
2.We remove first character of second string and search in first string.
And then we take maximum of those two.

To answer your question:
The part that I don't get is when the strings do not begin with the
same letter, why do we take (s1 without the first letter, s2), (s1, s2
without the first letter). Why do we recursively go through these
steps when they do not match? Is it just a set algorithm that is hard
to understand? What is the reasoning behind this?
What is confusing for you is the recursive call, I guess. The whole idea is to reduce the problem to a smaller input set. In this case 1 character less at a time.
You have 2 cases for the selected character(1st or last)
There is a match (example "h" in "hollow", "hello"), simply reduce the input size by 1 character in both strings and recursively call the same function.
No match. You have 2 options here - You may consider the first string has extra unwanted character or the 2nd one. Hence, make recursive call for both the cases and choose maximum of them.
Extra details:
This problem has properties of a typical Dynamic Programming (DP) Problem.
1) Optimal Substructure
2) Overlapping Sub-problems
Hope it helps!

Java newbie: cutting a string off?

I'm new to programming (taking a class) and I'm not sure how to accomplish this one task.
"Ignoring case, find the last occurrence of an ‘a’ in the input and remove all of the characters following it. In the case where there are no ‘a’s in the word, remove all but the first two characters (reminder: do not use if statements or loops). At the end of the now truncated word, add a number that is the percentage that the length of the truncated word is of the length of the original word; this percentage should be rounded to the closest integer value."
I'll be fine with the percentage part, but I'm not sure how to do the first part.
How do I remove only after the last occurrence of 'a'?
If there is no 'a' how do I cut it off after the first two letters without using an if statement?
I'm assuming its to be done using string manipulation and various substrings, but I'm not sure how the criteria for the substrings should be made.
Remember, Java newbie! I don't know a lot of fancy coding techniques yet.
Thank you!

String#toLowerCase - remove all case from the String
String#lastIndexOf will tell you where the last occurrence of the specified String occurs, will return -1 if there is no occurrence, this is important.
String#subString will allow you to generate a new String based on a sub element of the current String
Math#max, Math#min

Given String input, consider the following as a possible starting point:
int indexOfSmallA = input.lastIndexOf('a');
int indexOfBigA = input.lastIndexOf('A');
int beginIndex = Math.max(indexOfSmallA, indexOfBigA);
// if not found, begin at 2 or end of input, else begin after last 'a'
beginIndex = (beginIndex == -1) ? Math.min(2, input.length()) : beginIndex + 1;
String result = input.substring(beginIndex);

For finding the last occurence of 'a' or 'A' you can use...
int index = Math.max(str.lastIndexOf('a'),str.lastIndexOf('A'));
index = (index==-1)?Math.min(2,str.length()):index+1;
Once you get the index you can use the following to remove the characters after it...
str.substring(0,index);

Comparing parts of Arrays against each other?

I'm really really really not sure what is the best way to approach this. I've gotten as far as I can, but I basically want to scan a user response with an array of words and search for matches so that my AI can tell what mood someone is in based off the words they used. However, I've yet to find a clear or helpful answer. My code is pretty cluttered too because of how many different methods I've tried to use. I either need a way to compare sections of arrays to each other or portions of strings. I've found things for finding a part of an array. Like finding eggs in green eggs and ham, but I've found nothing that finds a section of an array in a section of another array.
public class MoodCompare extends Mood1 {
public static void MoodCompare(String inputMood){
int inputMoodLength = inputMood.length();
int HappyLength = Arrays.toString(Happy).length();
boolean itWorks = false;
String[] inputMoodArray = inputMood.split(" ");
if(Arrays.toString(Happy).contains(Arrays.toString(inputMoodArray)) == true)
System.out.println("Success!");
InputMood is the data the user has input that should have keywords lurking in them to their mood. Happy is an array of the class Mood1 that is being extended. This is only a small piece of the class, much less the program, but it should be all I need to make a valid comparison to complete the class.
If anyone can help me with this, you will save me hours of work. So THANK YOU!!!

Manipulating strings will be nicer when you do not use the relative primitive arrays, where you have to walk through yourself etcetera. A Dutch proverb says: not seeing the wood through the trees.
In this case it seems you check words of the input against a set of words for some mood.
Lets use java collections:
Turning an input string into a list of words:
String input = "...";
List<String> sentence = Arrays.asList(input.split("\\W+"));
sentence.remove("");
\\W+ is a sequence of one or more non-word characters. Mind "word" mean A-Za-z0-9_.
Now a mood would be a set of unique words:
Set<String> moodWords = new HashSet<>();
Collections.addAll(moodWords, "happy", "wow", "hurray", "great");
Evaluation could be:
int matches = 0;
for (String word : sentence) {
if (moodWords.contains(word)) {
++matches;
}
}
int percent = sentence.isEmpty() ? 0 : matches * 100 / sentence.size();
System.out.printf("Happiness: %d %%%n", percent);
In java 8 even compacter.
int matches = sentence.stream().filter(moodWords::contains).count();
Explanation:
The foreach-word-in-sentence takes every word. For every word it checks whether it is contained in moodWords, the set of all mood words.
The percentage is taken over the number of words in the sentence being moody. The boundary condition of an empty sentence is handled by the if-then-else expression ... ? ... : ... - an empty sentence given the arbitrary percentage 0%.
The printf format used %d for the integer, %% for the percent sign % (self-escaped) and %n for the line break character(s).

If I'm understanding your question correctly, you mean something like this?
String words[] = {"green", "eggs", "and", "ham"};
String response = "eggs or ham";
Mood mood = new Mood();
for(String foo : words)
{
if(response.contains(foo))
{
//Check if happy etc...
if(response.equals("green")
mood.sad++;
...
}
}
System.out.println("Success");
...
//CheckMood() etc... other methods.

Try to use tokens.
Every time that the program needs to compare the contents of a row from one array to the other array, just tokenize the contents in parallel and compare them.
Visit the following Java Doc page for farther reference: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html
or even view the following web pages:
http://introcs.cs.princeton.edu/java/72regular/Tokenizer.java.html

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.

You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.

Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.

You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.

One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Matching Subset in a String

Let's say I have-
String x = "ab";
String y = "xypa";
If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.

The answer really depends on many things.
If you just want to find any subset and you're doing this only once, looping is just fine (and the best you can do without using additional storage) and you can stop when you find a single character that matches.
If you have a fixed x and want to use it for matching several strings y, you can do some pre-processing to store the characters in x in a table and use this table to check if each character of y occurs in x or not.
If you want to find the largest subset, then you're looking at a different problem: the longest common subsequence problem.

Well, I'm not sure it's better than looping, but you could use String#matches:
if (y.matches(".*[" + x + "]+.*")) ...
You'd need to escape characters that are special in a regex [] construct, though (like ], -, \, ...).
The above is just an example, if you're doing it more than once, you'll want to use Pattern, Matcher, and the other stuff from the java.util.regex package.

You have to use for loop or use regex which is just as expensive as a for loop, becasue you need to convert one of your strings into chars basically.
Boolean isSubset = false;
for(int i = 0; i < x.length(); i++) {
if(y.contains(x.charAt(i))) {
isSubset = true;
break;
}
}
using a for loop.

It looks like this could be a case of the longest common substring problem.

You can generate all subsets of x (e.g. , in your example, ab, a, b) and then generate a regexp that would do the
Pattern p = Pattern.compile("(ab|a|b)");
Matcher m = p.matcher(y);
if(m.find()) {
System.err.println(m.group());
}

If both Strings will only contain [a-z]. Then fastest would be to make two bitmaps, 26 bits longs. Mark all the bits contained in the String. Take the AND of the bitmaps, the resulting bits are present in both Strings, the largest common subset. This would be a simple O(n) with n the length of the biggest String.
(If you want to cover the whole lot of UTF, bloom filters might be more appropriate. )

Looping is time-consuming, but there's no way to do what you want other than going over the target string repeatedly.
What you can do is optimize by checking the smallest strings first, and work your way up. For example, if the target string doesn't contain abc, it can't possibly contain abcdef.
Other optimizations off the top of my head:
Don't continue to check for a match after a non-matching character is hit, though in Java you can let the computer worry about this.
Don't check to see if something is a match if there aren't enough characters left in the target string for a match to be possible.
If you need speed and have lots of space, you might be able to break the target string up into a fancy data structure like a trie for better results, though I don't have an exact algorithm in mind.
Another storage-is-not-a-problem solution: decompose the target into every possible substring and store the results in a HashSet.

What about this:?
package so3935620;
import static org.junit.Assert.*;
import java.util.BitSet;
import org.junit.Test;
public class Main {
public static boolean overlap(String s1, String s2) {
BitSet bs = new BitSet();
for (int i = 0; i < s1.length(); i++) {
bs.set(s1.charAt(i));
}
for (int i = 0; i < s2.length(); i++) {
if (bs.get(s2.charAt(i))) {
return true;
}
}
return false;
}
#Test
public void test() {
assertFalse(overlap("", ""));
assertTrue(overlap("a", "a"));
assertFalse(overlap("abcdefg", "ABCDEFG"));
}
}
And if that version is too slow, you can compute the BitSet depending on s1, save that in some variable and later only loop over s2.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find Min Length Substring Containing All Given Strings - java

Related

Longest Common Subsequence Algorithm Explanation

Java newbie: cutting a string off?

Comparing parts of Arrays against each other?

Efficient way to search for a set of strings in a string in Java

Matching Subset in a String

Categories

Resources