i have a string containing nested repeating patterns, for example:
String pattern1 = "1234";
String pattern2 = "5678";
String patternscombined = "1234|1234|5678|9"//added | for reading pleasure
String pattern = (pattern1 + pattern1 + pattern2 + "9")
+(pattern1 + pattern1 + pattern2 + "9")
+(pattern1 + pattern1 + pattern2 + "9")
String result = "1234|1234|5678|9|1234|1234|56";
As you can see in the above example, the result got cut off. But when knowing the repeating patterns, you can predict, what could come next.
Now to my question:
How can i predict the next repetitions of this pattern, to get a resulting string like:
String predictedresult = "1234|1234|5678|9|1234|1234|5678|9|1234|1234|5678|9";
Patterns will be smaller that 10 characters, the predicted result will be smaller than 1000 characters.
I am only receiving the cutoff result string and a pattern recognition program is already implemented and working. In the above example, i would have result, pattern1, pattern2 and patternscombined.
EDIT:
I have found a solution working for me:
import java.util.Arrays;
public class LRS {
// return the longest common prefix of s and t
public static String lcp(String s, String t) {
int n = Math.min(s.length(), t.length());
for (int i = 0; i < n; i++) {
if (s.charAt(i) != t.charAt(i))
return s.substring(0, i);
}
return s.substring(0, n);
}
// return the longest repeated string in s
public static String lrs(String s) {
// form the N suffixes
int N = s.length();
String[] suffixes = new String[N];
for (int i = 0; i < N; i++) {
suffixes[i] = s.substring(i, N);
}
// sort them
Arrays.sort(suffixes);
// find longest repeated substring by comparing adjacent sorted suffixes
String lrs = "";
for (int i = 0; i < N - 1; i++) {
String x = lcp(suffixes[i], suffixes[i + 1]);
if (x.length() > lrs.length())
lrs = x;
}
return lrs;
}
public static int startingRepeats(final String haystack, final String needle)
{
String s = haystack;
final int len = needle.length();
if(len == 0){
return 0;
}
int count = 0;
while (s.startsWith(needle)) {
count++;
s = s.substring(len);
}
return count;
}
public static String lrscutoff(String s){
String lrs = s;
int length = s.length();
for (int i = length; i > 0; i--) {
String x = lrs(s.substring(0, i));
if (startingRepeats(s, x) < 10 &&
startingRepeats(s, x) > startingRepeats(s, lrs)){
lrs = x;
}
}
return lrs;
}
// read in text, replacing all consecutive whitespace with a single space
// then compute longest repeated substring
public static void main(String[] args) {
long time = System.nanoTime();
long timemilis = System.currentTimeMillis();
String s = "12341234567891234123456789123412345";
String repeat = s;
while(repeat.length() > 0){
System.out.println("-------------------------");
String repeat2 = lrscutoff(repeat);
System.out.println("'" + repeat + "'");
int count = startingRepeats(repeat, repeat2);
String rest = repeat.substring(count*repeat2.length());
System.out.println("predicted: (rest ='" + rest + "')" );
while(count > 0){
System.out.print("'" + repeat2 + "' + ");
count--;
}
if(repeat.equals(repeat2)){
System.out.println("''");
break;
}
if(rest!="" && repeat2.contains(rest)){
System.out.println("'" + repeat2 + "'");
}else{
System.out.println("'" + rest + "'");
}
repeat = repeat2;
}
System.out.println("Time: (nano+millis):");
System.out.println(System.nanoTime()-time);
System.out.println(System.currentTimeMillis()-timemilis);
}
}
If your predict String is always piped(|) the numbers then you can easily split them using pipe and then keep track of the counts on a HashMap. For example
1234 = 2
1344 = 1
4411 = 5
But if not, then you have to modify the Longest Repeated Substring algorithm. As you need to have all repeated substrings so keep track of all instead of only the Longest one. Also, you have to put a checking for minimum length of substring along with overlapping substring. By searching google you'll find lot of reference of this algorithm.
You seem to need something like an n-gram language model, which is a statistical model that is based on counts of co-occurring events. If you are given some training data, you can derive the probabilities from counts of seen patterns. If not, you can try to specify them manually, but this can get tricky. Once you have such a language model (where the digit patterns correspond to words), you can always predict the next word by picking one with the highest probability given some previous words ("history").
Related
I have a string of a random address like
String s = "H.N.-13/1443 laal street near bharath dental lab near thana qutubsher near modern bakery saharanpur uttar pradesh 247001";
I want to split it into array of string with two conditions:
each element of that array of string is of length less than or equal to 20
No awkward ending of an element of array of string
For example, splitting every 20 characters would produce:
"H.N.-13/1443 laal st"
"reet near bharath de"
"ntal lab near thana"
"qutubsher near moder"
"n bakery saharanpur"
but the correct output would be:
"H.N.-13/1443 laal"
"street near bharath"
"dental lab near"
"thana qutubsher near"
"modern bakery"
"saharanpur"
Notice how each element in string array is less than or equal to 20.
The above is my output for this code:
static String[] split(String s,int max){
int total_lines = s.length () / 24;
if (s.length () % 24 != 0) {
total_lines++;
}
String[] ans = new String[total_lines];
int count = 0;
int j = 0;
for (int i = 0; i < total_lines; i++) {
for (j = 0; j < 20; j++) {
if (ans[count] == null) {
ans[count] = "";
}
if (count > 0) {
if ((20 * count) + j < s.length()) {
ans[count] += s.charAt (20 * count + j);
} else {
break;
}
} else {
ans[count] += s.charAt (j);
}
}
String a = "";
a += ans[count].charAt (0);
if (a.equals (" ")) {
ans[i] = ans[i].substring (0, 0) + "" + ans[i].substring (1);
}
System.out.println (ans[i]);
count++;
}
return ans;
}
public static void main (String[]args) {
String add = "H.N.-13/1663 laal street near bharath dental lab near thana qutubsher near modern bakery";
String city = "saharanpur";
String state = "uttar pradesh";
String zip = "247001";
String s = add + " " + city + " " + state + " " + zip;
String[]ans = split (s);
}
Find all occurrences of up to 20 chars starting with a non-space and ending with a word boundary, and collect them to a List:
List<String> parts = Pattern.compile("\\S.{1,19}\\b").matcher(s)
.results()
.map(MatchResult::group)
.collect(Collectors.toList());
See live demo.
The code is not very clear, but at first glance it seems you are building character by character that is why you are getting the output you see. Instead you go word by word if you want to retain a word and overflow it to next String if necessary. A more promising code would be:
static String[] splitString(String s, int max) {
String[] words = s.split("\s+");
List<String> out = new ArrayList<>();
int numWords = words.length;
int i = 0;
while (i <numWords) {
int len = 0;
StringBuilder sb = new StringBuilder();
while (i < numWords && len < max) {
int wordLength = words[i].length();
len += (i == numWords-1 ? wordLength : wordLength + 1);//1 for space
if (len <= max) {
sb.append(words[i]+ " ");
i++;
}
}
out.add(sb.toString().trim());
}
return out.toArray(new String[] {});
}
Note: It works on your example input, but you may need to tweak it so it works for cases like a long word containing more than 20 characters, etc.
Here's what I have if someone could give me some idea of what to do that would be great. I think taking the index and counting how many values are together would be helpful but im not sure how to implement that. isVowel is a helper method to determine if the char is a vowel.
public static String doubleVowelsMaybe(String s)
{
int run =0;
String n = "";
for(int i = 0; i< s.length(); ++i)
{
char k = s.charAt(i);
if(isVowel(k))
{
}
if(run == 1)
{
n = n + s.substring(i, i+1) + s.substring(i, i+1);
run=0;
}
else
{
n = n + s.substring(i, i+1);
run= 0;
}
}
return n;
Most simple string manipulation tasks like this can be fairly easily done with a regex. This one's a one-liner:
public static String doubleVowelsMaybe(String s) {
return s.replaceAll("(?<![aeiou])([aeiou])(?![aeiou])", "$1$1");
}
The regex works as follows:
(?<![aeiou]) is a negative lookbehind, so it matches only if the character is not preceded by a vowel.
([aeiou]) matches a single vowel, and captures it to group number 1.
(?![aeiou]) is a negative lookahead, so it matches only if the character is not followed by a vowel.
The replacement of $1$1 means two copies of whatever was matched by group number 1, which is the single vowel character.
import java.util.*;
class Hello {
public static void main(String[] args) {
String abc = "beautiful";
String n = "";
int i = 0;
char[] abcchar = abc.toCharArray();
HashSet<Character> hs = new HashSet<>();
hs.add('a');
hs.add('e');
hs.add('i');
hs.add('o');
hs.add('u');
while (i < abcchar.length) {
if (i + 1 < abcchar.length && hs.contains(abcchar[i]) && !hs.contains(abcchar[i + 1])) {
n = n + abc.substring(i, i + 1) + abc.substring(i, i + 1);
} else {
while (hs.contains(abcchar[i])) {
n = n + abc.substring(i, i + 1);
i++;
}
n = n + abc.substring(i, i + 1);
}
i++;
}
System.out.print(n);
}
}
So I'm trying to write an algorithm that counts the number of occurrences of some pattern, say "aa", within a string, say "aaabca." The number of patterns in that string should return an integer, in this case 2, because the first three characters contain two occurrences of the pattern.
What I have finds the number of patterns under the assumption the existing occurrences of a pattern is NOT overlapping:
public class Pattern{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
System.out.println("Enter the string: ");
String s = scan.nextLine();
String[] splittedInput = s.split(";");
String pattern = splittedInput[0];
String blobs = splittedInput[1];
Pattern p = new Pattern();
p.count(pattern, blobs);
}
public static void count(String pattern, String blobs){
String[] substrings = blobs.split("[|]");
int numOccurences = 0;
int[] instances = new int[substrings.length];
int patternLength = pattern.length();
for (int i = 0; i < instances.length; i++){
int length = substrings[i].length();
String temp = substrings[i];
temp = temp.replaceAll(pattern, "");
int postLength = temp.length();
numOccurences = (length - postLength) / pattern.length();
instances[i] = numOccurences;
numOccurences = 0;
}
int sum = 0;
for (int i = 0; i < instances.length; i++){
System.out.print(instances[i] + "|");
sum += instances[i];
}
System.out.print(sum);
}
}
Any suggestions?
I would personally compare the pattern as a substring in this case. For example a run of a single String from your array would look like this:
//Initial values
String blobs = "aaaabcaaa";
String pattern = "aab";
String[] substrings = blobs.split("[|]");
//The code I added that should placed into the loop
int numOccurences = 0;
String str = substrings[0];
for (int k = 0; k <= (str.length() - pattern.length()); k++)
{
if (str.substring(k, k + pattern.length()).equals(pattern))
{
numOccurences++;
}
}
System.out.println(numOccurences);
If you want to run this on each String in your array simply modify String str = substrings[0] to String str = substrings[i] and iterate over the array storing the final numOccurences as you please.
Example Run:
String is aaaabcaaa
Pattern is aa
Output is 5 occurences
For one String, match is the String you're looking for:
int len = theStr.length ();
int start = 0;
int pos;
int count = 0;
while ((start < len) && ((pos = theStr.indexOf (match, start)) >= 0))
{
++count;
start = pos + 1;
}
If you use Java 8 you can count this value in the following way.
Example:
String blobs = "aaabcaaa";
String pattern = "aa";
List<String> strings = Arrays.asList(blobs.split(""));
long count = IntStream.range(0, strings.size())
.mapToObj(index -> index < strings.size() - 1 ? strings.get(index) + strings.get(index + 1) : strings.get(index - 1))
.filter(str -> str.equals(pattern))
.count();
System.out.println("Result count: " + count);
Continually taking substrings and using the startsWith method seems to work pretty well.
String pat = "ss";
String str = "kskslsksaaaslsslssskssssllsssss";
int count = 0;
while (str.length() >= pat.length()) {
count += str.startsWith(pat) ? 1 : 0;
str = str.substring(1);
}
System.out.println("count = " + count);
You can also take a similar approach with streams.
long count = IntStream.range(0, str.length()).mapToObj(
n -> str.substring(n)).filter(n -> n.startsWith(pat)).count();
System.out.println("count = " + count);
But in this case I actually prefer the non-stream approach.
I need to fetch a sub string that lies between two same or different delimiters. The delimiters will be occurring multiple times in the string, so i need to extract the sub-string that lies between mth occurrence of delimiter1 and nth occurrence of delimiter2.
For eg:
myString : Ron_CR7_MU^RM^_SAF_34^
What should i do here if i need to extract the sub-string that lies between 3rd occurrence of '_' and 3rd occurence of '^'?
Substring = SAF_34
Or i could look for a substring that lies between 2nd '^' and 4th '_', i.e :
Substring = _SAF
An SQL equivalent would be :
substr(myString, instr(myString, '',1,3)+1,instr(myString, '^',1,3)-1-instr(myString, '',1,3))
I would use,
public static int findNth(String text, String toFind, int count) {
int pos = -1;
do {
pos = text.indexOf(toFind, pos+1);
} while(--count > 0 && pos >= 0);
return pos;
}
int from = findNth(text, "_", 3);
int to = findNth(text, "^", 3);
String found = text.substring(from+1, to);
If you can use a solution without regex you can find indexes in your string where your resulting string needs to start and where it needs to end. Then just simply perform: myString.substring(start,end) to get your result.
Biggest problem is to find start and end. To do it you can repeat this N (M) times:
int pos = indexOf(delimiterX)
myString = myString.substring(pos) //you may want to work on copy of myString
Hope you get an idea.
You could create a little method that simply hunts for such substrings between delimiters sequentially, using (as noted) String.indexOf(string); You do need to decide whether you want all substrings (whether they overlap or not .. which your question indicates), or if you don't want to see overlapping strings. Here is a trial for such code
import java.util.Vector;
public class FindDelimitedStrings {
public static void main(String[] args) {
String[] test = getDelimitedStrings("Ron_CR7_MU'RM'_SAF_34'", "_", "'");
if (test != null) {
for (int i = 0; i < test.length; i++) {
System.out.println(" " + (i + 1) + ". |" + test[i] + "|");
}
}
}
public static String[] getDelimitedStrings(String source,
String leftDelimiter, String rightDelimiter) {
String[] answer = null;
;
Vector<String> results = new Vector<String>();
if (source == null || leftDelimiter == null || rightDelimiter == null) {
return null;
}
int loc = 0;
int begin = source.indexOf(leftDelimiter, loc);
int end;
while (begin > -1) {
end = source
.indexOf(rightDelimiter, begin + leftDelimiter.length());
if (end > -1) {
results.add(source.substring(begin, end));
// loc = end + rightDelimiter.length(); if strings must be
// returned as pairs
loc = begin + 1;
if (loc < source.length()) {
begin = source.indexOf(leftDelimiter, loc);
} else {
begin = -1;
}
} else {
begin = -1;
}
}
if (results.size() > 0) {
answer = new String[results.size()];
results.toArray(answer);
}
return answer;
}
}
Ex:
if there is a sentence given:
My name is not eugene. my pet name is not eugene.
And we have to search the smallest part in the sentence that Contains the given words
my and eugene
then the answer will be
eugene. my.
No need to check the uppercase or lowercase or special charaters or numerics.
I have pasted my code but getting wrong answer for some test cases.
can any one have any idea what is the problem with the code . I don't have the test case for which it is wrong.
import java.io.*;
import java.util.*;
public class ShortestSegment
{
static String[] pas;
static String[] words;
static int k,st,en,fst,fen,match,d;
static boolean found=false;
static int[] loc;
static boolean[] matches ;
public static void main(String s[]) throws IOException
{
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
pas = in.readLine().replaceAll("[^A-Za-z ]", "").split(" ");
k = Integer.parseInt(in.readLine());
words = new String[k];
matches = new boolean[k];
loc = new int[k];
for(int i=0;i<k;i++)
{
words[i] = in.readLine();
}
en = fen = pas.length;
find(0);
if(found==false)
System.out.println("NO SUBSEGMENT FOUND");
else
{
for(int j=fst;j<=fen;j++)
System.out.print(pas[j]+" ");
}
}
private static void find(int min)
{
if(min==pas.length)
return;
for(int i=0;i<k;i++)
{
if(pas[min].equalsIgnoreCase(words[i]))
{
if(matches[i]==false)
{
loc[i]=min;
matches[i] =true;
match++;
}
else
{
loc[i]=min;
}
if(match==k)
{
en=min;
st = min();
found=true;
if((fen-fst)>(en-st))
{
fen=en;
fst=st;
}
match--;
matches[getIdx()]=false;
}
}
}
find(min+1);
}
private static int getIdx()
{
for(int i=0;i<k;i++)
{
if(words[i].equalsIgnoreCase(pas[st]))
return i;
}
return -1;
}
private static int min()
{
int min=loc[0];
for(int i=1;i<loc.length;i++)
if(min>loc[i])
min=loc[i];
return min;
}
}
The code you've given will produce incorrect output for the following input. I'm assuming, the word length also matters when you want to 'Find Shortest Part of Sentence containing given words'
String: 'My firstname is eugene. My fn is eugene.'
Number of search strings: 2
string1: 'my'
string2: 'is'
Your solution is: 'My firstname is'
The correct answer is: 'My fn is'
The problem in your code is, it considers both 'firstname' and 'fn' as same length. In the comparison (fen-fst)>(en-st) you're only considering whether the number of words has minimized and not whether the word lengths has shortened.
the following codes (junit):
#Test
public void testIt() {
final String s = "My name is not eugene. my pet name is not eugene.";
final String tmp = s.toLowerCase().replaceAll("[^a-zA-Z]", " ");//here we need the placeholder (blank)
final String w1 = "my "; // leave a blank at the end to avoid those words e.g. "myself", "myth"..
final String w2 = "eugene ";//same as above
final List<Integer> l1 = getList(tmp, w1); //indexes list
final List<Integer> l2 = getList(tmp, w2);
int min = Integer.MAX_VALUE;
final int[] idx = new int[] { 0, 0 };
//loop to find out the result
for (final int i : l1) {
for (final int j : l2) {
if (Math.abs(j - i) < min) {
final int x = j - i;
min = Math.abs(j - i);
idx[0] = j - i > 0 ? i : j;
idx[1] = j - i > 0 ? j + w2.length() + 2 : i + w1.length() + 2;
}
}
}
System.out.println("indexes: " + Arrays.toString(idx));
System.out.println("result: " + s.substring(idx[0], idx[1]));
}
private List<Integer> getList(final String input, final String search) {
String t = new String(input);
final List<Integer> list = new ArrayList<Integer>();
int tmp = 0;
while (t.length() > 0) {
final int x = t.indexOf(search);
if (x < 0 || x > t.length()) {
break;
}
tmp += x;
list.add(tmp);
t = t.substring(search.length() + x);
}
return list;
}
give output:
indexes: [15, 25]
result: eugene. my
I think the codes with inline comments are pretty easy to understand. basically, playing with index+wordlength.
Note
the "Not Found" case is not implemented.
codes are just showing the
idea, it can be optimized. e.g. at least one abs() could be saved.
etc...
hope it helps.
I think it can be handled in another way :
First , find a matching result , and minimize the bound to the current result and then find a matching result from the current result .It can be coded as follows:
/**This method intends to check the shortest interval between two words
* #param s : the string to be processed at
* #param first : one of the words
* #param second : one of the words
*/
public static void getShortestInterval(String s , String first , String second)
{
String situationOne = first + "(.*?)" + second;
String situationTwo = second + "(.*?)" + first;
Pattern patternOne = Pattern.compile(situationOne,Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
Pattern patternTwo = Pattern.compile(situationTwo,Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
List<Integer> result = new ArrayList<Integer>(Arrays.asList(Integer.MAX_VALUE,-1,-1));
/**first , test the first choice*/
Matcher matcherOne = patternOne.matcher(s);
findTheMax(first.length(),matcherOne, result);
/**then , test the second choice*/
Matcher matcherTwo = patternTwo.matcher(s);
findTheMax(second.length(),matcherTwo,result);
if(result.get(0)!=Integer.MAX_VALUE)
{
System.out.println("The shortest length is " + result.get(0));
System.out.println("Which start # " + result.get(1));
System.out.println("And end # " + result.get(2));
}else
System.out.println("No matching result is found!");
}
private static void findTheMax(int headLength , Matcher matcher , List<Integer> result)
{
int length = result.get(0);
int startIndex = result.get(1);
int endIndex = result.get(2);
while(matcher.find())
{
int temp = matcher.group(1).length();
int start = matcher.start();
List<Integer> minimize = new ArrayList<Integer>(Arrays.asList(Integer.MAX_VALUE,-1,-1));
System.out.println(matcher.group().substring(headLength));
findTheMax(headLength, matcher.pattern().matcher(matcher.group().substring(headLength)), minimize);
if(minimize.get(0) != Integer.MAX_VALUE)
{
start = start + minimize.get(1) + headLength;
temp = minimize.get(0);
}
if(temp<length)
{
length = temp;
startIndex = start;
endIndex = matcher.end();
}
}
result.set(0, length);
result.set(1, startIndex);
result.set(2, endIndex);
}
Note that this can handle two situations , regardless of the sequence of the two words!
you can use Knuth Morris Pratt algorithm to find indexes of all occurrences of every given word in your text. Imagine you have text of length N and M words (w1 ... wM). Using KMP algorithm you can get array:
occur = string[N];
occur[i] = 1, if w1 starts at position i
...
occur[i] = M, if wM starts at position i
occur[i] = 0, if no word from w1...wM starts at position i
you loop through this array and from every non-zero position search forward for other M-1 words.
This is approximate pseudocode. Just to understand the idea. It definitely won't work if you just recode it on java:
for i=0 to N-1 {
if occur[i] != 0 {
for j = i + w[occur[i] - 1].length - 1 { // searching forward
if occur[j] != 0 and !foundWords.contains(occur[j]) {
foundWords.add(occur[j]);
lastWordInd = j;
if foundWords.containAllWords() break;
}
foundTextPeaceLen = j + w[occur[lastWordInd]].length - i;
if foundTextPeaceLen < minTextPeaceLen {
minTextPeaceLen = foundTextPeaceLen;
// also remember start and end indexes of text peace
}
}
}
}