Match Substring that contains separators in full string

Match Substring that contains separators in full string - java

I wasn't sure how to phrase the question. Long story short, I want to pull both strings (a, b) from the line In: a (b). In almost all cases a=b, but just in case, I've separated them. The problem: both strings can contain any character which includes Unicode, white space, punctuation, and parenthesis.
1: In: ThisName (ThisName) is in this list
2: In: OtherName (With These) (OtherName (With These)) is in this list
3: In: Really Annoying (Because) Separators (Really Annoying (Because) Separators) is in this list
Line 1, easy: ^\w+:\s(?'a'.+?)\s\((?'b'.+)\) a:ThisName b:ThisName
Line 2, same as before:a:OtherName b: With These) (OtherName (With These)
Line 2, lazy: ^\w+:\s(?'a'.+?)\s\((?'b'.+?)\) a:OtherName b:With These
Line 3, head desk
Is this possible? Perhaps I need to go another route? We know one set of parenthesis is required to be there. Perhaps I have to go down a math route, calculate the number of parenthesis and find that route to determine which should actually contain b? Count every open and close somehow.
What I've been playing with: https://regex101.com/r/8YIweJ/2
By the way, if I could change the input formatting, I most definitely would.
Added Question: If that is not possible, does assuming a=b all the time make this any easier? I can't think of how it would.

My comments are embedded in the processInput method.
public static void main(String[] args)
{
String input = "1: In: ThisName (ThisName) is in this list\n" +
"2: In: OtherName (With These) (OtherName (With These)) is in this list\n" +
"3: In: Really Annoying (Because) Separators (Really Annoying (Because) Separators) is in this list\n" +
"4: In: Not the Same (NotTheSame) is in this list\n" +
"5: In: A = (B) (A = (B)) is in this list\n" +
"6: In: A != (B) (A != B) is in this list\n";
for (String line : input.split("\n"))
{
processInput(line);
}
}
public static void processInput(String line)
{
// Parse the relevant part from the input.
Matcher inputPattern = Pattern.compile("(\\d+): In: (.*) is in this list").matcher(line);
if (!inputPattern.matches())
{
System.out.println(line + " is not valid input");
return;
}
String inputNum = inputPattern.group(1);
String aAndB = inputPattern.group(2);
// Check if a = b.
Matcher aEqualsBPattern = Pattern.compile("(.*) \\(\\1\\)").matcher(aAndB);
if (aEqualsBPattern.matches())
{
System.out.println("Input " + inputNum + ":");
System.out.println("a = b = " + aEqualsBPattern.group(1));
System.out.println();
return;
}
// Check if a and b have no parentheses.
Matcher noParenthesesPattern = Pattern.compile("([^()]*) \\(([^()]*)\\)").matcher(aAndB);
if (noParenthesesPattern.matches())
{
System.out.println("Input " + inputNum + ":");
System.out.println("a = " + noParenthesesPattern.group(1));
System.out.println("b = " + noParenthesesPattern.group(2));
System.out.println();
return;
}
// a and b have one or more parentheses in them.
// All you can do now is guess what a and b are.
// There is at least one " (" in the string.
String[] split = aAndB.split(" \\(");
for (int i = 0; i < split.length - 1; i++)
{
System.out.println("Possible Input " + inputNum + ":");
System.out.println("possible a = " + mergeParts(split, 0, i));
System.out.println("possible b = " + mergeParts(split, i + 1, split.length - 1));
System.out.println();
}
}
private static String mergeParts(String[] aAndBParts, int startIndex, int endIndex)
{
StringBuilder s = new StringBuilder(getPart(aAndBParts, startIndex));
for (int j = startIndex + 1; j <= endIndex; j++)
{
s.append(" (");
s.append(getPart(aAndBParts, j));
}
return s.toString();
}
private static String getPart(String[] aAndBParts, int j)
{
if (j != aAndBParts.length - 1)
{
return aAndBParts[j];
}
return aAndBParts[j].substring(0, aAndBParts[j].length() - 1);
}
Executing the above code outputs:
Input 1:
a = b = ThisName
Input 2:
a = b = OtherName (With These)
Input 3:
a = b = Really Annoying (Because) Separators
Input 4:
a = Not the Same
b = NotTheSame
Input 5:
a = b = A = (B)
Possible Input 6:
possible a = A !=
possible b = B) (A != B
Possible Input 6:
possible a = A != (B)
possible b = A != B

What I would do is not use regular expressions for this. Follow this kind of algorithm:
Find the first index of ( that should give you your "a" string if I follow your question
From that index go through the string character by character using charAt. Count up when you hit a ( and down when you get to a ). Once you hit zero in this counter then your brackets match and you have the position of the end of your "b" string.
It also looks like there could be multiple string making up "B" (from line 3), so you can just keep iterating over the string per step 2 above, adding the strings to either a list or a string builder as appropriate.

Well, you could parse your text, but not with a regular expression, and with at least one of the following conditions being true:
The parentheses in the B expression are guaranteed to be matched properly. That is, no )) ((, :-), etc.
The A and B are exactly the same. In such a case, even if you have unmatched parentheses inside them, e.g. Hello (-: (Hello (-:), you know that the ( before the second Hello is the "right" one.
If you can't make those guarantees, then you should write an isMatchedParenthesis(String) method, that checks if all parentheses are properly matched. Have a counter, starting from zero, and scan through the string.
For each character in the string:
If current char is (, counter++.
If current char is ), counter--.
If counter is negative, return false
If at the end the counter is positive, return false. Otherwise true.
Test your string with that method. If it works, you can rely on finding the "significant" parenthesis using parenthesis matching. If it returns false, you can try the fallback method that assumes that both strings are the same.
Find the significant parenthesis when balanced
Find the index of the rightmost ) (use lastIndexOf).
counter=0.
For each character going down from that index to 4 (The character after In::
If it is a ), counter++
If it is a (, counter--.
If counter==0 stop, return current index.
Now you have the index of the significant parenthesis. Your A is the substring between 4 and this index - 1 (remember the space before the (). Your B is from that index+1 to the index of the right ) that you found first.
Fallback method
Suppose your parentheses are not balanced. Can you do anything?
Make a list of all the indexes of ( in the string.
If the length of the list is even - bad string, report to user.
If the length is odd, take the index of the middle (. Assuming that A and B are the same, they should each have the same number of (, so the one that has the same number of ( to its left and to its right is your candidate.
Extract the A and B as before. If they are not equal - bad string, report to user.

Related

"number of segments in a string" not working for a particular input

I have to find "contiguous sequence of non-space characters" in a string.
My output is coming wrong for the input
Input=", , , , a, eaefa"
My answer is coming as 13 instead of 6.Though I have only counted words except for spaces.
class Solution {
public int countSegments(String s)
{
if(s.isEmpty()){
return 0;
}
else
{
int count=0;
String s1[]=s.split(" ");
for(int i=0;i<s1.length;i++)
{
if(s1[i]!=" ")
count++;
}
return count;
}
}
}

Others have suggesting using:
s.split("\\s+").length
However, there are complications in using split. Specifically, the above will give incorrect answers for strings with leading spaces. Even if these issues are fixed it's still overly expensive as we're creating count new strings, and an array to hold them.
We can implement countSegments directly by iterating through the string and counting the number of times we go from a non-space character to a space character, or the end of the string:
public static int countSegments(String s)
{
int count = 0;
for(int i=1; i<=s.length(); i++)
{
if((s.charAt(i-1) != ' ') && (i == s.length() || s.charAt(i) == ' ')) count++;
}
return count;
}
Test:
for(String s : new String[] {"", " ", "a", " a", "a ", " a ", ", , , , a, eaefa"})
System.out.format("<%s> : %d%n", s, countSegments(s));
Output:
<> : 0
< > : 0
<a> : 1
< a> : 1
<a > : 1
< a > : 1
<, , , , a, eaefa> : 6

You should use split on multiple spaces, and then you have the segments already divided up for you, so you don't need to make a for-loop or anything.
//The trim is because split gets messed up with leading spaces, as SirRaffleBuffle said
s = s.trim();
if (s.isEmpty()) return 0;
return s.split("\\s+").length;
If you want only sequences of alphanumeric characters, you can try this regex instead: "\\W+"
If you want only sequences of English letters, you can do the same thing but with the regex "[^A-Za-z]+".
Here, it splits on multiple spaces instead of just one.
The way you're currently doing it, you count every single letter that's not a whitespace instead of "contiguous sequences of no-space characters". That's why you're getting 13 instead of 6.
Notice that count is incremented anytime it finds something that isn't a space, but if you do want to do this with a for-loop, you should have a boolean flag telling you that you've entered a sequence, so you only increment count when that flat was previously false (you were outside a sequence) and then you find a space.
Also, using != for String comparison is wrong, you should use the equals method.

“number of segments in a string” not working for a particular input
You can do it easily by using the regex, \\s+ as follows:
public class Main {
public static void main(String[] args) {
String str = ", , , , a, eaefa";
str = str.trim();// Remove the leading and trailing space
System.out.println(str.isEmpty() ? 0 : str.split("\\s+").length);
}
}
Output:
6
The regex, \\s+ matches on one or more consecutive spaces.
On a side note, you are using != to compare strings, which is not correct. Note that == and != are used to compare the references, not the values.

Find out recursive pattern in a string java

In one of my interview I had asked one program on java string, I am unable to answer it. I don't know it is a simple program or complex one. I have explored on the internet for it, but unable to find the exact solution for it. My question is as follow,
I have supposed one string which contains recursive pattern like,
String str1 = "abcabcabc";
In above string recursive pattern is "abc" which repeated in one string, because this string only contains "abc" pattern recursively.
if I passed this string to a function/method as a parameter that function/method should return me "This string has a recursive pattern." If that string doesn't have any recursive pattern then simply function/method should return "This string doesn't contain the recursive pattern."
Following are probabilities,
String str1 = "abcabcabc";
//This string contains recursive pattern 'abc'
String str2 = "abcabcabcabdac";
//This string doesn't contains recursive pattern
String str2 = "abcddabcddabcddddabc";
//This string contains recursive pattern 'abc' & 'dd'
Can anybody suggest me solution/algorithm for this, I am struggling with it. What is the best way for different probabilities, so that I implement?

From LeetCode
public boolean repeatedSubstringPattern(String str) {
int l = str.length();
for(int i=l/2;i>=1;i--) {
if(l%i==0) {
int m = l/i;
String subS = str.substring(0,i);
StringBuilder sb = new StringBuilder();
for(int j=0;j<m;j++) {
sb.append(subS);
}
if(sb.toString().equals(str)) return true;
}
}
return false;
}
The length of the repeating substring must be a divisor of the length of the input string
Search for all possible divisor of str.length, starting for length/2
If i is a divisor of length, repeat the substring from 0 to i the number of times i is contained in s.length
If the repeated substring is equals to the input str return true

Solution is not in Javascript. However, problem looked interesting, so attempted to solve it in python. Apologies!
In python, I wrote a logic which worked [Could be written much better, thought the logic would help you]
Script is
def check(lst):
return all(x in lst[-1] for x in lst)
s = raw_input("Enter string:: ")
if check(sorted(s.split(s[0])[1:])):
print("String, {} is recursive".format(s))
else:
print("String, {} is NOT recursive".format(s))
Output of the script:
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabcabdac
String, abcabcabcabdac is NOT recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabc
String, abcabcabc is recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcddabcddabcddddabc
String, abcddabcddabcddddabc is recursive

This can also be solved using a part of the Knuth–Morris–Pratt Algorithm.
The idea is to build a 1-D array with each entry representing a character in the word. For each character i in the word we check if there is a prefix which is also a suffix in the word up 0 to i. The reason being if we have common suffix and prefix we can continue searching from the character after prefix ends which we update the array with the corresponding character index.
For s="abcababcababcab", the array will be
Index : 0 1 2 3 4 5 6 7 8
String: a b c a b c a b c
KMP : 0 0 0 1 2 3 4 5 6
For Index = 2, we see that there is no suffix which is also the prefix in the string ab i.e) up until Index = 2
For Index = 4, the suffix ab(Index = 3, 4) is same as the prefix ab(Index = 0, 1) so we update the KMP[4] = 2 which is the index of the pattern from which we have to resume searching.
Thus KMP[i] holds the index of the string s where prefix matches the longest possible suffix in the range 0 to i plus 1. Which essentially means that the a prefix with length index + 1 - KMP[index] exists in the string previously. using this information we can find out if all the substrings of that length are the same.
For Index = 8, we know KMP[index] = 6, which means there is a prefix(s[3] to s[5]) of length 9 - 6 = 3 which is equal to the suffix(s[6] to s[8]), If this is the only repetitive pattern we have this will follow
For a clearer explanation of this algorithm please check this video lecture.
This table can be build in linear time,
vector<int> buildKMPtable(string word)
{
vector<int> kmp(word.size());
int j=0;
for(int i=1; i < word.size(); ++i)
{
j = word[j] == word[i] ? j : kmp[j-1];
if(word[j] == word[i])
{
kmp[i] = j + 1;
++j;
}
else
{
kmp[i] = j;
}
}
return kmp;
}
bool repeatedSubstringPattern(string s) {
auto kmp = buildKMPtable(s);
if(kmp[s.size() -1] == 0) // Occurs when the string has no prefix with suffix ending at the last character of the string
{
return false;
}
int diff = s.size() - kmp[s.size() -1]; //Length of the repetitive pattern
if(s.size() % diff != 0) //Length of repetitive pattern must be a multiple of the size of the string
{
return false;
}
// Check if that repetitive pattern is the only repetitive pattern.
string word = s.substr(0, diff);
int w_size = word.size();
for(int i=0; i < w_size; ++i)
{
int j = i;
while(j < s.size())
{
if(word[i] == s[j])
{
j += w_size;
}
else
{
return false;
}
}
}
return true;
}

If you know the 'parts' in advance, then the answer could be Recursive regular expressions, it seems.
So for abcabcabc we need an expression like abc(?R)* where:
abc matches the literal characters
(?R) recurses the pattern
A * to match between zero and unlimited number of times
The third one is a little trickier. See this regex101 link but it looks like:
((abc)|(dd))(?R)*
where we have either 'abc' or 'dd' and there are any number of these.
Otherwise, I don't see how you could determine from just a string that it has some undefined recursive structure like this.

if statement unintentionally prematurely terminated a for loop(Regex)

I'm trying to make sure that the first letters of the forename and surname strings are capital. I have some java code as follows and for the life of me I do not know why it only works on the first character in the stringbuffer and wont carry out the rest of the loop. I believe this is an error in my regex which i'm not quite clear on.
I'm 90% sure it's because of the space & colon presents in the original string.
the original string reads as
StringBuffer output = new StringBuffer(forename + ", " + surname);
Java
int length_of_names = Director.getSurname().length() + Director.getForename().length() + 2;
Pattern pattern = Pattern.compile("\\b([A-Z][a-z]*)\\b");
Matcher matcher = pattern.matcher(output.append(Director));
for(int i = 0; i < length_of_names; i++)
{
if (matcher.find() == true)
{
output.setCharAt(i, Character.toUpperCase(output.charAt(i)) );
continue;
}
}
A nice, quick 101 on regex statements and how to compose them would also be well appreciated

Disclaimer: This answer makes lots of assumptions. Purpose of answer is to show problems with code in question, which is relevant even if assumptions are wrong.
Assumptions:
Value of forename is same as returned by Director.getForename().
Value of surname is same as returned by Director.getSurname().
Value of output at time of matcher(...) call is as shown earlier.
Director.toString() is implemented as return surname + ", " + forename;. Exact implementation doesn't matter, but rest of answer assumes this implementation.
For purpose of illustration, forename = "John" and surname = "Doe".
Now, lets go thru the code and see what's going on:
StringBuffer output = new StringBuffer(forename + ", " + surname);
Value of output is now "John, Doe" (9 characters).
int length_of_names = Director.getSurname().length() + Director.getForename().length() + 2;
Value of length_of_names calculates to 9.
This could be done better using int length_of_names = output.length().
Pattern pattern = Pattern.compile("\\b([A-Z][a-z]*)\\b");
Matcher matcher = pattern.matcher(output.append(Director));
The string returned by Director.toString() ("Doe, John") is appended to output, resulting in value being "John, DoeDoe, John". That value is given to the matcher.
With that regex pattern, the matcher will find "John", and "John". It will not find "DoeDoe", since that has an uppercase letter in the middle.
Result is that find() returns true twice, and all subsequent calls will return false.
for(int i = 0; i < length_of_names; i++)
{
if (matcher.find() == true)
{
output.setCharAt(i, Character.toUpperCase(output.charAt(i)) );
continue;
}
}
Loop iterate 9 times, with values of i from 0 to 8 (inclusive).
First two iterations enters the if statement, so code will uppercase first two characters in output, resulting in value "JOhn, DoeDoe, John".
The continue statement has no effect, since loop continues anyway.
OOPS!!
That is not what code should do. So, to fix it:
Don't append Director to output.
Don't iterate 9 times. Instead, iterate until find() returns false.
Use position of the found text to locate the character to uppercase.
That makes code look like this:
StringBuffer output = new StringBuffer(forename + ", " + surname);
Pattern pattern = Pattern.compile("\\b([A-Z][a-z]*)\\b");
Matcher matcher = pattern.matcher(output);
while (matcher.find()) {
int i = matcher.start();
output.setCharAt(i, Character.toUpperCase(output.charAt(i)));
}
Of course, the code is still totally meaningless, since you matched words starting with uppercase letter, so changing the first letter to uppercase will do nothing at all.

Regex does not store the element in the first index

I have a function which takes a String containing a math expression such as 6+9*8 or 4+9 and it evaluates them from left to right (without normal order of operation rules).
I've been stuck with this problem for the past couple of hours and have finally found the culprit BUT I have no idea why it is doing what it does. When I split the string through regex (.split("\\d") and .split("\\D")), I make it go into 2 arrays, one is a int[] where it contains the numbers involved in the expression and a String[] where it contains the operations.
What I've realized is that when I do the following:
String question = "5+9*8";
String[] mathOperations = question.split("\\d");
for(int i = 0; i < mathOperations.length; i++) {
System.out.println("Math Operation at " + i + " is " + mathOperations[i]);
}
it does not put the first operation sign in index 0, rather it puts it in index 1. Why is this?
This is the system.out on the console:
Math Operation at 0 is
Math Operation at 1 is +
Math Operation at 2 is *

Because on position 0 of mathOperations there's an empty String. In other words
mathOperations = {"", "+", "*"};
According to split documentation
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string. ...
Why isn't there an empty string at the end of the array too?
Trailing empty strings are therefore not included in the resulting
array.
More detailed explanation - your regex matched the String like this:
"(5)+(9)*(8)" -> "" + (5) + "+" + (9) + "*" + (8) + ""
but the trailing empty string is discarded as specified by the documentation.
(hope this silly illustration helps)
Also a thing worth noting, the regex you used "\\d", would split following string "55+5" into
["", "", "+"]
That's because you match only a single character, you should probably use "\\d+"

You may find the following variation on your program helpful, as one split does the jobs of both of yours...
public class zw {
public static void main(String[] args) {
String question = "85+9*8-900+77";
String[] bits = question.split("\\b");
for (int i = 0; i < bits.length; ++i) System.out.println("[" + bits[i] + "]");
}
}
and its output:
[]
[85]
[+]
[9]
[*]
[8]
[-]
[900]
[+]
[77]
In this program, I used \b as a "zero-width boundary" to do the splitting. No characters were harmed during the split, they all went into the array.
More info here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
and here: http://www.regular-expressions.info/wordboundaries.html

Java: Next character in String

I have one String generated of random characters that will encrypt another String given by the user by adding the first character from the String with the first character of the given String. It's working fine, but if the user were to enter multiple words with spaces in between, I want to choose the next character of the first String rather than code the space itself. Is that possible? This is what I have:
(random is the coded string and sentenceUpper is string given by user)
public static void encrypt(String sentenceUpper){
String newSentence = "";
for(int i = 0; i < sentenceUpper.length(); i++){
char one = random.charAt(i);
char two = sentenceUpper.charAt(i);
if(one < 'A' || one > 'Z'){
two = sentenceUpper.charAt(1 + i);}
char result = (char)((one + two)%26 + 'A');
newSentence += "" + result;
}
EDIT FOR BETTER EXPLANATION:
I have:
String random = "WFAZYZAZOHS";
I would like to code user input:
String upperCase: "YOU GO";
So, I'm going to take Y + L = U, etc...
to get :
"UTUSEN
"
But I see that there's a space in "YOU GO" , So I'd like to change it to:
WFA ZY + YOU GO = UTU SE.
I hope that's better explained.

The simplest way to do this would probably be to use an if statement to run the code in the loop only if the character is not a space. If you don't want to skip the character in the random string, you would need a separate variable to track the current character index in that string.
Example: Put this after defining one and two and put the rest of the loop inside it:
if(two==' '){
...
}
Then, add the space in the output:
else{
newSentence+=" ";
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Match Substring that contains separators in full string - java

Related

"number of segments in a string" not working for a particular input

Find out recursive pattern in a string java

if statement unintentionally prematurely terminated a for loop(Regex)

Regex does not store the element in the first index

Java: Next character in String

Categories

Resources