Regex - Split String on `:` but not inside if statements

Regex - Split String on `:` but not inside if statements - java

I am unable to figure out a solution myself on this one. Is it even possible in regex?
I have this sample:
...:...:If...Then...:...:Else...:...:...:IfEnd:...
I want to split my string on colons, but not the colons inside my if-statements.
Something like:
...
...
If...Then...:...:Else...:...:...:IfEnd
...
I searched on other questions, but all of them have 1-character delimiters, which you can solve the problem with [^set]. Is this case possible with regex?
I don't even have a half working regex solution because none of what I tried worked. haha.
If you want to know what I'm doing. I'm attempting to make an application to parse some script. The system will separate statements and parse each line individually. But separating the if-statement/while-loop/for-loop is not going to work properly because of obvious reasons. Is my way of thinking the solution not conventional or isn't right in any chance?

Assuming your input doesn't have newlines, you can use:
String str = "...:...:If...Then...:...:Else...:...:...:IfEnd:...";
String[] toks = str.replaceAll("(\\bIf\\b.*?\\bIfEnd\\b):?|:", "$1\n").split("\\n+");
for (String tok: toks) {
System.err.printf("%s%n", tok);
}
Output:
...
...
If...Then...:...:Else...:...:...:IfEnd
...
This regex first matches text from If to EfEnd and captures it in group #1 OR it matches a colon. Back-reference of captured group #1 is used in replacement while adding \n in front of it.
RegEx Demo

Related

Splitting a sentence

I'm trying to split a string: multiple characters such as !!! , ??, ... denote the end of the sentence so I want anything after this to be on a new line e.g. sentence hey.. hello split !!! example me. should be turned into:
hey..
hello split !!!
example me.
What I tried:
String myStr= "hey.. hello split !!! example me.";
String [] split = myStr.split("(?<=\\.{2,})");
This works fine when I have multiple dots but doesn't work for anything else, I can't add exclamation marks to this expression too "(?<=[\\.{2,}!{2,}]). This splits after each dot and exclamation. Is there any way to combine those ?
Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence) but I don't think this is possible in a single pass...Thanks

Just do like this,
String [] split = myStr.split("(?<=([?!.])\\1+)");
oir
String [] split = myStr.split("(?<=([?!.])\\1{1,99})");
It captures the first character from the list [?.!] and expects the same character to be present one or more times. If yes, then the splitting should occur next to this.
or
String[] split = s.split("(?<=\\.{2,}+)|(?<=\\?{2,}+)|(?<=!{2,}+)");
Ideone

Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence)
To do this first you have to lay down as to what cases are you considering as end of sentence. Multiple special symbols are not standard form of ending a sentence (as per my knowledge).
But if you are keeping in mind the nefarious users or some casual mistakes ending up making special symbols look like end of sentence then at least make a list of such cases and then proceed.
For your situation here where you want to split the string on multiple special symbols. Lookbehind won't be of much help because as Wiktor noted
The problem is in the backreference whose length is not known from the start.
So we need to find that zero-width where splitting needs to be done. And following regex does the same.
Regex:
(?<=[.!?])(?=[^.!?]) Regex101 Demo Ideone Demo
(?<=[.!?]) (?=[^.!?]) Regex101 Demo Ideone Demo
Note the space between two assertions in second regex.If you want to consume the preceding space when start next line.
Explanation:
This will split on the zero-width where it's preceded by special and not succeeded by it.
hey..¦ hello split !!!¦ example me. ( ¦ denotes the zero-width)

A look behind, with a negative look to prevent split within the group:
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Some test code:
public static void main (String[] args) {
String s = "hey..hello split !!!example me.";
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Arrays.stream(lines).forEach(System.out::println);
}
Output:
hey..
hello split !!!
example me.

Java String Regex replacement

Sample Input:
a:b
a.in:b
asds.sdsd:b
a:b___a.sds:bc___ab:bd
Sample Output:
a:replaced
a.in:replaced
asds.sdsd:replaced
a:replaced___a.sds:replaced___ab:replaced
String which comes after : should be replaced with custom function.
I have done the same without Regex. I feel it can be replaced with regex as we are trying to extract string out of specific pattern.
For first three cases, it's simple enough to extract String after :, but I couldn't find a way to deal with third case, unless I split the string ___ and apply the approach for first type of pattern and again concatenate them.

Just replace only the letters with exists next to : with the string replaced.
string.replaceAll("(?<=:)[A-Za-z]+", "replaced");
DEMO
or
If you also want to deal with digits, then add \d inside the char class.
string.replaceAll("(?<=:)[A-Za-z\\d]+", "replaced");

(:)[a-zA-Z]+
You can simply do this with string.replaceAll.Replace by $1replaced.See demo.
https://regex101.com/r/fX3oF6/18

How to extract multi-line text delimited by 2 strings

I've following pattern:
Claims(40)
This is good.
This is good, too.
Description
This is description.
The delimiter strings in this case are:
1st delimiter: "Claims(40)"
2nd delimiter: "Description"
I want to extract text between these delimiters while excluding the delimiters.
Also, in the above text, following rules exist:
1st delimiter starts on the 1st column in the text and it's the only word on the line.
In the first delimiter, opening parenthesis, combination of digits, and closing parenthesis may be absent. However, combination of digits and closing parenthesis exist if does the opening parenthesis.
2nd delimiter starts on the 1st column in the text and it's the only word on the line.
My regular expression:
String regxStr = "^Claims(\\(\\d+\\)?)$(.*?)^Description$";
This doesn't work.
I tried a lot many other regx, but none did work. So finally, I resorted applying brute-force approach with the regex:
String regxStr = "Claims(.*?)Description";
But neither of the regx is working. I am not being able to figure out what's and where the regx is going wrong.
I'm using Matcher class and find() method of Matcher class for further processing.
Please help me.

This captures the text you want, although I'm not totally clear on your requirements for the (40) part. #lovetostrike's answer addresses that.
\bClaims(?:\(\d+\))?\s+(.+?)\s+Description\b
You must activate the DOTALL flag when compiling the pattern:
Pattern.compile(regxStr, Pattern.DOTALL)
Escaped in a Java string:
"\\bClaims(?:\\(\\d+\\))?\\s+(.+?)\\s+Description\\b"

Here's a one-line solution:
String target = input.relaceAll(".*Claims(\\(\\d+\\))?\\s+(.*?)Description.*", "$1");

Also in addition to #aliteralmind answer, Regex isn't a good tool for nested structure, i.e. matching paren pairs. But in your simple case, you can use the OR, '|', operator in your pattern. The outer parens are used to separate the two groups for OR operator, first part with parens, and the second without parens.
(\\(\\d+\\)|\\d+)

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.
For example:
(this is a test "sentence for the regex")
should become
(thisisatest"sentence for the regex")

Here's a single regex-replace that works:
\s+(?=([^"]*"[^"]*")*[^"]*$)
which will replace:
(this is a test "sentence for the regex" foo bar)
with:
(thisisatest"sentence for the regex"foobar)
Note that if the quotes can be escaped, the even more verbose regex will do the trick:
\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)
which replaces the input:
(this is a test "sentence \"for the regex" foo bar)
with:
(thisisatest"sentence \"for the regex"foobar)
(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))
Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.
EDIT
A quick demo:
String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));
// output: (thisisatest"sentence \"for the regex"foobar)

Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)
\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)
It won't work with the strings which has quotes inside.

This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.
I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.
Hope that helps!
PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.
PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.

Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.
Perhaps something like:
(\s+)([^ "]+|"[^"]*")*
The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.
This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:
StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
if (m.group(2) != null)
b.append(m.group(2));
}
System.out.println(b.toString());
(I haven't done much regex in Java so expect bugs.)
Finally This is how I'd do it if regexes were compulsory. ;-)
As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.

If there is only one set of quotes, you can do this:
String s = "(this is a test \"sentence for the regex\") a b c";
Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
while (matcher.find())
{
String group = matcher.group();
s = s.replace(group, group.replaceAll("\\s", ""));
}
System.out.println(s); // (thisisatest"sentence for the regex")abc

This isn't an exact solution, but you can accomplish your goal by doing the following:
STEP 1: Match the two segments
\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)
STEP 2: remove spaces
temp = $1 replace " " with ""
STEP 3: rebuild your string
(temp"$2")

java regex tricky pattern

I'm stucked for a while with a regex that does me the following:
split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}
this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.
If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

Your description isn't clear enough, but why not just split it up by spaces?

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
[\W&&[^-]]+
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Almost the same regular expression as in your previous question:
String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just added the option (...)? to also match non-hypened words.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex - Split String on `:` but not inside if statements - java

Related

Splitting a sentence

Java String Regex replacement

How to extract multi-line text delimited by 2 strings

Regular expression to select all whitespace that isn't in quotes?

java regex tricky pattern

Categories

Resources