Java regex line.split("\\s*//") - java

I came across the following string split line.split("\s*//")[0] but can't seem find documentation on the use of the '/' character in regular expressions.
Here my code:
String line = "type=path.composition id=pathComp";
line = line.split("\\s*//")[0];
Console console = System.console();
System.out.println("This is the line: " + line);
Here the output:
This is the line: type=rule.composition id=ruleComp
I am wondering what exactly '/' does to the regular expression and was wondering whether anybody would be able to point me to some documentation and/or an answer highlighting what it does?
I also noticed that when I remove the '//' from the regex, the output changes to merely the first character, which I suppose makes sense given that \s* means that the expression spits on zero or more white space characters.
This is the line: t
This however raises the question: "what does the '//' add to the regular expression that sees the split occur at the end of the line"?
Any advice would be highly appreciated.
Z

Consider your input text (type=rule.composition id=ruleComp), and your two regexes:
regex 1: \s*//;
regex 2: \s*.
When you try and .split() against a regular expression, the regex engine will try and match the regular expression (which is computed from the text literal as an argument) and these two things can happen:
the regex cannot match anything (this is what happens with regex 1): the split effectively cannot operate and the 0th element is the input text;
the regex can match an empty string (this is what happens with regex 2): in this case, the regex engine notices this and cannot let the situation continue, since otherwise it would result in and endless loop. Therefore it forcefully advances by one token before proceeding.
Hence your results:
with the first regex, nothing is matched;
with the second regex, an empty string is matched; the regex engine chooses to shift one character and considers the "discarded" text (the previous token) as the 0th match.

Related

Regular expression replace characters by a given match between strings

I am trying to replace a given character by a regular expression match.
For example, given the following string:
If you look at what you have in life, you'll always have more. If you look at what you don't have in life, you'll never have enough
I would like to replace all 't' with a '!' only where the match is between the characters 'ok' and 'fe'.
I get the match between 'ok' and 'fe' with this regular expression:
(?<=ok).*?(?=fe)
And I can only match one character with the following regex:
(?<=ok).*?(t).*?(?=fe)
I tried to transform that regex in the following way but it does not work:
(?<=ok).*?((t).*?)*?(?=fe)
How can I match all 't' between 'ok' and 'fe'?
https://regex101.com/r/ORgseA/1
You can use
String result = text.replaceAll("(?s)(\\G(?!\\A)|ok)((?:(?!ok|fe|t).)*)t(?=(?:(?!ok|fe).)*fe)", "$1$2!");
See the regex demo and the Java demo:
String text = "If you look at what you have in life, you'll always have more. If you look at what you don't have in life, you'll never have enough";
String result = text.replaceAll("(?s)(\\G(?!\\A)|ok)((?:(?!ok|fe|t).)*)t(?=(?:(?!ok|fe).)*fe)", "$1$2!");
System.out.println(result);
// => If you look a! wha! you have in life, you'll always have more. If you look a! wha! you don'! have in life, you'll never have enough
Details:
(?s) - Pattern.DOTALL embedded flag option (to make . match line break chars)
(\G(?!\A)|ok) - Group 1 ($1): ok or the end of the previous successful match
((?:(?!ok|fe|t).)*) - Group 2 ($2): any one char, zero or more occurrences, as many as possible, that does not start a ok, fe or t char sequence
t - a t char
(?=(?:(?!ok|fe).)*fe) - immediately to the right, there must be any single char, zero or more occurrences, as many as possible, that does not start ok or fe char sequences and then a fe substring.

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

How to extract multi-line text delimited by 2 strings

I've following pattern:
Claims(40)
This is good.
This is good, too.
Description
This is description.
The delimiter strings in this case are:
1st delimiter: "Claims(40)"
2nd delimiter: "Description"
I want to extract text between these delimiters while excluding the delimiters.
Also, in the above text, following rules exist:
1st delimiter starts on the 1st column in the text and it's the only word on the line.
In the first delimiter, opening parenthesis, combination of digits, and closing parenthesis may be absent. However, combination of digits and closing parenthesis exist if does the opening parenthesis.
2nd delimiter starts on the 1st column in the text and it's the only word on the line.
My regular expression:
String regxStr = "^Claims(\\(\\d+\\)?)$(.*?)^Description$";
This doesn't work.
I tried a lot many other regx, but none did work. So finally, I resorted applying brute-force approach with the regex:
String regxStr = "Claims(.*?)Description";
But neither of the regx is working. I am not being able to figure out what's and where the regx is going wrong.
I'm using Matcher class and find() method of Matcher class for further processing.
Please help me.
This captures the text you want, although I'm not totally clear on your requirements for the (40) part. #lovetostrike's answer addresses that.
\bClaims(?:\(\d+\))?\s+(.+?)\s+Description\b
You must activate the DOTALL flag when compiling the pattern:
Pattern.compile(regxStr, Pattern.DOTALL)
Escaped in a Java string:
"\\bClaims(?:\\(\\d+\\))?\\s+(.+?)\\s+Description\\b"
Here's a one-line solution:
String target = input.relaceAll(".*Claims(\\(\\d+\\))?\\s+(.*?)Description.*", "$1");
Also in addition to #aliteralmind answer, Regex isn't a good tool for nested structure, i.e. matching paren pairs. But in your simple case, you can use the OR, '|', operator in your pattern. The outer parens are used to separate the two groups for OR operator, first part with parens, and the second without parens.
(\\(\\d+\\)|\\d+)

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

regular expression to match one or more of char a or just one of char b

I am taking user input through UI, and I have to validate it. Input text should obey the following ondition
It should either end with one or more
white space characters OR with just
single '='
I can use
".*[\s=]+"
but it matches multiple '=' also which I don't want to.
Please help.
You can use alternation:
(\s+|=)$
This expression means match one or more whitespace character or one equals, at the end of the string. The $ is an anchor which matches the end of the string (as you mentioned you're looking for characters at the end of the string).
(As tchrist correctly pointed out in the comments, $ matches the end of line instead of end of string when in multiline mode. If this is true in your case, and you are indeed looking for the end of the string instead of the end of the line, you can use \Z instead, which matches the end of the string regardless of multiline mode.)
If you want to ensure that there is only one = at the end, you can use a lookaround (in this case, a negative lookbehind, specifically). A lookaround is a zero-width assertion which tells the regex engine that the assertion must pass for the pattern to match, but it does not consume any characters.
(\s+|(?<!=)=)$
In this case, (?<!=) tells the regex engine, the character before the current position cannot be an =. When put into the expression, (?<!=)= means that the = will only match if the previous character is not also a =.
Begin string
Anything not "=" ( to avoid the double "==")
One or more blank spaces OR one "="
End of string
^([^=]*[\s+|=])$
Should work :-)
Try this expression:
".*(\\s+|=)"

Categories

Resources