Capturing repeating groups - java

First of all let-me warning you that I am new to REGEX and that my English isn't the better...
I am trying to capture repeating groups, just like optional headers from http protocol.
What I need is given a string get all headers (none or many):
GET /RESOURCE/RES1 H1:value H2:value H3:value
So what I've tried is something like:
GET /RESOURCE/([^/\s]*)(\s[a-zA-Z:/|-]*)+
But all that I get is:
Group 1 = LS
Group 2 = H3:value
What am I doing wrong?

You can do something similar using the \G anchor but can't individually capture repeated patterns.
(?:\G(?!\A)|GET /RESOURCE/)(\S+)(?: |$)
Example:
String s = "GET /RESOURCE/RES1 H1:value H2:value H3:value";
Pattern p = Pattern.compile("(?:\\G(?!\\A)|GET /RESOURCE/)(\\S+)(?: |$)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Output
RES1
H1:value
H2:value
H3:value

You can add another capture group wrapping the last capture group and quantifier +. In fact you can make the current capture group as non-capturing:
GET /RESOURCE/([^/\s]*)((?:\s[a-zA-Z:/|-]*)+)
Now, capture group 2 will give you following result:
H1:value H2:value H3:value
You can get individual headers from it by splitting on space, and then colon.

Related

Regex match only if text contains something before

Given the following text
KEYWORD This is a test
We want to match the following groups 1:YES 2:YES 3:YES
I want to match with "1:YES", "2:YES" and "3:YES" using
((\d):YES)
If and only if the first word in the complete text is "KEYWORD"
Given this test:
This is a test
We want to match the following groups 1:YES 2:YES 3:YES
No matches should be found
Java (as with most regex engines) doesn't support unbounded length look behinds, however there is a work-around!
String str = "KEYWORD This is a test\n" +
"We want to match the following groups 1:YES 2:YES 3:YES";
Matcher matcher = Pattern.compile("(?s)(?<=\\AKEYWORD\\b.{1,99999})(\\d+:YES)").matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Which outputs:
1:YES
2:YES
3:YES
The trick here is the look behind (?<=\\AKEYWORD.{1,99999}) which has a large (but not unbounded) length. (?s) means DOTALL flag (dot matches newline too) and \A means start of input which is needed because ^ matches start of line when DOTALL flag is used.
Without tricking lookbhinds in Java you can capture \d+:YES\b strings with using \G. \G causes a match to start from where previous match ended or it will match beginning of string the same as \A.
We are in need of its first capability:
(?:\AKEYWORD|\G(?!\A))[\s\S]*?(\d:YES\b)
Breakdown:
(?: Start of non-capturing group
\A Match beginning of subject string
KEYWORD Match keyword
| Or
\G(?!\A) Continue from where previous match ends
) End of NCG
[\s\S]*? Match anything else un-greedily
(\d+:YES\b) Match and capture our desired part
Live demo
Java code:
Pattern p = Pattern.compile("(?:\\AKEYWORD|\\G(?!\\A))[\\s\\S]*?(\\d+:YES\\b)");
Matcher m = p.matcher(string);
while (m.find()) {
System.out.println(m.group(1));
}
Live demo

Regex starts with "ATG" ends with "TAG, TAA orTGA" but does not contain "ATG" and "TAG, TAA or TGA" in between

I'm searching for patterns in a String starting with ATG, ending with TAG, TAA or TGA and length = multiple of 3. ATG and TAG, TAA or TGA can only appear at respectively beginning or end. Which means:
From ATGTTGTGATGT extract ATGTTGTGA
From ATGATGTTGTGATGT extract ATGTTGTGA
Currently I'm using regex (ATG)([ATG]{3})+?(TAG|TAA|TGA).
For ATGATGTTGTGATGT this gets me the wrong result ATGATGTTGTGA.
I've tried:
(^ATG)(!?=.*ATG)([ATG]{3})+?(TAG|TAA|TGA)
(^ATG)(!?=(ATG)+)([ATG]{3})+?(TAG|TAA|TGA)
How to tell it to contain ATG only once in the beginning and no more after that?
You may use
ATG(?:(?!ATG)[ATG]{3})*?(?:TAG|TAA|TGA)
See the regex demo
Details
ATG - an ATG substring
(?:(?!ATG)[ATG]{3})*? - a tempered greedy token matching any sequence of 3 chars from the [ATG] character set that is not equal to ATG (that is restricted with the negative lookahead (?!ATG))
(?:TAG|TAA|TGA) - either of the three alternatives defined in the non-capturing group: TAG, TAA or TGA.
Java demo:
String rx = "ATG(?:(?!ATG)[ATG]{3})*?(?:TAG|TAA|TGA)";
String s = "ATGTTGTGATGT, ATGATGTTGTGATGT, ATGATGTTGTGATGT";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Result:
ATGTTGTGA
ATGTTGTGA
ATGTTGTGA

Why my regular expression matches but does not capture a group?

I am trying to extract the information from the following string:
//YES: We got a match.
I want to extract the information defining two groups
Everything between // and :
all the rest behind :
The pattern matches correctly but I cannot extract the groups.
String example = "//YES: We got a match.";
String COMMENT_PATTERN = "//(\\w+):(.*)";
Pattern pattern = Pattern.compile(COMMENT_PATTERN);
example.matches(COMMENT_PATTERN); // true
Matcher matcher = pattern.matcher(example);
matcher.group(1); // raises an exception
I tried it as well with named groups:
String COMMENT_PATTERN = "//(?<init>\\w+):(?<rest>.*)";
...
matcher.group("init"); // raises an exception
Why my patterns cannot extract the specified groups?
You have to call either find() or matches() on the matcher to cause it to run the matching process before you can extract groups. The
example.matches(COMMENT_PATTERN);
creates its own internal Matcher, calls matches() and then discards the Matcher - it's equivalent to
Pattern.compile(COMMENT_PATTERN).matcher(example).matches()

Change group using regex java

I need help in regular expression using in regex java.
I need change group in string:
Example:
Input:
=sum($var1;2) or =if($result<10;"little";"big") ...
Need Output:
=sum(teste;2) or =if(teste<10;"little";"big") ...
Code I have:
Pattern p = Pattern.compile("(\\.*)(\\$\\w)(\\.*)");
Matcher m = p.matcher(total);
if (m.find()) {
System.out.println(m.replaceAll("$2teste"));
}
Output I have:
=sum($vtestear1;2)
=if($r testeesultado<5;"maior";"menor")
Why match everything when all you need is to match variable tokens?
Pattern p = Pattern.compile("\\b\\$[a-z0-9]+\\b");
p.matcher(total).replaceAll("teste");
Change the [a-z0-9] part if you can have more than lowercase ASCII letters and digits.
Also, you don't need to test for .find() or anything if you .replace(): no match means nothing will be replaced.

Regex for removing part of a line if it is preceded by some word in Java

There's a properties language bundle file:
label.username=Username:
label.tooltip_html=Please enter your username.</center></html>
label.password=Password:
label.tooltip_html=Please enter your password.</center></html>
How to match all lines that have both "_html" and "</center></html>" in that order and replace them with the same line except the ending "</center></html>". For example, line:
label.tooltip_html=Please enter your username.</center></html>
should become:
label.tooltip_html=Please enter your username.
Note: I would like to do this replacement using an IDE (IntelliJ IDEA, Eclipse, NetBeans...)
Since you clarified that this regex is to be used in the IDE, I tested this in Eclipse and it works:
FIND:
(_html.*)</center></html>
REPLACE WITH:
$1
Make sure you turn on the Regular expressions switch in the Find/Replace dialog. This will match any string that contains _html.* (where the .* greedily matches any string not containing newlines), followed by </center></html>. It uses (…) brackets to capture what was matched into group 1, and $1 in the replacement substitutes in what group 1 captured.
This effectively removes </center></html> if that string is preceded by _html in that line.
If there can be multiple </center></html> in a line, and they are all to be removed if there's a _html_ to their left, then the regex will be more complicated, but it can be done in one regex with \G continuing anchor if absolutely need be.
Variations
Speaking more generally, you can also match things like this:
(delete)this part only(please)
This now creates 2 capturing groups. You can match strings with this pattern and replace with $1$2, and it will effectively delete this part only, but only if it's preceded by delete and followed by please. These subpatterns can be more complicated, of course.
if (line.contains("_html=")) {
line = line.replace("</center></html>", "");
}
No regExp needed here ;) (edit) as long as all lines of the property file are well formed.
String s = "label.tooltip_html=Please enter your password.</center></html>";
Pattern p = Pattern.compile("(_html.*)</center></html>");
Matcher m = p.matcher(s);
System.out.println(m.replaceAll("$1"));
Try something like this:
Pattern p = Pattern.compile(".*(_html).*</center></html>");
Matcher m = p.matcher(input_line); // get a matcher object
String output = input_line;
if (m.matches()) {
String output = input_line.replace("</center></html>", "");
}
/^(.*)<\/center><\/html>/
finds you the
label.tooltip_html=Please enter your username.
part. then you can just put the string together correctly.

Categories

Resources