Why my regular expression matches but does not capture a group? - java

I am trying to extract the information from the following string:
//YES: We got a match.
I want to extract the information defining two groups
Everything between // and :
all the rest behind :
The pattern matches correctly but I cannot extract the groups.
String example = "//YES: We got a match.";
String COMMENT_PATTERN = "//(\\w+):(.*)";
Pattern pattern = Pattern.compile(COMMENT_PATTERN);
example.matches(COMMENT_PATTERN); // true
Matcher matcher = pattern.matcher(example);
matcher.group(1); // raises an exception
I tried it as well with named groups:
String COMMENT_PATTERN = "//(?<init>\\w+):(?<rest>.*)";
...
matcher.group("init"); // raises an exception
Why my patterns cannot extract the specified groups?

You have to call either find() or matches() on the matcher to cause it to run the matching process before you can extract groups. The
example.matches(COMMENT_PATTERN);
creates its own internal Matcher, calls matches() and then discards the Matcher - it's equivalent to
Pattern.compile(COMMENT_PATTERN).matcher(example).matches()

Related

Regex match only if text contains something before

Given the following text
KEYWORD This is a test
We want to match the following groups 1:YES 2:YES 3:YES
I want to match with "1:YES", "2:YES" and "3:YES" using
((\d):YES)
If and only if the first word in the complete text is "KEYWORD"
Given this test:
This is a test
We want to match the following groups 1:YES 2:YES 3:YES
No matches should be found
Java (as with most regex engines) doesn't support unbounded length look behinds, however there is a work-around!
String str = "KEYWORD This is a test\n" +
"We want to match the following groups 1:YES 2:YES 3:YES";
Matcher matcher = Pattern.compile("(?s)(?<=\\AKEYWORD\\b.{1,99999})(\\d+:YES)").matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Which outputs:
1:YES
2:YES
3:YES
The trick here is the look behind (?<=\\AKEYWORD.{1,99999}) which has a large (but not unbounded) length. (?s) means DOTALL flag (dot matches newline too) and \A means start of input which is needed because ^ matches start of line when DOTALL flag is used.
Without tricking lookbhinds in Java you can capture \d+:YES\b strings with using \G. \G causes a match to start from where previous match ended or it will match beginning of string the same as \A.
We are in need of its first capability:
(?:\AKEYWORD|\G(?!\A))[\s\S]*?(\d:YES\b)
Breakdown:
(?: Start of non-capturing group
\A Match beginning of subject string
KEYWORD Match keyword
| Or
\G(?!\A) Continue from where previous match ends
) End of NCG
[\s\S]*? Match anything else un-greedily
(\d+:YES\b) Match and capture our desired part
Live demo
Java code:
Pattern p = Pattern.compile("(?:\\AKEYWORD|\\G(?!\\A))[\\s\\S]*?(\\d+:YES\\b)");
Matcher m = p.matcher(string);
while (m.find()) {
System.out.println(m.group(1));
}
Live demo

Capturing repeating groups

First of all let-me warning you that I am new to REGEX and that my English isn't the better...
I am trying to capture repeating groups, just like optional headers from http protocol.
What I need is given a string get all headers (none or many):
GET /RESOURCE/RES1 H1:value H2:value H3:value
So what I've tried is something like:
GET /RESOURCE/([^/\s]*)(\s[a-zA-Z:/|-]*)+
But all that I get is:
Group 1 = LS
Group 2 = H3:value
What am I doing wrong?
You can do something similar using the \G anchor but can't individually capture repeated patterns.
(?:\G(?!\A)|GET /RESOURCE/)(\S+)(?: |$)
Example:
String s = "GET /RESOURCE/RES1 H1:value H2:value H3:value";
Pattern p = Pattern.compile("(?:\\G(?!\\A)|GET /RESOURCE/)(\\S+)(?: |$)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Output
RES1
H1:value
H2:value
H3:value
You can add another capture group wrapping the last capture group and quantifier +. In fact you can make the current capture group as non-capturing:
GET /RESOURCE/([^/\s]*)((?:\s[a-zA-Z:/|-]*)+)
Now, capture group 2 will give you following result:
H1:value H2:value H3:value
You can get individual headers from it by splitting on space, and then colon.

Change group using regex java

I need help in regular expression using in regex java.
I need change group in string:
Example:
Input:
=sum($var1;2) or =if($result<10;"little";"big") ...
Need Output:
=sum(teste;2) or =if(teste<10;"little";"big") ...
Code I have:
Pattern p = Pattern.compile("(\\.*)(\\$\\w)(\\.*)");
Matcher m = p.matcher(total);
if (m.find()) {
System.out.println(m.replaceAll("$2teste"));
}
Output I have:
=sum($vtestear1;2)
=if($r testeesultado<5;"maior";"menor")
Why match everything when all you need is to match variable tokens?
Pattern p = Pattern.compile("\\b\\$[a-z0-9]+\\b");
p.matcher(total).replaceAll("teste");
Change the [a-z0-9] part if you can have more than lowercase ASCII letters and digits.
Also, you don't need to test for .find() or anything if you .replace(): no match means nothing will be replaced.

Regex matching in online tester but not in JAVA

I'm trying to extract the text BetClic from this string popup_siteinfo(this, '/click/betclic', '373', 'BetClic', '60€');
I wrote a simple regex that works on Regex Tester but that doesn't work on Java.
Here's the regex
'\d+', '(.*?)'
here's Java output
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at javaapplication1.JavaApplication1.main(JavaApplication1.java:74)
Java Result: 1
and here's my code
Pattern pattern = Pattern.compile("'\\d+', '(.*?)'");
Matcher matcher = pattern.matcher(onMouseOver);
System.out.print(matcher.group(1));
where the onMouseOver string is popup_siteinfo(this, '/click/betclic', '373', 'BetClic', '60€');
I'm not an expert with regex, but I'm quite sure that mine isn't wrong at all!
Suggestions?
You need to call find() before group(...):
Pattern pattern = Pattern.compile("'\\d+', '(.*?)'");
Matcher matcher = pattern.matcher(onMouseOver);
if(matcher.find()) {
System.out.print(matcher.group(1));
}
else {
System.out.print("no match");
}
You're calling group(1) without having first called a matching operation (such as find()).- which is the cause of IllegalStateException.
And if you have to use that grouped cases for replacement then this isn't needed if you're just using $1 since the replaceAll() is the matching operation.

Negating a Regular Expression for string replacement

I have the following code that can replace the email address in a String in Java:
addressStr.replaceFirst("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})", "")
So, a string with John Smith <john#smith.com> would become John Smith <>. How do I negate it so that it will instead replace all that doesn't match the email address and have the final result as just john#smith.com?
I tried to put in the ^ and ?<= at the front but it doesn't work.
Well, it's not the regex you need to change but the calling code. Your regex matches the e-mail address (in a weird way), and the replace() removes it from the string.
So just use
Pattern regex = Pattern.compile("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})");
Matcher regexMatcher = regex.matcher(addressStr);
if (regexMatcher.find()) {
address = regexMatcher.group();
}
The complete Java regex for catching e-mails would be as follows:
"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])"
Take a look at https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1 for more info on this.
A bit complicated but it is valid for all known and valid emails formats (yours do not allows mails like bob+bib#gmail.com which are valid).
For your problem, as stated multiple times, just find (stealing Tim Pietzcker piece of code):
Pattern regex = Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])");
Matcher regexMatcher = regex.matcher(addressStr);
foundMatch = regexMatcher.find();
You can try:
String mailId = Pattern.compile(regexp, Pattern.LITERAL).matcher(addressStr).group();
Idea here is to get the matched string rather than trying to replace everything else with blank. You can extract the pattern into a field if this operation is repetitive.
Just don't replace.... use match(es) instead.

Categories

Resources