java.util.regex.Matcher confused group - java

I'm having trouble getting the right group of a regex match. My code boils down to following:
Pattern fileNamePattern = Pattern.compile("\\w+_\\w+_\\w+_(\\w+)_(\\d*_\\d*)\\.xml");
Matcher fileNameMatcher = fileNamePattern.matcher("test_test_test_test_20110101_0000.xml");
System.out.println(fileNameMatcher.groupCount());
if (fileNameMatcher.matches()) {
for (int i = 0; i < fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
}
I expect the output to be:
2
test
20110101_0000
However its:
2
test_test_test_test_20110101_0000.xml
test
Does anyone have an explanation?

Group(0) is the whole match, and group(1), group(2), ... are the sub-groups matched by the regular expression.
Why do you expect "test" to be contained in your groups? You didn't define a group to match test (your regex contains only the group \d*_\d*).

Group 0 is the whole match. Real groups start with 1, i.e. you need this:
System.out.println(fileNameMatcher.group(i + 1));

group(0) should be the entire match ("test_test_test_test_20110101_0000.xml");
group(1) should be the sole capture group in your regex ("20110101_0000").
This is what I am getting. I am puzzled as to why you'd be getting a different value for group(1).

actually your for loop should INCLUDE groupCount() using "<=" :
for (int i = 0; i <= fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
thus your output then will be:
2
test_test_test_test_20110101_0000.xml
test
20110101_0000
the groupCount() will not count group 0 matching the whole string.
first group will be "test" as matched by (\w+) and
second group will be "20110101_0000" as matched by (\d*_\d*)

Related

Split String at different lengths in Java

I want to split a string after a certain length.
Let's say we have a string of "message"
123456789
Split like this :
"12" "34" "567" "89"
I thought of splitting them into 2 first using
"(?<=\\G.{2})"
Regexp and then join the last two and again split into 3 but is there any way to do it on a single go using RegExp. Please help me out
Use ^(.{2})(.{2})(.{3})(.{2}).* (See it in action in regex101) to group the String to the specified length and grab the groups as separate Strings
String input = "123456789";
List<String> output = new ArrayList<>();
Pattern pattern = Pattern.compile("^(.{2})(.{2})(.{3})(.{2}).*");
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
output.add(matcher.group(i));
}
}
System.out.println(output);
NOTE: Group capturing starts from 1 as the group 0 matches the whole String
And a Magnificent Sorcery from #YCF_L from comment
String pattern = "^(.{2})(.{2})(.{3})(.{2}).*";
String[] vals = "123456789".replaceAll(pattern, "$1-$2-$3-$4").split("-");
Whats the magic here is you can replace the captured group by replaceAll() method. Use $n (where n is a digit) to refer to captured subsequences. See this stackoverflow question for better explanation.
NOTE: here its assumed that no input string contains - in it.
if so, then find any other character that will not be in any of
your input strings so that it can be used as a delimiter.
test this regex in regex101 with 123456789 test string.
^(\d{2})(\d{2})(\d{3})(\d{2})$
output :
Match 1
Full match 0-9 `123456789`
Group 1. 0-2 `12`
Group 2. 2-4 `34`
Group 3. 4-7 `567`
Group 4. 7-9 `89`

Java regex returns full string instead of capture

Java Code:
String imagesArrayResponse = xmlNode.getChildText("files");
Matcher m = Pattern.compile("path\":\"([^\"]*)").matcher(imagesArrayResponse);
while (m.find()) {
String path = m.group(0);
}
String:
[{"path":"upload\/files\/56727570aaa08922_0.png","dir":"files","name":"56727570aaa08922_0","original_name":"56727570aaa08922_0.png"}{"path":"upload\/files\/56727570aaa08922_0.png","dir":"files","name":"56727570aaa08922_0","original_name":"56727570aaa08922_0.png"}{"path":"upload\/files\/56727570aaa08922_0.png","dir":"files","name":"56727570aaa08922_0","original_name":"56727570aaa08922_0.png"}{"path":"upload\/files\/56727570aaa08922_0.png","dir":"files","name":"56727570aaa08922_0","original_name":"56727570aaa08922_0.png"}]
m.group returns
path":"upload\/files\/56727570aaa08922_0.png"
instead of captured value of path. Where I am wrong?
See the documentation of group( int index ) method
When called with 0, it returns the entire string. Group 1 is the first.
To avoid such a trap, you should use named group with syntax :
"path\":\"(?<mynamegroup>[^\"]*)"
javadoc:
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
m.group(1) will give you the Match. If there are more than one matchset (), it will be m.group(2), m.group(3),...
By convention, AFAIK in regex engines the 0th group is always the whole matched string. Nested groups start at 1.
Check out the grouping options in Matcher.
Matcher m =
Pattern.compile(
//<- (0) -> that's group(0)
// <-(1)-> that's group(1)
"path\":\"([^\"]*)").matcher(imagesArrayResponse);
Change your code to
while (m.find()) {
String path = m.group(1);
}
And you should be okay. This is also worth checking out: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?

Second capturing group not capturing

In java, I've been trying to parse a log file using regex. Below one line of the log file.
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
I need the json string at the end of the line, and the id. Which means I need two capturing groups. So I started coding.
Pattern p = Pattern.compile(
"^I [0-9]{8} [0-9]{6} - com\\.example\\.Main - Main\\.doStuff \\(\\d+\\): ##identifier \\(id:(\\d+)\\): (.*?);$"
);
The (.*?) at the end of the pattern is because it needs to be greedy, but give back the ; at the very end of the input line.
Matcher m = p.matcher(readAboveLogfileLineToString());
System.err.println(m.matches() + ", " + m.groupCount());
for (int i = 0; i < m.groupCount(); i++) {
System.out.println(m.group(i));
}
However, above code outputs
true, 2
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
21
But where's my "rest" group? And why is the entire line a group? I've checked multiple online regex test sites, and it should work: http://www.regexplanet.com/advanced/java/index.html for example sees 3 capturing groups. Maybe it's to do with the fact that I'm currently using jdk 1.6?
The problem is that the groupCount iteration is one of the few cases in Java where you actually need to reach the count value to get all groups.
In this case, you need to iterate to group 2, since group 0 actually represents the whole match.
Just increment your counter as such (notice the <= instead of just <):
for (int i = 0; i <= m.groupCount(); i++) {
The last text printed should be: {}
You can also skip group 0 an start your count at 1 directly, of course.
To summarize, the explicit groups marked in the Pattern with parenthesis start from index 1.
See documentation here.

I can't get the first group of regex pattern in java

I'm trying to get the first group of a regex pattern.
I got this string from a lyric text:
[01:34][01:36]Blablablahh nanana
I'm this regex pattern to extract [01:34],[03:36] and the text.
Pattern timeLine = Pattern.compile("(\\[\\d\\d:\\d\\d\\])+(.*)");
But when I try to extract the first group [01:34] using group(1) it returns [03:36]
is there something wrong in the regex pattern?
Your problem is here
Pattern.compile("(\\[\\d\\d:\\d\\d\\])+(.*)");
^
This part of your pattern (\\[\\d\\d:\\d\\d\\])+ will match [01:34][01:36] because of + (which is greedy), but your group 1 can contain only one of [dd:dd] so it will store the last match found.
If you want to find only [01:34] you can correct your pattern by removing +. But you can also create simpler pattern
Pattern.compile("^\\[\\d\\d:\\d\\d\\]");
and use it with group(0) which is also called by group().
Pattern timeLine = Pattern.compile("^\\[\\d\\d:\\d\\d\\]");
Matcher m = timeLine.matcher("[01:34][01:36]Blablablahh nanana");
while (m.find()) {
System.out.println(m.group()); // prints [01:34]
}
In case you want to extract both [01:34][01:36] you can just add another parenthesis to your current regex like
Pattern.compile("((\\[\\d\\d:\\d\\d\\])+)(.*)");
This way entire match of (\\[\\d\\d:\\d\\d\\])+ will be in group 1.
You can also achieve it by removing (.*) from your original pattern and reading group 0.
I thin you are confused by the repeating match (\\[\\d\\d:\\d\\d\\])+ which returns just the last match as the group value. Try the following and see if it makes more sense to you:
String s = "[01:34][01:36]Blablablahh nanana";
Pattern timeLine = Pattern.compile("(\\[\\d\\d:\\d\\d\\])(\\[\\d\\d:\\d\\d\\])(.+)");
Matcher m = timeLine.matcher(s);
if (m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.printf(" Group %d -> %s\n", i, m.group(i)); // prints [01:36]
}
}
which for me returns:
Group 1 -> [01:34]
Group 2 -> [01:36]
Group 3 -> Blablablahh nanana
I would simply grab the first part using a character class:
String timings = str.replaceAll("([\\[\\]\\d:]+).*", "$1");
And similarly the text:
String text = str.replaceAll("[\\[\\]\\d:]+", "");

Regex fails to capture all groups

Using java.util.regex (jdk 1.6), the regular expression 201210(\d{5,5})Test applied to the subject string 20121000002Test only captures group(0) and does not capture group(1) (the pattern 00002) as it should, given the code below:
Pattern p1 = Pattern.compile("201210(\\d{5,5})Test");
Matcher m1 = p1.matcher("20121000002Test");
if(m1.find()){
for(int i = 1; i<m1.groupCount(); i++){
System.out.println("number = "+m1.group(i));
}
}
Curiously, another similar regular expression like 201210(\d{5,5})Test(\d{1,10}) applied to the subject string 20121000002Test0000000099 captures group 0 and 1 but not group 2.
On the contrary, by using JavaScript's RegExp object, the exact same regular expressions applied to the exact same subject strings captures all groups, as one could expect. I checked and re-checked this fact on my own by using these online testers:
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/
Am I doing something wrong here? Or is it that Java's regex library really sucks?
m1.groupCount() returns the number of capturing groups, ie. 1 in your first case so you won't enter in this loop for(int i = 1; i<m1.groupCount(); i++)
It should be for(int i = 1; i<=m1.groupCount(); i++)
Change the line
for(int i = 1; i<m1.groupCount(); i++){
to
for(int i = 1; i<=m1.groupCount(); i++){ //NOTE THE = ADDED HERE
It now works as a charm!
From java.util.regex.MatchResult.groupCount:
Group zero denotes the entire pattern by convention. It is not included in this count.
So iterate through groupCount() + 1.
the regular expression "201210(\d{5,5})Test" applied to the subject string "20121000002Test" only captures group(0) and does not capture group(1)
Well I can say I didn't read the manual either but if you do it says for Matcher.groupCount()
Returns the number of capturing groups in this matcher's pattern.
Group zero denotes the entire pattern by convention. It is not included in this count.
for (int i = 1; i <= m1.groupCount(); i++) {
↑
your problem

Categories

Resources