Second capturing group not capturing - java

In java, I've been trying to parse a log file using regex. Below one line of the log file.
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
I need the json string at the end of the line, and the id. Which means I need two capturing groups. So I started coding.
Pattern p = Pattern.compile(
"^I [0-9]{8} [0-9]{6} - com\\.example\\.Main - Main\\.doStuff \\(\\d+\\): ##identifier \\(id:(\\d+)\\): (.*?);$"
);
The (.*?) at the end of the pattern is because it needs to be greedy, but give back the ; at the very end of the input line.
Matcher m = p.matcher(readAboveLogfileLineToString());
System.err.println(m.matches() + ", " + m.groupCount());
for (int i = 0; i < m.groupCount(); i++) {
System.out.println(m.group(i));
}
However, above code outputs
true, 2
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
21
But where's my "rest" group? And why is the entire line a group? I've checked multiple online regex test sites, and it should work: http://www.regexplanet.com/advanced/java/index.html for example sees 3 capturing groups. Maybe it's to do with the fact that I'm currently using jdk 1.6?

The problem is that the groupCount iteration is one of the few cases in Java where you actually need to reach the count value to get all groups.
In this case, you need to iterate to group 2, since group 0 actually represents the whole match.
Just increment your counter as such (notice the <= instead of just <):
for (int i = 0; i <= m.groupCount(); i++) {
The last text printed should be: {}
You can also skip group 0 an start your count at 1 directly, of course.
To summarize, the explicit groups marked in the Pattern with parenthesis start from index 1.
See documentation here.

Related

Java: replace() the n pipe delimiter starting at the end with a regex

I'm trying to replace the first "|" but by starting by the end of the string:
usr/bin/pipe|pipe|name|28|-rwxr-xr-x|root:root||46711b361edd4512814d9b367ae765f42a71d729708b3f2e162acb8f64592610|
my file name is pipe|pipe|name and i want my regex to return me usr/bin/pipe|pipe|name
I've begin by this regex: .([^\|]*)$ but I don't know how to go further in the pipes : https://regex101.com/r/ttbiab/3
And in Java:
String strLine = "usr/bin/pipe|pipe|name|28|-rwxr-xr-x|root:root||46711b361edd4512814d9b367ae765f42a71d729708b3f2e162acb8f64592610|";
strLine = strLine.replaceAll(".([^\\|]*)$", "[:124:]");
System.out.println("strLine : " + strLine);
Based on your comments, it looks like you don't know how many pipes will be in the filename, but you do know how many pipes are in the input string that aren't in the filename. In this case, regex may not be the best approach. There are a couple of different ways to do this, but perhaps one of the easiest to understand and maintain would be to split the String, and then recombine it with replacements where applicable:
String input = "usr/bin/pipe|pipe|name|28|-rwxr-xr-x|root:root||46711b361edd4512814d9b367ae765f42a71d729708b3f2e162acb8f64592610|";
String pipeReplacement = ":124:";
int numberOfPipesToKeep = 7;
String[] split = input.split("\\|");
StringBuilder sb = new StringBuilder();
for (int i = 0; i < split.length; i++) {
sb.append(split[i]);
if (i < split.length - numberOfPipesToKeep) {
sb.append(pipeReplacement);
} else {
sb.append("|");
}
}
String output = sb.toString();
System.out.println(output);
The above sample handles any number of pipes, is pretty configurable, and is (in my opinion) a lot easier to understand and debug than trying to use regular expressions.
You can try something like this [\|]([^\|]*[\|]){5}$. It matches the pipe count of 5 followed by first pipe from the ends.
If there is a fixed number of 6 pipes until the end of the string, and you want to select the individual pipes before that to replace them, you could make use of \G to assert the position at the previous match and use a lookahead to assert that what is on the right is 6 times not a pipe followed by a pipe.
(?:([^\|]*)|\G(?!^))\|([^|]*)(?=(?:[^|]*\|){6})
In Java:
String regex = "(?:([^\\|]*)|\\G(?!^))\\|([^|]*)(?=(?:[^|]*\\|){6})";
(?: Non capturing group
([^\|]*) Capture in group 1 matching not a pipe 0+ times using a negated character class
| OR
\G(?!^) Assert position at the end of the previous match, not at the start
) Close non capturing group
\|([^|]*) Match a pipe and capture in group 2 match 0+ times not a pipe
(?= Non capturing group
(?:[^|]*\|){6} Positive lookahead, assert what is on the right is 6 times not a pipe followed by a pipe
) Close non capturing group
Regex demo | Java demo
If you want to replace the pipe with for example #, then use the 2 capturing groups:
$1#$2

Regular Expression in Java. Unexpected behaviour

I am trying to match mostly numbers, but depending on the Words which follow the Expression I need to make a difference.
I match every Number which is not followed by a Temperature Term like °C or a Time Specification.
My Regular Expression looks like this:
(((\d+?)(\s*)(\-)(\s*))?(\d+)(\s*))++(?!minuten|Minuten|min|Min|Stunden|stunden|std|Std|°C| °C)
Here is an Example: http://regexr.com?33jeg
While this Behavior is what I expected Java does the Following:
Index is the corresponding Group to the Match 4
0: "4 "1: "4 "2: "0 - "3: "0"4: " "5: "-"6: " "7: "4"8: " "9: "°C"
You need to Know that I match every String separate. So the match for the 5 looks like this:
0: "5 "1: "5 "2: "null"3: "null"4: "null"5: "null"6: "null"7: "5"8: " "9: "null"
This is how Id like the other Match to be. This unpleasant behavior is only when a "-" is somewhere in the String before the Match
My Java Code is the following:
public static void adaptPortionDetails(EList<Step> steps, double multiplicator){
String portionMatcher = "(((\\d+?)(\\s*)(\\-)(\\s*))?(\\d+)(\\s*))++(?!°C|Grad|minuten|Minuten|min|Min|Stunden|stunden|std|Std)";
for (int i = 0; i < steps.size(); i++) {
Matcher matcher = Pattern.compile(portionMatcher).matcher(
steps.get(i).getDescription());
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
printGroups(matcher);
String newValue1Str;
if (matcher.group(3) == null){
newValue1Str = "";
System.out.println("test");
}else{
double newValue1 = Integer.parseInt(matcher.group(3)) * multiplicator;
newValue1Str = Fraction.getFraction(newValue1).toProperString();
}
double newValue2 = Integer.parseInt(matcher.group(7)) * multiplicator;
String newValue2Str = Fraction.getFraction(newValue2).toProperString();
matcher.appendReplacement(sb, newValue1Str + "$4$5$6" + newValue2Str + "$8");
}
matcher.appendTail(sb);
steps.get(i).setDescription(sb.toString());
}
}
Hope you can tell what I'm missing.
This seems to be a bug (or feature?) in Java's implementation. It doesn't seem to reset the captured text for the capturing group when the matching has to be redone from the next index.
This test reveals the discrepancy in behavior between Java regex engine and PHP's PCRE.
Regex: (\d+(-\d+)?){1}+(?!x)
Input: 34 34-43x 78 90
Java result: 3 matches (34, 78, 90). The 2nd capturing group of the 2nd match is -43. The 2nd capturing group captures nothing for 1st and 3rd match.
PHP result: Also the same 3 matches, but 2nd capturing group captures nothing for all matches. For PHP's PCRE implementation, when the match has to be redone, the captured text of the capturing groups are reset.
This is tested this on JRE 6 Update 37 and JRE 7 Update 11.
Same result for this, just to prove the point that captured text is not reset when matching has to be redone:
Regex: a(\d+(-\d+)?){1}+(?!x)
Input: a34 a34-43x a78 a90
PHP result
Some comment about your regex
I think the ++ should be {1}+, since it seems that you want to modify one number or one range of number at a time, while making the match possessive to discard unwanted numbers.
Workaround
The first group (the outer most capturing group), which captures everything (one number or a range of number), will always be overwritten when a match is found. Hence you can rely on it. You can check whether there exist a - in the group 1 (with contains method). If there is, then you can tell that capturing group 2 contains captured text from the current match, and you can use the captured text. If there is not, then you can ignore all the captured text in capturing group 2 and its nested capturing groups.

Regex fails to capture all groups

Using java.util.regex (jdk 1.6), the regular expression 201210(\d{5,5})Test applied to the subject string 20121000002Test only captures group(0) and does not capture group(1) (the pattern 00002) as it should, given the code below:
Pattern p1 = Pattern.compile("201210(\\d{5,5})Test");
Matcher m1 = p1.matcher("20121000002Test");
if(m1.find()){
for(int i = 1; i<m1.groupCount(); i++){
System.out.println("number = "+m1.group(i));
}
}
Curiously, another similar regular expression like 201210(\d{5,5})Test(\d{1,10}) applied to the subject string 20121000002Test0000000099 captures group 0 and 1 but not group 2.
On the contrary, by using JavaScript's RegExp object, the exact same regular expressions applied to the exact same subject strings captures all groups, as one could expect. I checked and re-checked this fact on my own by using these online testers:
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/
Am I doing something wrong here? Or is it that Java's regex library really sucks?
m1.groupCount() returns the number of capturing groups, ie. 1 in your first case so you won't enter in this loop for(int i = 1; i<m1.groupCount(); i++)
It should be for(int i = 1; i<=m1.groupCount(); i++)
Change the line
for(int i = 1; i<m1.groupCount(); i++){
to
for(int i = 1; i<=m1.groupCount(); i++){ //NOTE THE = ADDED HERE
It now works as a charm!
From java.util.regex.MatchResult.groupCount:
Group zero denotes the entire pattern by convention. It is not included in this count.
So iterate through groupCount() + 1.
the regular expression "201210(\d{5,5})Test" applied to the subject string "20121000002Test" only captures group(0) and does not capture group(1)
Well I can say I didn't read the manual either but if you do it says for Matcher.groupCount()
Returns the number of capturing groups in this matcher's pattern.
Group zero denotes the entire pattern by convention. It is not included in this count.
for (int i = 1; i <= m1.groupCount(); i++) {
↑
your problem

Java recursive(?) repeated(?) deep(?) pattern matching

I'm trying to get ALL the substrings in the input string that match the given pattern.
For example,
Given string: aaxxbbaxb
Pattern: a[a-z]{0,3}b
(What I actually want to express is: all the patterns that starts with a and ends with b, but can have up to 2 alphabets in between them)
Exact results that I want (with their indexes):
aaxxb: index 0~4
axxb: index 1~4
axxbb: index 1~5
axb: index 6~8
But when I run it through the Pattern and Matcher classes using Pattern.compile() and Matcher.find(), it only gives me:
aaxxb : index 0~4
axb : index 6~8
This is the piece of code I used.
Pattern pattern = Pattern.compile("a[a-z]{0,3}b", Pattern.CASE_INSENSITIVE);
Matcher match = pattern.matcher("aaxxbbaxb");
while (match.find()) {
System.out.println(match.group());
}
How can I retrieve every single piece of string that matches the pattern?
Of course, it doesn't have to use Pattern and Matcher classes, as long as it's efficient :)
(see: All overlapping substrings matching a java regex )
Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.
public static void allMatches(String text, String regex)
{
for (int i = 0; i < text.length(); ++i) {
for (int j = i + 1; j <= text.length(); ++j) {
String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);
if (m.find())
{
System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
}
}
}
}
you are in effect searching for the strings ab, a_b, and a__b in an input string, where
_ denotes a non-whitespace character whose value you do not care about.
That's three search targets. The most efficient way I can think of to do this would be to use a search algorithm like the Knuth-Morris-Pratt algorithm, with a few modifications. In effect your pseudocode would be something like:
for i in 0 to sourcestring.length
check sourcestring[i] - is it a? if so, check sourcestring[i+x]
// where x is the index of the search string - 1
if matches then save i to output list
else i = i + searchstring.length
obviously if you have a position match you must then check the inner characters of the substring to make sure they are alphabetical.
run the algorithm 3 times, one for each search term. It will doubtless be much faster than trying to do the search using pattern matching.
edit - sorry, didn't read the question properly. If you have to use regex then the above will not work for you.
One thing you could do is:
Create all possible Substrings that are 4 characters or longer (good
luck with that if your String is large)
Create a new Matcher for each of these substrings
do a match() instead of a find()
calculate the absolute offset from the substring's relative offset and the matcher info

java.util.regex.Matcher confused group

I'm having trouble getting the right group of a regex match. My code boils down to following:
Pattern fileNamePattern = Pattern.compile("\\w+_\\w+_\\w+_(\\w+)_(\\d*_\\d*)\\.xml");
Matcher fileNameMatcher = fileNamePattern.matcher("test_test_test_test_20110101_0000.xml");
System.out.println(fileNameMatcher.groupCount());
if (fileNameMatcher.matches()) {
for (int i = 0; i < fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
}
I expect the output to be:
2
test
20110101_0000
However its:
2
test_test_test_test_20110101_0000.xml
test
Does anyone have an explanation?
Group(0) is the whole match, and group(1), group(2), ... are the sub-groups matched by the regular expression.
Why do you expect "test" to be contained in your groups? You didn't define a group to match test (your regex contains only the group \d*_\d*).
Group 0 is the whole match. Real groups start with 1, i.e. you need this:
System.out.println(fileNameMatcher.group(i + 1));
group(0) should be the entire match ("test_test_test_test_20110101_0000.xml");
group(1) should be the sole capture group in your regex ("20110101_0000").
This is what I am getting. I am puzzled as to why you'd be getting a different value for group(1).
actually your for loop should INCLUDE groupCount() using "<=" :
for (int i = 0; i <= fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
thus your output then will be:
2
test_test_test_test_20110101_0000.xml
test
20110101_0000
the groupCount() will not count group 0 matching the whole string.
first group will be "test" as matched by (\w+) and
second group will be "20110101_0000" as matched by (\d*_\d*)

Categories

Resources