Regex fails to capture all groups

Regex fails to capture all groups - java

Using java.util.regex (jdk 1.6), the regular expression 201210(\d{5,5})Test applied to the subject string 20121000002Test only captures group(0) and does not capture group(1) (the pattern 00002) as it should, given the code below:
Pattern p1 = Pattern.compile("201210(\\d{5,5})Test");
Matcher m1 = p1.matcher("20121000002Test");
if(m1.find()){
for(int i = 1; i<m1.groupCount(); i++){
System.out.println("number = "+m1.group(i));
}
}
Curiously, another similar regular expression like 201210(\d{5,5})Test(\d{1,10}) applied to the subject string 20121000002Test0000000099 captures group 0 and 1 but not group 2.
On the contrary, by using JavaScript's RegExp object, the exact same regular expressions applied to the exact same subject strings captures all groups, as one could expect. I checked and re-checked this fact on my own by using these online testers:
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/
Am I doing something wrong here? Or is it that Java's regex library really sucks?

m1.groupCount() returns the number of capturing groups, ie. 1 in your first case so you won't enter in this loop for(int i = 1; i<m1.groupCount(); i++)
It should be for(int i = 1; i<=m1.groupCount(); i++)

Change the line
for(int i = 1; i<m1.groupCount(); i++){
to
for(int i = 1; i<=m1.groupCount(); i++){ //NOTE THE = ADDED HERE
It now works as a charm!

From java.util.regex.MatchResult.groupCount:
Group zero denotes the entire pattern by convention. It is not included in this count.
So iterate through groupCount() + 1.

the regular expression "201210(\d{5,5})Test" applied to the subject string "20121000002Test" only captures group(0) and does not capture group(1)
Well I can say I didn't read the manual either but if you do it says for Matcher.groupCount()
Returns the number of capturing groups in this matcher's pattern.
Group zero denotes the entire pattern by convention. It is not included in this count.

for (int i = 1; i <= m1.groupCount(); i++) {
↑
your problem

Related

How to split a string and save the 2 characters that I split with?

I am trying to split a given string using the java split method while the string should be devided by two different characters (+ and -) and I am willing to save the characters inside the array aswell in the same index the string has been saven.
for example :
input : String s = "4x^2+3x-2"
output :
arr[0] = 4x^2
arr[1] = +3x
arr[2] = -2
I know how to get the + or - characters in a different index between the numbers but it is not helping me,
any suggestions please?

You can face this problem in many ways. I´m sure there are clever and fancy ways to split this expression. I will show you the simplest problem-solving process that can help you.
State the problem you need to solve, the input and output
Problem: Split a math expression into subexpressions at + and - signals
Input: 4x^2+3x-2
Output: 4x^2,+3x,-2
Create a pseudo code with some logic you might think works
Given an expression string
Create an empty list of expressions
Create a subExpression string
For each character in the expression
Check if the character is + ou - then
add the subExpression in the list and create a new empty subexpression
otherwise, append the character in the subExpression
In the end, add the left subexpression in the list
Implement the pseudo-code in the programming language of your choice
String expression = "4x^2+3x-2";
List<String> expressions = new ArrayList();
StringBuilder subExpression = new StringBuilder();
for (int i = 0; i < expression.length(); i++) {
char character = expression.charAt(i);
if (character == '-' || character == '+') {
expressions.add(subExpression.toString());
subExpression = new StringBuilder(String.valueOf(character));
} else {
subExpression.append(String.valueOf(character));
}
}
expressions.add(subExpression.toString());
System.out.println(expressions);
Output
[4x^2, +3x, -2]
You will end with one algorithm that works for your problem. You can start to improve it.

Try this code:
String s = "4x^2+3x-2";
s = s.replace("+", "#+");
s = s.replace("-", "#-");
String[] ss = s.split("#");
for (int i = 0; i < ss.length; i++) {
Log.e("XOP",ss[i]);
}
This code replaces + and - with #+ and #- respectively and then splits the string with #. That way the + and - operators are not lost in the result.
If you require # as input character then you can use any other Unicode character instead of #.

Try this one:
String s = "4x^2+3x-2";
String[] arr = s.split("[\\+-]");
for(int i=0;i<arr.length;i++){
System.out.println(arr[i]);
}

Personally I like it better to have positive matches of patterns, especially if the split pattern itself is empty.
So for instance you could use a Pattern and Matcher like this:
Pattern p = Pattern.compile("(^|[+-])([^+-]*)");
Matcher m = p.matcher("4x^2+3x-2");
while (m.find()) {
System.out.printf("%s or %s %s%n", m.group(), m.group(1), m.group(2));
}
This matches the start of the string or a plus or minus: ^|[+-], followed by any amount of characters that are not a plus or minus: [^+-]*.
Do note that the ^ first matches the start of the string, and is then used to negate a character class when used between brackets. Regular expressions are tricky like that.
Bonus: you can also use the two groups (within the parenthesis in the pattern) to match the operators - if any.
All this is presuming that you want to use/test regular expressions; generally things like this require a parser rather than a regular expression.
A one-liner for persons thinking that this is too complex:
var expressions = Pattern.compile("^|[+-][^+-]*")
.matcher("4x^2+3x-2")
.results()
.map(r -> r.group())
.collect(Collectors.toList());

Regular Expression to restrict some special characters

I am trying to write regular expression to restrict some characters. The character to restrict is based on the requirement from various users.
I am trying to use this regex - [(char1|char2|char3|...)$]
Note: Each char will be from requirement.
If the user entered string matches any of the character i ll return true. Now,
what I want to know is weather this expression will work for all the conditions?
For example - requirement1 = .:, requirement2 = .:&%
I will concatinate | in between each char and then i will generate regular expression in java. This is working for my requirement1 but not for requirement2.
my sample java code
String requirement = ":>&%";
String regExp1 = null;
for (int i = 0; i < requirement.length(); i++) {
regExp1 = "[(" + requirement.charAt(i);
if (i - 1 != requirement.length()) {
regExp1.concat("|");
}
}
if (regExp1 != null) {
regExp1.concat(")]$");
}
Pattern p = Pattern.compile(regExp);
Matcher m = p.matcher(arg);
if (m.find())
return true;
else
return false;
How can I generate standard regular expression?

If you want "one of these characters" the brackets are good enough. No need for parenthesis and pipes.
Something like this : [.:,] and [.:&%] may work. If want them one or more times you have to had + at the end of your regex (ie: [.:&%]+).
As said in the comments, beware of special chars (like the dot, which means any chars in regex).

Second capturing group not capturing

In java, I've been trying to parse a log file using regex. Below one line of the log file.
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
I need the json string at the end of the line, and the id. Which means I need two capturing groups. So I started coding.
Pattern p = Pattern.compile(
"^I [0-9]{8} [0-9]{6} - com\\.example\\.Main - Main\\.doStuff \\(\\d+\\): ##identifier \\(id:(\\d+)\\): (.*?);$"
);
The (.*?) at the end of the pattern is because it needs to be greedy, but give back the ; at the very end of the input line.
Matcher m = p.matcher(readAboveLogfileLineToString());
System.err.println(m.matches() + ", " + m.groupCount());
for (int i = 0; i < m.groupCount(); i++) {
System.out.println(m.group(i));
}
However, above code outputs
true, 2
I 20151007 090137 - com.example.Main - Main.doStuff (293): ##identifier (id:21): {};
21
But where's my "rest" group? And why is the entire line a group? I've checked multiple online regex test sites, and it should work: http://www.regexplanet.com/advanced/java/index.html for example sees 3 capturing groups. Maybe it's to do with the fact that I'm currently using jdk 1.6?

The problem is that the groupCount iteration is one of the few cases in Java where you actually need to reach the count value to get all groups.
In this case, you need to iterate to group 2, since group 0 actually represents the whole match.
Just increment your counter as such (notice the <= instead of just <):
for (int i = 0; i <= m.groupCount(); i++) {
The last text printed should be: {}
You can also skip group 0 an start your count at 1 directly, of course.
To summarize, the explicit groups marked in the Pattern with parenthesis start from index 1.
See documentation here.

Java recursive(?) repeated(?) deep(?) pattern matching

I'm trying to get ALL the substrings in the input string that match the given pattern.
For example,
Given string: aaxxbbaxb
Pattern: a[a-z]{0,3}b
(What I actually want to express is: all the patterns that starts with a and ends with b, but can have up to 2 alphabets in between them)
Exact results that I want (with their indexes):
aaxxb: index 0~4
axxb: index 1~4
axxbb: index 1~5
axb: index 6~8
But when I run it through the Pattern and Matcher classes using Pattern.compile() and Matcher.find(), it only gives me:
aaxxb : index 0~4
axb : index 6~8
This is the piece of code I used.
Pattern pattern = Pattern.compile("a[a-z]{0,3}b", Pattern.CASE_INSENSITIVE);
Matcher match = pattern.matcher("aaxxbbaxb");
while (match.find()) {
System.out.println(match.group());
}
How can I retrieve every single piece of string that matches the pattern?
Of course, it doesn't have to use Pattern and Matcher classes, as long as it's efficient :)

(see: All overlapping substrings matching a java regex )
Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.
public static void allMatches(String text, String regex)
{
for (int i = 0; i < text.length(); ++i) {
for (int j = i + 1; j <= text.length(); ++j) {
String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);
if (m.find())
{
System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
}
}
}
}

you are in effect searching for the strings ab, a_b, and a__b in an input string, where
_ denotes a non-whitespace character whose value you do not care about.
That's three search targets. The most efficient way I can think of to do this would be to use a search algorithm like the Knuth-Morris-Pratt algorithm, with a few modifications. In effect your pseudocode would be something like:
for i in 0 to sourcestring.length
check sourcestring[i] - is it a? if so, check sourcestring[i+x]
// where x is the index of the search string - 1
if matches then save i to output list
else i = i + searchstring.length
obviously if you have a position match you must then check the inner characters of the substring to make sure they are alphabetical.
run the algorithm 3 times, one for each search term. It will doubtless be much faster than trying to do the search using pattern matching.
edit - sorry, didn't read the question properly. If you have to use regex then the above will not work for you.

One thing you could do is:
Create all possible Substrings that are 4 characters or longer (good
luck with that if your String is large)
Create a new Matcher for each of these substrings
do a match() instead of a find()
calculate the absolute offset from the substring's relative offset and the matcher info

java.util.regex.Matcher confused group

I'm having trouble getting the right group of a regex match. My code boils down to following:
Pattern fileNamePattern = Pattern.compile("\\w+_\\w+_\\w+_(\\w+)_(\\d*_\\d*)\\.xml");
Matcher fileNameMatcher = fileNamePattern.matcher("test_test_test_test_20110101_0000.xml");
System.out.println(fileNameMatcher.groupCount());
if (fileNameMatcher.matches()) {
for (int i = 0; i < fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
}
I expect the output to be:
2
test
20110101_0000
However its:
2
test_test_test_test_20110101_0000.xml
test
Does anyone have an explanation?

Group(0) is the whole match, and group(1), group(2), ... are the sub-groups matched by the regular expression.
Why do you expect "test" to be contained in your groups? You didn't define a group to match test (your regex contains only the group \d*_\d*).

Group 0 is the whole match. Real groups start with 1, i.e. you need this:
System.out.println(fileNameMatcher.group(i + 1));

group(0) should be the entire match ("test_test_test_test_20110101_0000.xml");
group(1) should be the sole capture group in your regex ("20110101_0000").
This is what I am getting. I am puzzled as to why you'd be getting a different value for group(1).

actually your for loop should INCLUDE groupCount() using "<=" :
for (int i = 0; i <= fileNameMatcher.groupCount(); ++i) {
System.out.println(fileNameMatcher.group(i));
}
thus your output then will be:
2
test_test_test_test_20110101_0000.xml
test
20110101_0000
the groupCount() will not count group 0 matching the whole string.
first group will be "test" as matched by (\w+) and
second group will be "20110101_0000" as matched by (\d*_\d*)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex fails to capture all groups - java

m1.groupCount() returns the number of capturing groups, ie. 1 in your first case so you won't enter in this loop for(int i = 1; i<m1.groupCount(); i++) It should be for(int i = 1; i<=m1.groupCount(); i++)

Change the line for(int i = 1; i<m1.groupCount(); i++){ to for(int i = 1; i<=m1.groupCount(); i++){ //NOTE THE = ADDED HERE It now works as a charm!

From java.util.regex.MatchResult.groupCount: Group zero denotes the entire pattern by convention. It is not included in this count. So iterate through groupCount() + 1.

for (int i = 1; i <= m1.groupCount(); i++) { ↑ your problem

Related

How to split a string and save the 2 characters that I split with?

Regular Expression to restrict some special characters

Second capturing group not capturing

Java recursive(?) repeated(?) deep(?) pattern matching

java.util.regex.Matcher confused group

Categories

Resources