Regular Expression in Java. Unexpected behaviour

Regular Expression in Java. Unexpected behaviour - java

I am trying to match mostly numbers, but depending on the Words which follow the Expression I need to make a difference.
I match every Number which is not followed by a Temperature Term like °C or a Time Specification.
My Regular Expression looks like this:
(((\d+?)(\s*)(\-)(\s*))?(\d+)(\s*))++(?!minuten|Minuten|min|Min|Stunden|stunden|std|Std|°C| °C)
Here is an Example: http://regexr.com?33jeg
While this Behavior is what I expected Java does the Following:
Index is the corresponding Group to the Match 4
0: "4 "1: "4 "2: "0 - "3: "0"4: " "5: "-"6: " "7: "4"8: " "9: "°C"
You need to Know that I match every String separate. So the match for the 5 looks like this:
0: "5 "1: "5 "2: "null"3: "null"4: "null"5: "null"6: "null"7: "5"8: " "9: "null"
This is how Id like the other Match to be. This unpleasant behavior is only when a "-" is somewhere in the String before the Match
My Java Code is the following:
public static void adaptPortionDetails(EList<Step> steps, double multiplicator){
String portionMatcher = "(((\\d+?)(\\s*)(\\-)(\\s*))?(\\d+)(\\s*))++(?!°C|Grad|minuten|Minuten|min|Min|Stunden|stunden|std|Std)";
for (int i = 0; i < steps.size(); i++) {
Matcher matcher = Pattern.compile(portionMatcher).matcher(
steps.get(i).getDescription());
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
printGroups(matcher);
String newValue1Str;
if (matcher.group(3) == null){
newValue1Str = "";
System.out.println("test");
}else{
double newValue1 = Integer.parseInt(matcher.group(3)) * multiplicator;
newValue1Str = Fraction.getFraction(newValue1).toProperString();
}
double newValue2 = Integer.parseInt(matcher.group(7)) * multiplicator;
String newValue2Str = Fraction.getFraction(newValue2).toProperString();
matcher.appendReplacement(sb, newValue1Str + "$4$5$6" + newValue2Str + "$8");
}
matcher.appendTail(sb);
steps.get(i).setDescription(sb.toString());
}
}
Hope you can tell what I'm missing.

This seems to be a bug (or feature?) in Java's implementation. It doesn't seem to reset the captured text for the capturing group when the matching has to be redone from the next index.
This test reveals the discrepancy in behavior between Java regex engine and PHP's PCRE.
Regex: (\d+(-\d+)?){1}+(?!x)
Input: 34 34-43x 78 90
Java result: 3 matches (34, 78, 90). The 2nd capturing group of the 2nd match is -43. The 2nd capturing group captures nothing for 1st and 3rd match.
PHP result: Also the same 3 matches, but 2nd capturing group captures nothing for all matches. For PHP's PCRE implementation, when the match has to be redone, the captured text of the capturing groups are reset.
This is tested this on JRE 6 Update 37 and JRE 7 Update 11.
Same result for this, just to prove the point that captured text is not reset when matching has to be redone:
Regex: a(\d+(-\d+)?){1}+(?!x)
Input: a34 a34-43x a78 a90
PHP result
Some comment about your regex
I think the ++ should be {1}+, since it seems that you want to modify one number or one range of number at a time, while making the match possessive to discard unwanted numbers.
Workaround
The first group (the outer most capturing group), which captures everything (one number or a range of number), will always be overwritten when a match is found. Hence you can rely on it. You can check whether there exist a - in the group 1 (with contains method). If there is, then you can tell that capturing group 2 contains captured text from the current match, and you can use the captured text. If there is not, then you can ignore all the captured text in capturing group 2 and its nested capturing groups.

Related

java regex tell which column not match

Good day,
My java code is as follow:
Pattern p = Pattern.compile("^[a-zA-Z0-9$&+,:;=\\[\\]{}?##|\\\\'<>._^*()%!/~\"`  -]*$");
String i = "f698fec0-dd89-11e8-b06b-â˜º";
Matcher tagmatch = p.matcher(i);
System.out.println("tagmatch is " + tagmatch.find());
As expected, the answer will be false, because there is â˜º character inside. However, I would like to show the column number that not match. For this example, it should show column 25th having the invalid character.
May I know how can I do this?

You should remove anchors from your regex and then use Matcher#end() method to get the position where it stopped the previous match like this:
String i = "f698fec0-dd89-11e8-b06b-â˜º";
Pattern p = Pattern.compile("[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+");
Matcher m = p.matcher(i);
if (m.lookingAt() && i.length() > m.end()) {
System.out.println("Match <" + m.group() + "> failed at: " + m.end());
}
Output:
Match <f698fec0-dd89-11e8-b06b-> failed at: 24
PS: I have used lookingAt() to ensure that we match the pattern starting from the beginning of the region. You can use find() as well to get the next match anywhere or else keep the start anchor in pattern as
"^[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+"
and use find() to effectively make it behave like the above code with lookingAt().
Read difference between lookingAt() and find()
I have refactored your regex to use \w instead of [a-zA-Z0-9_] and used quantifier + (meaning match 1 or more) instead of * (meaning match 0 or more) to avoid returning success for zero-length matches.

multiple regex pattern matches in a single string groovy

I have a test string like this
08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts
I want to regex and match "ABCD" and "35" in this string
def regexString = ~ /(\s\d{1,5}[^\d\]\-\:\,\.])|([A-Z]{4}\:)/
............
while (matcher.find()) {
acct = matcher.group(1)
grpName = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}
My Current Output is
group : ABCD: acct : null
group : null acct : 35
But I expected something like this
group : ABCD: acct : 35
Is there any option to match all the patterns in the string before it loops into the while(). Or a better way to implement this

You may use
String s = "08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts"
def res = s =~ /\b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b/
if (res.find()) {
println "${res[0][1]}, ${res[0][2]}"
} else {
println "not found"
}
See the Groovy demo.
The regex - \b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b - matches a string starting with a whole word consisting of 4 uppercase ASCII letters (captured into Group 1), then followed with : and 0+ chars other than [, ] and digits, and then matches and captures into Group 2 a whole number consisting of 1 to 4 digits.
See the regex demo.
In the code, =~ operator makes the regex engine find a partial match (i.e. searches for the pattern anywhere inside the string) and the res variable contains all the match objects that hold a whole match inside res[0][0], Group 1 inside res[0][1] and Group 2 value in res[0][2].

I believe your issues is with the 'or' in your regex. I think it is essentially parsing it twice, once to match the first half of the regex and then again to match the second half after the '|'. You need a regex that will match both in one parse. You can reverse the matches so they match in order:
/([A-Z]{4})\:.*\s(\d{1,5)}[^\d\]-"\,\.]/
Also notice the change in parentheses so you don't capture more than you need - Currently you are capturing the ':' after the group name and an extra space before the acct. This is assuming the "ABCD" will always come before the "35".
There is also a lot more you can do assuming that all your strings are formatted very similarly:
For example, if there is always a space after the acct number you could simplify it to:
/([A-Z]{4})\:.*\s(\d{1,5)}\s/
There's probably a lot more you could do to make sure you're always capturing the correct things, but i'd have to see or know more about the dataset to do so.
Then of course you have the switch the order of matches in your code:
while (matcher.find()) {
grpName = matcher.group(1)
acct = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}

Parsing array syntax using regex

I think what I am asking is either very trivial or already asked, but I have had a hard time finding answers.
We need to capture the inner number characters between brackets within a given string.
so given the string
StringWithMultiArrayAccess[0][9][4][45][1]
and the regex
^\w*?(\[(\d+)\])+?
I would expect 6 capture groups and access to the inner data.
However, I end up only capturing the last "1" character in capture group 2.
If it is important heres my java junit test:
#Test
public void ensureThatJsonHandlerCanHandleNestedArrays(){
String stringWithArr = "StringWithMultiArray[0][0][4][45][1]";
Pattern pattern = Pattern.compile("^\\w*?(\\[(\\d+)\\])+?");
Matcher matcher = pattern.matcher(stringWithArr);
matcher.find();
assertTrue(matcher.matches()); //passes
System.out.println(matcher.group(2)); //prints 1 (matched from last array symbols)
assertEquals("0", matcher.group(2)); //expected but its 1 not zero
assertEquals("45", matcher.group(5)); //only 2 capture groups exist, the whole string and the 1 from the last array brackets
}

In order to capture each number, you need to change your regex so it (a) captures a single number and (b) is not anchored to--and therefore limited by--any other part of the string ("^\w*?" anchors it to the start of the string). Then you can loop through them:
Matcher mtchr = Pattern.compile("\\[(\\d+)\\]").matcher(arrayAsStr);
while(mtchr.find()) {
System.out.print(mtchr.group(1) + " ");
}
Output:
0 9 4 45 1

(Pattern and Matcher) not discovering all pattern matches

I have this string object which consists of tags(bounded by [$ and $]) and rest of the text. Im trying to isolate all of the tags. (Pattern-Matcher) recognize all of the tags properly, but two of them are combined into one. I dont have any idea why this is happening, probably some internal (Matcher-Pattern) bussiness.
String docBody = "This is sample text.\r\n[$ FOR i 1 10 1 $]\r\n This is" +
"[$ i $]-th time this message is generated.\r\n[$END$]\r\n" +
"[$ FOR i 0 10 2 $]\r\n sin([$= i $]^2) = [$= i i * #sin \"0.000\"" +
" #decfmt $]" +
"\r\n[$END$] ";
Pattern p = Pattern.compile("(\\[\\$)(.)+(\\$\\])");
Matcher m = p.matcher(docBody);
while(m.find()){
System.out.println(m.group());
}
output:
[$ FOR i 1 10 1 $]
[$ i $]
[$END$]
[$ FOR i 0 10 2 $]
[$= i $]^2) = [$= i i * #sin "0.000" #decfmt $]
[$END$]`
As you can see, this part [$= i $]^2) = [$= i i * #sin "0.000" #decfmt $] is not split into these two tags [$= i $] and [$= i i * #sin "0.000" #decfmt $]
Any suggestions why this is happening?

You should use reluctant quantifier - ".+?" instead of greedy - ".+" :
"(\\[\\$).+?(\\$\\])" // Note `?` after `.+`
If you use .+, it will match everything except the line-terminator till the last $. Note that a dot (.) matches everything except a newline. With reluctant quantifier, .+? matches only till the first $] it encounters.
In your given string, you got all those matches, because you had \r\n in between, where the .+ stops matching. If you remove all those newlines, then you will just get a single match from 1st [$ to the last $].

A good way is to replace the dot by a negated character class, example:
Pattern p = Pattern.compile("(\\[\\$)([^$]++)(\\$])");
(note that you don't need to escape closing square brackets)
But perhaps are you only interested by the content of the tags:
Pattern p = Pattern.compile("(?<=\\[\\$)[^$]++(?=\\$])");
In this case the content is the whole match

Capturing groups using If then else regular expression construct in java

I have an input string in the following format
String input = "00IG356001110002005064007000000";
Characters 3-7 is the code.
Characters 8-12 is the amount.
Based on the code in the input string (IG356 in the sample input string), i need to capture the amount(00111 in the sample).
The value in the amount (characters 8-12) should be picked up only for specific codes and the logic is detailed below.
The code should not be SG356. If it is SG356, not a match and exit.
a. If the code is not SG356, check if the codes are IG902 or SG350, in this case capture the amount(00111)
else
b. Check for the 3 numbers in the code (characters 5-7, 356 in this sample). If they are 200,201,356,370. go ahead and capture the amount
I am using the regular expression shown below:
Using positive lookahead and if then else construct.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
This regular expression is working fine while just checking for a match.
.{2}(?!SG356)((?=IG902|SG350).+|.{2}(?=200|201|356|370).+)
The problem is only while capturing the group.
I am running this in Java. Any help would be greatly appreciated.
The java code i am using is :
public String getTsqlSum(String input, String regex){
String value = null;
Matcher m = Pattern.compile(regex).matcher(input);
System.out.println("Group Count: " + m.groupCount());
if (m.matches()) {
for (int i=0;i<m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}
}
return value;
}
public void forumTest(){
//String input = "00IG902001110002005064007000000";
String input = "00IG356001110002005064007000000";
String regex= ".{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+";
System.out.println(match(input, regex));
String match = getTsqlSum(input, regex);
System.out.println("Match: " + match);
}

The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
You are not unable to capture the amount, the expression is working fine. But if you are in the second part of the alternation (This is not a regex if-then-else) then your result is in a different capturing group. You will find it in the capturing group 3 and not in the second one like when you are matching in the first part of the alternation.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
Group number 1 2 3
In a regular expression the capturing groups are numbered by their opening brackets and this continues also in an alternation. In Perl there would be a construct that gives the capturing groups of an alternation the same number, but I think thats the only flavour that is able to do this.
In Java you need to check after the expression in which group you have the result.
See my answer here, similar topic
You can change your regex and make the alternation before the capturing group
try this
.{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+
You will find your result in both cases in the group 1. (I made the first one a non capturing group using the ?:)
Update after the source was added
Your loop is wrong, that means the groups are starting at 1, if you want the content of group one, you have to use m.group(1).
In group m.group(0) you will find the whole matched string.
Try this
for (int i=1;i<=m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular Expression in Java. Unexpected behaviour - java

Related

java regex tell which column not match

multiple regex pattern matches in a single string groovy

Parsing array syntax using regex

(Pattern and Matcher) not discovering all pattern matches

Capturing groups using If then else regular expression construct in java

Categories

Resources