Capturing groups using If then else regular expression construct in java - java

I have an input string in the following format
String input = "00IG356001110002005064007000000";
Characters 3-7 is the code.
Characters 8-12 is the amount.
Based on the code in the input string (IG356 in the sample input string), i need to capture the amount(00111 in the sample).
The value in the amount (characters 8-12) should be picked up only for specific codes and the logic is detailed below.
The code should not be SG356. If it is SG356, not a match and exit.
a. If the code is not SG356, check if the codes are IG902 or SG350, in this case capture the amount(00111)
else
b. Check for the 3 numbers in the code (characters 5-7, 356 in this sample). If they are 200,201,356,370. go ahead and capture the amount
I am using the regular expression shown below:
Using positive lookahead and if then else construct.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
This regular expression is working fine while just checking for a match.
.{2}(?!SG356)((?=IG902|SG350).+|.{2}(?=200|201|356|370).+)
The problem is only while capturing the group.
I am running this in Java. Any help would be greatly appreciated.
The java code i am using is :
public String getTsqlSum(String input, String regex){
String value = null;
Matcher m = Pattern.compile(regex).matcher(input);
System.out.println("Group Count: " + m.groupCount());
if (m.matches()) {
for (int i=0;i<m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}
}
return value;
}
public void forumTest(){
//String input = "00IG902001110002005064007000000";
String input = "00IG356001110002005064007000000";
String regex= ".{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+";
System.out.println(match(input, regex));
String match = getTsqlSum(input, regex);
System.out.println("Match: " + match);
}

The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
You are not unable to capture the amount, the expression is working fine. But if you are in the second part of the alternation (This is not a regex if-then-else) then your result is in a different capturing group. You will find it in the capturing group 3 and not in the second one like when you are matching in the first part of the alternation.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
Group number 1 2 3
In a regular expression the capturing groups are numbered by their opening brackets and this continues also in an alternation. In Perl there would be a construct that gives the capturing groups of an alternation the same number, but I think thats the only flavour that is able to do this.
In Java you need to check after the expression in which group you have the result.
See my answer here, similar topic
You can change your regex and make the alternation before the capturing group
try this
.{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+
You will find your result in both cases in the group 1. (I made the first one a non capturing group using the ?:)
Update after the source was added
Your loop is wrong, that means the groups are starting at 1, if you want the content of group one, you have to use m.group(1).
In group m.group(0) you will find the whole matched string.
Try this
for (int i=1;i<=m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}

Related

Java - Regular Expressions Split on character after and before certain words

I'm having trouble figuring out how to grab a certain part of a string using regular expressions in JAVA. Here's my input string:
application.APPLICATION NAME.123456789.status
I need to grab the portion of the string called "APPLICATION NAME". I can't simply split on the period character becuase APPLICATION NAME may itself include a period. The first word, "application", will always remain the same and the characters after "APPLICATION NAME" will always be numbers.
I've been able to split on period and grab the 1st index but as I mentioned, APPLICATION NAME may itself include periods so this is no good. I've also been able to grab the first and second to last index of a period but that seems ineffecient and would like to future-proof by using REGEX.
I've googled around for hours and haven't been able to find much guidance. Thanks!
You can use ^application\.(.*)\.\d with find(), or application\.(.*)\.\d.* with matches().
Sample code using find():
private static void test(String input) {
String regex = "^application\\.(.*)\\.\\d";
Matcher m = Pattern.compile(regex).matcher(input);
if (m.find())
System.out.println(input + ": Found \"" + m.group(1) + "\"");
else
System.out.println(input + ": **NOT FOUND**");
}
public static void main(String[] args) {
test("application.APPLICATION NAME.123456789.status");
test("application.Other.App.Name.123456789.status");
test("application.App 55 name.123456789.status");
test("application.App.55.name.123456789.status");
test("bad input");
}
Output
application.APPLICATION NAME.123456789.status: Found "APPLICATION NAME"
application.Other.App.Name.123456789.status: Found "Other.App.Name"
application.App 55 name.123456789.status: Found "App 55 name"
application.App.55.name.123456789.status: Found "App.55.name"
bad input: **NOT FOUND**
The above will work as long as "status" doesn't start with a digit.
With split(), you could save key.split("\\.") in a String[] s and, in a second time, join from s[1] to s[s.length-3].
With regexes you can do:
String appName = key.replaceAll("application\\.(.*)\\.\\d+\\.\\w+")", "$1");
Why split? Just:
String appName = input.replaceAll(".*?\\.(.*)\\.\\d+\\..*", "$1");
This also correctly handles a dot then digits within the application name, but only works correctly if you know the input is in the expected format.
To handle "bad" input by returning blank if the pattern is not matched, be more strict and use an optional that will always match (replace) the entire input:
String appName = input.replaceAll("^application\\.(.*)\\.\\d+\\.\\w+$|.*", "$1");

Java regular expression truncates string

I have the following Java string replaceAll function with a regular expression that replaces with zero variables with format ${var}:
String s = "( 200828.22 +400000.00 ) / ( 2.00 + ${16!*!8!1} ) + 200828.22 + ${16!*!8!0}";
s = s.replaceAll("\\$\\{.*\\}", "0");
The problem is that the resulting string s is:
"( 200828.22 +400000.00 ) / ( 2.00 + 0"
What's wrong with this code?
Change your regex to
\\$\\{.*?\\}
↑
* is greedy, the engine repeats it as many times as it can, so it matches {, then match everything until last token. It then begins to backtrack until it matches the last character before }.
For example, if you have the regex
\\{.*\\}
and the string
"{this is} a {test} string"
it'll match as follows:
{ matches the first {
.* matches everything until g token
the regex fails to match last } in the string
it backtracks until it reaches t, then it can match the next } resulting with matching "{this is} a {test}"
In order to make it ungreedy, you should add an ?. By doing that, it'll become lazy and stops until first } is encountered.
As mentioned in the comments, an alternative would be [^}]*. It matches anything that's not } (since it's placed in a character class).

Parsing array syntax using regex

I think what I am asking is either very trivial or already asked, but I have had a hard time finding answers.
We need to capture the inner number characters between brackets within a given string.
so given the string
StringWithMultiArrayAccess[0][9][4][45][1]
and the regex
^\w*?(\[(\d+)\])+?
I would expect 6 capture groups and access to the inner data.
However, I end up only capturing the last "1" character in capture group 2.
If it is important heres my java junit test:
#Test
public void ensureThatJsonHandlerCanHandleNestedArrays(){
String stringWithArr = "StringWithMultiArray[0][0][4][45][1]";
Pattern pattern = Pattern.compile("^\\w*?(\\[(\\d+)\\])+?");
Matcher matcher = pattern.matcher(stringWithArr);
matcher.find();
assertTrue(matcher.matches()); //passes
System.out.println(matcher.group(2)); //prints 1 (matched from last array symbols)
assertEquals("0", matcher.group(2)); //expected but its 1 not zero
assertEquals("45", matcher.group(5)); //only 2 capture groups exist, the whole string and the 1 from the last array brackets
}
In order to capture each number, you need to change your regex so it (a) captures a single number and (b) is not anchored to--and therefore limited by--any other part of the string ("^\w*?" anchors it to the start of the string). Then you can loop through them:
Matcher mtchr = Pattern.compile("\\[(\\d+)\\]").matcher(arrayAsStr);
while(mtchr.find()) {
System.out.print(mtchr.group(1) + " ");
}
Output:
0 9 4 45 1

Matching Urls Inside Strings

I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string "&quot[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the &quot[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?
If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
#Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.

Regular Expression in Java. Unexpected behaviour

I am trying to match mostly numbers, but depending on the Words which follow the Expression I need to make a difference.
I match every Number which is not followed by a Temperature Term like °C or a Time Specification.
My Regular Expression looks like this:
(((\d+?)(\s*)(\-)(\s*))?(\d+)(\s*))++(?!minuten|Minuten|min|Min|Stunden|stunden|std|Std|°C| °C)
Here is an Example: http://regexr.com?33jeg
While this Behavior is what I expected Java does the Following:
Index is the corresponding Group to the Match 4
0: "4 "1: "4 "2: "0 - "3: "0"4: " "5: "-"6: " "7: "4"8: " "9: "°C"
You need to Know that I match every String separate. So the match for the 5 looks like this:
0: "5 "1: "5 "2: "null"3: "null"4: "null"5: "null"6: "null"7: "5"8: " "9: "null"
This is how Id like the other Match to be. This unpleasant behavior is only when a "-" is somewhere in the String before the Match
My Java Code is the following:
public static void adaptPortionDetails(EList<Step> steps, double multiplicator){
String portionMatcher = "(((\\d+?)(\\s*)(\\-)(\\s*))?(\\d+)(\\s*))++(?!°C|Grad|minuten|Minuten|min|Min|Stunden|stunden|std|Std)";
for (int i = 0; i < steps.size(); i++) {
Matcher matcher = Pattern.compile(portionMatcher).matcher(
steps.get(i).getDescription());
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
printGroups(matcher);
String newValue1Str;
if (matcher.group(3) == null){
newValue1Str = "";
System.out.println("test");
}else{
double newValue1 = Integer.parseInt(matcher.group(3)) * multiplicator;
newValue1Str = Fraction.getFraction(newValue1).toProperString();
}
double newValue2 = Integer.parseInt(matcher.group(7)) * multiplicator;
String newValue2Str = Fraction.getFraction(newValue2).toProperString();
matcher.appendReplacement(sb, newValue1Str + "$4$5$6" + newValue2Str + "$8");
}
matcher.appendTail(sb);
steps.get(i).setDescription(sb.toString());
}
}
Hope you can tell what I'm missing.
This seems to be a bug (or feature?) in Java's implementation. It doesn't seem to reset the captured text for the capturing group when the matching has to be redone from the next index.
This test reveals the discrepancy in behavior between Java regex engine and PHP's PCRE.
Regex: (\d+(-\d+)?){1}+(?!x)
Input: 34 34-43x 78 90
Java result: 3 matches (34, 78, 90). The 2nd capturing group of the 2nd match is -43. The 2nd capturing group captures nothing for 1st and 3rd match.
PHP result: Also the same 3 matches, but 2nd capturing group captures nothing for all matches. For PHP's PCRE implementation, when the match has to be redone, the captured text of the capturing groups are reset.
This is tested this on JRE 6 Update 37 and JRE 7 Update 11.
Same result for this, just to prove the point that captured text is not reset when matching has to be redone:
Regex: a(\d+(-\d+)?){1}+(?!x)
Input: a34 a34-43x a78 a90
PHP result
Some comment about your regex
I think the ++ should be {1}+, since it seems that you want to modify one number or one range of number at a time, while making the match possessive to discard unwanted numbers.
Workaround
The first group (the outer most capturing group), which captures everything (one number or a range of number), will always be overwritten when a match is found. Hence you can rely on it. You can check whether there exist a - in the group 1 (with contains method). If there is, then you can tell that capturing group 2 contains captured text from the current match, and you can use the captured text. If there is not, then you can ignore all the captured text in capturing group 2 and its nested capturing groups.

Categories

Resources