multiple regex pattern matches in a single string groovy - java

I have a test string like this
08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts
I want to regex and match "ABCD" and "35" in this string
def regexString = ~ /(\s\d{1,5}[^\d\]\-\:\,\.])|([A-Z]{4}\:)/
............
while (matcher.find()) {
acct = matcher.group(1)
grpName = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}
My Current Output is
group : ABCD: acct : null
group : null acct : 35
But I expected something like this
group : ABCD: acct : 35
Is there any option to match all the patterns in the string before it loops into the while(). Or a better way to implement this

You may use
String s = "08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts"
def res = s =~ /\b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b/
if (res.find()) {
println "${res[0][1]}, ${res[0][2]}"
} else {
println "not found"
}
See the Groovy demo.
The regex - \b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b - matches a string starting with a whole word consisting of 4 uppercase ASCII letters (captured into Group 1), then followed with : and 0+ chars other than [, ] and digits, and then matches and captures into Group 2 a whole number consisting of 1 to 4 digits.
See the regex demo.
In the code, =~ operator makes the regex engine find a partial match (i.e. searches for the pattern anywhere inside the string) and the res variable contains all the match objects that hold a whole match inside res[0][0], Group 1 inside res[0][1] and Group 2 value in res[0][2].

I believe your issues is with the 'or' in your regex. I think it is essentially parsing it twice, once to match the first half of the regex and then again to match the second half after the '|'. You need a regex that will match both in one parse. You can reverse the matches so they match in order:
/([A-Z]{4})\:.*\s(\d{1,5)}[^\d\]-"\,\.]/
Also notice the change in parentheses so you don't capture more than you need - Currently you are capturing the ':' after the group name and an extra space before the acct. This is assuming the "ABCD" will always come before the "35".
There is also a lot more you can do assuming that all your strings are formatted very similarly:
For example, if there is always a space after the acct number you could simplify it to:
/([A-Z]{4})\:.*\s(\d{1,5)}\s/
There's probably a lot more you could do to make sure you're always capturing the correct things, but i'd have to see or know more about the dataset to do so.
Then of course you have the switch the order of matches in your code:
while (matcher.find()) {
grpName = matcher.group(1)
acct = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}

Related

Regex for finding only single alphabets in a string and ignore consecutive double

I have searched a lot but I am unable to find a regex that could select only single alphabets and double them while those alphabets which are already double, should remain untouched.
I tried
String str = "yahoo";
str = str.replaceAll("(\\w)\\1+", "$0$0");
But since this (\\w)\\1+ selects all double elements, my output becomes yahoooo. I tried to add negation to it !(\\w)\\1+ but didn't work and output becomes same as input. I have tried
str.replaceAll(".", "$0$0");
But that doubles every character including which are already doubled.
Please help to write an regex that could replace all single character with double while double character should remain untouched.
Example
abc -> aabbcc
yahoo -> yyaahhoo (o should remain untouched)
opinion -> ooppiinniioonn
aaaaaabc -> aaaaaabbcc
You can match using this regex:
((.)\2+)|(.)
And replace it with:
$1$3$3
RegEx Demo
RegEx Explanation:
((.)\2+): Match a character and capture in group #2 and using \2+ next to it to make sure we match all multiple repeats of captured character. Capture all the repeated characters in group #1
|: OR
(.): Match any character and capture in group #3
Code Demo:
import java.util.List;
class Ideone {
public static void main(String[] args) {
List<String> input = List.of("aaa", "abc", "yahoo",
"opinion", "aaaaaabc");
for (String s: input) {
System.out.println( s + " => " +
s.replaceAll("((.)\\2+)|(.)", "$1$3$3") );
}
}
}
Output:
aaa => aaa
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
The solution by #anubhava, if viable in Java, is probably the best way to go. For a more brute force approach, we can try a regex iteration approach on the following pattern:
(\\w)\\1+|\\w
This matches, eagerly, a series of similar letters (two or more of them), followed by, that failing, a single letter. For each match, we can no-op on the multi-letter match, and double up any other single letter. Here is a short Java code which does this:
List<String> inputs = Arrays.asList(new String[] {"abc", "yahoo", "opinion", "aaaaaabc"});
String pattern = "(\\w)\\1+|\\w";
Pattern r = Pattern.compile(pattern);
for (String input : inputs) {
Matcher m = r.matcher(input);
StringBuffer buffer = new StringBuffer();
while (m.find()) {
if (m.group().matches("(\\w)\\1+")) {
m.appendReplacement(buffer, m.group());
}
else {
m.appendReplacement(buffer, m.group() + m.group());
}
}
m.appendTail(buffer);
System.out.println(input + " => " + buffer.toString());
}
}
This prints:
abc => aabbcc
yahoo => yyaahhoo
opinion => ooppiinniioonn
aaaaaabc => aaaaaabbcc
I've got two different understandings of the question.
If the goal is to get an even amount of each word character:
Search for (\w)\1? and replace with $1$1 (regex101 demo).
If just solely characters should be duplicated and others left untouched:
Search for (\w)\1?(\1*) and replace with $1$1$2 (regex 101 demo).
Captures a word character \w to $1, optionally matches the same character again. The second variant captures any more of the same character to $2 for attaching in the replacement.
FYI: If using as a Java string remember to escape the pattern. E.g. \1 -> \\1, \w ->\\w, ...

regex for not matching alpha plus numeric range

I have the following regex
.{19}_.{3}PDR_.{8}(ABCD|CTNE|PFRE)006[0-9][0-9].{3}_.{6}\.POC
a match is for example
NRM_0157F0680884976_598PDR_T0060000ABCD00619_00_6I1N0T.POC
and would like to negate the (ABCD|CTNE|PFRE)006[0-9][0-9]
portion such that
NRM_0157F0680884976_598PDR_T0060000ABCD00719_00_6I1N0T.POC
is a match but
NRM_0157F0680884976_598PDR_T0060000ABCD007192_00_6I1N0T.POC
or
NRM_0157F0680884976_598PDR_T0060000ABCD0061_00_6I1N0T.POC
is not (the negated part must be 9 chars long just like the non negated part for a total length of 58 chars).
Consider using the following pattern:
\b(?:ABCD|CTNE|PFRE)006[0-9][0-9]\b
Sample Java code:
String input = "Matching value is ABCD00601 but EFG123 is non matching";
Pattern r = Pattern.compile("\\b(?:ABCD|CTNE|PFRE)006[0-9][0-9]\\b");
Matcher m = r.matcher(input);
while (m.find()) {
System.out.println("Found a match: " + m.group());
}
This prints:
Found a match: ABCD00601
I would like to propose this expression
(ABCD|CTNE|PFRE)006\d{1,2}
where \d{1,2} catches any one or two digit number
that is it would get any alphanumeric values from ABCD0060~ABCD00699 or CTNE0060~CTNE00699 or PFRE0060~PFRE00699
Edit #1:
as user #Hao Wu mentioned the above regex would also accept if its ABCD0060 which is not ideal so
this should do the job by removing 1 from the { } we can get
alphanumeric values from ABCD00600~ABCD00699 or CTNE00600~CTNE00699 or PFRE00600~PFRE00699
so the resulting regex would be
(ABCD|CTNE|PFRE)006\d{2}

Regex expression is not working correctly

I am trying to find in a string in which numbers are formatted as "4.97", but if they are smaller than 1, they are in the format .97, .80 etc. I want to find these kind of substrings in the String and replace them so that they would start with a 0.
It's working for the string
String str = "Rate is : .97";
Result : "Rate is : 0.97"
But not for the string:
String str = "Rate is : .97 . XXXXXXXXX do you want . to perform another calculation . ";
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(.*\\D)(.\\d\\d.*)";
System.out.println(str.matches("(.*\\D)(.\\d\\d.*)"));
str = str.replaceAll(pattern, "$10$2");
Why is this happening?
In your second example, the .* after the last \\d will match any character except a newline which will match the rest of the string.
You might do the replacement without a capturing group using a negative lookbehind (?<!\S) to check if what is on the left is not a non whitespace char.
(?<!\S)\.[0-9]
In the replacement use a zero followed by the full match.
Regex demo | Java demo
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(?<!\\S)\\.[0-9]";
System.out.println(str.replaceAll(pattern, "0$0"));
Output
Rate is : 0.97 . XXXXXXXXX do you want . 87 to perform another calculation .
If there should be a non digit before, you could make use of a positive lookbehind
(?<=\D)\.[0-9]
Regex demo
In Java
String regex = "(?<=\\D)\\.[0-9]";
It looks like you need to add some lazy matching to your regex.
? means it will attempt to match as few times as possible, in this case it's to only pick up the first number and not go onto the second.
^(.*?\D)(.\d\d.*?)
You can see this regex work here, with a more complete explanation.
I have also added the ^ start of string matcher so to make sure only one match it created and not repeated onto the second.
First of all, your regex pattern seems to be wrong. I think you can just use:
(\D)(\.\d+)
Find a character that is not a digit, followed by a dot and at least one digit.
Second, for replacing, you could use more low-level features, such as:
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
final Pattern regex = Pattern.compile("(\\D)(\\.\\d+)");
final Matcher m = regex.matcher(str);
if (m.find()) {
str = m.replaceFirst(m.group(1) + "0" + m.group(2));
}
System.out.println(str);
But of course, this works too:
str = str.replaceAll("(\\D)(\\.\\d+)", "$10$2");
You can do a positive lookahead so that way you also catch whitespaces between . and the number.
(.(?=.\d)|(\d+))+
would give you
Then in your code you can do whatever operation on group 1(blue) and group 2(red) as you wish.

regex - How to match elements while ignoring others between quotation marks?

I can't seem to find the regex that suits my needs.
I have a .txt file of this form:
Abc "test" aBC : "Abc aBC"
Brooking "ABC" sadxzc : "I am sad"
asd : "lorem"
a22 : "tactius"
testsa2 : "bruchia"
test : "Abc aBC"
b2 : "Ast2"
From this .txt file I wish to extract everything matching this regex "([a-zA-Z]\w+)", except the ones between the quotation marks.
I want to rename every word (except the words in quotation marks), so I should have for example the following output:
A "test " B : "Abc aBC"
Z "ABC" X : "I am sad"
Test : "lorem"
F : "tactius"
H : "bruchia"
Game : "Abc aBC"
S: "Ast2"
Is this even achievable using a regex? Are there alternatives without using regex?
If quotes are balanced and there is no escaping in the input like \" then you can use this regex to match words outside double quotes:
(?=(?:(?:[^"]*"){2})*[^"]*$)(\b[a-zA-Z]\w+\b)
RegEx Demo
In java it will be:
Pattern p = Pattern.compile("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(\\b[a-zA-Z]\\w+\\b)");
This regex will match word if those are outside double quotes by using a lookahead to make sure there are even number of quotes after each matched word.
A simple approach might be to split the string by ", then do the replace using your regex on every odd part (on parts 1, 3, ..., if you start the numbering from 1), and join everything back.
UPD
However, it is also simple to implement manually. Just go along the line and track whether you are inside quotes or not.
insideQuotes = false
result = ""
currentPart = ""
input = input + '"' // so that we do not need to process the last part separately
for ch in string
if ch == '"'
if not insideQuotes
currentPart = replace(currentPart)
result = result + currentPart + '"'
currentPart = ""
insideQuotes = not insideQuotes
else
currentPart = currentPart + ch
drop the last symbol of result (it is that quote mark that we have added)
However, think also on whether you will need some more advanced syntax. For example, quote escaping like
word "inside quote \" still inside" outside again
? If yes, then you will need a more advanced parser, or you might think of using some special format.
You can’t formulate a “within quotes” condition the way you might think. But you can easily search for unquoted words or quoted strings and take action only for the unquoted words:
Pattern p = Pattern.compile("\"[^\"]*\"|([a-zA-Z]\\w+)");
for(String s: lines) {
Matcher m=p.matcher(s);
while(m.find()) {
if(m.group(1)!=null) {
System.out.println("take action with "+m.group(1));
}
}
}
This utilizes the fact that each search for the next match starts at the end of the previous. So if you find a quoted string ("[^"]*") you don’t take any action and continue searching for other matches. Only if there is no match for a quoted string, the pattern looks for a word (([a-zA-Z]\w+)) and if one is found, the group 1 captures the word (will be non null).

Capturing groups using If then else regular expression construct in java

I have an input string in the following format
String input = "00IG356001110002005064007000000";
Characters 3-7 is the code.
Characters 8-12 is the amount.
Based on the code in the input string (IG356 in the sample input string), i need to capture the amount(00111 in the sample).
The value in the amount (characters 8-12) should be picked up only for specific codes and the logic is detailed below.
The code should not be SG356. If it is SG356, not a match and exit.
a. If the code is not SG356, check if the codes are IG902 or SG350, in this case capture the amount(00111)
else
b. Check for the 3 numbers in the code (characters 5-7, 356 in this sample). If they are 200,201,356,370. go ahead and capture the amount
I am using the regular expression shown below:
Using positive lookahead and if then else construct.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
This regular expression is working fine while just checking for a match.
.{2}(?!SG356)((?=IG902|SG350).+|.{2}(?=200|201|356|370).+)
The problem is only while capturing the group.
I am running this in Java. Any help would be greatly appreciated.
The java code i am using is :
public String getTsqlSum(String input, String regex){
String value = null;
Matcher m = Pattern.compile(regex).matcher(input);
System.out.println("Group Count: " + m.groupCount());
if (m.matches()) {
for (int i=0;i<m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}
}
return value;
}
public void forumTest(){
//String input = "00IG902001110002005064007000000";
String input = "00IG356001110002005064007000000";
String regex= ".{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+";
System.out.println(match(input, regex));
String match = getTsqlSum(input, regex);
System.out.println("Match: " + match);
}
The regular expression works fine if the code in the input string is IG902 or SG350 (when the 'if' part of the regex is getting matched). but if the 'else' is getting matched, i am unable to capture the amount.
You are not unable to capture the amount, the expression is working fine. But if you are in the second part of the alternation (This is not a regex if-then-else) then your result is in a different capturing group. You will find it in the capturing group 3 and not in the second one like when you are matching in the first part of the alternation.
String regex= ".{2}(?!SG356)((?=IG902|SG350).{5}(.{5}).+|.{2}(?=200|201|356|370).{3}(.{5}).+)";
Group number 1 2 3
In a regular expression the capturing groups are numbered by their opening brackets and this continues also in an alternation. In Perl there would be a construct that gives the capturing groups of an alternation the same number, but I think thats the only flavour that is able to do this.
In Java you need to check after the expression in which group you have the result.
See my answer here, similar topic
You can change your regex and make the alternation before the capturing group
try this
.{2}(?!SG356)(?:(?=IG902|SG350).{5}|.{2}(?=200|201|356|370).{3})(.{5}).+
You will find your result in both cases in the group 1. (I made the first one a non capturing group using the ?:)
Update after the source was added
Your loop is wrong, that means the groups are starting at 1, if you want the content of group one, you have to use m.group(1).
In group m.group(0) you will find the whole matched string.
Try this
for (int i=1;i<=m.groupCount();i++){
System.out.println("For i: " + i +" Value: " + m.group(i));
}

Categories

Resources