extract values from string with Regular Expression - java

I have this java code
String msg = "*1*20*11*30*IGNORE*53*40##";
String regex = "\\*1\\*(.*?)\\*11\\*(.*?)\\*(.*?)\\*53\\*(.*?)##";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
System.out.println(matcher.group((i+1)));
}
}
the output is
20
30
IGNORE
40
How do I have to change the regex, that the String which is IGNORE is ignored?
I want to,that anything which is written there not to be found by the matcher.
the positions where 20,30,40 is are values for me which I need to extract, IGNORE in my case is any protocol specific counter which has no need for me

Always ignore the 3rd parameter:
Simply don't create a capture (don't use parentheses).
\\*1\\*(.*?)\\*11\\*(.*?)\\*.*?\\*53\\*(.*?)##
Ignore independently of position:
You need to capture the IGNORE part just like you're doing, and check in your loop if it needs to be ignored:
String msg = "*1*20*11*30*IGNORE*53*40##";
String regex = "\\*1\\*(.*?)\\*11\\*(.*?)\\*(.*?)\\*53\\*(.*?)##";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(msg);
if (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
if (!matcher.group(i+1).equals("IGNORE")) {
System.out.println(matcher.group(i+1));
}
}
}
DEMO

You can use a tempered greedy token to make sure you do not get a match when IGNORE is in-between the 2nd and 3rd capture groups:
\\*1\\*(.*?)\\*11\\*(.*?)\\*(?:(?!IGNORE).)*\\*53\\*(.*?)##
See demo. In this case, the 3rd group cannot contain IGNORE.
The token is useful when you need to match the closest window between two subpatterns that does not contain some substring.
In case you just do not want the 3rd group to be equal to IGNORE, use a negative look-ahead:
\\*1\\*(.*?)\\*11\\*(.*?)\\*(?!IGNORE\\*)(.*?)\\*53\\*(.*?)##
^^^^^^^^^^^^
See demo

Split the input on * and treat IGNORE as an optional part of the delimiter, having first trimmed off the prefix and suffix:
String[] parts = msg.replaceAll("^\\*\\d\\*|##$","").split("(\\*IGNORE)?\\*\\d+\\*");
Some test code:
String msg = "*1*20*11*30*IGNORE*53*40##";
String[] parts = msg.replaceAll("^\\*\\d\\*|##$","").split("(\\*IGNORE)?\\*\\d+\\*");
System.out.println(Arrays.toString(parts));
Output:
[20, 30, 40]

Related

Regex capturing groups within logical OR

I have a set of strings I need to parse and extract values from. They look like:
/apple/1212d3fe
/cat/23224a2f4
/auto/445478eefd
/somethingelse/1234fded
It should match only apple, cat and auto. The output I expect is:
1212, d3fe
23224, a2f4
445478, eefd
null
I need to come up with a regex capturing groups to do the same. I am able to extract the second part but not the first one. The closest I came up with is:
String r2 = "^/(apple/[0-9]{4}|cat/[0-9]{5}|auto/[0-9]{6})([a-f0-9]{4})$";
System.out.println(r2);
Pattern pattern2 = Pattern.compile(r2);
Matcher matcher2 = pattern2.matcher("/apple/2323efff");
if (matcher2.find()) {
System.out.println(matcher2.group(1));
System.out.println(matcher2.group(2));
}
UPDATED QUESTION:
I have a set of strings I need to parse and extract values from. They look like:
/apple/1212d3fe
/cat/23e24a2f4
/auto/df5478eefd
/somethingelse/1234fded
It should match only apple, cat and auto. The output I expect is the everything after the 2nd '/' split as follows: 4 characters if 'apple', 5 characters if 'cat' and 6 characters if 'auto' like:
1212, d3fe
23e24, a2f4
df5478, eefd
null
I need to come up with a regex capturing groups to do the same. I am able to extract the second part but not the first one. The closest I came up with is:
String r2 = "^/(apple/[0-9]{4}|cat/[0-9]{5}|auto/[0-9]{6})([a-f0-9]{4})$";
System.out.println(r2);
Pattern pattern2 = Pattern.compile(r2);
Matcher matcher2 = pattern2.matcher("/apple/2323efff");
if (matcher2.find()) {
System.out.println(matcher2.group(1));
System.out.println(matcher2.group(2));
}
I can do it without the regex OR(|) but it breaks when I include it. Any help with the right regex?
Updated Answer:
As per your updated question you can use this regex based on lookbehind assertions:
/((?<=apple/).{4}|(?<=cat/).{5}|(?<=auto/).{6})(.+)$
RegEx Demo
This regex uses 2 capture groups after matching /
In 1st group we have 3 lookbehind conditions with alternations.
(?<=apple/).{4} makes sure that we match 4 characters that have apple/ on left hand side. Likewise we match 5 and 6 character strings that have cat/ and /auto/.
In 2nd capture group we match remaining characters before end of line.
You could use the regex \/[apple|auto|cat]+\/(\d*)(.*), See here
If you want the last group to have exactly 4 digits you can use this regex:
/(apple|cat|auto)/([0-9a-f]+)([0-9a-f]{4})
Here is a working example:
List<String> strings = Arrays.asList("/apple/1212d3fe", "/cat/23224a2f4", "/auto/445478eefd");
Pattern pattern = Pattern.compile("/(apple|cat|auto)/([0-9a-f]+)([0-9a-f]{4})");
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
}
If you want for digits after apple, 5 after cat and 6 after auto you can split your algorithm in 2 parts:
List<String> strings = Arrays.asList("/apple/1212d3fe", "/cat/23224a2f4", "/auto/445478eefd", "/some/445478eefd");
Pattern firstPattern = Pattern.compile("/(apple|cat|auto)/([0-9a-f]+)");
for (String string : strings) {
Matcher firstMatcher = firstPattern.matcher(string);
if (firstMatcher.find()) {
String first = firstMatcher.group(1);
System.out.println(first);
int length = getLength(first);
Pattern secondPattern = Pattern.compile("([0-9a-f]{" + length + "})([0-9a-f]{4})");
Matcher secondMatcher = secondPattern.matcher(string);
if (secondMatcher.find()) {
System.out.println(secondMatcher.group(1));
System.out.println(secondMatcher.group(2));
}
}
}
private static int getLength(String key) {
switch (key) {
case "apple":
return 4;
case "cat":
return 5;
case "auto":
return 6;
}
throw new IllegalArgumentException("key not allowed");
}

Java regex except combination of symbols

I'm trying to find substing contains any character, but not include combination "[%"
As examples:
Input: atrololo[%trololo
Output: atrololo
Input: tro[tro%tro[%trololo
Output: tro[tro%tro
I already wrote regex, take any symbol except [ or %:
[A-Za-z-0-9\s!-$-&/:-#\\-`\{-~]*
I must put in the end of my expression something like [^("[%")], but i can't solve how it should input.
You may check my regular in
https://www.regex101.com/
Put as test string this:
sdfasdsdfasa##!55#321!2h/ хf[[[[[sds d
asgfdgsdf[[[%for (int i = 0; i < 5; i++){}%]
[% fo%][%r(int i = 0; i < 5; i++){ %]*[%}%]
[%for(int i = 0; i < 5; i++){%][%=i%][%}%]
[%#n%]<[%# n + m %]*[%#%]>[%#%]
%?s.equals(""TEST"")%]TRUE[%#3%]![%#%][%?%]
Kind regards.
You could use a negative lookahead based regex like below to get the part before the [%
^(?:(?!\[%).)*
(?:(?!\[%).)* matches any character but not of [% zero or more times.
DEMO
String s = "tro[tro%tro[%trololo";
Pattern regex = Pattern.compile("^(?:(?!\\[%).)*");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group()); // output : tro[tro%tro
}
OR
A lookahead based regex,
^.*?(?=\[%)
DEMO
Pattern regex = Pattern.compile("^.*?(?=\\[%)");
OR
You could split the input string based on the regex \[% and get the parts you want.
String s = "tro[tro%tro[%trololo";
String[] part = s.split("\\[%");
System.out.println(part[0]); // output : tro[tro%tro
Using your input/output pairs as the spec:
String input; // the starting string
String output = input.replaceAll("\\[%.*", "");

I can't get the first group of regex pattern in java

I'm trying to get the first group of a regex pattern.
I got this string from a lyric text:
[01:34][01:36]Blablablahh nanana
I'm this regex pattern to extract [01:34],[03:36] and the text.
Pattern timeLine = Pattern.compile("(\\[\\d\\d:\\d\\d\\])+(.*)");
But when I try to extract the first group [01:34] using group(1) it returns [03:36]
is there something wrong in the regex pattern?
Your problem is here
Pattern.compile("(\\[\\d\\d:\\d\\d\\])+(.*)");
^
This part of your pattern (\\[\\d\\d:\\d\\d\\])+ will match [01:34][01:36] because of + (which is greedy), but your group 1 can contain only one of [dd:dd] so it will store the last match found.
If you want to find only [01:34] you can correct your pattern by removing +. But you can also create simpler pattern
Pattern.compile("^\\[\\d\\d:\\d\\d\\]");
and use it with group(0) which is also called by group().
Pattern timeLine = Pattern.compile("^\\[\\d\\d:\\d\\d\\]");
Matcher m = timeLine.matcher("[01:34][01:36]Blablablahh nanana");
while (m.find()) {
System.out.println(m.group()); // prints [01:34]
}
In case you want to extract both [01:34][01:36] you can just add another parenthesis to your current regex like
Pattern.compile("((\\[\\d\\d:\\d\\d\\])+)(.*)");
This way entire match of (\\[\\d\\d:\\d\\d\\])+ will be in group 1.
You can also achieve it by removing (.*) from your original pattern and reading group 0.
I thin you are confused by the repeating match (\\[\\d\\d:\\d\\d\\])+ which returns just the last match as the group value. Try the following and see if it makes more sense to you:
String s = "[01:34][01:36]Blablablahh nanana";
Pattern timeLine = Pattern.compile("(\\[\\d\\d:\\d\\d\\])(\\[\\d\\d:\\d\\d\\])(.+)");
Matcher m = timeLine.matcher(s);
if (m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.printf(" Group %d -> %s\n", i, m.group(i)); // prints [01:36]
}
}
which for me returns:
Group 1 -> [01:34]
Group 2 -> [01:36]
Group 3 -> Blablablahh nanana
I would simply grab the first part using a character class:
String timings = str.replaceAll("([\\[\\]\\d:]+).*", "$1");
And similarly the text:
String text = str.replaceAll("[\\[\\]\\d:]+", "");

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Extracting a word containing a symbol from a string in Java

The basic idea is that I want to pull out any part of the string with the form "text1.text2". Some examples of the input and output of what I'd like to do would be:
"employee.first_name" ==> "employee.first_name"
"2 * employee.salary AS double_salary" ==> "employee.salary"
Thus far I have just .split(" ") and then found what I needed and .split("."). Is there any cleaner way?
I would go with an actual Pattern and an iterative find, instead of splitting the String.
For instance:
String test = "employee.first_name 2 * ... employee.salary AS double_salary blabla e.s blablabla";
// searching for a number of word characters or puctuation, followed by dot,
// followed by a number of word characters or punctuation
// note also we're avoiding the "..." pitfall
Pattern p = Pattern.compile("[\\w\\p{Punct}&&[^\\.]]+\\.[\\w\\p{Punct}&&[^\\.]]+");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
Output:
employee.first_name
employee.salary
e.s
Note: to simplify the Pattern you could only list the allowed punctuation forming your "."-separated words in the categories
For instance:
Pattern p = Pattern.compile("[\\w_]+\\.[\\w_]+");
This way, foo.bar*2 would be matched as foo.bar
You need to make use of split to break the string into fragments.Then search for . in each of those fragments using contains method, to get the desired fragments:
Here you go:
public static void main(String args[]) {
String str = "2 * employee.salary AS double_salary";
String arr[] = str.split("\\s");
for (int i = 0; i < arr.length; i++) {
if (arr[i].contains(".")) {
System.out.println(arr[i]);
}
}
}
String mydata = "2 * employee.salary AS double_salary";
pattern = Pattern.compile("(\\w+\\.\\w+)");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
I'm not an expert in JAVA, but as I used regex in python and based on internet tutorials, I offer you to use r'(\S*)\.(\S*)' as the pattern. I tried it in python and it worked well in your example.
But if you want to use multiple dots continuously, it has a bug. I mean if you are trying to match something like first.second.third, this pattern identifies ('first.second', 'third') as the matched group and I think it relates to the best match strategy.

Categories

Resources