How can this .NET regex group-capture operation be written in Java?

How can this .NET regex group-capture operation be written in Java? - java

In .NET, if I want to match a sequence of characters against a pattern that describes capturing groups that occur any number of times, I could write something as follows:
String input = "a, bc, def, hijk";
String pattern = "(?<x>[^,]*)(,\\s*(?<y>[^,]*))*";
Match m = Regex.Match(input, pattern);
Console.WriteLine(m.Groups["x"].Value);
//the group "y" occurs 0 or more times per match
foreach (Capture c in m.Groups["y"].Captures)
{
Console.WriteLine(c.Value);
}
This code would print:
a
bc
def
hijk
That seems straightforward, but unfortunately the following Java code doesn't do what the .NET code does. (Which is expected, since java.util.regex doesn't seem to distinguish between groups and captures.)
String input = "a, bc, def, hijk";
Pattern pattern = Pattern.compile("(?<x>[^,]*)(,\\s*(?<y>[^,]*))*");
Matcher m = pattern.matcher(input);
while(m.find())
{
System.out.println(m.group("x"));
System.out.println(m.group("y"));
}
Prints:
a
hijk
null
Can someone please explain how to accomplish the same using Java, without having to re-write the regular expression or use external libraries?

What you want is not possible in java. When the same group has been matched several times, only the last occurrence of that group is saved. For more info read the Pattern docs section Groups and capturing. In java the Matcher/Pattern is used to iterate through a String in "real-time".
Example with repetition:
String input = "a1b2c3";
Pattern pattern = Pattern.compile("(?<x>.\\d)*");
Matcher matcher = pattern.matcher(input);
while(matcher.find())
{
System.out.println(matcher.group("x"));
}
Prints (null because the * matches the empty string too):
c3
null
Without:
String input = "a1b2c3";
Pattern pattern = Pattern.compile("(?<x>.\\d)");
Matcher matcher = pattern.matcher(input);
while(matcher.find())
{
System.out.println(matcher.group("x"));
}
Prints:
a1
b2
c3

You can use Pattern and Matcher classes in Java. It's slightly different. For example following code:
Pattern p = Pattern.compile("(el).*(wo)");
Matcher m = p.matcher("hello world");
while(m.find()) {
for(int i=1; i<=m.groupCount(); ++i) System.out.println(m.group(i));
}
Will print two strings:
el
wo

Related

Extracting a group from matched String in Java using regex

I have a list of String containing values like this:
String [] arr = {"${US.IDX_CA}", "${UK.IDX_IO}", "${NZ.IDX_BO}", "${JP.IDX_TK}", "${US.IDX_MT}", "more-elements-with-completely-different-patterns-which-is-irrelevant"};
I'm trying to extract all the IDX_XX from this list. So from above list, i should have, IDX_CA, IDX_IO, IDX_BO etc using regex in Java
I wrote following code:
Pattern pattern = Pattern.compile("(.*)IDX_(\\w{2})");
for (String s : arr){
Matcher m = pattern.matcher(s);
if (m.matches()){
String extract = m.group(1);
System.out.println(extract);
}
}
But this does not print anything. Can someone please tell me what mistake am i making. Thanks.

Use the following fix:
String [] arr = {"${US.IDX_CA}", "${UK.IDX_IO}", "${NZ.IDX_BO}", "${JP.IDX_TK}", "${US.IDX_MT}", "more-elements-with-completely-different-patterns-which-is-irrelevant"};
Pattern pattern = Pattern.compile("\\bIDX_(\\w{2})\\b");
for (String s : arr){
Matcher m = pattern.matcher(s);
while (m.find()){
System.out.println(m.group(0)); // Get the whole match
System.out.println(m.group(1)); // Get the 2 chars after IDX_
}
}
See the Java demo, output:
IDX_CA
CA
IDX_IO
IO
IDX_BO
BO
IDX_TK
TK
IDX_MT
MT
NOTES:
Use \bIDX_(\w{2})\b pattern that matches IDX_ and 2 word chars in between word boundaries and captures the 2 chars after IDX_ into Group 1
m.matches needs a full string match, so it is replaced with m.find()
if replaced with while in case there are more than 1 match in a string
m.group(0) contains the whole match values
m.group(1) contains the Group 1 values.

multiple regex matches in a string

i have the following text:
bla [string1] bli [string2]
I like to match string1 and string2 with regex in a loop in java.
Howto do ?
my code so far, which only matches the first string1, but not also string 2.
String sRegex="(?<=\\[).*?(?=\\])";
Pattern p = Pattern.compile(sRegex); // create the pattern only once,
Matcher m = p.matcher(sFormula);
if (m.find())
{
String sString1 = m.group(0);
String sString2 = m.group(1); // << no match
}

Your regex is not using any captured groups hence this call with throw exceptions:
m.group(1);
You can use just use:
String sRegex="(?<=\\[)[^]]*(?=\\])";
Pattern p = Pattern.compile(sRegex); // create the pattern only once,
Matcher m = p.matcher(sFormula);
while (m.find()) {
System.out.println( m.group() );
}
Also if should be replaced by while to match multiple times to return all matches.

Your approach is confused. Either write your regex so that it matches two [....] sequences in the one pattern, or call find multiple times. Your current attempt has a regex that "finds" just one [...] sequence.
Try something like this:
Pattern p = Pattern.compile("\\[([^\\]]+)]");
Matcher m = p.matcher(formula);
if (m.find()) {
String string1 = m.group(0);
if (m.find(m.end()) {
String string2 = m.group(0);
}
}
Or generalize using a loop and an array of String for the extracted strings.
(You don't need any fancy look-behind patterns in this case. And ugly "hungarian notation" is frowned in Java, so get out of the habit of using it.)

Extracting a word containing a symbol from a string in Java

The basic idea is that I want to pull out any part of the string with the form "text1.text2". Some examples of the input and output of what I'd like to do would be:
"employee.first_name" ==> "employee.first_name"
"2 * employee.salary AS double_salary" ==> "employee.salary"
Thus far I have just .split(" ") and then found what I needed and .split("."). Is there any cleaner way?

I would go with an actual Pattern and an iterative find, instead of splitting the String.
For instance:
String test = "employee.first_name 2 * ... employee.salary AS double_salary blabla e.s blablabla";
// searching for a number of word characters or puctuation, followed by dot,
// followed by a number of word characters or punctuation
// note also we're avoiding the "..." pitfall
Pattern p = Pattern.compile("[\\w\\p{Punct}&&[^\\.]]+\\.[\\w\\p{Punct}&&[^\\.]]+");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
Output:
employee.first_name
employee.salary
e.s
Note: to simplify the Pattern you could only list the allowed punctuation forming your "."-separated words in the categories
For instance:
Pattern p = Pattern.compile("[\\w_]+\\.[\\w_]+");
This way, foo.bar*2 would be matched as foo.bar

You need to make use of split to break the string into fragments.Then search for . in each of those fragments using contains method, to get the desired fragments:
Here you go:
public static void main(String args[]) {
String str = "2 * employee.salary AS double_salary";
String arr[] = str.split("\\s");
for (int i = 0; i < arr.length; i++) {
if (arr[i].contains(".")) {
System.out.println(arr[i]);
}
}
}

String mydata = "2 * employee.salary AS double_salary";
pattern = Pattern.compile("(\\w+\\.\\w+)");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}

I'm not an expert in JAVA, but as I used regex in python and based on internet tutorials, I offer you to use r'(\S*)\.(\S*)' as the pattern. I tried it in python and it worked well in your example.
But if you want to use multiple dots continuously, it has a bug. I mean if you are trying to match something like first.second.third, this pattern identifies ('first.second', 'third') as the matched group and I think it relates to the best match strategy.

Regular Expression strings in Java

I want to use a regular expression that extracts a substring with the following properties in Java:
Beginning of the substring begins with 'WWW'
The end of the substring is a colon ':'
I have some experience in SQL with using the Like clause such as:
Select field1 from A where field2 like '%[A-Z]'
So if I were using SQL I would code:
like '%WWW%:'
How can I start this in Java?

Pattern p = Pattern.compile("WWW.*:");
Matcher m = p.matcher("zxdfefefefWWW837eghdehgfh:djf");
while (m.find()){
System.out.println(m.group());
}

Here's a different example using substring.
public static void main(String[] args) {
String example = "http://www.google.com:80";
String substring = example.substring(example.indexOf("www"), example.lastIndexOf(":"));
System.out.println(substring);
}

If you want to match only word character and ., then you may want to use the regular expression as "WWW[\\w.]+:"
Pattern p = Pattern.compile("WWW[\\w.]+:");
Matcher m = p.matcher("WWW.google.com:hello");
System.out.println(m.find()); //prints true
System.out.println(m.group()); // prints WWW.google.com:
If you want to match any character, then you may want to use the regular expression as "WWW[\\w\\W]+:"
Pattern p = Pattern.compile("WWW[\\w\\W]+:");
Matcher m = p.matcher("WWW.googgle_$#.com:hello");
System.out.println(m.find());
System.out.println(m.group());
Explanation: WWW and : are literals. \\w - any word character i.e. a-z A-Z 0-9. \\W - Any non word character.

If I understood it right
String input = "aWWW:bbbWWWa:WWW:aWWWaaa:WWWa:WWWabc:WWW:";
Pattern p = Pattern.compile("WWW[^(WWW)|^:]*:");
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println(m.group());
}
Output:
WWW:
WWWa:
WWW:
WWWaaa:
WWWa:
WWWabc:
WWW:

Java regex doesn't find numbers

I'm trying to parse some text, but for some strange reason, Java regex doesn't work. For example, I've tried:
Pattern p = Pattern.compile("[A-Z][0-9]*,[0-9]*");
Matcher m = p.matcher("H3,4");
and it simply gives No match found exception, when I try to get the numbers m.group(1) and m.group(2). Am I missing something about how Java regex works?

Yes.
You must actually call matches() or find() on the matcher first.
Your regex must actually contain capturing groups
Example:
Pattern p = Pattern.compile("[A-Z](\\d*),(\\d*)");
matcher m = p.matcher("H3,4");
if (m.matches()) {
// use m.group(1), m.group(2) here
}

You also need the parenthesis to specify what is part of each group. I changed the leading part to be anything that's not a digit, 0 or more times. What's in each group is 1 or more digits. So, not * but + instead.
Pattern p = Pattern.compile("[^0-9]*([0-9]+),([0-9]+)");
Matcher m = p.matcher("H3,4");
if (m.matches())
{
String g1 = m.group(1);
String g2 = m.group(2);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can this .NET regex group-capture operation be written in Java? - java

Related

Extracting a group from matched String in Java using regex

multiple regex matches in a string

Extracting a word containing a symbol from a string in Java

Regular Expression strings in Java

Java regex doesn't find numbers

Categories

Resources