Replacing repeatedly occuring groups of an anchored regex in java

Replacing repeatedly occuring groups of an anchored regex in java - java

Using Java 7 and the default RegEx implementatiin in java.util.regex.Pattern, given a regex like this:
^start (m[aei]ddel[0-9] ?)+ tail$
And a string like this:
start maddel1 meddel2 middel3 tail
Is it possible to get an output like this using the anchored regex:
start <match> <match> <match> tail.
I can get every group without anchors like this:
Regex: m[aei]ddel[0-9]
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
matcher.appendReplacement(sb, Matcher.quoteReplacement("<middle>"));
}
The problem is that I'm working on a quite big dataset and being able to anchor the patterns would be a huge performance win.
However when I add the anchors the only API that I can find requires a whole match and accessing the last occurrence of the group. I my case I need to verify that the regex actually matches (i.e. a whole match), but in the replacement step I need to be able to access every group on it's own.
edit I'd like to avoid workarounds like looking for the anchors in a separate step because it would require bigger changes to the code and wrapping it all up in RegExes feels more elegant.

You can use \G for this:
final String regex = "(^start |(?<!^)\\G)m[aei]ddel[0-9] (?=.* tail$)";
final String str = "start maddel1 meddel2 middel3 tail";
String repl = str.replaceAll(regex, "$1<match> ");
//=> start <match> <match> <match> tail
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match.

To do it in one step, you need to use a \G based regex that will do the anchoring. However, you also need a positive lookahead to check if the string ends with the desired pattern.
Here is a regex that should work:
(^start|(?!\A)\G)\s+m[aei]ddel[0-9](?=(?:\s+m[aei]ddel[0-9])*\s+tail$)
See the regex demo
String s = "start maddel1 meddel2 middel3 tail";
String pat = "(^start|(?!\\A)\\G)\\s+(m[aei]ddel[0-9])(?=(?:\\s+m[aei]ddel[0-9])*\\s+tail$)";
System.out.println(s.replaceAll(pat, "$1 <middle>" ));
See the Java online demo
Explanation:
(^start|(?!\A)\G) - match start at the end of string or the end of the previous successful match
\s+ - 1 or more whitespaces
m[aei]ddel[0-9] - m, then either a, e, i, then ddel, then 1 digit
(?=(?:\s+m[aei]ddel[0-9])*\s+tail$) - only if followed with:
(?:\s+m[aei]ddel[0-9])* - zero or more sequences of 1+ whitespaces and middelN pattern
\s+ - 1 or more whitespaces
tail$ - tails substring followed with the end of string.

With the \G anchor, for the find method, you can write it this way:
pat = "\\G(?:(?!\\A) |\\Astart (?=(?:m[aei]ddel[0-9] )+tail\\z))(m\\S+)";
details:
\\G # position after the previous match or at the start of the string
# putting it in factor makes fail the pattern more quickly after the last match
(?:
(?!\\A) [ ] # a space not at the start of the string
# this branch is the first one because it has more chance to succeed
|
\\A start [ ] # "start " at the beginning of the string
(?=(?:m[aei]ddel[0-9] )+tail\\z) # check the string format once and for all
# since this branch will succeed only once
)
( # capture group 1
m\\S+ # the shortest and simplest pattern that matches "m[aei]ddel[0-9]"
# and excludes "tail" (adapt it to your need but keep the same idea)
)
demo

Related

Trying to match possible tags in string by regex

those are my possible inputs:
"#smoke"
"#smoke,#Functional1" (OR condition)
"#smoke,#Functional1,#Functional2" (OR condition)
"#smoke","#Functional1" (AND condition),
"#smoke","~#Functional1" (SKIP condition),
"~#smoke","~#Functional1" (NOT condition)
(Please note, the string input for the regex, stops at the last " character on each line, no space or comma follows it!
The regex I came up with so far is
"((?:[~#]{1}\w*)+),?"
This matches in capturing groups for the samples 1, 4, 5 and 6 but NOT 2 and 3.
I am not sure how to continue tweaking it further, any suggestions?
I would like to capture the preceding boolean meaning of the tag (eg: ~) as well please.
If you have any suggestions to pre-process the string in Java before regex that would make it simpler, I am open to that possibility as well.
Thanks.

It seems that you want to match an optional ~ followed by an # and get iterative matches for group 1. You could make use of the \G anchors, which matches either at the start, or at the end of the previous match.
(?:"(?=.*"$)|\G(?!^))(~?#\w+(?:,~?#\w+)*)"?[,\h]?
Explanation
(?: Non capture group
"(?=.*"$) Match " and assert that the string ends with "
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close non capture group
( Capture group 1
~?#\w+(?:,~?#\w+)* Match an optional ~, than # and 1+ word characters and repeat 0+ times with a comma prepended
)"? Close group 1 and match an optional "
[,\h] Match either a comma or a horizontal whitespace char.
Regex demo | Java demo
Example code
String regex = "(?:\"(?=.*\"$)|\\G(?!^))(~?#\\w+(?:,~?#\\w+)*)\"?[,\\h]?";
String string = "\"#smoke\"\n"
+ "\"#smoke,#Functional1\"\n"
+ "\"#smoke,#Functional1,#Functional2\"\n"
+ "\"#smoke\",\"#Functional1\"\n"
+ "\"#smoke\",\"~#Functional1\"\n"
+ "\"~#smoke\",\"~#Functional1\"";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
#smoke
#smoke,#Functional1
#smoke,#Functional1,#Functional2
#smoke
#Functional1
#smoke
~#Functional1
~#smoke
~#Functional1
Edit
If there are no consecutive matches, you could also use:
"(~?#\w+(?:,~?#\w+)*)"
Regex demo

regular expression retrieve a portion of a string that not contain a string

I have some strings like the following:
it.mycompany.db.beans.str1.PD_T_CLASS
it.mycompany.db.beans.join.PD_T_CLASS
it.mycompany.db.beans.str2.PD_T_CLASS_1
it.mycompany.db.beans.join.PD_T_CLASS_1
PD_T_CLASS myVar = new PD_T_CLASS();
myVar.setPD_T_CLASS(something);
and I want to select "PD_" part to substitute it with "" (the void string) but only inf the entire line does not contain the string ".join."
what I want to achieve is:
it.mycompany.db.beans.str1.T_CLASS
it.mycompany.db.beans.join.PD_T_CLASS
it.mycompany.db.beans.str2.T_CLASS_1
it.mycompany.db.beans.join.PD_T_CLASS_1
T_CLASS myVar = new T_CLASS();
myVar.setT_CLASS(something);
The substitution is not a problem since I'm using eclipse search tool and will hit replace as soon as it show me the right result.
I have tried:
^((?!\.join\.).)*(PD_)*$ // whole string selected
^((?!\.join\.).)*(\bPD_\b)*$ // whole string selected
I start getting frustrated since I've searched a bit around (the ^((?!join bla bla come from those searches)
Can you help me?

You may use the following regex:
(?m)(?:\G(?!\A)|^(?!.*\.join\.))(.*?)PD_
and replace with
$1
See the regex demo
Details:
(?m) - a Pattern.MULTILINE inline modifier flag that will force ^ to match the beginning of a line rather than a whole string
(?:\G(?!\A)|^(?!.*\.join\.)) - either of the two alternatives:
\G(?!\A) - the end of the previous successful match
| - or
^(?!.*\.join\.) - start of a line that has no .join. text in it (as the (?!.*\.join\.) is a negative lookahead that will fail the match if it matches any 0+ chars other than line break chars (.*) and then .join.)
(.*?) - Capturing group #1 (referred to with the $1 backreference in the replacement pattern): any 0+ chars other than line breaks, as few as possible, up to the first occurrence of ...
PD_ - a literal PD_
The replacement is a $1 backreference to the first capturing group that will restore any text matched before PD_s.

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?

You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.

Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

How to match the middle character in a string with regex?

In an odd number length string, how could you match (or capture) the middle character?
Is this possible with PCRE, plain Perl or Java regex flavors?
With .NET regex you could use balancing groups to solve it easily (that could be a good example). By plain Perl regex I mean not using any code constructs like (??{ ... }), with which you could run any code and of course do anything.
The string could be of any odd number length.
For example in the string 12345 you would want to get the 3, the character at the center of the string.
This is a question about the possibilities of modern regex flavors and not about the best algorithm to do that in some other way.

With PCRE and Perl (and probably Java) you could use:
^(?:.(?=.*?(?(1)(?=.\1$))(.\1?$)))*(.)
which would capture the middle character of odd length strings in the 2nd capturing group.
Explained:
^ # beginning of the string
(?: # loop
. # match a single character
(?=
# non-greedy lookahead to towards the end of string
.*?
# if we already have captured the end of the string (skip the first iteration)
(?(1)
# make sure we do not go past the correct position
(?= .\1$ )
)
# capture the end of the string +1 character, adding to \1 every iteration
( .\1?$ )
)
)* # repeat
# the middle character follows, capture it
(.)

Hmm, maybe someone can come up with a pure regex solution, but if not you could always dynamically build the regex like this:
public static void main(String[] args) throws Exception {
String s = "12345";
String regex = String.format(".{%d}3.{%d}", s.length() / 2, s.length() / 2);
Pattern p = Pattern.compile(regex);
System.out.println(p.matcher(s).matches());
}

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?

You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replacing repeatedly occuring groups of an anchored regex in java - java

Related

Trying to match possible tags in string by regex

regular expression retrieve a portion of a string that not contain a string

Regex in Java: Capture last {n} words

How to match the middle character in a string with regex?

Simple regex to match strings containing <n> chars

Categories

Resources