How to appendReplacement on a Matcher group instead of the whole pattern? - java

I am using a while(matcher.find()) to loop through all of the matches of a Pattern. For each instance or match of that pattern it finds, I want to replace matcher.group(3) with some new text. This text will be different for each one so I am using matcher.appendReplacement() to rebuild the original string with the new changes as it goes through. However, appendReplacement() replaces the entire Pattern instead of just the group.
How can I do this but only modify the third group of the match rather than the entire Pattern?
Here is some example code:
Pattern pattern = Pattern.compile("THE (REGEX) (EXPRESSION) (WITH MULTIPLE) GROUPS");
Matcher matcher = pattern.matcher("THE TEXT TO SEARCH AND MODIFY");
StringBuffer buffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(buffer, processTheGroup(matcher.group(3));
}
but I would like to do something like this (obviously this doesn't work).
...
while(matcher.find()){
matcher.group(3).appendReplacement(buffer, processTheGroup(matcher.group(3));
}
Something like that, where it only replaces a certain group, not the whole Pattern.
EDIT: changed the regex example to show that not all of the pattern is grouped.

I see this already has an accepted answer, but it is not fully correct. The correct answer appears to be something like this:
.appendReplacement("$1" + process(m.group(2)) + "$3");
This also illustrates that "$" is a special character in .appendReplacement. Therefore you must take care in your "process()" function to replace all "$" with "\$". Matcher.quoteReplacement(replacementString) will do this for you (thanks #Med)
The previous accepted answer will fail if either groups 1 or 3 happen to contain a "$". You'll end up with "java.lang.IllegalArgumentException: Illegal group reference"

Let's say your entire pattern matches "(prefix)(infix)(suffix)", capturing the 3 parts into groups 1, 2 and 3 respectively. Now let's say you want to replace only group 2 (the infix), leaving the prefix and suffix intact the way they were.
Then what you do is you append what group(1) matched (unaltered), the new replacement for group(2), and what group(3) matched (unaltered), so something like this:
matcher.appendReplacement(
buffer,
matcher.group(1) + processTheGroup(matcher.group(2)) + matcher.group(3)
);
This will still match and replace the entire pattern, but since groups 1 and 3 are left untouched, effectively only the infix is replaced.
You should be able to adapt the same basic technique for your particular scenario.

Related

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

How to wrap (surround) java matcher groups with xml?

Using the following value of a text node...
MatcH one MatcHer two MarcH three
How can java matcher.find() be used to create the following output?
<wrap>MatcH</wrap> one MatcHer two <wrap>MarcH</wrap> three
Assuming a java regex that captures all words starting with capital 'M' and ending with a capital 'H'
\bM\w*H\b
Basically, I want to surround anything that matches this regex with wrap tags
String text = "MatcH one MatcHer two MarcH three";
Pattern pattern = Pattern.compile(\\bM\w*H\b\);
Matcher matcher = pattern.matcher(text);
// replace each time the regex is found
while (matcher.find()) {
text = text.replaceAll(matcher.group(), "<wrap>" +
+ matcher.group() + "</wrap>");
}
ReplaceFirst/ReplaceAll is not working for me because it results in the following...
<wrap>MatcH</wrap> one <wrap>MatcH</wrap>er two <wrap>MarcH</wrap> three
Thanks in advance...
Your regex is problematic since your do replaceAll, so it will match MatcH, then MatcH and MatcHer will get replaced in that iteration of the loop. Note that the \\b doesn't appear in the output of group, so nothing prevents it from replacing MatcHer.
You can put a System.out.println inside the loop to print the output of group and the output of replaceAll to see what happens and why it does what it does.
Simplifying your code to just the below will work: (that's probably "hard-coding match numbers" but I don't really see a problem with that as it stands and I don't see a simpler solution)
String text = "MatcH one MatcHer two MarcH three";
text = text.replaceAll("\\b(M\\w*H)\\b", "<wrap>$1</wrap>");
The above is how regex is supposed to work. If you see that problems may arise in future using something similar to the above, regex may not be the way to go.

How can i add multiple match conditions in a regex

I have a String like this : String x = "return function ('ABC','DEF')";
I am using this:
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(formula);
while (matcher.find()) {
System.out.println("------> " + matcher.group();
}
to retrieve strings between single quotes.
My question is: how can i adapt this regex so that it will check for strings between single quotes AND strings like " ,'DEF' " (meaning which start with ,' and end with ')?
You can use this pattern:
'[^']+'|"[^"]+"
Just to match with empty quoted string change '+' to '*'.
See test.
This pattern should do what you want:
"(?:,\s*)?'[^']*'"
The ? means the first group will match zero or one times.
I used (?:...) because this is a non-capturing group. It is better to use when you don't need to capture that portion of the match.
Also, I replaced .*? with [^']*, meaning the single-quoted string contains anything that is not a single quote. This is more efficient and less likely to lead to mistakes in your regex than .*?.
(Note: this regex allows there to be space between the comma and the start of the string. At first looking at your example, I thought that was true of your example. But now I see that it is not. Still, that might be useful depending on what your data looks like).
You could use the regex pattern:
Pattern.compile(",?'(.*?)'");
,? means 0 or 1 commas. The ? is greedy, so if there is a comma, it will be included in the match.
So: This will match:
A comma, followed by a string enclosed in single quotes
OR.. only a string enclosed in single quotes

Categories

Resources