How to wrap (surround) java matcher groups with xml? - java

Using the following value of a text node...
MatcH one MatcHer two MarcH three
How can java matcher.find() be used to create the following output?
<wrap>MatcH</wrap> one MatcHer two <wrap>MarcH</wrap> three
Assuming a java regex that captures all words starting with capital 'M' and ending with a capital 'H'
\bM\w*H\b
Basically, I want to surround anything that matches this regex with wrap tags
String text = "MatcH one MatcHer two MarcH three";
Pattern pattern = Pattern.compile(\\bM\w*H\b\);
Matcher matcher = pattern.matcher(text);
// replace each time the regex is found
while (matcher.find()) {
text = text.replaceAll(matcher.group(), "<wrap>" +
+ matcher.group() + "</wrap>");
}
ReplaceFirst/ReplaceAll is not working for me because it results in the following...
<wrap>MatcH</wrap> one <wrap>MatcH</wrap>er two <wrap>MarcH</wrap> three
Thanks in advance...

Your regex is problematic since your do replaceAll, so it will match MatcH, then MatcH and MatcHer will get replaced in that iteration of the loop. Note that the \\b doesn't appear in the output of group, so nothing prevents it from replacing MatcHer.
You can put a System.out.println inside the loop to print the output of group and the output of replaceAll to see what happens and why it does what it does.
Simplifying your code to just the below will work: (that's probably "hard-coding match numbers" but I don't really see a problem with that as it stands and I don't see a simpler solution)
String text = "MatcH one MatcHer two MarcH three";
text = text.replaceAll("\\b(M\\w*H)\\b", "<wrap>$1</wrap>");
The above is how regex is supposed to work. If you see that problems may arise in future using something similar to the above, regex may not be the way to go.

Related

Java find value in a string using regex

I'm wondering about the behavior of using the matcher in java.
I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.
My code:
String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("\nRegex : " matcher.group());
}
I get hit with "star war" which is right as it is in my pattern.
But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.
The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.
So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.
Just check if the string contains the strings you check against, the simplest approach is with a loop:
String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
if(str.contains(s))
System.out.println(s);
}
See the Java demo
By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.
A better way to build your regex would be like this:
String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";
Breaking down:
[Ss]: it will match either S or s in the first position
\s: representation of space
{0,1}: the previous character (or set) will be matched from 0 to 1 times
An alternative is:
String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
?: the previous character (or set) will be matched once or not at all
For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit 1: fixed typo (\s -> \\s). Thanks, #eugene.
You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:
Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
+ "star wars|pirates of the caribbean)$");
will print
Regex : star wars
But I agree with #NAMS: Don't build your regex like this.

Constructing regex pattern to match sentence

I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace.
For example, I want my regex pattern to be able to match " hello there I like regex!"
but so I'm scratching my head on how to match words after "hello". So far I have this:
String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
Pattern PATTERN = Pattern.compile(REGEX);
Matcher m = PATTERN.matcher(" asdsada adf adfah.");
if (m.matches()) {
System.out.println("hurray!");
}
Any help would be appreciated. Thanks.
String regex = "^\\s+[A-Za-z,;'\"\\s]+[.?!]$"
^ means "begins with"
\\s means white space
+ means 1 or more
[A-Za-z,;'"\\s] means any letter, ,, ;, ', ", or whitespace character
$ means "ends with"
An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:
\s+[^.!?]*[.!?]
Note that newline characters will also be included in this match.
A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:
\b[^.!?]+[.!?]+
https://regex101.com/r/7DdyM1/1
This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:
The value of PI is 3.141...
If you looking to match all strings starting with a white space you can try using "^\s+*"
regular expression.
This tool could help you to test your regular expression efficiently.
http://www.rubular.com/
Based upon what you desire and asked for, the following will work.
String s = " hello there I like regex!";
Pattern p = Pattern.compile("^\\s+[a-zA-Z\\s]+[.?!]$");
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("hurray!");
}
See working demo
String regex = "(?<=^|(\.|!|\?) |\n|\t|\r|\r\n) *\(?[A-Z][^.!?]*((\.|!|\?)(?! |\n|\r|\r\n)[^.!?]*)*(\.|!|\?)(?= |\n|\r|\r\n)"
This match any sentence following the definition 'a sentence start with a capital letter and end with a dot'.
The below regex pattern matches sentences in a paragraph.
Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");
Reference: https://devsought.com/regex-pattern-to-match-sentence

Finding substring in RegEx Java

Hello I have a question about RegEx. I am currently trying to find a way to grab a substring of any letter followed by any two numbers such as: d09.
I came up with the RegEx ^[a-z]{1}[0-9]{2}$ and ran it on the string
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
However, it never finds r30, the code below shows my approach in Java.
Pattern pattern = Pattern.compile("^[a-z]{1}[0-9]{2}$");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.matches())
System.out.println(matcher.group(1));
it never prints out anything because matcher never finds the substring (when I run it through the debugger), what am I doing wrong?
There are three errors:
Your expression contains anchors. ^ matches only at the start of the string, and $ only matches at the end. So your regular expression will match "r30" but not "foo_r30_bar". You are searching for a substring so you should remove the anchors.
The matches should be find.
You don't have a group 1 because you have no parentheses in your regular expression. Use group() instead of group(1).
Try this:
Pattern pattern = Pattern.compile("[a-z][0-9]{2}");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.find()) {
System.out.println(matcher.group());
}
ideone
Matcher Documentation
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:
The matches method attempts to match the entire input sequence against the pattern.
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
The find method scans the input sequence looking for the next subsequence that matches the pattern.
It doesn't match because ^ and $ delimite the start and the end of the string. If you want it to be anywhere, remove that and you will succed.
Your regex is anchored, as such it will never match unless the whole input matches your regex. Use [a-z][0-9]{2}.
Don't use .matches() but .find(): .matches() is shamefully misnamed and tries to match the whole input.
How about "[a-z][0-9][0-9]"? That should find all of the substrings that you are looking for.
^[a-z]{1}[0-9]{2}$
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
as far as i can read this
find thr first lower gives[s] caps letter after it there should be two numbers meaning the length of your string is and always will be 3 word chars
Maybe if i have more data about your string i can help
EDIT
if you are sure of *number of dots then
change this line
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
to
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0".split("\.")[0]);
note:-
using my solution you should omit the leading ^ for pattern
read this page for Spliting strings

Find string in between two strings using regular expression

I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break
You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.

How to appendReplacement on a Matcher group instead of the whole pattern?

I am using a while(matcher.find()) to loop through all of the matches of a Pattern. For each instance or match of that pattern it finds, I want to replace matcher.group(3) with some new text. This text will be different for each one so I am using matcher.appendReplacement() to rebuild the original string with the new changes as it goes through. However, appendReplacement() replaces the entire Pattern instead of just the group.
How can I do this but only modify the third group of the match rather than the entire Pattern?
Here is some example code:
Pattern pattern = Pattern.compile("THE (REGEX) (EXPRESSION) (WITH MULTIPLE) GROUPS");
Matcher matcher = pattern.matcher("THE TEXT TO SEARCH AND MODIFY");
StringBuffer buffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(buffer, processTheGroup(matcher.group(3));
}
but I would like to do something like this (obviously this doesn't work).
...
while(matcher.find()){
matcher.group(3).appendReplacement(buffer, processTheGroup(matcher.group(3));
}
Something like that, where it only replaces a certain group, not the whole Pattern.
EDIT: changed the regex example to show that not all of the pattern is grouped.
I see this already has an accepted answer, but it is not fully correct. The correct answer appears to be something like this:
.appendReplacement("$1" + process(m.group(2)) + "$3");
This also illustrates that "$" is a special character in .appendReplacement. Therefore you must take care in your "process()" function to replace all "$" with "\$". Matcher.quoteReplacement(replacementString) will do this for you (thanks #Med)
The previous accepted answer will fail if either groups 1 or 3 happen to contain a "$". You'll end up with "java.lang.IllegalArgumentException: Illegal group reference"
Let's say your entire pattern matches "(prefix)(infix)(suffix)", capturing the 3 parts into groups 1, 2 and 3 respectively. Now let's say you want to replace only group 2 (the infix), leaving the prefix and suffix intact the way they were.
Then what you do is you append what group(1) matched (unaltered), the new replacement for group(2), and what group(3) matched (unaltered), so something like this:
matcher.appendReplacement(
buffer,
matcher.group(1) + processTheGroup(matcher.group(2)) + matcher.group(3)
);
This will still match and replace the entire pattern, but since groups 1 and 3 are left untouched, effectively only the infix is replaced.
You should be able to adapt the same basic technique for your particular scenario.

Categories

Resources