Find string in between two strings using regular expression

Find string in between two strings using regular expression - java

I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break

You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.

Related

Pattern Matching for java using regex

I have a Long string that I have to parse for different keywords. For example, I have the String:
"==References== This is a reference ==Further reading== *{{cite book|editor1-last=Lukes|editor1-first=Steven|editor2-last=Carrithers|}} * ==External links=="
And my keywords are
'==References==' '==External links==' '==Further reading=='
I have tried a lot of combination of regex but i am not able to recover all the strings.
the code i have tried:
Pattern pattern = Pattern.compile("\\=+[A-Za-z]\\=+");
Matcher matcher = pattern.matcher(textBuffer.toString());
while (matcher.find()) {
System.out.println(matcher.group(0));
}

You don't need to escape the = sign. And you should also include a whitespace inside your character class.
Apart from that, you also need a quantifier on your character class to match multiple occurrences. Try with this regex:
Pattern pattern = Pattern.compile("=+[A-Za-z ]+=+");
You can also increase the flexibility to accept any characters in between two =='s, by using .+? (You need reluctant quantifier with . to stop it from matching everything till the last ==) or [^=]+:
Pattern pattern = Pattern.compile("=+[^=]+=+");
If the number of ='s are same on both sides, then you need to modify your regex to use capture group, and backreference:
"(=+)[^=]+\\1"

How to wrap (surround) java matcher groups with xml?

Using the following value of a text node...
MatcH one MatcHer two MarcH three
How can java matcher.find() be used to create the following output?
<wrap>MatcH</wrap> one MatcHer two <wrap>MarcH</wrap> three
Assuming a java regex that captures all words starting with capital 'M' and ending with a capital 'H'
\bM\w*H\b
Basically, I want to surround anything that matches this regex with wrap tags
String text = "MatcH one MatcHer two MarcH three";
Pattern pattern = Pattern.compile(\\bM\w*H\b\);
Matcher matcher = pattern.matcher(text);
// replace each time the regex is found
while (matcher.find()) {
text = text.replaceAll(matcher.group(), "<wrap>" +
+ matcher.group() + "</wrap>");
}
ReplaceFirst/ReplaceAll is not working for me because it results in the following...
<wrap>MatcH</wrap> one <wrap>MatcH</wrap>er two <wrap>MarcH</wrap> three
Thanks in advance...

Your regex is problematic since your do replaceAll, so it will match MatcH, then MatcH and MatcHer will get replaced in that iteration of the loop. Note that the \\b doesn't appear in the output of group, so nothing prevents it from replacing MatcHer.
You can put a System.out.println inside the loop to print the output of group and the output of replaceAll to see what happens and why it does what it does.
Simplifying your code to just the below will work: (that's probably "hard-coding match numbers" but I don't really see a problem with that as it stands and I don't see a simpler solution)
String text = "MatcH one MatcHer two MarcH three";
text = text.replaceAll("\\b(M\\w*H)\\b", "<wrap>$1</wrap>");
The above is how regex is supposed to work. If you see that problems may arise in future using something similar to the above, regex may not be the way to go.

Finding substring in RegEx Java

Hello I have a question about RegEx. I am currently trying to find a way to grab a substring of any letter followed by any two numbers such as: d09.
I came up with the RegEx ^[a-z]{1}[0-9]{2}$ and ran it on the string
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
However, it never finds r30, the code below shows my approach in Java.
Pattern pattern = Pattern.compile("^[a-z]{1}[0-9]{2}$");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.matches())
System.out.println(matcher.group(1));
it never prints out anything because matcher never finds the substring (when I run it through the debugger), what am I doing wrong?

There are three errors:
Your expression contains anchors. ^ matches only at the start of the string, and $ only matches at the end. So your regular expression will match "r30" but not "foo_r30_bar". You are searching for a substring so you should remove the anchors.
The matches should be find.
You don't have a group 1 because you have no parentheses in your regular expression. Use group() instead of group(1).
Try this:
Pattern pattern = Pattern.compile("[a-z][0-9]{2}");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.find()) {
System.out.println(matcher.group());
}
ideone
Matcher Documentation
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:
The matches method attempts to match the entire input sequence against the pattern.
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
The find method scans the input sequence looking for the next subsequence that matches the pattern.

It doesn't match because ^ and $ delimite the start and the end of the string. If you want it to be anywhere, remove that and you will succed.

Your regex is anchored, as such it will never match unless the whole input matches your regex. Use [a-z][0-9]{2}.
Don't use .matches() but .find(): .matches() is shamefully misnamed and tries to match the whole input.

How about "[a-z][0-9][0-9]"? That should find all of the substrings that you are looking for.

^[a-z]{1}[0-9]{2}$
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
as far as i can read this
find thr first lower gives[s] caps letter after it there should be two numbers meaning the length of your string is and always will be 3 word chars
Maybe if i have more data about your string i can help
EDIT
if you are sure of *number of dots then
change this line
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
to
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0".split("\.")[0]);
note:-
using my solution you should omit the leading ^ for pattern
read this page for Spliting strings

Matching several URLs in a string using regex

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?

It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Why does regular expression not match without boundary matcher "Beginning of line"?

There is something I don't understand in Java's regular expressions. I have the following string (and I need the "to Date"):
From Date :01/11/2011 To Date :30/11/2011;;;;;;;;;;;;;
I think that the following regular expression (in Perl) would have matched.
to\\s+date\\s*?:\\s*?([0-9]{2}[\\./][0-9]{2}[\\./][0-9]{2,4})
In Java, this pattern doesn't match. But it does if I add in front and at the end a .+
So this pattern works in Java:
Pattern p = Pattern.compile(".+to\\s+date\\s*?:\\s*?([0-9]{2}[\\./][0-9]{2}[\\./][0-9]{2,4}).+", Pattern.CASE_INSENSITIVE);
What I don't understand: It would be clear to me that the first pattern would not match in Java if I add a ^ (beginning of the line) and a $ at the end of the line. That would mean, that the pattern has to match the whole line. But without that, the first pattern should actually match, because why does the pattern care about string data which is out of scope of this pattern, if I don't set delimiters in front and at the end? This is not logical to me. In my opinion the first pattern should behave similar to the "contains" method of String class. And I think it is so in Perl.

In Java, matches() validates the entire string. Your input probably has line breaks in them (which don't get matched by .+).
Try this instead:
Pattern p = Pattern.compile(".+to\\s+date\\s*?:\\s*?([0-9]{2}[\\./][0-9]{2}[\\./][0-9]{2,4}).+", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("... \n From Date :01/11/2011 To Date :30/11/2011;;;;;;;;;;;;; \n ...");
System.out.println(m.matches()); // prints false
if(m.find()) {
System.out.println(m.group(1)); // prints 30/11/2011
}
And when using find(), your can drop the .+'s from the pattern:
Pattern.compile("to\\s+date\\s*?:\\s*?([0-9]{2}[./][0-9]{2}[./][0-9]{2,4})", Pattern.CASE_INSENSITIVE);
(no need to escape the . inside a character class, btw)

I think this answer from a different question also answers your question: Why do regular expressions in Java and Perl act differently?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find string in between two strings using regular expression - java

Related

Pattern Matching for java using regex

How to wrap (surround) java matcher groups with xml?

Finding substring in RegEx Java

Matching several URLs in a string using regex

Why does regular expression not match without boundary matcher "Beginning of line"?

Categories

Resources