Matching several URLs in a string using regex

Matching several URLs in a string using regex - java

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?

It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Related

Regex matches exact string contains word

I want to "catch" the next path to do some action on it:
/root/m/api/users/<user-id-can be any combination of characters and digits>/content
The path must ends with content
For example:
/root/m/api/users/acme/content
To do so, I need to match regex to know if this the correct path:
private boolean isPathAllow(final String urlToBlock) {
Matcher matcher = Pattern.compile("^/root/m/api/users/.*/content$").matcher(urlToBlock);
return matcher.matches();
}
But it's return true even on requests like:
/root/m/api/users/acme/applications/versions/1.0/content
So I must do something wrong in the matches function.
Any help to do so as it's supposed to be?

I succeeded with:
Matcher matcher = Pattern.compile("^/root/m/api/users/\\w*/content$").matcher(urlToBlock);
or
Matcher matcher = Pattern.compile("^/root/m/api/users/[^/]+/content$").matcher(urlToBlock);
So what are the differents between them (\\w* vs [^/]+)?

.* is greedy so it takes everything between users/ and /content.
Use [^/] to catch everything that is not / between users/ and /content. Or you can make the .* lazy by appending a question mark (?).
A 'greedy' quantifier will try to match as much tokens possible. A 'lazy' quantifier will stop at the first mach.
In some cases, greedy quantifiers can also be much less efficient, as the regex engine will try to match more (or a lot more) tokens after the actual good match. And will back trace only after a certain failure.

Searching for number after a specific word that does not immediately precede the number

I am trying to use a pattern to search for a Zip Code within a string. I cannot get it to work correctly.
A sample of the inputLine is
What is the weather in 75042?
What I am trying to use for a pattern is
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile(".*weather.*([0-9]+).*");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
If I am looking to only get 75002, what do I need to change? This only outputs the last digit in the number, 2. I am terribly confused and I do not completely understand the Javadocs for the Pattern class.

The reason is because the .* matches the first digits and let only one left for your capturing group, you have to throw it away
A more simple pattern can be used here : \D+(\d+)\D+ which means
some non-digits \D+, then some digits to capture (\d+), then some non-digits \D+
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile("\\D+(\\d+)\\D+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
Workable Demo

The problem is that your middle .* is too greedy and eats away 7500. One easy fix is to add a space before your regexp: .*weather.* ([0-9]+).* or even use \\s. But the best is to use non-greedy version of .*? so regexp should be .*weather.*?([0-9]+).*

Spaces are missing in your regex (\s). You can use \s* or \s+ based on your data
Pattern pattern = Pattern.compile("weather\\s*\\w+\\s*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);

Your .*weather.*([0-9]+).* pattern grabs the whole line with the first .* and backtracks to find weather, and if it finds it, it grabs the line portion after the words to the end of line with the subsequent .* pattern and backtracks again to find the last digit and the only one digit is stored in Capturing group 1 since one digit satisfies the [0-9]+ pattern. The last .* just consumes the line to its end.
You may solve the issue by just using ".*weather.*?([0-9]+).*" (making the second .* lazy), but since you are using Matcher#find(), you can use a simpler regex:
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
And after getting a match, retrieve the value with matcher.group(1).
See the regex demo.
Pattern details
weather - a weather word
\\D* - 0+ chars other than digits
(\\d+) - Capturing group 1: one or more digits
See the Java demo:
String inputLine = "What is the weather in 75042?";
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
System.out.println(matcher.group(1)); // => 75042
}

I think all you need is \\d+
public String getZipcode(String inputLine) throws Exception {
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group();
}
//A good practice is to throw an exception if no result found
throw new NoSuchElementException("Zipcode Not Found.");
}

In regular expressions operators that have no upper bound (*, +) are greedy.
There were already perfect solutions suggested.
I'm just adding one that is very close to your's and addresses the problem in a more isolated way:
If you use the regex
".*weather.*?([0-9]+).*" ... instead of ...
".*weather.*([0-9]+).*"
... your solution will work perfectly well. The '?' after the asterisk instructs the regex compiler to treat the asterisk as non-greedy.
Greedy means consuming as many characters as possible (from left to right) while still allowing the remainder of the regex to match.
Non-greedy means consuming as few characters as possible while still allowing the remainder of the regex to match.

Grouping multiple digits prior to a known value

I'm executing this regex code expecting a grouping value of 11, but am getting a 1. Seems like the grouping contains the correct regex for getting one or more digits prior to a known value. I'm sure it is simple, bit I cannot seem to figure it out.
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*(\\\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}

Try this
public static void main(String a1[]) {
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*?(\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}
}
Output
11

The problem is that .* will try to consume/match as much as possible before the next part is checked. Thus in your regex P.*(\d+)H.* the first .* will match 0Y0M0W0DT1 since that's as much as can be matched with the group still being able to match a single digit afterwards.
If you make that quantifier lazy/reluctant (i.e. .*?), it will try to match as little as possible so of the possible matches 0Y0M0W0DT1 and 0Y0M0W0DT it will select the shorter one and leave all the digits for the group to match.
Thus the regex P.*?(\d+)H.* should do what you want.
Additional note: since you're using Matcher#find() you'd not need the catch-all-expression .* at the end. It would also match any string that contains the character H preceeded by at least one digit and a P somewhere in front of those digits. So if you want to be more restrictive your regex would need to be enhanced.

Finding substring in RegEx Java

Hello I have a question about RegEx. I am currently trying to find a way to grab a substring of any letter followed by any two numbers such as: d09.
I came up with the RegEx ^[a-z]{1}[0-9]{2}$ and ran it on the string
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
However, it never finds r30, the code below shows my approach in Java.
Pattern pattern = Pattern.compile("^[a-z]{1}[0-9]{2}$");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.matches())
System.out.println(matcher.group(1));
it never prints out anything because matcher never finds the substring (when I run it through the debugger), what am I doing wrong?

There are three errors:
Your expression contains anchors. ^ matches only at the start of the string, and $ only matches at the end. So your regular expression will match "r30" but not "foo_r30_bar". You are searching for a substring so you should remove the anchors.
The matches should be find.
You don't have a group 1 because you have no parentheses in your regular expression. Use group() instead of group(1).
Try this:
Pattern pattern = Pattern.compile("[a-z][0-9]{2}");
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
if(matcher.find()) {
System.out.println(matcher.group());
}
ideone
Matcher Documentation
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:
The matches method attempts to match the entire input sequence against the pattern.
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
The find method scans the input sequence looking for the next subsequence that matches the pattern.

It doesn't match because ^ and $ delimite the start and the end of the string. If you want it to be anywhere, remove that and you will succed.

Your regex is anchored, as such it will never match unless the whole input matches your regex. Use [a-z][0-9]{2}.
Don't use .matches() but .find(): .matches() is shamefully misnamed and tries to match the whole input.

How about "[a-z][0-9][0-9]"? That should find all of the substrings that you are looking for.

^[a-z]{1}[0-9]{2}$
sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0
as far as i can read this
find thr first lower gives[s] caps letter after it there should be two numbers meaning the length of your string is and always will be 3 word chars
Maybe if i have more data about your string i can help
EDIT
if you are sure of *number of dots then
change this line
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0");
to
Matcher matcher = pattern.matcher("sedfdhajkldsfakdsakvsdfasdfr30.reed.op.1xp0".split("\.")[0]);
note:-
using my solution you should omit the leading ^ for pattern
read this page for Spliting strings

Find string in between two strings using regular expression

I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break

You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Matching several URLs in a string using regex - java

Related

Regex matches exact string contains word

Searching for number after a specific word that does not immediately precede the number

Grouping multiple digits prior to a known value

Finding substring in RegEx Java

Find string in between two strings using regular expression

Categories

Resources