Grouping multiple digits prior to a known value - java

I'm executing this regex code expecting a grouping value of 11, but am getting a 1. Seems like the grouping contains the correct regex for getting one or more digits prior to a known value. I'm sure it is simple, bit I cannot seem to figure it out.
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*(\\\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}

Try this
public static void main(String a1[]) {
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*?(\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}
}
Output
11

The problem is that .* will try to consume/match as much as possible before the next part is checked. Thus in your regex P.*(\d+)H.* the first .* will match 0Y0M0W0DT1 since that's as much as can be matched with the group still being able to match a single digit afterwards.
If you make that quantifier lazy/reluctant (i.e. .*?), it will try to match as little as possible so of the possible matches 0Y0M0W0DT1 and 0Y0M0W0DT it will select the shorter one and leave all the digits for the group to match.
Thus the regex P.*?(\d+)H.* should do what you want.
Additional note: since you're using Matcher#find() you'd not need the catch-all-expression .* at the end. It would also match any string that contains the character H preceeded by at least one digit and a P somewhere in front of those digits. So if you want to be more restrictive your regex would need to be enhanced.

Related

Python and Java regex behavior different when using the same regex

There are several questions about this but non answered my question. I wish to use pattern and matcher to find a pattern in a string and then from there create a list out of the matches that include the rest of the not match as well.
String a = "125t160f"; // The "t" could be replaced with symbols such as "." or anything so I wish to take one of anything after many digits.
Matcher m = Pattern.compile("(\\.?\\d+\\.)").matcher(a);
System.out.println(m.find());
My current result:
False
My expected result should be in list:
["125t", "160f"] // I understand how to do it in python but not in java. So could anyone assist me in this.
Your pattern should be \d+\D:
String a = "125t160f";
Matcher m = Pattern.compile("\\d+\\D").matcher(a);
while (m.find()) {
System.out.println(m.group(0));
}
The above regex pattern says to match one or more digits, followed by a single non digit character. If we can't rely on a non digit character to know when to stop matching, then we would need to know how many digits to consume.

Searching for number after a specific word that does not immediately precede the number

I am trying to use a pattern to search for a Zip Code within a string. I cannot get it to work correctly.
A sample of the inputLine is
What is the weather in 75042?
What I am trying to use for a pattern is
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile(".*weather.*([0-9]+).*");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
If I am looking to only get 75002, what do I need to change? This only outputs the last digit in the number, 2. I am terribly confused and I do not completely understand the Javadocs for the Pattern class.
The reason is because the .* matches the first digits and let only one left for your capturing group, you have to throw it away
A more simple pattern can be used here : \D+(\d+)\D+ which means
some non-digits \D+, then some digits to capture (\d+), then some non-digits \D+
public String getZipcode(String inputLine) {
Pattern pattern = Pattern.compile("\\D+(\\d+)\\D+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group(1).toString();
}
return "Zipcode Not Found.";
}
Workable Demo
The problem is that your middle .* is too greedy and eats away 7500. One easy fix is to add a space before your regexp: .*weather.* ([0-9]+).* or even use \\s. But the best is to use non-greedy version of .*? so regexp should be .*weather.*?([0-9]+).*
Spaces are missing in your regex (\s). You can use \s* or \s+ based on your data
Pattern pattern = Pattern.compile("weather\\s*\\w+\\s*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);
Your .*weather.*([0-9]+).* pattern grabs the whole line with the first .* and backtracks to find weather, and if it finds it, it grabs the line portion after the words to the end of line with the subsequent .* pattern and backtracks again to find the last digit and the only one digit is stored in Capturing group 1 since one digit satisfies the [0-9]+ pattern. The last .* just consumes the line to its end.
You may solve the issue by just using ".*weather.*?([0-9]+).*" (making the second .* lazy), but since you are using Matcher#find(), you can use a simpler regex:
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
And after getting a match, retrieve the value with matcher.group(1).
See the regex demo.
Pattern details
weather - a weather word
\\D* - 0+ chars other than digits
(\\d+) - Capturing group 1: one or more digits
See the Java demo:
String inputLine = "What is the weather in 75042?";
Pattern pattern = Pattern.compile("weather\\D*(\\d+)");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
System.out.println(matcher.group(1)); // => 75042
}
I think all you need is \\d+
public String getZipcode(String inputLine) throws Exception {
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(inputLine);
if (matcher.find()) {
return matcher.group();
}
//A good practice is to throw an exception if no result found
throw new NoSuchElementException("Zipcode Not Found.");
}
In regular expressions operators that have no upper bound (*, +) are greedy.
There were already perfect solutions suggested.
I'm just adding one that is very close to your's and addresses the problem in a more isolated way:
If you use the regex
".*weather.*?([0-9]+).*" ... instead of ...
".*weather.*([0-9]+).*"
... your solution will work perfectly well. The '?' after the asterisk instructs the regex compiler to treat the asterisk as non-greedy.
Greedy means consuming as many characters as possible (from left to right) while still allowing the remainder of the regex to match.
Non-greedy means consuming as few characters as possible while still allowing the remainder of the regex to match.

a strange regular on look behind

i write a piece of program to fetch content from a string between ":"(may not have) and "#" and order guaranteed,for example a string like "url:123#my.com",the I fetch "123",or "123#my.com" then i fetch "123" ,too; so I write a regular expression to implement it ,but i can not work,behind is first version:
Pattern pattern = Pattern.compile("(?<=:?).*?(?=#)");
Matcher matcher = pattern.matcher("sip:+8610086#dmcw.com");
if (matcher.find()) {
Log.d("regex", matcher.group());
} else {
Log.d("regex", "not match");
}
it can not work because in the first case:"url:123#my.com" it will get the result:"url:123"
obviously not what i want:
so i write the second version:
Pattern pattern = Pattern.compile("(?<=:??).*?(?=#)");
but it get the error,somebody said java not support variable length in look behind;
so I try the third version:
Pattern pattern = Pattern.compile("(?<=:).*?(?=#)|.*?(?=#)");
and its result is same as the first version ,BUT SHOULD NOT THE FIRST CONDITION BE CONSIDERED FIRST?
it same as
Pattern pattern = Pattern.compile(".*?(?=#)|(?<=:).*?(?=#)");
not left to right! I consider I understood regular expression before ,but confused again.thanks in advance anyway.
Try this (slightly edited, see comments):
String test = "sip:+8610086#dmcw.com";
String test2 = "8610086#dmcw.com";
Pattern pattern = Pattern.compile("(.+?:)?(.+?)(?=#)");
Matcher matcher = pattern.matcher(test);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
matcher = pattern.matcher(test2);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
Output:
+8610086
8610086
Let me know if you need explanations on the pattern.
You really don't need any look-aheads or look-behinds here. What you need can be accomplished by using a a greedy quantifer and some alternation:
.*(?:^|:)([^#]+)
By default java regular expression quantifiers (*+{n}?) are all greedy (will match as many characters as possible until a match can't be found. They can be made lazy by using a question mark after the quantifier like so: .*?
You will want to output capture group 1 for this expression, outputting capture group 0 will return the entire match.
As you said, you can't do a variable lookbehind in java.
Then, you can do something like this, you don't need lookbehind or lookaround.
Regex: :?([^#:]*)#
Example In this example (forget about \n, its because of regex101) you will get in the first group what you need, and you don't have to do anything special. Sometimes the easiest solution is the best.

Matching several URLs in a string using regex

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?
It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Java Regex Behavior

I am trying to apply the below pattern:
Pattern p = Pattern.compile(".*?");
Matcher m = p.matcher("RAJ");
StringBuffer sb = new StringBufffer();
while(m.find()) {
m.appendReplacement(sb, "L");
}
m.appendTail(sb);
Expected Output : LLL
Actual output : LRLALJL
Does the Dot(.) in the above regex match the position between the characters? If not why is the above output received
The .*? matches any number of characters, but as few as necessary to match the whole regex (the ? makes the * reluctant (also known as lazy)). Since there's nothing after that in the regex, this will always match the empty string (a.k.a the place between characters).
If you want at least a single character to be matched try .+?. Note that this is the same as just . if there's nothing else after it in the regex.
You can get it doing this:
String s = "RAJ";
s = s.replaceAll(".","L");
System.out.println(s);
You can do it using a Matcher and find method, but replaceAll accepts a regex.
It is not that . matches between the characters, but that * means 0 or more and the ? means as few as possible.
So "Zero or more things, and as few of them as possible" will always match Zero things, as that is the fewest possible, if it's not followed by something else the expression is looking for.
.{1} would result in an output of LLL as it matches anything once.
The * in your regex .*? means none or more repetitions. If you want to match at least a single character use the regex .+?.

Categories

Resources