Regex matches exact string contains word - java

I want to "catch" the next path to do some action on it:
/root/m/api/users/<user-id-can be any combination of characters and digits>/content
The path must ends with content
For example:
/root/m/api/users/acme/content
To do so, I need to match regex to know if this the correct path:
private boolean isPathAllow(final String urlToBlock) {
Matcher matcher = Pattern.compile("^/root/m/api/users/.*/content$").matcher(urlToBlock);
return matcher.matches();
}
But it's return true even on requests like:
/root/m/api/users/acme/applications/versions/1.0/content
So I must do something wrong in the matches function.
Any help to do so as it's supposed to be?

I succeeded with:
Matcher matcher = Pattern.compile("^/root/m/api/users/\\w*/content$").matcher(urlToBlock);
or
Matcher matcher = Pattern.compile("^/root/m/api/users/[^/]+/content$").matcher(urlToBlock);
So what are the differents between them (\\w* vs [^/]+)?

.* is greedy so it takes everything between users/ and /content.
Use [^/] to catch everything that is not / between users/ and /content. Or you can make the .* lazy by appending a question mark (?).
A 'greedy' quantifier will try to match as much tokens possible. A 'lazy' quantifier will stop at the first mach.
In some cases, greedy quantifiers can also be much less efficient, as the regex engine will try to match more (or a lot more) tokens after the actual good match. And will back trace only after a certain failure.

Related

Regex For All String Except Certain Characters

I am trying to write a regular expression that matches a certain word that is not preceded by 2 dashes (--) or a slash and a star (/*). I tried some expression but none seem to work.
Below is the text I am testing on
a_func(some_param);
/* a comment initialization */
init;
I am trying to write a regex that will only match the word init in the last line alone, what I've tried so far is matching the word init in initialization and the last line, I tried to look for existing answers, and found that used negative lookahead, but it was still matching init in initialization. Below are the expressions I tried:
(?!\/\*).*(init)
[^(\-\-|\/\*)].*(init)
(?<!\/\*).*(init) While reading in regex101's quick reference, I found this negative lookbehind which I believe had a similar example to what I need but I was still not able to get what I want, should I look into the negative lookbehind more or is this not how I achieve what I want?
My knowledge in regular expression is not that extensive, so I don't know if it is possible for what I want or not, but is it doable?
Assuming the -- or /* are on the same line as the init, there are some options. As the commenters said, multiline comments will likely require stronger techniques.
The simplest way I know is to actually preprocess the strings to remove the --.*$ and /\*.*$, then look for init (or init\b if you don't want to match initialization):
String input = "if init then foo";
String checkme = input.replaceAll("--.*$", "").replaceAll("/\\*.*$", "");
Pattern pattern = Pattern.compile("init"); // or "init\\b"
Matcher matcher = pattern.matcher(checkme);
System.out.println(matcher.find());
You can also use negative lookbehind as in #olsli's answer.
You can start with:
String input = "/*init";
Pattern pattern = Pattern.compile("^((?!--|\\/\\*).)*init");
Matcher matcher = pattern.matcher(input);
System.out.println(matcher.find());
I have added more braces to separate things out. This should also work, tested it in Regexr and IDEONE
Pattern p = Pattern.compile("^(?!=((\\/\\*)|(--)))([.]*?)init[.]*$", Pattern.MULTILINE|Pattern.CASE_INSENSITIVE);
String s = "/* Initialisation";
Matcher m = p.matcher(s);
m.find(); /* should return you >-1 if there's a match

Grouping multiple digits prior to a known value

I'm executing this regex code expecting a grouping value of 11, but am getting a 1. Seems like the grouping contains the correct regex for getting one or more digits prior to a known value. I'm sure it is simple, bit I cannot seem to figure it out.
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*(\\\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}
Try this
public static void main(String a1[]) {
String mydata = "P0Y0M0W0DT11H0M0S";
Pattern pattern = Pattern.compile("P.*?(\\d+)H.*");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()){
System.out.println(matcher.group(1));
}
}
Output
11
The problem is that .* will try to consume/match as much as possible before the next part is checked. Thus in your regex P.*(\d+)H.* the first .* will match 0Y0M0W0DT1 since that's as much as can be matched with the group still being able to match a single digit afterwards.
If you make that quantifier lazy/reluctant (i.e. .*?), it will try to match as little as possible so of the possible matches 0Y0M0W0DT1 and 0Y0M0W0DT it will select the shorter one and leave all the digits for the group to match.
Thus the regex P.*?(\d+)H.* should do what you want.
Additional note: since you're using Matcher#find() you'd not need the catch-all-expression .* at the end. It would also match any string that contains the character H preceeded by at least one digit and a P somewhere in front of those digits. So if you want to be more restrictive your regex would need to be enhanced.

Matching several URLs in a string using regex

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?
It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Find string in between two strings using regular expression

I am using a regular expression for finding string in between two strings
Code:
Pattern pattern = Pattern.compile("EMAIL_BODY_XML_START_NODE"+"(.*)(\\n+)(.*)"+"EMAIL_BODY_XML_END_NODE");
Matcher matcher = pattern.matcher(part);
if (matcher.find()) {
..........
It works fine for texts but when text contains special characters like newline it's break
You need to compile the pattern such that . matches line terminaters as well. To do this you need to use the DOTALL flag.
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
edit: Sorry, it's been a while since I've had this problem. You'll also have to change the middle regex from (.*)(\\n+)(.*) to (.*?). You need to lazy quantifier (*?) if you have multiple EMAIL_BODY_XML_START_NODE elements. Otherwise the regex will match the start of the first element with the end of the last element rather than having separate matches for each element. Though I'm guessing this is unlikely to be the case for you.

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Categories

Resources