Java Regex - Pattern.match fails to deal with lazy match

Java Regex - Pattern.match fails to deal with lazy match - java

This is the sample code
String m_testPattern = "AB.*?";
String m_testMatcherString = "ABCDCDCDCD";
final Pattern pattern = Pattern.compile(m_testPattern);
final Matcher matcher = pattern.matcher(m_testMatcherString);
if (matcher.matches()) {
// This means the regex matches
System.out.println("Successful comparison");
} else {
// match failed
System.out.println("Comparison failed !!!");
}
Ideally the match operation should result in a failure and give me output as "Comparison failed !!!"
But this code snippet gives me "Successful comparison" as output
I checked online regex tools with the same input and the result was different
I did the trial in this site http://regexr.com/v1/
Here when I put AB.*? in the regex and ABCDCDCDCD as the string to be compared, then the search stops at AB.
This means the comparison performed is a Lazy Comparison and not a greedy one
Can anyone please explain why the same use case fails in case of Java Pattern.match function ?
My test case is something like
1. regex AB\wCD should match with ABZCD plus fail at AB2CD
2. AB\w{2}CD would match ABZZCD
3. AB\d{1,3}CD should match AB555CD or AB6CD or AB77CD plus fail at ABCD or AB9999CD etc
4. AB.* should match AB(followed by anything)
5. AB.*? should fail if input like ABCDCDCD is given for comparison
All the 4 steps is passed successfully while using matcher.matches() function <br/>
Only the fifth one gives a wrong answer. (5th scenario also gives a success message eventhough it should fail)
Thanks in advance

matches()
return true if the whole string matches the given pattern.
find()
tries to find a substring that matches the pattern.

Related

how to substring and extract a dynamic content

i have a string which loads on the page based on the success and failure search results. If the search case is success then my output table will have a string like
Search Results - 31 Items Found (Debug Code: 4b50016efc3a1ad93502)
or if my search results fail, then the output table will display a string either:
Search Results - No data found (Debug Code: 4b50016efc3a1ad93502)
Search Results - 0 Items Found (Debug Code: 4b50016efc3a1ad93502)
Search Results - An Exception Occurred (Debug Code: 4b50016efc3a1ad93502)
depending on the input conditions.
I want to extract the Debug Code value and pass on to other scenario to validate further. I know I can use substring() to extract the Debug Code, but its position in the string is not constant; it varies based on the input conditions, however Debug code will be at last.
How can I extract the Debug Code value (eg 4b50016efc3a1ad93502) for all scenarios?

You can use regex to capture and return the target code:
String debugCode = output.replaceAll(".*Debug Code: (\\w+).*|.*", "$1");
What's happening here?
The regex .*Debug Code: (\\w+).* matches the entire string, and captures (with brackets) the target text using \\w+, which means "one or more 'letter' characters" ('letters' included digits). The replacement string $1 means "group 1" - the first captured group. Because the entire input is matched, this operation effectively replaces the whole input with the captured group, so returns just the target text.
So what happens if the input doesn't have a debug code?
The extra |.* at the end of the regex means "or 'anything'", so the regex will match the entire input even if there it doesn't have a debug code, but captured group 1 will still exist, but it will be empty, so the operation returns the blank string.
Examples:
String output1 = "Results - 0 Items Found (Debug Code: 4b50016efc3a1ad93502)";
String output2 = "something else";
String code1 = output1.replaceAll(".*Debug Code: (\\w+).*|.*", "$1"); // "4b50016efc3a1ad93502"
String code2 = output2.replaceAll(".*Debug Code: (\\w+).*|.*", "$1"); // ""
You don't have to use "|.*", but you don't have it, the entire string will be returned if the input doesn't have the 'debug code' format.

A java String is invariant. No matter how you get a String, its contents will not change.
For example
String s = "the way we were";
String t = s.substring(4, 6); // t = "way"
s = "abcdefghijk";
s is a new String, but t is unchanged

Java regex to detect semver strings is failing without qualifiers

I am trying to get a Java method to validate whether or not a String argument is a properly-formatted "semver" (semantic versioning) version string.
In my app, semver strings must be of the form:
<major>.<minor>.<patch>-<qualifier>
Where:
<major> is a positive integer (1+)
<minor> and <patch> are both non-negative integers (0+)
<qualifier> is an alphanumeric string (([0-9][a-z][A-Z])+)
Valid examples:
1.2.40
1.0.0-SNAPSHOT
2.0.45-RC
3.10.0
My best attempt thus far:
public boolean isSemVer(String version) {
Pattern versionPattern = Pattern.compile("^[a-zA-Z-]+\\d+\\.\\d+\\.\\d+");
Matcher matcher = versionPattern.matcher(version);
return matcher.matches();
}
Produces false for the first valid example of 1.2.40. Can anyone tell me where I'm going awry and what I need to tweak in my regex to get it to accept my use cases? Thanks in advance!

Your valid strings start with digits and not with letters, so [a-zA-Z-]+ in your pattern already makes the pattern wrong.
Use
^[1-9]\d*\.\d+\.\d+(?:-[a-zA-Z0-9]+)?$
See the regex demo
Details
^ - start of string
[1-9]\d* - a digit from 1 to 9 and then 0 or more digits
\.\d+\.\d+ - two occurrences of . and 1+ digits (can be written as (?:\.\d+){2})
(?:-[a-zA-Z0-9]+)? - an optional occurrence of - and 1+ alphanumeric chars ([a-zA-Z0-9] can be written as \p{Alnum})
$ - end of string.
In Java, use with .matches():
public boolean isSemVer(String version) {
Pattern versionPattern = Pattern.compile("[1-9]\\d*\\.\\d+\\.\\d+(?:-[a-zA-Z0-9]+)?");
Matcher matcher = versionPattern.matcher(version);
return matcher.matches();
}

You can try with the official SemVer regex
"^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$"gm

JAVA REGEX: Match until the specific character

I have this Java code
String cookies = TextUtils.join(";", LoginActivity.msCookieManager.getCookieStore().getCookies());
Log.d("TheCookies", cookies);
Pattern csrf_pattern = Pattern.compile("csrf_cookie=(.+)(?=;)");
Matcher csrf_matcher = csrf_pattern.matcher(cookies);
while (csrf_matcher.find()) {
json.put("csrf_key", csrf_matcher.group(1));
Log.d("CSRF KEY", csrf_matcher.group(1));
}
The String contains something like this:
SessionID=sessiontest;csrf_cookie=e18d027da2fb95e888ebede711f1bc39;ci_session=3f4675b5b56bfd0ba4dae46249de0df7994ee21e
Im trying to get the csrf_cookie data by using this Regular Expression:
csrf_cookie=(.+)(?=;)
I expect a result like this in the code:
csrf_matcher.group(1);
e18d027da2fb95e888ebede711f1bc39
instead I get a:
3492f8670f4b09a6b3c3cbdfcc59e512;ci_session=8d823b309a361587fac5d67ad4706359b40d7bd0
What is the possible work around for this problem?

Here is a one-liner using String#replaceAll:
String input = "SessionID=sessiontest;csrf_cookie=e18d027da2fb95e888ebede711f1bc39;ci_session=3f4675b5b56bfd0ba4dae46249de0df7994ee21e";
String cookie = input.replaceAll(".*csrf_cookie=([^;]*).*", "$1");
System.out.println(cookie);
e18d027da2fb95e888ebede711f1bc39
Demo
Note: We could have used a formal regex pattern matcher, and in face you may want to do this if you need to do this search/replacement often in your code.

You are getting more data than expected because you are using an greedy '+' (It will match as long as it can)
For example the pattern a+ could match on aaa the following: a, aa, and aaa. Where the later is 'preferred' if the pattern is greedy.
So you are matching
csrf_cookie=e18d027da2fb95e888ebede711f1bc39;ci_session=3f4675b5b56bfd0ba4dae46249de0df7994ee21e;
as long as it ends with a ';'. The first ';' is skipped with .+ and the last ';' is found with the possitive lookahead
To make a patter ungreedy/lazy use +? instead of + (so a+? would match a (three times) on aaa string)
So try with:
csrf_cookie=(.+?);
or just match anything that is not a ';'
csrf_cookie=([^;]*);
that way you don't need to make it lazy.

Java(Apex) RegEx not working?

I am having trouble with a regex in salesforce, apex. As I saw that apex is using the same syntax and logic as apex, I aimed this at java developers also.
I debugged the String and it is correct. street equals 'str 3 B'.
When using http://www.regexr.com/, the regex works('\d \w$').
The code:
Matcher hasString = Pattern.compile('\\d \\w$').matcher(street);
if(hasString.matches())
My problem is, that hasString.matches() resolves to false. Can anyone tell me if I did something somewhere wrong? I tried to use it without the $, with difference casing, etc. and I just can't get it to work.
Thanks in advance!

You need to use find instead of matches for partial input match as matches attempts to match complete input text.
Matcher hasString = Pattern.compile("\\d \\w$").matcher(street);
if(hasString.find()) {
// matched
System.out.println("Start position: " + hasString.start());
}

Pattern syntax error

The following regex works in the find dialog of Eclipse but throws an exception in Java.
I can't find why
(?<=(00|\\+))?[\\d]{1}[\\d]*
The syntax error is at runtime when executing:
Pattern.compile("(?<=(00|\\+))?[\\d]{1}[\\d]*")
In the find I used
(?<=(00|\+))?[\d]{1}[\d]*
I want to match phone numbers with or without the + or 00. But that is not the point because I get a Syntax error at position 13. I don't get the error if I get rid of the second "?"
Pattern.compile("(?<=(00|\\+))[\\d]{1}[\\d]*")
Please consider that instead of 1 sometime I need to use a greater number and anyway the question is about the syntax error

If your data looks like 00ddddd or +ddddd where d is digit you want to get #Bergi's regex (?<=00|\\+)\\d+ will do the trick. But if your data sometimes don't have any part that you want to ignore like ddddd then you probably should use group mechanism like
String[] data={"+123456","00123456","123456"};
Pattern p=Pattern.compile("(?:00|\\+)?(\\d+)");
Matcher m=null;
for (String s:data){
m=p.matcher(s);
if(m.find())
System.out.println(m.group(1));
}
output
123456
123456
123456

Here is an example that works for me:
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=00|\\+)(\\d+)");
Matcher matcher = pattern.matcher("+1123456");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}

You might shorten your regex a lot. The character classes are not needed when there is only one class inside - just use \d. And {1} is quite useless as well. Also, you can use + for matching "one or more" (it's short for {1,}). Next the additional grouping in your lookbehind should not be needed.
And last, why is that lookbehind optional (with ?)? Just leave it away if you don't need it. This might even be the source of your pattern syntax error - a lookaround must not be optional.
Try this:
/(?<=00|\+)\d+/
Java:
"(?<=00|\\+)\\d+"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex - Pattern.match fails to deal with lazy match - java

matches() return true if the whole string matches the given pattern. find() tries to find a substring that matches the pattern.

Related

how to substring and extract a dynamic content

Java regex to detect semver strings is failing without qualifiers

JAVA REGEX: Match until the specific character

Java(Apex) RegEx not working?

Pattern syntax error

Categories

Resources