How to match an maximum length Regex in java - java

public static void main(String[] args) {
Pattern compile = Pattern
.compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
Matcher matcher = compile.matcher("i5-2450M");
matcher.find();
System.out.println(matcher.group(0));
}
I assume this should return i5-2450M but it returns i5 actually

The problem is that the first alternation that matches is used.
In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.
// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]
(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)
To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)
The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.
From Java regex alternation operator "|" behavior seems broken:
Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
(The accepted answer in this question uses word boundaries.)
From Pattern:
The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Try to iterate over the matches (i.e. while matcher(text).find())

Related

Match all minuses that are not at the beginning of the string and that do not follow immediately after an `e` or `E`

And ideally, I want to allow spaces between, say and e and the minus:
(?<!(^|[eE]))\s*-
(the reason \s* is outside the lookbehind is that negative lookbehinds need be a fixed length, which \s* is not)
the logic here makes sense to me: match \s*- unless it is preceded by ^, e or E
this is intended as part of a larger pattern meant to purge e.g. thousands separators from a number string:
[^\d,.\-+eE]|(?<!(^|[eE]))\s*[+\-]|[eE](?!\s*[+\-]?\s*\d+$)|[.,](?=.*[.,])
What this does is (in order), it matches
everything that isn't a number, a comma, a dot, a minus, a plus or an E
all pluses and minuses that aren't at the beginning of the string and that don't follow e or E
all e and E that aren't followed by at least one digit with potentially a plus or minus between the E and the digit
all dots and commas except the last dot or comma
i.e. everything matched by this pattern can be replaced with an empty string.
Now let's try that in Java:
private static final Pattern ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA = Pattern.compile("[^\d,.\-+eE]|(?<!(^|[eE]))\s*[+\-]|[eE](?!\s*[+\\-]?\s*\d+$)|[.,](?=.*[.,])");
and
var intermediate = ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA
.matcher("3 e -9")
.replaceAll("");
But as you can see here, the result of that is 3e9 and not 3e-9 as it should.
So I pasted just the (?<!(^|[eE]))\s*- pattern to regex101 and turns out that the lookbehind is "not fixed", after all.
I do think it's possible this results in a mis-compilation of the pattern.
So how do I actually DO this?
First of all, always test your regexps in an environment that is compatible with the one you will be using your regex in. Thus, select "Java", not "PCRE" at regex101.com.
Next, regex101 supports Java 8 regex flavor, and there has been some progress on Java regex support since then, here is a note on lookbehind patterns in Java:
Java 13 allows you to use the star and plus inside lookbehind, as well as curly braces without an upper limit. But Java 13 still uses the laborious method of matching lookbehind introduced with Java 6. Java 13 also does not correctly handle lookbehind with multiple quantifiers if one of them is unbounded. In some situations you may get an error. In other situations you may get incorrect matches. So for both correctness and performance, we recommend you only use quantifiers with a low upper bound in lookbehind with Java 6 through 13.
See the Java demo:
String pattern = "[^\\d,.+eE-]|(?<!(?:^|[eE])\\s*)[+-]|[eE](?!\\s*(?:[+-]\\s*)?\\d+$)|[.,](?=.*[.,])";
Pattern ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA = Pattern.compile(pattern);
var intermediate = ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA
.matcher("3 e -9")
.replaceAll("");
System.out.println(intermediate);
// => 3e-9
Although (?<!(?:^|[eE])\s*) works here, it is still recommended to only use limiting quantifiers in constrained-width lookbehind patterns, i.e. just make sure the upper bound is reasonable enough, e.g. (?<!(?:^|[eE])\s{0,100}).

Why I got IllegalStateException here? [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?

Perl RegEx and PCRE (Perl-Compatible RegEx) amongst others have the shorthand \K to discard all matches to the left of it except for capturing groups, but Java doesn't support it, so what's Java's equivalent to it ?
There is no direct equivalent. However, you can always re-write such patterns using capturing groups.
If you have a closer look at \K operator and its limitations, you will see you can replace this pattern with capturing groups.
See rexegg.com \K reference:
In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind.
The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.
However, all this means that the pattern before \K is still a consuming pattern, i.e. the regex engine adds up the matched text to the match value and advances its index while matching the pattern, and \K only drops the matched text from the match keeping the index where it is. This means that \K is no better than capturing groups.
So, a value\s*=\s*\K\d+ PCRE/Onigmo pattern would translate into this Java code:
String s = "Min value = 5000 km";
Matcher m = Pattern.compile("value\\s*=\\s*(\\d+)").matcher(s);
if(m.find()) {
System.out.println(m.group(1));
}
There is an alternative, but that can only be used with smaller, simpler
patterns. A constrained width lookbehind:
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
So, this will also work:
m = Pattern.compile("(?<=value\\s{0,10}=\\s{0,10})\\d+").matcher(s);
if(m.find()) {
System.out.println(m.group());
}
See the Java demo.

Very slow Regular Expression in Java

Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.

Java regex alternation operator "|" behavior seems broken

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"
In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.
Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?
A code sample for clarity:
public static void main(String[] args)
{
Pattern p = Pattern.compile("six|sixty");
Matcher m = p.matcher("The year was nineteen sixty five.");
if (m.find())
{
System.out.println(m.group());
}
else
{
System.out.println("wtf?");
}
}
This outputs "six"
No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).
"\\b(I|II|III|IV)\\b"
EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.
I think a pattern that will work is something like
IV|I{1,3}
See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.
Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).
Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

Categories

Resources