Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`? - java

Perl RegEx and PCRE (Perl-Compatible RegEx) amongst others have the shorthand \K to discard all matches to the left of it except for capturing groups, but Java doesn't support it, so what's Java's equivalent to it ?

There is no direct equivalent. However, you can always re-write such patterns using capturing groups.
If you have a closer look at \K operator and its limitations, you will see you can replace this pattern with capturing groups.
See rexegg.com \K reference:
In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind.
The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.
However, all this means that the pattern before \K is still a consuming pattern, i.e. the regex engine adds up the matched text to the match value and advances its index while matching the pattern, and \K only drops the matched text from the match keeping the index where it is. This means that \K is no better than capturing groups.
So, a value\s*=\s*\K\d+ PCRE/Onigmo pattern would translate into this Java code:
String s = "Min value = 5000 km";
Matcher m = Pattern.compile("value\\s*=\\s*(\\d+)").matcher(s);
if(m.find()) {
System.out.println(m.group(1));
}
There is an alternative, but that can only be used with smaller, simpler
patterns. A constrained width lookbehind:
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
So, this will also work:
m = Pattern.compile("(?<=value\\s{0,10}=\\s{0,10})\\d+").matcher(s);
if(m.find()) {
System.out.println(m.group());
}
See the Java demo.

Related

Match all minuses that are not at the beginning of the string and that do not follow immediately after an `e` or `E`

And ideally, I want to allow spaces between, say and e and the minus:
(?<!(^|[eE]))\s*-
(the reason \s* is outside the lookbehind is that negative lookbehinds need be a fixed length, which \s* is not)
the logic here makes sense to me: match \s*- unless it is preceded by ^, e or E
this is intended as part of a larger pattern meant to purge e.g. thousands separators from a number string:
[^\d,.\-+eE]|(?<!(^|[eE]))\s*[+\-]|[eE](?!\s*[+\-]?\s*\d+$)|[.,](?=.*[.,])
What this does is (in order), it matches
everything that isn't a number, a comma, a dot, a minus, a plus or an E
all pluses and minuses that aren't at the beginning of the string and that don't follow e or E
all e and E that aren't followed by at least one digit with potentially a plus or minus between the E and the digit
all dots and commas except the last dot or comma
i.e. everything matched by this pattern can be replaced with an empty string.
Now let's try that in Java:
private static final Pattern ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA = Pattern.compile("[^\d,.\-+eE]|(?<!(^|[eE]))\s*[+\-]|[eE](?!\s*[+\\-]?\s*\d+$)|[.,](?=.*[.,])");
and
var intermediate = ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA
.matcher("3 e -9")
.replaceAll("");
But as you can see here, the result of that is 3e9 and not 3e-9 as it should.
So I pasted just the (?<!(^|[eE]))\s*- pattern to regex101 and turns out that the lookbehind is "not fixed", after all.
I do think it's possible this results in a mis-compilation of the pattern.
So how do I actually DO this?

First of all, always test your regexps in an environment that is compatible with the one you will be using your regex in. Thus, select "Java", not "PCRE" at regex101.com.
Next, regex101 supports Java 8 regex flavor, and there has been some progress on Java regex support since then, here is a note on lookbehind patterns in Java:
Java 13 allows you to use the star and plus inside lookbehind, as well as curly braces without an upper limit. But Java 13 still uses the laborious method of matching lookbehind introduced with Java 6. Java 13 also does not correctly handle lookbehind with multiple quantifiers if one of them is unbounded. In some situations you may get an error. In other situations you may get incorrect matches. So for both correctness and performance, we recommend you only use quantifiers with a low upper bound in lookbehind with Java 6 through 13.
See the Java demo:
String pattern = "[^\\d,.+eE-]|(?<!(?:^|[eE])\\s*)[+-]|[eE](?!\\s*(?:[+-]\\s*)?\\d+$)|[.,](?=.*[.,])";
Pattern ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA = Pattern.compile(pattern);
var intermediate = ALL_NON_NUMERICS_EXCEPT_LEADING_MINUS_AND_E_AND_LAST_DOT_OR_COMMA
.matcher("3 e -9")
.replaceAll("");
System.out.println(intermediate);
// => 3e-9
Although (?<!(?:^|[eE])\s*) works here, it is still recommended to only use limiting quantifiers in constrained-width lookbehind patterns, i.e. just make sure the upper bound is reasonable enough, e.g. (?<!(?:^|[eE])\s{0,100}).

Why I got IllegalStateException here? [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.

Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.

In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex - Optimize look behind

I have a regex that finds urls in a source code
(?<!\b(XmlNamespace)\([^\n]{0,1000})"(http|ftp|socket):\/\/(?!www\.google-analytics\.com(\/collect)?)(\:(\d+)?)?("|\/))[\w\d]
But it VERY slow. The main problem is look behind.
I use Java (java.util.regex.Pattern)
Can someone help me please?
UPD:
When I changed {0,1000} to {0,100}, processing time has changed to 50 seconds. But it's not a solution. I believe that this look behind works first, but the main part only second. So the question: How to make "(http|ftp|socket):// work first and look behind only after that?

The regex was tested at RegexPlanet.
NOTE: The (?!www\\.google-analytics\\.com(/collect)?) lookahead does not make much sense since you are consuming : + digits* after //, so your regex might be invalid in general.
I dwell on how the pattern can be enhanced.
The point is that the lookbehind in your pattern is triggered at alocation before every character. If you place it after the initial pattern repeating this pattern after the subpattern that is already in your lookbehind, it will be triggered only after matching this initial subpattern.
Fortunately, Java regex is wise enough to see that the lookbehind width is still constrained width with the alternation.
So,
"(http|ftp|socket)://(?<!\bXmlNamespace\(.{0,1000}"(http|ftp|socket)://)(?!www\‌.google-analytics\.com(/collect)?)(:\d*)?["/]\w
Should work better. Note I removed unnecessary escape symbols (/ is not a special character in Java regex), put the ("|/) alternation into a character class [...]. Also, [\w\d] is the same as \w (it already matches \d).
The regex was tested at RegexPlanet.
Java test:
String value1 = "\"http://:2123\"123";
String pattern1 = "\"(http|ftp|socket)://(?<!\\bXmlNamespace\\(.{0,1000}\"(http|ftp|socket)://)(?!www\\.google-analytics\\.com(/collect)?)(:\\d*)?[\"/]\\w";
Pattern ptrn = Pattern.compile(pattern1);
Matcher matcher = ptrn.matcher(value1);
if (matcher.find())
System.out.println("true");
else
System.out.println("false");

Regular expression Java Merge Pattern

I've these three regular expressions. They work individually but i would like to merge them in a single pattern.
regex1 = [0-9]{16}
regex2 = [0-9]{4}[-][0-9]{4}[-][0-9]{4}[-][0-9]{4}
regex3 = [0-9]{4}[ ][0-9]{4}[ ][0-9]{4}[ ][0-9]{4}
I use this method:
Pattern.compile(regex);
Which is the regex string to merge them?

You can use backreferences:
[0-9]{4}([ -]|)([0-9]{4}\1){2}[0-9]{4}
This will only match if the seperators are either all
spaces
hyphens
blank
\1 means "this matches exactly what the first capturing group – expression in parentheses – matched".
Since ([ -]|) is that group, both other separators need to be the same for the pattern to match.
You can simplify it further to:
\d{4}([ -]|)(\d{4}\1){2}\d{4}

The following should match anything the three patterns match:
regex = [0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}
That is, I'm assuming you are happy with either a hyphen, a space or nothing between the numbers?
Note: this will also match situations where you have any combination of the three, e.g.
0000-0000 00000000
which may not be desired?
Alternatively, if you need to match any of the three individual patterns then just concatenate them with |, as follows:
([0-9]{16})|([0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})|([0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4})
(Your original example appears to have unnecessary square brackets around the space and hyphen)

How to match an maximum length Regex in java

public static void main(String[] args) {
Pattern compile = Pattern
.compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
Matcher matcher = compile.matcher("i5-2450M");
matcher.find();
System.out.println(matcher.group(0));
}
I assume this should return i5-2450M but it returns i5 actually

The problem is that the first alternation that matches is used.
In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.
// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]
(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)
To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)
The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.
From Java regex alternation operator "|" behavior seems broken:
Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
(The accepted answer in this question uses word boundaries.)
From Pattern:
The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Try to iterate over the matches (i.e. while matcher(text).find())

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`? - java

Perl RegEx and PCRE (Perl-Compatible RegEx) amongst others have the shorthand \K to discard all matches to the left of it except for capturing groups, but Java doesn't support it, so what's Java's equivalent to it ?

Related

Match all minuses that are not at the beginning of the string and that do not follow immediately after an `e` or `E`

Why I got IllegalStateException here? [duplicate]

Regex - Optimize look behind

Regular expression Java Merge Pattern

How to match an maximum length Regex in java

Categories

Resources