Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.
Related
I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).
I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.
I'm not sure if this is possible using Regex but I'd like to be able to limit the number of underscores allowed based on a different character. This is to limit crazy wildcard queries to a search engine written in Java.
The starting characters would be alphanumeric. But I basically want a match if there are more underscores than preceding characters. So
BA_ would be fine but BA___ would match the regex and would get kicked out of the query parser.
Is that possible using Regex?
Yes you can do it. This pattern will succeed only if there are less underscores than letters (you can adapt it with the characters you want):
^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$
(as Pshemo notices it, anchors are not needed if you use the matches() method, I wrote them to illustrate the fact that this pattern must be bounded whatever the means. With lookarounds for example.)
negated version:
^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$
The idea is to repeat a capture group that contains a backreference to itself + an underscore. At each repetition, the capture group is growing. ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ will match all letters that have a correspondant underscore. You only need to add [A-Z]+ to be sure that there is more letters, and to finish your pattern with \\1? that contains all the underscores (I make it optional, in case there is no underscore at all).
Note that if you replace [A-Z]+ with [A-Z]{n} in the first pattern, you can set exactly the number of characters difference between letters and underscores.
To give a better idea, I will try to describe step by step how it works with the string ABC-- (since it's impossible to put underscores in bold, i use hyphens instead) :
In the non-capturing group, the first letter is found
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
let's enter the lookahead (keep in mind that all in the lookahead is only
a check and not a part of the match result.)
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the first capturing group is encounter for the first time and its content is not
defined. This is the reason why an optional quantifier is used, to avoid to make
the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the first hyphen is matched. Once the capture group closed, the first capture
group is now defined and contains one hyphen.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
The lookahead succeeds, let's repeat the non-capturing group.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
The second letter is found
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We enter the lookahead
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
but now, things are different. The capture group was defined before and
contains an hyphen, this is why \1?+ will match the first hyphen.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
the literal hyphen matches the second hyphen in the string. And now the
capture group 1 contains the two hypens. The lookahead succeeds.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We repeat one more time the non capturing group.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
In the lookahead. There is no more letters, it's not a problem, since
the * quantifier is used.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
\\1?+ matches now two hyphens.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
but there is no more hyphen in the string for the literal hypen and the regex
engine can not use the bactracking since \1?+ has a possessive quantifier.
The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ensure that there is at least one more letter.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
We match the end of the string with the backreference to capture group 1 that
contains the two hyphens. Note that the fact that this backreference is optional
allows the string to not have hyphens at all.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
This is the end of the string. The pattern succeeds.
ABC-- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)
Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)
The non-capturing group is repeated three times and `ABC` are matched:
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Note that at this step the first capturing group contains ---
But after the non capturing group, there is no more letter to match for [A-Z]+
and the regex engine must backtrack.
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Question: How many hyphens are in the capture group now?
Answer: Always three!
If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.
Then the letter C is found:
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
And the three hyphens
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
The pattern succeeds
ABC--- ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
)
\A (?: \g<neutral> | _ )+ \z
~x
EOD;
var_dump(preg_match($pattern, '____ABC_DEF___'));
Its not possible in singular regular expression.
i) Logic needs to be implemented to get number of characters before underscores(regular expression should be written to get characters word before underscore).
ii) And validate result (number of characters - 1) = number of semicolons followed(regular expression which returns stream of underscores followed by characters).
Edit: Dang! I just noticed that you need this for java. Anyways...I leave it here if someone from the .Net world stumbles upon this post.
You can use Balancing Groups if you are using .Net:
^(?:(?<letter>[^_])|(?<-letter>_))*(?(letter)(?=)|(?!))$
The .net regex engine has the ability to maintain all captured patterns in the captured groups. In other flavors the captured group would always contain the last matched pattern but in .net all previous matches are contained in a capture collection for your use. Also the .net engine has the ability to push and pop to the stack of the captured groups using the ?<group-name>, ?<-group-name> constructs. These two handy constructs can be utilized to match pairs of paranthesis, etc.
In the above regex, the engine starts from the start of the string and tries to match anything other than "_". This of course can be changed to whatever works for you(e.g [A-Z][a-z]). The alternation basically means either match [^\_] or [\_] and doing so either push or pop from the captured group.
The latter part of the regex is a conditional (?(group-name)true|false). It basically says, if the group still exists(more pushes than pops), then do the true section and if not do the false section. The easiest way to make the pattern match is to use an empty positive look ahead: (?=) and the easiest way to make it fail is (?!) which is a negative lookahead.
I've these three regular expressions. They work individually but i would like to merge them in a single pattern.
regex1 = [0-9]{16}
regex2 = [0-9]{4}[-][0-9]{4}[-][0-9]{4}[-][0-9]{4}
regex3 = [0-9]{4}[ ][0-9]{4}[ ][0-9]{4}[ ][0-9]{4}
I use this method:
Pattern.compile(regex);
Which is the regex string to merge them?
You can use backreferences:
[0-9]{4}([ -]|)([0-9]{4}\1){2}[0-9]{4}
This will only match if the seperators are either all
spaces
hyphens
blank
\1 means "this matches exactly what the first capturing group – expression in parentheses – matched".
Since ([ -]|) is that group, both other separators need to be the same for the pattern to match.
You can simplify it further to:
\d{4}([ -]|)(\d{4}\1){2}\d{4}
The following should match anything the three patterns match:
regex = [0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}
That is, I'm assuming you are happy with either a hyphen, a space or nothing between the numbers?
Note: this will also match situations where you have any combination of the three, e.g.
0000-0000 00000000
which may not be desired?
Alternatively, if you need to match any of the three individual patterns then just concatenate them with |, as follows:
([0-9]{16})|([0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})|([0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4})
(Your original example appears to have unnecessary square brackets around the space and hyphen)
public static void main(String[] args) {
Pattern compile = Pattern
.compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
Matcher matcher = compile.matcher("i5-2450M");
matcher.find();
System.out.println(matcher.group(0));
}
I assume this should return i5-2450M but it returns i5 actually
The problem is that the first alternation that matches is used.
In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.
// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]
(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)
To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)
The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.
From Java regex alternation operator "|" behavior seems broken:
Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
(The accepted answer in this question uses word boundaries.)
From Pattern:
The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.
Try to iterate over the matches (i.e. while matcher(text).find())