What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
Related
I am looking for a way to match an optional ABC in the following strings.
Both strings should be matched either way, if ABC is there or not:
precedingstringwithundefinedlenghtABCsubsequentstringwithundefinedlength
precedingstringwithundefinedlenghtsubsequentstringwithundefinedlength
I've tried
.*(ABC).*
which doesn't work for an optional ABC but making ABC non greedy doesn't work either as the .* will take all the pride:
.*(ABC)?.*
This is NOT a duplicate to e.g. Regex Match all characters between two strings as I am looking for a certain string inbetween two random string, kind of the other way around.
You can use
.*(ABC).*|.*
This works like this:
.*(ABC).* pattern is searched for first, since it is the leftmost part of an alternation (see "Remember That The Regex Engine Is Eager"), it looks for any zero or more chars other than line break chars as many as possible, then captures ABC into Group 1 and then matches the rest of the line with the right-hand .*
| - or
.* - is searched for if the first alternation part does not match.
Another solution without the need to use alternation:
^(?:.*(ABC))?.*
See this regex demo. Details:
^ - start of string
(?:.*(ABC))? - an optional non-capturing group that matches zero or more chars other than line break chars as many as possible and then captures into Group 1 an ABC char sequence
.* - zero or more chars other than line break chars as many as possible.
I’ve come up with an answer myself:
Using the OR operator seems to work:
(?:(?:.*(ABC))|.*).*
If there’s a better way, feel free to answer and I will accept it.
You could use this regex: .*(ABC){0,1}.*. It means any, optional{min,max}, any. It is easier to read. I can' t say if your solution or mine is faster due to the processing speed.
Options:
{value} = n-times
{min,} = min to infinity
{min,max} = min to max
.+([ABC])?.+ should do the job
What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
I'm trying to remove the first occurrence of a pattern from a string in Java.
Source string: DUMMY01012016DUMMY01012016
Format is 1-8 alpha-numeric characters followed by a date MMddyyyy followed by any number of alpha-numerics.
Want I'm trying to achieve is remove all beginning chars including the first date occurrence. So in the example below I would be left with DUMMY01012016.
Here is a simplified version of what I have tried: ".*\\d{4}(2016|2017|2015)"
That works well until the pattern is matched more than once. So in the example matcher.replaceFirst("") will replace the entire source string and not just the first occurrence.
Any thoughts would be greatly appreciated.
Thanks. Stephan
Your issue is that the * quantifier is greedy. It will cause the preceding sub-pattern to match as many times as possible without causing the overall match to fail (if a match is possible at all). Thus the tail of your pattern .*\d{4}(2016|2017|2015) will match the last occurrence of a date in the string, whereas you want it to match the first.
You can solve this problem by switching to a "reluctant" quantifier instead:
myString.replaceFirst(".*?\d{4}(2016|2017|2015)", "");
There, *? is a reluctant quantifier: it matches zero or more instances of the preceding sub-pattern, as few as possible to enable an overall match (if an overall match is possible).
This regex should work:
(\w{1,8}?\d{8})(?:\1)
One of your problems is that the .* is greedy. It means that it matches as much as it can at first. Then the regexp engine starts to step back symbol by symbol until a full match had been found.
So, roughly:
Step 1) .* macthes the whole DUMMY01012016DUMMY01012016
Step 2) The engine steps back symbol by symbol trying to match the remaining part:
DUMMY01012016DUMMY0101201 -> DUMMY01012016DUMMY010120 -> DUMMY01012016DUMMY01012 -> .. -> DUMMY01012016DUMMY
Step 3) A complete match is found -> DUMMY01012016DUMMY01012016
You can try something like this:
#Test
public void testReplace()
{
String string = "DUMMY01012016DUMMY01012016";
String replaced = string.replaceFirst("\\w{1,8}\\d{4}(2016|2017|2015)", "");
Assert.assertEquals("DUMMY01012016", replaced);
}
To understand the difference between lazy and greedy you can experiment and make the asterisk lazy by adding a question mark ?., e.g. .*?\d{4}(2016|2017|2015). Then the engine will do the opposite, it will match lazily at the beginning and step forward character by character.
I don't know why but when I try to run this program, it's looks like the program will run forever.
package fjr.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
public static void main(String[] args){
String regex = "dssdfsdfdsf wdasdads dadlkn mdsds .";
Pattern p = Pattern.compile("^([a-zA-Z]+ *)+$");
Matcher match = p.matcher(regex);
if(match.matches()){
System.out.println("Yess");
}else{
System.out.println("No.. ");
}
System.out.println("FINISH...");
}
}
What I need to do was to match the pattern that contain a bunch of word that only separated by spaces
Your program has likely encountered what's called catastrophic backtracking.
If you have a bit of time, let's look at how your regex works...
Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.
On the left hand side, we have our pattern:
/^([a-zA-Z]+ *)+$/
And here's the String to match:
dssdfsdfdsf wdasdads dadlkn mdsds .
From the regex101 debugger, your regex took 78540 steps to fail. This is because you used quantifiers that are greedy and not possessive (backtracking).
... Long story short, because the input string fails to match, every quantifier within your regex causes indefinite backtracking - Every character is released from + and then * and then both and then a group is released from ()+ to backtrack more.
Here's a few solutions you should follow:
Avoid abundant quantifiers!
If you revise your expression, you'll see that the pattern is logically same as:
/^[a-zA-Z]+( +[a-zA-Z]+)*$/
This uses a step of logical induction to reduce the regex upstairs to match far quicker, now at 97 steps!
Use possessive quantifiers while you can!
As I mentioned, /^([a-zA-Z]+ *)+$/ is evil because it backtracks in a terrible manner. We're in Java, what can we do?
This solution works only because [a-zA-Z] and matches distinct items. We can use a possessive group!
/^([a-zA-Z]++ *+)++$/
^ ^ ^
These simple "+" denotes "We're not backtracking if we fail the match from here". This is an extremely effective solution, and cuts off any need for backtracking. Whenever you have two distinct groups with a quantifier in between, use them. And if you need some proof on the effectiveness, here's our scorecard:
Read also:
The Stack Overflow Regex Reference
ReDoS - Wikipedia
Online Demos:
RegEx demo 1
RegEx demo 2
This does terminate, but it does take 10 seconds or so. Some observations:
Removing the fullstop from the end of the test string makes it fast.
Changing the * in the regex to a + (which i believe is actually what you want) makes it fast. I think having the option of 0 characters in that spot expands the state space a lot.
I would use:
^(\w+ +)*\w+$"
Which means a bunch of (word space), followed by a word. Ran against your example and it is fast
You can use this regex to match all word with spaces or none.
sample:
Pattern p = Pattern.compile("([a-zA-Z ]+)");
Interesting phenomena. This is related on Greedy quantifiers's behaviour & performance.
Based on your pattern ^([a-zA-Z]+ *)+$ , and your inpput "dssdfsdfdsf wdasdads dadlkn mdsds ." , the patther doesn't match your input, however, the java regex will backtrack all the possibilities of ^([a-zA-Z]+ *)+, (see below examples) and then obtain the not-match results.
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsds )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsd)(s )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mds)(ds )
...
(d)(s)(s)(d)(f)(s)(d)(f)(d)(s)(f )(w)(d)(a)(s)(d)(a)(d)(s )(d)(a)(d)(l)(k)(n )(m)(d)(s)(d)(s )
...
(dssdfsdfdsf )
...
(d)
The backtrack could be more than 200,000,000 times.
I'm a little bit curious why java-regex can't do some performance improvement, since after first try, it found the epxected char is '.', not '$', then any further backtrack won't be successful. it's useless to do the backtrack.
Therefore, when we define the Loop pattern, we need to pay more attenttion on the internal Loop pattern (e.g.: [a-zA-Z]+ * in your example), not making it matching so many case. Otherwise, the backtrack numbers for the whole Loop will be huge.
Another similar exmaple (more bad then your case):
Pattern: "(.+)*A"
Input: "abcdefghijk lmnopqrs tuvwxy zzzzz A" -- So quick
Input: "abcdefghijAk lmnopqrs tuvwxy zzzzz " -- Not bad, just wait for a while
Input: "aAbcdefghijk lmnopqrs tuvwxy zzzzz " -- it seems infinit. (acutally, not, but I have no patience to wait its finishing.)
In java regex I have read about Greedy and Reluctant Quantifiers. They mentioned as
A reluctant or "non-greedy" quantifier first matches as little as
possible. So the .* matches nothing at first, leaving the entire
string unmatched
In this example
source: yyxxxyxx
pattern: .*xx
greedy quantifier * and produces
0 yyxxxyxx
reluctant qualifier *?, and we get the following:
0 yyxx
4 xyxx
Why result of yxx, yxx not possible even it is the smallest possible value?
The regex engine returns the first and leftmost match it find as a result.
Basically it tries to match the pattern starting from the first character. If it doesn't find a corresponding match, the transmission jumps in and it tries again from the second character, and so on.
If you use a+?b on bab it will first try from the first b. That doesn't work, so we try from the second character.
But here it finds a match right from the first character. Starting from the second isn't even considered, we found a match so we return.
If you apply a+?b on aab, we try at the first a and find an overall match: end of story, no reason to try anything else.
To sum up: the regex engine goes from the left to the right, so laziness can only affect the right side length.