Regular expression hangs program (100% CPU usage)

Regular expression hangs program (100% CPU usage) - java

Java is hanging with 100% CPU usage when I use the below string as input for a regular expression.
RegEx Used:
Here is the regular expression used for the description field in my application.
^([A-Za-z0-9\\-\\_\\.\\&\\,]+[\\s]*)+
String used for testing:
SaaS Service VLAN from Provider_One
2nd attempt with Didier SPT because the first one he gave me was wrong :-(
It works properly when I split the same string in different combinations. Like "SaaS Service VLAN from Provider_One", "first one he gave me was wrong :-(", etc. Java is hanging only for the above given string.
I also tried optimizing the regex as below.
^([\\w\\-\\.\\&\\,]+[\\s]*)+
Even with this is not working.

Another classic case of catastrophic backtracking.
You have nested quantifiers that cause a gigantic number of permutations to be checked when the regex arrives at the : in your input string which is not part of your character class (assuming you're using the .matches() method).
Let's simplify the problem to this regex:
^([^:]+)+$
and this string:
1234:
The regex engine needs to check
1234 # no repetition of the capturing group
123 4 # first repetition of the group: 123; second repetition: 4
12 34 # etc.
12 3 4
1 234
1 23 4
1 2 34
1 2 3 4
...and that's just for four characters. On your sample string, RegexBuddy aborts after 1 million attempts. Java will happily keep on chugging... before finally admitting that none of these combinations allows the following : to match.
How can you solve this?
You can forbid the regex from backtracking by using possessive quantifiers:
^([A-Za-z0-9_.&,-]++\\s*+)+
will allow the regex to fail faster. Incidentally, I removed all those unnecessary backslashes.
Edit:
A few measurements:
On the string "was wrong :-)", it takes RegexBuddy 862 steps to figure out a non-match.
For "me was wrong :-)", it's 1,742 steps.
For "gave me was wrong :-)", 14,014 steps.
For "he gave me was wrong :-)", 28,046 steps.
For "one he gave me was wrong :-)", 112,222 steps.
For "first one he gave me was wrong :-)", >1,000,000 steps.

First, you need to realize that your regexes CANNOT match the supplied input string. The strings contain a number of characters ('<' '>' '/' ':' and ')') that are not "word" characters.
So why is it taking so long?
Basically "catastrophic backtracking". More specifically, the repeating structures of your regex give an exponential number of alternatives for the regex backtracking algorithm to try!
Here's what your regex says:
One or more word characters
Followed by zero or more space characters
Repeat the previous 2 patterns as many times as you like.
The problem is with the "zero or more space characters" part. The first time, the matcher will match everything up to the first unexpected character (i.e. the '<'). Then it will back off a bit and try again with a different alternative ... that involves "zero spaces" before the last letter, then when that fails, it will move the "zero spaces" back one position.
The problem is that for String with N non-space characters, there as N different places that "zero spaces" can be matched, and that makes 2^N different combinations. That rapidly turns into a HUGE number as N grows, and the end result is hard to distinguish from an infinite loop.

Why are you matching whitespace separately from the other characters? And why are you anchoring the match at the beginning, but not at the end? If you want to make sure the string doesn't start or end with whitespace, you should do something like this:
^[A-Za-z0-9_.&,-]+(?:\s+[A-Za-z0-9_.&,-]+)*$
Now there's only one "path" the regex engine can take through the string. If it runs out of characters that match [A-Za-z0-9_.&,-] before reaching the end, and the next character doesn't match \s, it fails immediately. If it reaches the end while still matching whitespace characters, it fails because it's required to match at least one non-whitespace character after each run of whitespace.
If you want to make sure there's exactly one whitespace character separating the runs of non-whitespace, just remove the quantifier from \s+:
^[A-Za-z0-9_.&,-]+(?:\s[A-Za-z0-9_.&,-]+)*$
If you don't care where the whitespace is in relation to the non-whitespace, just match them all with the same character class:
^[A-Za-z0-9_.&,\s-]+$
I'm assuming you know that your regex won't match the given input because of the : and ( in the smiley, and you just want to know why it takes so long to fail.
And of course, since you're creating the regex in the form of a Java string literal, you would write:
"^[A-Za-z0-9_.&,-]+(?:\\s+[A-Za-z0-9_.&,-]+)*$"
or
"^[A-Za-z0-9_.&,-]+(?:\\s[A-Za-z0-9_.&,-]+)*$"
or
"^[A-Za-z0-9_.&,\\s-]+$"
(I know you had double backslashes in the original question, but that was probably just to get them to display properly, since you weren't using SO's excellent code formatting feature.)

Related

Regular expression for allowing only 1 of a set of characters

I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.

using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.

Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text

Match pattern which is not wrapped by some character

I have an input string like this:
one `two three` four five `six` seven
where some parts can be wrapped by grave accent character (`).
I want to match only these parts which are not wrapped by it, it is one, four five and seven in example (skip two three and six).
I tryied to do it using lookaheads ((?<=) and (?=)) but it recognised four five group like two three and six. Is it possible to solve this problem using regex only, or I have to do it programmatically? (I'm using java 1.8)

If you are sure that there are no unclosed backticks, you could do this:
((?:\w| )+)(?=(?:[^`]*`[^`]*`)*[^`]*$)
This will match:
"one "
" four five "
" seven"
But it's a little bit expensive, because the lookahead that checks whether the number of backtics in the remaining part of line is divisible by 2 takes O(n^2) time to scan through the entire string.
Note that this works regardless of where the whitespace is, it really counts the backticks, it does not care about the relative position of the backticks. If you don't need this kind of robustness, #anubhava's answer is certainly more performant.
Demo: regex101.

You may use this regex using a lookahead and lookbehind:
(?<!`)\b\w+(?:\s+\w+)*\b(?!`)
RegEx Demo
Explanation:
- (?<!`): Negative Lookbehind to assert that we don't have ` at previous position
- \b\w+(?:\s+\w+)*\b: Match our text surrounded by word boundaries
- (?!`): Negative Lookahead to assert that we don't have ` at next position

I solve issues like this by specifying to exclude closing characters (in your case whitespace) like so:
`[^\s]+`

Password Validation with Regex Java

I am trying to figure out a regex to match a password that contains
one upper case letter.
one number
one special character.
and at least 4 characters of length
the regex that I wrote is
^((?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9])){4,}
however it is not working, and I couldn't figure out why.
So please can someone tell me why this code is not working, where did I mess up, and how to correct this code.

Your regex can be rewritten as
^(
(?=.*[0-9])
(?=.*[A-Z])
(?=.*[^A-Za-z0-9])
){4,}
As you see {4,} applies to group which doesn't let you match any character since look-around is zero-width, which effectively means "4 or more of nothing".
You need to add . before {4,} to let your regex handle "and at least 4 characters of length" point (rest is handled by look-around).
You can remove that capturing group since you don't really need it.
So try with something like
^(?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9]).{4,}

You could come up with sth. like:
^(?=.*[A-Z])(?=.*\d)(?=.*[!"§$%&/()=?`]).{4,}$
In multiline mode, see a demo on regex101.com.
This approach specifies the special characters directly (which could be extended, obviously).
From the following list only the bold ones would satisfy these criteria:
test
Test123!
StrongPassword34?
weakone
Tabaluga"12???
You can still enhance this expression by being more specific and requiring contrary pairs. Just to remind you, the dot-star (.*) brings you down the line and then backtracks eventually. This will almost always require more steps than to directly look for contrary pairs.
Consider the following expression:
^ # bind the expression to the beginning of the string
(?=[^A-Z\n\r]*[A-Z]) # look ahead for sth. that is not A-Z, or newline and require one of A-Z
(?=[^\d\n\r]*\d) # same construct for digits
(?=\w*[^\w\n\r]) # same construct for special chars (\w = _A-Za-z0-9)
.{4,}
$
You'll see a significant reduction in steps as the regex engine does not have to backtrack everytime.

Reg Expression Validation on a String

Can I use Reg Expression for the following use case?
I Need to write a boolean method which takes a String parameter that should satisfy following conditions.
20 character length string.
First 9 characters will be a number
Next 2 characters will be alphabets
Next 2 characters will be a number.(1 to 31 or 99)
Next 1 character will be an alphabet
Last 6 characters will be a number.
In this, I have wrote the code for the first requirement:
[a-zA-Z0-9]{20} - This expression works well for the first case. I don't know how to write a complete reg expression to meet the entire requirement.
Please help.

Yes, it is possible to use regexes for this.
Ignore the "20 characters" part and describe a string created by concatenating 9 digits, 2 letters, 2 digits, 1 letter and another digit.
Start with the string start: ^
Then 9 digits. The \d conveniently describes the character set [0-9], so \d{9} means "nine digits"
Then 2 letters. The \w class is too broad, so stick to [a-zA-Z] for a letter.
Then another two digits. They seem to be from a restricted set, so describe the set with alternation and grouping.
Then another letter and another digit.
And, finally, you have to end at the end of the string: $
For reference, this regex means "the string is nine letters, then 12-15 or 99, then another letter":
^[a-zA-Z]{9}(1[2-5]|99)[a-zA-Z]$

Read the String JavaDocs, especially the part about String.matches() as well as the documentation about regular expressions in Java.

Your first requirement is already implicit in the remaining ones, so I would just skip it. Then, just write the regex code that matches each part one after the other:
[0-9]{9}[a-zA-Z]{2}...
There is one special consideration for the number that might be 1 to 31. While it is possible to match this in one regex, it would be verbose and difficult to understand. Instead, perform basic matching in the regex and extract this part as a capturing group by putting it into parentheses:
([0-9]{2})
If you use Pattern and Matcher to apply your regex, and your string matches the pattern, you can then easily get at just thost two characters, use Integer.parseInt() to convert them to an integer (which is completely safe because you know the two characters are digits), and then check the value normally.

This regular expression takes
^[0-9]{9}[a-zA-Z]{2}([1-9]|[1-2][0-9]|3[0-1]|99)[a-zA-Z]([0-9]{6})$
takes
9 letters at start,
Followed by 2 alphabets,
Followed by number between 1 to 31 or 99,
Followed by an alphabet,
followed by 6 digits.

Regex to detect number within String

I'm confronted with a String:
[something] -number OR number [something]
I want to be able to cast the number. I do not know at which position is occures. I cannot build a sub-string because there's no obvious separator.
Is there any method how I could extract the number from the String by matching a pattern like
[-]?[0..9]+
, where the minus is optional? The String can contain special characters, which actually drives me crazy defining a regex.

-?\b\d+\b
That's broken down by:
-? (optional minus sign)
\b word boundary
\d+ 1 or more digits
[EDIT 2] - nod to Alan Moore
Unfortuantely Java doesn't have verbatim strings, so you'll have to escape the Regex above as:
String regex = "-?\\b\\d+\\b"
I'd also recommend a site like http://regexlib.com/RETester.aspx or a program like Expresso to help you test and design your regular expressions
[EDIT] - after some good comments
If haven't done something like *?(-?\d+).* (from #Voo) because I wasn't sure if you wanted to match the entire string, or just the digits. Both versions should tell you if there are digits in the string, and if you want the actual digits, use the first regex and look for group[0]. There are clever ways to name groups or multiple captures, but that would be a complicated answer to a straight forward question...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression hangs program (100% CPU usage) - java

Related

Regular expression for allowing only 1 of a set of characters

Match pattern which is not wrapped by some character

Password Validation with Regex Java

Reg Expression Validation on a String

Regex to detect number within String

Categories

Resources