Modifying existing Java regex - java

I have the following regex that validates the allowed characters:
^[a-zA-Z0-9-?\/:;(){}\[\]|`~´.\,'+÷ !##$£%^"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\]*$
I need to modify it so that the string being validated:
may not begin with space or “/”
may not contain “//”
may not end with “/”
For the space at the beginning I have adapted it to
^[^\s][a-zA-Z0-9-?\\/:;(){}\\[\\]|`~´.\\,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\\\]*$
Not sure what to do about the other two requirements
For the second one I tried combining it with ^((?!//))*$ in various ways but to no success.

Note that ^((?!\/\/))*$ matches any empty string since the lookahead is a non-consuming pattern and here it always returns true.
[^\s] at the start of your pattern will match any chars other than whitespace chars, even those you did not specify in the character class.
You can use
^(?![\s/])(?!.*//)[a-zA-Z0-9?/:;(){}\[\]|`~´.,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\-]*$(?<!/)
See the regex demo. Details:
^(?![\s/])(?!.*//) - at the start of string, two checks are peformed:
(?![\s/]) - no whitespace or / allowed (right at the start)
(?!.*//) - no // allowed anywhere after zero or more chars other than line break chars, as many as possible
(?<!/) is the check after the end of string is hit, and it fails the match if the last char in string is /.
Note that in Java regex declarations, you do not need to escape / since regex delimiter notation is not used, and / itself is not a special regex metacharacter.

It seems like the following regexp should be enough and more simple: (?!.*//)^[^ /].*[^/]$
So at the beginning you can use negative lookahead to prevent occurence of // anywhere in the text. Then any character but space and / is accepted at the beginning, then anything can be present (besides // which was excluded by negative lookahead) and anything but / is accepted at the end.

Since 95% of the time the special conditions on the space and forward slash
will not occur, it might be better to take those two characters out of your
big class and handle them separately if and when they occur.
The big class can also be condensed to speed things up a bit.
^(?>[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/LpCwt6/1
^
(?>
[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+
| (?:
/
(?! / | $ )
| [ ]
)
(?<! ^ . )
)*
$
And if you want to absorb all the class characters it can get very small.
^(?>[!-.0-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/EYdM5C/1

Related

Why I got IllegalStateException here? [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regular expression for allowing only 1 of a set of characters

I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.
using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.
Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text

Restrict consecutive characters using Java Regex

I need to allow alphanumeric characters , "?","." , "/" and "-" in the given string. But I need to restrict consecutive - only.
For example:
www.google.com/flights-usa should be valid
www.google.com/flights--usa should be invalid
currently I'm using ^[a-zA-Z0-9\\/\\.\\?\\_\\-]+$.
Please suggest me how to restrict consecutive - only.
You may use grouping with quantifiers:
^[a-zA-Z0-9/.?_]+(?:-[a-zA-Z0-9/.?_]+)*$
See the regex demo
Details:
^ - start of string
[a-zA-Z0-9/.?_]+ - 1 or more characters from the set defined in the character class (can be replaced with [\w/.?]+)
(?:-[a-zA-Z0-9/.?_]+)* - zero or more sequences ((?:...)*) of:
- - hyphen
[a-zA-Z0-9/.?_]+ - see above
$ - end of string.
Or use a negative lookahead:
^(?!.*--)[a-zA-Z0-9/.?_-]+$
^^^^^^^^^
See the demo here
Details:
^ - start of string
(?!.*--) - a negative lookahead that will fail the match once the regex engine finds a -- substring after any 0+ chars other than a newline
[a-zA-Z0-9/.?_-]+ - 1 or more chars from the set defined in the character class
$ - end of string.
Note that [a-zA-Z0-9_] = \w if you do not use the Pattern.UNICODE_CHARACTER_CLASS flag. So, the first would look like "^[\\w/.?]+(?:-[\\w/.?]+)*$" and the second as "^(?!.*--)[\\w/.?-]+$".
One approach is to restrict multiple dashes with negative look-behind on a dash, like this:
^(?:[a-zA-Z0-9\/\.\?\_]|(?<!-)-)+$
The right side of the |, i.e. (?<!-)-, means "a dash, unless preceded by another dash".
Demo.
I'm not sure of the efficiency of this, but I believe this should work.
^([a-zA-Z0-9\/\.\?\_]|\-([^\-]|$))+$
For each character, this regex checks if it can match [a-zA-Z0-9\/\.\?\_], which is everything you included in your regex except the hyphen. If that does not match, it instead tries to match \-([^\-]|$), which matches a hyphen not followed by another hyphen, or a hyphen at the end of the string.
Here's a demo.

how to limit the number of "/" in a string

How do I use lookahead assertion to limit by range the number of "/"
I have tired the following
^(?=/{1,3})$
but it doesn't work
The easiest solution is to use a negative lookahead:
^(?!(?:[^/]*/){4})
That basically means the string cannot contain 4 slashes.
This assumes you allow other characters between slashes, but a maximum of 3 slashes.
A positive version would be ^(?=[^/]*(?:/[^/]*){0,3}$) or ^[^/]*(?:/[^/]*){0,3}$, without the lookahead.
Of course, the problem is trivial without regular expressions, if possible.
Lets try to break that last one down:
^ - Start of the string.
[^/]* - Some characters that are not slashes (or none)
(?: ) - A logical group. Similar to (), but does not capture the result (we do not need it after validation)
/[^/]* - Slash, followed by non-slash characters.
{0,3} - From 0 to 3 times.
$ - End of the string.
You could try the following (you have to say that there should be no / afterwards):
^(?=/{1,3}([^/]|$))

Regular expression help needed for simple (I think) scenario

Never was good at these things. We are using Checkstyle (for Java) to enforce good coding practices. One new check we would like to add is to identify when a certain API is used (potentially) incorrectly through the use of a regex check.
Incorrect usage examples:
BigDecimal.valueOf(someDouble).setScale(3);
someBigDecimalObject.setScale(6);
Correct usage examples:
BigDecimal.valueOf(someDouble).setScale(3, RoundingMode.HALF_UP);
someBigDecimalObject.setScale(1, RoundingMode.HALF_DOWN);
So,the regular expression I'm looking for is when ".setScale(" appears in the code that "RoundingMode." appears somewhere after it. Or to add more clarity, the regular expression should be true when ".setScale(" appears but "RoundingMode." doesn't.
Thanks in advance
Off the top of my head:
(?x: \.setScale \s* \( (?: (?! \bRoundingMode\b ) [^)] ) * \) )
What you want is to see .setScale(...) where ... doesn't contain RoundingMode. To explain further (assuming the Regex is being processed by Java):
(?x: - introduce "whitespace" mode so that you can pad the regex with spaces for readability
\.setScale \s* \( - look for .setScale with arbitrary spaces then (
Now the fun really begins: At every position until we encounter the trailing ) check whether the word RoundingMode appears:
(?: - start a group containing...
(?! \bRoundingMode\b ) - ...a negative look-ahead assertion for the word RoundingMode, that is, the regex will fail to match if "RoundingMode" is detected; and then
[^)] - match a non ) character
) * - repeat the above as often as required until
\) - the closing ")" is detected.
) - Finally, close the (?x: construct.
Of course, this won't work if there are nested (...) expressions in the method call but you can't solve that with a Java Pattern, you'll need a dodgy Perl regex instead
You can use the regex:
\.setScale\(\d+,\s*RoundingMode
See it
Howabout:
\.setScale\(.+RoundingMode

Categories

Resources