I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.
using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.
Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text
I am trying to figure out a regex to match a password that contains
one upper case letter.
one number
one special character.
and at least 4 characters of length
the regex that I wrote is
^((?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9])){4,}
however it is not working, and I couldn't figure out why.
So please can someone tell me why this code is not working, where did I mess up, and how to correct this code.
Your regex can be rewritten as
^(
(?=.*[0-9])
(?=.*[A-Z])
(?=.*[^A-Za-z0-9])
){4,}
As you see {4,} applies to group which doesn't let you match any character since look-around is zero-width, which effectively means "4 or more of nothing".
You need to add . before {4,} to let your regex handle "and at least 4 characters of length" point (rest is handled by look-around).
You can remove that capturing group since you don't really need it.
So try with something like
^(?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9]).{4,}
You could come up with sth. like:
^(?=.*[A-Z])(?=.*\d)(?=.*[!"§$%&/()=?`]).{4,}$
In multiline mode, see a demo on regex101.com.
This approach specifies the special characters directly (which could be extended, obviously).
From the following list only the bold ones would satisfy these criteria:
test
Test123!
StrongPassword34?
weakone
Tabaluga"12???
You can still enhance this expression by being more specific and requiring contrary pairs. Just to remind you, the dot-star (.*) brings you down the line and then backtracks eventually. This will almost always require more steps than to directly look for contrary pairs.
Consider the following expression:
^ # bind the expression to the beginning of the string
(?=[^A-Z\n\r]*[A-Z]) # look ahead for sth. that is not A-Z, or newline and require one of A-Z
(?=[^\d\n\r]*\d) # same construct for digits
(?=\w*[^\w\n\r]) # same construct for special chars (\w = _A-Za-z0-9)
.{4,}
$
You'll see a significant reduction in steps as the regex engine does not have to backtrack everytime.
If I have a set of characters like "abcdefghij" and using this characters, I generate random a password using this characters. A generated password can have, for example, 6 characters. How to validate a password using regex so that tho neighbor characters are not identical and a character does not repeat more that twice?
You could use something like:
/^
(?:(.)
(?!\1) # neighbor characters are not identical
(?!(?>.*?\1){2}) # character does not occur more than twice
)*
\z/x
Perl quoting, the atomic group can be removed if not supported.
In Java regex it could be written like:
^(?:(.)(?!\1|(?:.*?\1){2}))*\z
AFAIK, this cannot be done with a simple regexp (particularly, ensuring that a letter only appears twice at max. You could do a bunch of expressions like
[^a]*(a[^a]*(a[^a]*))
[^b]*(b[^b]*(b[^b]*))
....
and also (matching means the validation failed):
[^a]*aa[^a]*
[^b]*bb[^b]*
but obviously this is not good idea.
The condition that the characters do not repeat together maybe can treated with capturing groups, but I am almost sure the other one cannot be checked with regex.
BTW... why the obsession with regex? Programming these checks are trivial, regex is useful in a set of cases but not every check can be done with regex.
I'm confronted with a String:
[something] -number OR number [something]
I want to be able to cast the number. I do not know at which position is occures. I cannot build a sub-string because there's no obvious separator.
Is there any method how I could extract the number from the String by matching a pattern like
[-]?[0..9]+
, where the minus is optional? The String can contain special characters, which actually drives me crazy defining a regex.
-?\b\d+\b
That's broken down by:
-? (optional minus sign)
\b word boundary
\d+ 1 or more digits
[EDIT 2] - nod to Alan Moore
Unfortuantely Java doesn't have verbatim strings, so you'll have to escape the Regex above as:
String regex = "-?\\b\\d+\\b"
I'd also recommend a site like http://regexlib.com/RETester.aspx or a program like Expresso to help you test and design your regular expressions
[EDIT] - after some good comments
If haven't done something like *?(-?\d+).* (from #Voo) because I wasn't sure if you wanted to match the entire string, or just the digits. Both versions should tell you if there are digits in the string, and if you want the actual digits, use the first regex and look for group[0]. There are clever ways to name groups or multiple captures, but that would be a complicated answer to a straight forward question...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am having trouble coming up with a regular expression which would essentially black list certain special characters.
I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. French or German) and some special characters such as '-. etc.
How do I blacklist characters such as <>%$ etc?
I would just white list the characters.
^[a-zA-Z0-9äöüÄÖÜ]*$
Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:
[<>%\$]
This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.
To add more characters to the black list, just insert them between the brackets; order does not matter.
According to some Java documentation for regular expressions, you could use the expression like this:
Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
// Invalid input: reject it, or remove/change the offending characters.
}
else
{
// Valid input.
}
Even in 2009, it seems too many had a very limited idea of what designing for the WORLDWIDE web involved. In 2015, unless designing for a specific country, a blacklist is the only way to accommodate the vast number of characters that may be valid.
The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required.
However, sometimes it pays to break down the requirements, and handle each separately. Here look-ahead is your friend. These are sections bounded by (?=) for positive, and (?!) for negative, and effectively become AND blocks, because when the block is processed, if not failed, the regex processor will begin at the start of the text with the next block. Effectively, each look-ahead block will be preceded by the ^, and if its pattern is greedy, include up to the $. Even the ancient VB6/VBA (Office) 5.5 regex engine supports look-ahead.
So, to build up a full regular expression, start with the look-ahead blocks, then add the blacklisted character block before the final $.
For example, to limit the total numbers of characters, say between 3 and 15 inclusive, start with the positive look-ahead block (?=^.{3,15}$). Note that this needed its own ^ and $ to ensure that it covered all the text.
Now, while you might want to allow _ and -, you may not want to start or end with them, so add the two negative look-ahead blocks, (?!^[_-].+) for starts, and (?!.+[_-]$) for ends.
If you don't want multiple _ and -, add a negative look-ahead block of (?!.*[_-]{2,}). This will also exclude _- and -_ sequences.
If there are no more look-ahead blocks, then add the blacklist block before the $, such as [^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+, where the \0-\cZ excludes null and control characters, including NL (\n) and CR (\r). The final + ensures that all the text is greedily included.
Within the Unicode domain, there may well be other code-points or blocks that need to be excluded as well, but certainly a lot less than all the blocks that would have to be included in a whitelist.
The whole regex of all of the above would then be
(?=^.{3,15}$)(?!^[_-].+)(?!.+[_-]$)(?!.*[_-]{2,})[^<>[\]{}|\\\/^~%# :;,$%?\0-\cZ]+$
which you can check out live on https://regex101.com/, for pcre (php), javascript and python regex engines. I don't know where the java regex fits in those, but you may need to modify the regex to cater for its idiosyncrasies.
If you want to include spaces, but not _, just swap them every where in the regex.
The most useful application for this technique is for the pattern attribute for HTML input fields, where a single expression is required, returning a false for failure, thus making the field invalid, allowing input:invalid css to highlight it, and stopping the form being submitted.
I guess it depends what language you are targeting. In general, something like this should work:
[^<>%$]
The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.
You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.
The negated set of everything that is not alphanumeric & underscore for ASCII chars:
/[^\W]/g
For email or username validation i've used the following expression that allows 4 standard special characters - _ . #
/^[-.#_a-z0-9]+$/gi
For a strict alphanumeric only expression use:
/^[a-z0-9]+$/gi
Test # RegExr.com
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html
Do you really want to blacklist specific characters or rather whitelist the allowed charachters?
I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):
^(?:\p{L}\p{M}*|[\-])*$
Edit: Optimized the pattern with the input from the comments
Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.
Here's all the french accented characters:
àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ
I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.
For URLS I Replace accented URLs with regular letters like so:
string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {
cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}
There's probably a more efficient way, mind you.
I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".
Use This one
^(?=[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*$)(?!.*[<>'"/;`%])