Regular expression for excluding special characters [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am having trouble coming up with a regular expression which would essentially black list certain special characters.
I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. French or German) and some special characters such as '-. etc.
How do I blacklist characters such as <>%$ etc?

I would just white list the characters.
^[a-zA-Z0-9äöüÄÖÜ]*$
Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.

To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:
[<>%\$]
This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.
To add more characters to the black list, just insert them between the brackets; order does not matter.
According to some Java documentation for regular expressions, you could use the expression like this:
Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
// Invalid input: reject it, or remove/change the offending characters.
}
else
{
// Valid input.
}

Even in 2009, it seems too many had a very limited idea of what designing for the WORLDWIDE web involved. In 2015, unless designing for a specific country, a blacklist is the only way to accommodate the vast number of characters that may be valid.
The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required.
However, sometimes it pays to break down the requirements, and handle each separately. Here look-ahead is your friend. These are sections bounded by (?=) for positive, and (?!) for negative, and effectively become AND blocks, because when the block is processed, if not failed, the regex processor will begin at the start of the text with the next block. Effectively, each look-ahead block will be preceded by the ^, and if its pattern is greedy, include up to the $. Even the ancient VB6/VBA (Office) 5.5 regex engine supports look-ahead.
So, to build up a full regular expression, start with the look-ahead blocks, then add the blacklisted character block before the final $.
For example, to limit the total numbers of characters, say between 3 and 15 inclusive, start with the positive look-ahead block (?=^.{3,15}$). Note that this needed its own ^ and $ to ensure that it covered all the text.
Now, while you might want to allow _ and -, you may not want to start or end with them, so add the two negative look-ahead blocks, (?!^[_-].+) for starts, and (?!.+[_-]$) for ends.
If you don't want multiple _ and -, add a negative look-ahead block of (?!.*[_-]{2,}). This will also exclude _- and -_ sequences.
If there are no more look-ahead blocks, then add the blacklist block before the $, such as [^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+, where the \0-\cZ excludes null and control characters, including NL (\n) and CR (\r). The final + ensures that all the text is greedily included.
Within the Unicode domain, there may well be other code-points or blocks that need to be excluded as well, but certainly a lot less than all the blocks that would have to be included in a whitelist.
The whole regex of all of the above would then be
(?=^.{3,15}$)(?!^[_-].+)(?!.+[_-]$)(?!.*[_-]{2,})[^<>[\]{}|\\\/^~%# :;,$%?\0-\cZ]+$
which you can check out live on https://regex101.com/, for pcre (php), javascript and python regex engines. I don't know where the java regex fits in those, but you may need to modify the regex to cater for its idiosyncrasies.
If you want to include spaces, but not _, just swap them every where in the regex.
The most useful application for this technique is for the pattern attribute for HTML input fields, where a single expression is required, returning a false for failure, thus making the field invalid, allowing input:invalid css to highlight it, and stopping the form being submitted.

I guess it depends what language you are targeting. In general, something like this should work:
[^<>%$]
The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.
You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.

The negated set of everything that is not alphanumeric & underscore for ASCII chars:
/[^\W]/g
For email or username validation i've used the following expression that allows 4 standard special characters - _ . #
/^[-.#_a-z0-9]+$/gi
For a strict alphanumeric only expression use:
/^[a-z0-9]+$/gi
Test # RegExr.com

Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html

Do you really want to blacklist specific characters or rather whitelist the allowed charachters?
I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):
^(?:\p{L}\p{M}*|[\-])*$
Edit: Optimized the pattern with the input from the comments

Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.

Here's all the french accented characters:
àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ
I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.
For URLS I Replace accented URLs with regular letters like so:
string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {
cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}
There's probably a more efficient way, mind you.

I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".

Use This one
^(?=[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*$)(?!.*[<>'"/;`%])

Related

Regex: Allow minimum alphanumeric, dot and - characters. Asterisk allowed anywhere?

My requirement is that:
String should contain minimum 4 characters (only alphanumeric,
dot and hyphen allowed).
Apart from this asterisk is allowed anywhere (start, in-between or end)
It should not contain any other characters than mentioned in point 1 and 2 above.
e.g. following are valid strings:
Ab*08
*.6-*N*
following are invalid strings:
****AB-*
GH.*
My regex looks like:
^(.*?[a-zA-Z0-9.\-]){4,}.*$
My basic validations as mentioned in point 1 and 2 are working. But regex allows other special characters like <, >, & etc. How can I modify my regex to achieve this?
You can use
^(?:[*]*[a-zA-Z0-9.-]){4}[*a-zA-Z0-9.-]*$
It checks for 4 valid characters (that might be surrounded by *) and checks that your whole string only of your required characters.
Obligatory regex 101
Note: regex101 doesn't fully support java regex syntax. The pattern shown is a PCRE pattern, but all features used are also available in java regex.
Note2: if you use .matches to check your input, you can omit anchors, at it is already anchored.
^(?:\**[\w\d.-]\**){4,}$
This should do it. A little bit simpler.

Why do I get a parsing exception for this regex?

I have the following patterns in a web service constructor:
rUsername = Pattern.compile("[A-z0-9]{4,30}");
rPassword = Pattern.compile("[A-z0-9!##$%&*()=+-\\\\s]{6,30}");
rQuestion = Pattern.compile("[A-z0-9\\\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\\\s]{1,30}");
If I only have 2 slashes instead of the 4 there when I deploy my web application I get a parsing exception from Tomcat.
The username one works fine, but I seem to be having issues with the password, question and answer. The password will match "testasdas" but not "test1234", the question will not match anything with a space in it and the answer doesn't seem to match anything.
I want the password to be able to match lowercase and uppercase letters, numbers, spaces and the symbols I threw in there. The question one should be able to match lowercase and uppercase, numbers, spaces and '?', and the answer just uppercase and lowercase letters, spaces and numbers.
EDIT: The patterns have changed to these:
rPassword = Pattern.compile("[A-Za-z0-9!##$%&*()=+\\s-]{6,30}");
rQuestion = Pattern.compile("[A-Za-z0-9\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\s]{1,30}");
These are more or less how I want, but as pointed out in an answer I'm being quite restrictive on my password field which probably isn't a good idea. I don't hash anything before I save it because this is a college project nobody will ever use and I know that is a bad idea in the real world but it was not part of the requirements for the project. I do however have to stop SQL injection attacks, which is why everything is so restrictive. The idea was to mainly disallow the use of ' which every SQL attack I know of needs to work, but I don't know how to disallow only that character alone.
Let's look at your second regex:
[A-z0-9!##$%&*()=+-\\\\s]
There are several errors here.
[A-z] is incorrect, you need [A-Za-z] because there are some ASCII characters between Z and a that you probably don't want to match. But that's not the problem of your error.
More problematic is this section:
+-\\\\s
Translated from a Java string into an actual regex, this becomes
+-\\s
and that means (inside a character class) "Match any character between + and \, or any whitespace". [+-\\] is a valid range (ASCII 43-92), but it's not what you want.
But if you now remove the two extra backslashes, your character class becomes
+-\s
and that is a Syntax Error, because there is no ASCII range between + and "any whitespace".
Solution: Use
[A-Za-z0-9!##$%&*()=+\\s-]
or refrain from imposing limits on what characters your users may choose in a password in the first place.
To match uppercase and lowercase letter you need to give pattern: - [a-zA-Z]
Try changing your pattern to: -
[a-zA-Z0-9\\s]{1,30} for your answer..
For your question, that don't take whitespace, you can use \\S - non-whitespace character
[a-zA-Z0-9\\S]{1,250}
I think it should work.. You can make corresponding changes to remaining 2.
*EDIT: - You can also use \\p{Z} to match any Whitespace character..
See this link for a good tutorial on using Java Regular Expression..

regex repeated character count

If I have a set of characters like "abcdefghij" and using this characters, I generate random a password using this characters. A generated password can have, for example, 6 characters. How to validate a password using regex so that tho neighbor characters are not identical and a character does not repeat more that twice?
You could use something like:
/^
(?:(.)
(?!\1) # neighbor characters are not identical
(?!(?>.*?\1){2}) # character does not occur more than twice
)*
\z/x
Perl quoting, the atomic group can be removed if not supported.
In Java regex it could be written like:
^(?:(.)(?!\1|(?:.*?\1){2}))*\z
AFAIK, this cannot be done with a simple regexp (particularly, ensuring that a letter only appears twice at max. You could do a bunch of expressions like
[^a]*(a[^a]*(a[^a]*))
[^b]*(b[^b]*(b[^b]*))
....
and also (matching means the validation failed):
[^a]*aa[^a]*
[^b]*bb[^b]*
but obviously this is not good idea.
The condition that the characters do not repeat together maybe can treated with capturing groups, but I am almost sure the other one cannot be checked with regex.
BTW... why the obsession with regex? Programming these checks are trivial, regex is useful in a set of cases but not every check can be done with regex.

What is a regular expression for control characters?

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][#-z]
I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.
Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?#_\[\]^\\], so \\\^[A-Z?#_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.
Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)
If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.
Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..

Regular expression to allow a set of characters and disallow others

I want to restrict the users from entering the below special characters in a field:
œçşÇŞ
ğĞščřŠŘŇĚŽĎŤČňěž
ůŮ
İťı
—¿„”*#
Newline
Carriage return
A few more will be added to this list but I will have the complete restricted list eventually.
But he can enter certain foreign characters like äöüÄÖÜÿï etc in addition to alphanumeric chars, usual special chars etc.
Is there an easy way to build a regex for doing this. Adding so many chars in the not allowed list like
[^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” * #]+
does not seem to work.
And I do not have the complete list of allowed characters. It would be too long even if I try to get it and would include all chars like:
~`!#$%^&()[]{};':",.
along with certain foreign chars.
You do not mention what "flavor" of regex you are using. Does the following work?
\A[^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” * #]+\z
A regular expression can be built to match the incorrect characters, e.g.:
[œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı]
(I didn't include all the characters; you get the idea!).
If any character matches, it's a fail.
Or, if you need a regular expression that matches valid input, simply add a caret to the front of the brackets like so:
[^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı]*
You COULD use a regular expression for this, but why not just check if any of the disallowed characters are in your string with a builtin method? For example, in the .NET world you could use .Contains().
Personally, I would create a list of allowed characters, then just check that your string doesn't have any characters that aren't in your list. Using a whitelist will ensure that you haven't forgotten any "bad" characters as well.
A few more will be added to this list but I will have the complete restricted list
eventually.
And I do not have the complete list of allowed characters (It would be too long even if
I try to get it and would include all chars like ~`!#$%^&()[]{};':",.<> alongwith
certain foreign chars)
You will eventually have the list of disallowed characters and probably not the list of allowed characters? You must have either the list of all allowed characters or the list of all disallowed characters. Else you cannot tell if the input is legal. Further more, if you have one of the lists, you have the second implicitly if the character set is known. Then just implement the shorter one.
Just guessing, but if you use Unicode, there will probably be much more characters you want to disallow than to allow - think of all the fancy Chinees and Japanes symbols. So I think you should really build a list of allowed characters and use ranges like a-z where posiible.
If you really want to build the list of disallowed characters, you will have to build a regular expression like [^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” * #]*. Do not forget to escape the characters if required and use ranges if possible.
Adding so many chars in the not allowed list like [^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” *#]+ does not seem to work.
There are spaces in your list. Are they in your code, too? I am not sure, but may be this might be a problem.
It would be best to try and match any character that is not allowed by negating the allowed set. For example, if you only wanted to allow 'a' through 'z', you might do the following.
[^a-z]
You cannot possibly know all of the characters that are not allowed, but you presumably know the ones that are allowed. So, build a regular expression like the one above that matches only one character that is not in the allowed set. If you get a match, you'll know that the string contains an invalid character.
If you can, try to use built-in character class escape codes if they're available.
Find them for Perl RE here, look for "Character Classes and other Special Escapes". It may allow you to have a shorter expression like this one.
[^\w\d ..other individual chars.. ]

Categories

Resources