Why do I get a parsing exception for this regex? - java

I have the following patterns in a web service constructor:
rUsername = Pattern.compile("[A-z0-9]{4,30}");
rPassword = Pattern.compile("[A-z0-9!##$%&*()=+-\\\\s]{6,30}");
rQuestion = Pattern.compile("[A-z0-9\\\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\\\s]{1,30}");
If I only have 2 slashes instead of the 4 there when I deploy my web application I get a parsing exception from Tomcat.
The username one works fine, but I seem to be having issues with the password, question and answer. The password will match "testasdas" but not "test1234", the question will not match anything with a space in it and the answer doesn't seem to match anything.
I want the password to be able to match lowercase and uppercase letters, numbers, spaces and the symbols I threw in there. The question one should be able to match lowercase and uppercase, numbers, spaces and '?', and the answer just uppercase and lowercase letters, spaces and numbers.
EDIT: The patterns have changed to these:
rPassword = Pattern.compile("[A-Za-z0-9!##$%&*()=+\\s-]{6,30}");
rQuestion = Pattern.compile("[A-Za-z0-9\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\s]{1,30}");
These are more or less how I want, but as pointed out in an answer I'm being quite restrictive on my password field which probably isn't a good idea. I don't hash anything before I save it because this is a college project nobody will ever use and I know that is a bad idea in the real world but it was not part of the requirements for the project. I do however have to stop SQL injection attacks, which is why everything is so restrictive. The idea was to mainly disallow the use of ' which every SQL attack I know of needs to work, but I don't know how to disallow only that character alone.

Let's look at your second regex:
[A-z0-9!##$%&*()=+-\\\\s]
There are several errors here.
[A-z] is incorrect, you need [A-Za-z] because there are some ASCII characters between Z and a that you probably don't want to match. But that's not the problem of your error.
More problematic is this section:
+-\\\\s
Translated from a Java string into an actual regex, this becomes
+-\\s
and that means (inside a character class) "Match any character between + and \, or any whitespace". [+-\\] is a valid range (ASCII 43-92), but it's not what you want.
But if you now remove the two extra backslashes, your character class becomes
+-\s
and that is a Syntax Error, because there is no ASCII range between + and "any whitespace".
Solution: Use
[A-Za-z0-9!##$%&*()=+\\s-]
or refrain from imposing limits on what characters your users may choose in a password in the first place.

To match uppercase and lowercase letter you need to give pattern: - [a-zA-Z]
Try changing your pattern to: -
[a-zA-Z0-9\\s]{1,30} for your answer..
For your question, that don't take whitespace, you can use \\S - non-whitespace character
[a-zA-Z0-9\\S]{1,250}
I think it should work.. You can make corresponding changes to remaining 2.
*EDIT: - You can also use \\p{Z} to match any Whitespace character..
See this link for a good tutorial on using Java Regular Expression..

Related

I'm trying to create a regex for a strong password

I'm having some trouble creating a regex for a password. It has the following requirements :
at least 10 characters
at least 1 uppercase letter
at least 1 lowercase letter
at least 1 special character
I currently created this line :
`
"^(?=(.*[A-Z])+)(?=(.*[a-z])+)(?=(.*[0-9])+)(?=(.*[!##$%^&*()_+=.])+){10,}$"
`
For a password like : Lollypop56#
it still gives me false.
You forgot a full point after the lookahead of special characters. So it would be:
"^(?=(.*[A-Z])+)(?=(.*[a-z])+)(?=(.*[0-9])+)(?=(.*[!##$%^&*()_+=.])+).{10,}$"
I recommend to you that use https://www.passay.org/. This dependency able you to validate wide range of password policy.
I'd remove the + in the groups inside the lookaheads since those aren't needed and also make those groups non-capturing. I'd also not explicitly specify the characters in the "special" group. Just make it match any of the characters not in the first three groups, [^A-Za-z0-9]. That'll allow ~ as a special character too etc.
Also, the actual matching should be .{10,}, not {10,}.
"^(?=(?:.*[A-Z]))(?=(?:.*[a-z]))(?=(?:.*[0-9]))(?=(?:.*[^A-Za-z0-9])).{10,}$"

Java regex not matching German "Umlaut" OR underscore

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?
If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Password Validation with Regex Java

I am trying to figure out a regex to match a password that contains
one upper case letter.
one number
one special character.
and at least 4 characters of length
the regex that I wrote is
^((?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9])){4,}
however it is not working, and I couldn't figure out why.
So please can someone tell me why this code is not working, where did I mess up, and how to correct this code.
Your regex can be rewritten as
^(
(?=.*[0-9])
(?=.*[A-Z])
(?=.*[^A-Za-z0-9])
){4,}
As you see {4,} applies to group which doesn't let you match any character since look-around is zero-width, which effectively means "4 or more of nothing".
You need to add . before {4,} to let your regex handle "and at least 4 characters of length" point (rest is handled by look-around).
You can remove that capturing group since you don't really need it.
So try with something like
^(?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9]).{4,}
You could come up with sth. like:
^(?=.*[A-Z])(?=.*\d)(?=.*[!"§$%&/()=?`]).{4,}$
In multiline mode, see a demo on regex101.com.
This approach specifies the special characters directly (which could be extended, obviously).
From the following list only the bold ones would satisfy these criteria:
test
Test123!
StrongPassword34?
weakone
Tabaluga"12???
You can still enhance this expression by being more specific and requiring contrary pairs. Just to remind you, the dot-star (.*) brings you down the line and then backtracks eventually. This will almost always require more steps than to directly look for contrary pairs.
Consider the following expression:
^ # bind the expression to the beginning of the string
(?=[^A-Z\n\r]*[A-Z]) # look ahead for sth. that is not A-Z, or newline and require one of A-Z
(?=[^\d\n\r]*\d) # same construct for digits
(?=\w*[^\w\n\r]) # same construct for special chars (\w = _A-Za-z0-9)
.{4,}
$
You'll see a significant reduction in steps as the regex engine does not have to backtrack everytime.

How to support internationalization for String validation?

How to support internationalization for String validation?
In my program I had a regex which ensures an input string has at least one alpha and one numeric character and the length is in between 2 to 10.
Pattern p = Pattern.compile("^(?=.\d)(?=.[A-Za-z])[A-Za-z0-9]{2,10}$");
Per new requirement, it need to support internationalization. How can it be done?
To support internationalization for the messages, I have used resource bundle, properties file using translated hard coded text. But not sure it is achieve to validate string.
What you need is Unicode!
Unicode code properites
Pattern p = Pattern.compile("^(?=.*\p{Nd})(?=.*\p{L})[\p{L}\p{Nd}]{2,10}$");
\p{L} and \p{Nd} are Unicode properties, where
\p{L} is any kind of letter from any language
\p{Nd} is a digit zero through nine in any script except ideographic scripts
For more details on Unicode properties see regular-expressions.info
Pattern.UNICODE_CHARACTER_CLASS
There is also a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links
You could do something like this
Pattern p = Pattern.compile("^(?=.*\\d)(?=.*[A-Za-z])\\w{2,10}$", Pattern.UNICODE_CHARACTER_CLASS);
and \w would match all letters and all digits from any languages (and of course some word combining characters like _).
Error in your regex
I also changed your regex a bit. Your original lookaheads ((?=.\d)(?=.[A-Za-z])) would check for the second character being a letter and a digit, what is failing in all ways, my version with the quantifier checks for if they are anywhere in the string.
At this point it might be better to define which characters (if any) don't count as alpha characters (like spaces, etc?). Then just make it "at least one numeric and one non-numeric character". But I think the problems you're having with the requirement stem from it being a bit silly.
Is this for a password? Two-character passwords are completely not secure. Some people may want to use passwords longer than ten characters. Is there actually any reason not to allow much longer passwords?
http://xkcd.com/936/ gives a pretty good overview of what makes an actually strong password. Requiring numbers doesn't help much against a modern attacker but makes the user's life harder. Better require a long password.

Regular expression for excluding special characters [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am having trouble coming up with a regular expression which would essentially black list certain special characters.
I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. French or German) and some special characters such as '-. etc.
How do I blacklist characters such as <>%$ etc?
I would just white list the characters.
^[a-zA-Z0-9äöüÄÖÜ]*$
Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:
[<>%\$]
This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.
To add more characters to the black list, just insert them between the brackets; order does not matter.
According to some Java documentation for regular expressions, you could use the expression like this:
Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
// Invalid input: reject it, or remove/change the offending characters.
}
else
{
// Valid input.
}
Even in 2009, it seems too many had a very limited idea of what designing for the WORLDWIDE web involved. In 2015, unless designing for a specific country, a blacklist is the only way to accommodate the vast number of characters that may be valid.
The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required.
However, sometimes it pays to break down the requirements, and handle each separately. Here look-ahead is your friend. These are sections bounded by (?=) for positive, and (?!) for negative, and effectively become AND blocks, because when the block is processed, if not failed, the regex processor will begin at the start of the text with the next block. Effectively, each look-ahead block will be preceded by the ^, and if its pattern is greedy, include up to the $. Even the ancient VB6/VBA (Office) 5.5 regex engine supports look-ahead.
So, to build up a full regular expression, start with the look-ahead blocks, then add the blacklisted character block before the final $.
For example, to limit the total numbers of characters, say between 3 and 15 inclusive, start with the positive look-ahead block (?=^.{3,15}$). Note that this needed its own ^ and $ to ensure that it covered all the text.
Now, while you might want to allow _ and -, you may not want to start or end with them, so add the two negative look-ahead blocks, (?!^[_-].+) for starts, and (?!.+[_-]$) for ends.
If you don't want multiple _ and -, add a negative look-ahead block of (?!.*[_-]{2,}). This will also exclude _- and -_ sequences.
If there are no more look-ahead blocks, then add the blacklist block before the $, such as [^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+, where the \0-\cZ excludes null and control characters, including NL (\n) and CR (\r). The final + ensures that all the text is greedily included.
Within the Unicode domain, there may well be other code-points or blocks that need to be excluded as well, but certainly a lot less than all the blocks that would have to be included in a whitelist.
The whole regex of all of the above would then be
(?=^.{3,15}$)(?!^[_-].+)(?!.+[_-]$)(?!.*[_-]{2,})[^<>[\]{}|\\\/^~%# :;,$%?\0-\cZ]+$
which you can check out live on https://regex101.com/, for pcre (php), javascript and python regex engines. I don't know where the java regex fits in those, but you may need to modify the regex to cater for its idiosyncrasies.
If you want to include spaces, but not _, just swap them every where in the regex.
The most useful application for this technique is for the pattern attribute for HTML input fields, where a single expression is required, returning a false for failure, thus making the field invalid, allowing input:invalid css to highlight it, and stopping the form being submitted.
I guess it depends what language you are targeting. In general, something like this should work:
[^<>%$]
The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.
You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.
The negated set of everything that is not alphanumeric & underscore for ASCII chars:
/[^\W]/g
For email or username validation i've used the following expression that allows 4 standard special characters - _ . #
/^[-.#_a-z0-9]+$/gi
For a strict alphanumeric only expression use:
/^[a-z0-9]+$/gi
Test # RegExr.com
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html
Do you really want to blacklist specific characters or rather whitelist the allowed charachters?
I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):
^(?:\p{L}\p{M}*|[\-])*$
Edit: Optimized the pattern with the input from the comments
Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.
Here's all the french accented characters:
àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ
I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.
For URLS I Replace accented URLs with regular letters like so:
string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {
cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}
There's probably a more efficient way, mind you.
I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".
Use This one
^(?=[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*$)(?!.*[<>'"/;`%])

Categories

Resources