Java regex not matching German "Umlaut" OR underscore

Java regex not matching German "Umlaut" OR underscore - java

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?

If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Related

Password Validation with Regex Java

I am trying to figure out a regex to match a password that contains
one upper case letter.
one number
one special character.
and at least 4 characters of length
the regex that I wrote is
^((?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9])){4,}
however it is not working, and I couldn't figure out why.
So please can someone tell me why this code is not working, where did I mess up, and how to correct this code.

Your regex can be rewritten as
^(
(?=.*[0-9])
(?=.*[A-Z])
(?=.*[^A-Za-z0-9])
){4,}
As you see {4,} applies to group which doesn't let you match any character since look-around is zero-width, which effectively means "4 or more of nothing".
You need to add . before {4,} to let your regex handle "and at least 4 characters of length" point (rest is handled by look-around).
You can remove that capturing group since you don't really need it.
So try with something like
^(?=.*[0-9])(?=.*[A-Z])(?=.*[^A-Za-z0-9]).{4,}

You could come up with sth. like:
^(?=.*[A-Z])(?=.*\d)(?=.*[!"§$%&/()=?`]).{4,}$
In multiline mode, see a demo on regex101.com.
This approach specifies the special characters directly (which could be extended, obviously).
From the following list only the bold ones would satisfy these criteria:
test
Test123!
StrongPassword34?
weakone
Tabaluga"12???
You can still enhance this expression by being more specific and requiring contrary pairs. Just to remind you, the dot-star (.*) brings you down the line and then backtracks eventually. This will almost always require more steps than to directly look for contrary pairs.
Consider the following expression:
^ # bind the expression to the beginning of the string
(?=[^A-Z\n\r]*[A-Z]) # look ahead for sth. that is not A-Z, or newline and require one of A-Z
(?=[^\d\n\r]*\d) # same construct for digits
(?=\w*[^\w\n\r]) # same construct for special chars (\w = _A-Za-z0-9)
.{4,}
$
You'll see a significant reduction in steps as the regex engine does not have to backtrack everytime.

Java RegExp: Capture part after a character but don't replace the character

I am using Java to parse through a JavaScript file. Because the scope is different than expected in the environment in which I am using it, I am trying to replace every instance of i.e.
test = value
with
window.test = value
Previously, I had just been using
writer.append(js.getSource().replaceAll("test", "window.test"));
which obviously isn't generalizable, but for a fixed dataset it was working fine.
However, in the new files I'm supposed to work with, an updated version of the old ones, I now have to deal with
window['test'] = value
and
([[test]])
I don't want to match test in either of those cases, and it seems like those are the only two cases where there's a new format. So my plan was to now do a regex to match anything except ' and [ as the first character. That would be ([^'\[])test; however, I don't actually want to replace the first character - just make sure it's not one of the two I don't want to match.
This was a new situation for me because I haven't worked with replacement with RegExps that much, just pattern matching. So I looked around and found what I thought was the solution, something called "non-capturing groups". The explanation on the Oracle page sounded like what I was looking for, but when I re-wrote my Regular Expression to be (?:[^'\\[])test, it just behaved exactly the same as if I hadn't changed anything - replacing the character preceding test. I looked around StackOverflow, but what I discovered just made me more confident that what I was doing should work.
What am I doing wrong that it's not working as expected? Am I misusing the pattern?

If you include an expression for the character in your regex, it will be part of what is matched.
The trick is to use what you match in the replacement String, so you replace that bit by itself.
try :
replaceAll("([^'\[])test", "$1window.test"));
the $1 in the replacement String is a back reference to what capturing group 1 matched. In this case that is the character preceding test

Why not simply test on "(test)(\s*)=(\s*)([\w\d]+)" ? That way you only match "test", then whitespace, followed by an '=' sign followed by a value (in this case consisting of digits and alphabetical letters and the underscore character). You can then use the groups (between parentheses) to copy the value -and even the whitespace if required - to your new text.

Regex Differences between Java and Ruby

I'm trying to write a regex for:
Strings of characters beginning and ending with a double quote character, that do not contain control characters, and for which the backslash is used to escape the next character.
The paren-star form of comments in Pascal: strings beginning with (* and ending with *) that do not contain *)
I'm trying to write a version in Ruby, then another in Java, but I'm having trouble finding the differences in regex expressions for both. Any help is appreciated!

Here is a good place to start:
specifics for Java (mostly usage of regex in general)
specifics for Ruby (mostly usage of regex in general)
flavor comparison (mostly regex syntax and features)
Mostly note that in Ruby your write regexes by delimiting them with /, and in Java you need to double-escape everything (\\ instead of \) so that the backslashes get through to the regex engine. Everything else you should find within those links I gave you above.
For the sake of completeness of this answer, I would also like to include Tom's Link to this online regex tester, that supports a multitude of regex flavors.
You should go ahead and give both regexes a go. If you encounter any problems, you are more than welcome to ask a new (specific) question, showing your own attempts.

Why do I get a parsing exception for this regex?

I have the following patterns in a web service constructor:
rUsername = Pattern.compile("[A-z0-9]{4,30}");
rPassword = Pattern.compile("[A-z0-9!##$%&*()=+-\\\\s]{6,30}");
rQuestion = Pattern.compile("[A-z0-9\\\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\\\s]{1,30}");
If I only have 2 slashes instead of the 4 there when I deploy my web application I get a parsing exception from Tomcat.
The username one works fine, but I seem to be having issues with the password, question and answer. The password will match "testasdas" but not "test1234", the question will not match anything with a space in it and the answer doesn't seem to match anything.
I want the password to be able to match lowercase and uppercase letters, numbers, spaces and the symbols I threw in there. The question one should be able to match lowercase and uppercase, numbers, spaces and '?', and the answer just uppercase and lowercase letters, spaces and numbers.
EDIT: The patterns have changed to these:
rPassword = Pattern.compile("[A-Za-z0-9!##$%&*()=+\\s-]{6,30}");
rQuestion = Pattern.compile("[A-Za-z0-9\\s?]{5,140}");
rAnswer = Pattern.compile("[A-z0-9\\s]{1,30}");
These are more or less how I want, but as pointed out in an answer I'm being quite restrictive on my password field which probably isn't a good idea. I don't hash anything before I save it because this is a college project nobody will ever use and I know that is a bad idea in the real world but it was not part of the requirements for the project. I do however have to stop SQL injection attacks, which is why everything is so restrictive. The idea was to mainly disallow the use of ' which every SQL attack I know of needs to work, but I don't know how to disallow only that character alone.

Let's look at your second regex:
[A-z0-9!##$%&*()=+-\\\\s]
There are several errors here.
[A-z] is incorrect, you need [A-Za-z] because there are some ASCII characters between Z and a that you probably don't want to match. But that's not the problem of your error.
More problematic is this section:
+-\\\\s
Translated from a Java string into an actual regex, this becomes
+-\\s
and that means (inside a character class) "Match any character between + and \, or any whitespace". [+-\\] is a valid range (ASCII 43-92), but it's not what you want.
But if you now remove the two extra backslashes, your character class becomes
+-\s
and that is a Syntax Error, because there is no ASCII range between + and "any whitespace".
Solution: Use
[A-Za-z0-9!##$%&*()=+\\s-]
or refrain from imposing limits on what characters your users may choose in a password in the first place.

To match uppercase and lowercase letter you need to give pattern: - [a-zA-Z]
Try changing your pattern to: -
[a-zA-Z0-9\\s]{1,30} for your answer..
For your question, that don't take whitespace, you can use \\S - non-whitespace character
[a-zA-Z0-9\\S]{1,250}
I think it should work.. You can make corresponding changes to remaining 2.
*EDIT: - You can also use \\p{Z} to match any Whitespace character..
See this link for a good tutorial on using Java Regular Expression..

What is a regular expression for control characters?

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][#-z]
I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.

Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?#_\[\]^\\], so \\\^[A-Z?#_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.
Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)
If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.

Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex not matching German "Umlaut" OR underscore - java

Related

Password Validation with Regex Java

Java RegExp: Capture part after a character but don't replace the character

Regex Differences between Java and Ruby

Why do I get a parsing exception for this regex?

What is a regular expression for control characters?

Categories

Resources