Java, Regex : Adding umlaut and other german characters for first-name

Java, Regex : Adding umlaut and other german characters for first-name - java

I want to validate first-name sent by user with Regex. I found multiple expressions for first-name, but I also want to add german characters like äöüß and french ones à À è È é É ù Ù ì Ì ò Ò ñ Ñ to it. I tried regex evaluator suggested by SO here, but that didn't help. Wheneever I would make an extra square bracket, it would tell me Your regular expression does not match the subject string.. What am I doing wrong?
Current Regex pattern :
^([A-Z][a-z]*((\s)))+[A-Z][a-z][äöüß]*$
Thank you.

It is not a good idea to restrict people's names too much, I suggest a rather generic regex:
s.matches("(?U)[\\p{L}\\p{M}\\s'-]+")
This regex will match a string only consisting of 1 or more Unicode letters, diacritics, whitespaces, apostrophes or hyphens.
If you need more restrictive checks, like a whitespace may only appear inside the string and only if not consecutive, use grouping:
"(?U)[\\p{L}\\p{M}'-]+(?:\\s[\\p{L}\\p{M}'-]+)*"

I agree with previous answer from Wiktor.
But to answer why your regex did not work:
You expected a capital letter followed by 0 or more letters and a space and then another word. And that other word is 1 cap. letter, 1 letter and then 0 or more Umlauts. I do not know any name that would fit. All those special characters should be in the same [] as the letters.
Also hyphenated names are not allowed.
So your regex could look more like this:
^([A-ZÜÄÖ][a-züäöß]*(\s|-))*[A-ZÜÄÖ][a-züäöß]*$

Related

How to prevent accented characters in an email field in Java using regex?

I have an email field in a form which is currently validated using GenericValidator.isEmail method. But now I need to apply another validation where I need to prevent accented characters being sent to the email address. So I was thinking of using a Regex Pattern Matching approach and I found one in stackoverflow itself
if (Pattern.matches(".*[éèàù].*", input)) {
// your code
}
Problem is I saw only é è à ù characters in the pattern but there are several other accented characters like õ ü ì etc. So is there a way we can match pattern for all types of accented characters?
I needed to match for NL (Dutch), FR(French) and DE(German) language accented characters. I need to check if my email address has any accented character and if it does need to stop execution there and throw an error

It turns out you want to match any letter but an ASCII letter.
I suggest substracting ASCII letters from the \p{L} pattern that matches any Unicode letter:
Pattern.matches("(?s).*[\\p{L}&&[^A-Za-z]].*", input)
Here,
(?s) - Pattern.DOTALL embedded flag option that makes . match across lines
.* - any zero or more chars, as many as possible
[\\p{L}&&[^A-Za-z]] - any Unicode letter except ASCII letters
.* - any zero or more chars, as many as possible.
Note it is better to use find() since it also returns partial matches, and there is no need using (?s).* and .* in the above pattern, making it much more efficient with longer strings:
Pattern.compile("[\\p{L}&&[^A-Za-z]]").matcher(input).find()
See this Java demo.

How to write a Regular expression to match any non alphabet or number and also matching dot

I need to match any special character in a string. For example, if the string has & % € (), etc. I could have Unicode alphabets such as ä ö å.
But I also want to match a dot "." For example, if I have a string as "8x8 Inc." . It should return true. Because it has a .
I tried a few expression so far but none of them worked for me. Please let me how it can be done? Thanks in advance!

You can do that one:
[^a-zA-Z\d\s] -> basically anything outside the group of all a-Z characters, digits and spaces. It will capture all other characters including special letters ä, dots, commas, braces etc
A simpler version would be [^\w\s] and it would match any non word/space characters but it will not match ä ö å

Java Regex .* will match all characters.
If you want to match only dot(.) then use escape character like \. It will match only dot(.) in string.
And in Java Program you have to use it like.
String regex="\\.";

Take a look at Unicode character classes. For your example, I think something like "(\\p{IsAlphabetic}|\\d)+" should work

Regex to include all spanish characters and number

I have a Java app where I need to have a regex that replace ALL except characters and number (including the spanish characters as stressed vowels and ñ/Ñ) It's also needs to include some specific spacial characters.
I created the following regEx but it's removing also the stressed vowels which is not the idea
string.replaceAll("[^-_/.,a-zA-Z0-9 ]+","")
I just wanna accept those characters.. not others like æ, å or others..

You may use \p{L} instead of a-zA-Z:
string = string.replaceAll("[^-_/.,\\p{L}0-9 ]+","");
The \p{L} matches all Unicode letters regardless of modifiers passed to the regex compile.
See a Java test:
List<String> strs = Arrays.asList("!##Łąka$%^", "Word123-)(=+");
for (String str : strs)
System.out.println("\"" + str.replaceAll("[^-_/.,\\p{L}0-9 ]+","") + "\"");
Output:
"Łąka"
"Word123-"
Pattern details: the [^-_/.,\\p{L}0-9 ]+ pattern matches any char other than -, _, _, /, ., ,, Unicode letter, ASCII digit and a space.
Note that with this solution, you will still remove Unicode digits, like ٠١٢٣٤٥٦٧٨٩.
You may use Mena's suggested \p{Alnum} but with (?U) embedded flag option to reall match all Unicode letters and digits:
string = string.replaceAll("(?U)[^-_/.,\\p{Alnum} ]+","");
To only remove Unicode letters other than common European letters, just add À-ÿ and subtract two non-letters, ×÷, from this range:
string = string.replaceAll("(?U)[^-_/.,A-Za-zÀ-ÿ &&[^×÷]]+","");

You could try to include spanish special characters in a character class [ ... ], there are only 7 after all.
I needed only lowercase characters, so instead of [a-z], I used [a-zñáéíóúü] and that worked for me.

You can use the Alnum script to replace all alphabetic characters and digits, including accented characters:
"[^-_/.,\\p{Alnum} ]+"
See docs:
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
Note that your replacement currently impacts all alphabetic characters, etc.
If you want to actually negate that custom class (thus replacing everything that's not defined in there), use:
"[^[-_/.,\\p{Alnum} ]]+"
(note the additional square brackets after the ^, otherwise it would be interpreted as literal ^).
Edit
You can furtherly narrow down to a subset of latin character blocks by using:
String s = "a1áŁą";
System.out.println(
s.replaceAll("[^[-_/.,\\p{InBASIC_LATIN}\\p{InLATIN_1_SUPPLEMENT}0-9]]+","")
);
Output
Łą
Note that you will still have some non-Spanish characters in the Latin 1 supplement, see here.
If you want to restrict your requirements further, you will likely need to define your own (lenghty) character class with specific Spanish characters.

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!

As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

match whole sentence with regex

I'm trying to match sentences without capital letters with regex in Java:
"Hi this is a test" -> Shouldn't match
"hi thiS is a test" -> Shouldn't match
"hi this is a test" -> Should match
I've tried the following regex, but it also matches my second example ("hi, thiS is a test").
[a-z]+
It seems like it's only looking at the first word of the sentence.
Any help?

[a-z]+ will match if your string contains any lowercase letter.
If you want to make sure your string doesn't contain uppercase letters, you could use a negative character class: ^[^A-Z]+$
Be aware that this won't handle accentuated characters (like É) though.
To make this work, you can use Unicode properties: ^\P{Lu}+$
\P means is not in Unicode category, and Lu is the uppercase letter that has a lowercase variant category.

^[a-z ]+$
Try this.This will validate the right ones.

It's not matching because you haven't used a space in the match pattern, so your regex is only matching whole words with no spaces.
try something like ^[a-z ]+$ instead (notice the space is the square brackets) you can also use \s which is shorthand for 'whitespace characters' but this can also include things like line feeds and carriage returns so just be aware.
This pattern does the following:
^ matches the start of a string
[a-z ]+ matches any a-z character or a space, where 1 or more exists.
$ matches the end of the string.

I would actually advise against regex in this case, since you don't seem to employ extended characters.
Instead try to test as following:
myString.equals(myString.toLowerCase());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.