Regex to include all spanish characters and number - java

I have a Java app where I need to have a regex that replace ALL except characters and number (including the spanish characters as stressed vowels and ñ/Ñ) It's also needs to include some specific spacial characters.
I created the following regEx but it's removing also the stressed vowels which is not the idea
string.replaceAll("[^-_/.,a-zA-Z0-9 ]+","")
I just wanna accept those characters.. not others like æ, å or others..

You may use \p{L} instead of a-zA-Z:
string = string.replaceAll("[^-_/.,\\p{L}0-9 ]+","");
The \p{L} matches all Unicode letters regardless of modifiers passed to the regex compile.
See a Java test:
List<String> strs = Arrays.asList("!##Łąka$%^", "Word123-)(=+");
for (String str : strs)
System.out.println("\"" + str.replaceAll("[^-_/.,\\p{L}0-9 ]+","") + "\"");
Output:
"Łąka"
"Word123-"
Pattern details: the [^-_/.,\\p{L}0-9 ]+ pattern matches any char other than -, _, _, /, ., ,, Unicode letter, ASCII digit and a space.
Note that with this solution, you will still remove Unicode digits, like ٠١٢٣٤٥٦٧٨٩.
You may use Mena's suggested \p{Alnum} but with (?U) embedded flag option to reall match all Unicode letters and digits:
string = string.replaceAll("(?U)[^-_/.,\\p{Alnum} ]+","");
To only remove Unicode letters other than common European letters, just add À-ÿ and subtract two non-letters, ×÷, from this range:
string = string.replaceAll("(?U)[^-_/.,A-Za-zÀ-ÿ &&[^×÷]]+","");

You could try to include spanish special characters in a character class [ ... ], there are only 7 after all.
I needed only lowercase characters, so instead of [a-z], I used [a-zñáéíóúü] and that worked for me.

You can use the Alnum script to replace all alphabetic characters and digits, including accented characters:
"[^-_/.,\\p{Alnum} ]+"
See docs:
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
Note that your replacement currently impacts all alphabetic characters, etc.
If you want to actually negate that custom class (thus replacing everything that's not defined in there), use:
"[^[-_/.,\\p{Alnum} ]]+"
(note the additional square brackets after the ^, otherwise it would be interpreted as literal ^).
Edit
You can furtherly narrow down to a subset of latin character blocks by using:
String s = "a1᣹";
System.out.println(
s.replaceAll("[^[-_/.,\\p{InBASIC_LATIN}\\p{InLATIN_1_SUPPLEMENT}0-9]]+","")
);
Output
Łą
Note that you will still have some non-Spanish characters in the Latin 1 supplement, see here.
If you want to restrict your requirements further, you will likely need to define your own (lenghty) character class with specific Spanish characters.

Related

How to prevent accented characters in an email field in Java using regex?

I have an email field in a form which is currently validated using GenericValidator.isEmail method. But now I need to apply another validation where I need to prevent accented characters being sent to the email address. So I was thinking of using a Regex Pattern Matching approach and I found one in stackoverflow itself
if (Pattern.matches(".*[éèàù].*", input)) {
// your code
}
Problem is I saw only é è à ù characters in the pattern but there are several other accented characters like õ ü ì etc. So is there a way we can match pattern for all types of accented characters?
I needed to match for NL (Dutch), FR(French) and DE(German) language accented characters. I need to check if my email address has any accented character and if it does need to stop execution there and throw an error
It turns out you want to match any letter but an ASCII letter.
I suggest substracting ASCII letters from the \p{L} pattern that matches any Unicode letter:
Pattern.matches("(?s).*[\\p{L}&&[^A-Za-z]].*", input)
Here,
(?s) - Pattern.DOTALL embedded flag option that makes . match across lines
.* - any zero or more chars, as many as possible
[\\p{L}&&[^A-Za-z]] - any Unicode letter except ASCII letters
.* - any zero or more chars, as many as possible.
Note it is better to use find() since it also returns partial matches, and there is no need using (?s).* and .* in the above pattern, making it much more efficient with longer strings:
Pattern.compile("[\\p{L}&&[^A-Za-z]]").matcher(input).find()
See this Java demo.

How to write a Regular expression to match any non alphabet or number and also matching dot

I need to match any special character in a string. For example, if the string has & % € (), etc. I could have Unicode alphabets such as ä ö å.
But I also want to match a dot "." For example, if I have a string as "8x8 Inc." . It should return true. Because it has a .
I tried a few expression so far but none of them worked for me. Please let me how it can be done? Thanks in advance!
You can do that one:
[^a-zA-Z\d\s] -> basically anything outside the group of all a-Z characters, digits and spaces. It will capture all other characters including special letters ä, dots, commas, braces etc
A simpler version would be [^\w\s] and it would match any non word/space characters but it will not match ä ö å
Java Regex .* will match all characters.
If you want to match only dot(.) then use escape character like \. It will match only dot(.) in string.
And in Java Program you have to use it like.
String regex="\\.";
Take a look at Unicode character classes. For your example, I think something like "(\\p{IsAlphabetic}|\\d)+" should work

match whole sentence with regex

I'm trying to match sentences without capital letters with regex in Java:
"Hi this is a test" -> Shouldn't match
"hi thiS is a test" -> Shouldn't match
"hi this is a test" -> Should match
I've tried the following regex, but it also matches my second example ("hi, thiS is a test").
[a-z]+
It seems like it's only looking at the first word of the sentence.
Any help?
[a-z]+ will match if your string contains any lowercase letter.
If you want to make sure your string doesn't contain uppercase letters, you could use a negative character class: ^[^A-Z]+$
Be aware that this won't handle accentuated characters (like É) though.
To make this work, you can use Unicode properties: ^\P{Lu}+$
\P means is not in Unicode category, and Lu is the uppercase letter that has a lowercase variant category.
^[a-z ]+$
Try this.This will validate the right ones.
It's not matching because you haven't used a space in the match pattern, so your regex is only matching whole words with no spaces.
try something like ^[a-z ]+$ instead (notice the space is the square brackets) you can also use \s which is shorthand for 'whitespace characters' but this can also include things like line feeds and carriage returns so just be aware.
This pattern does the following:
^ matches the start of a string
[a-z ]+ matches any a-z character or a space, where 1 or more exists.
$ matches the end of the string.
I would actually advise against regex in this case, since you don't seem to employ extended characters.
Instead try to test as following:
myString.equals(myString.toLowerCase());

How to match all numerical characters and some single characters using regex

How can I match all numbers along with specific characters in a String using regex? I have this so far
if (!s.matches("[0-9]+")) return false;
I don't understand much regex, but this matches all characters from 0-9 and now I need to be able to match other specific characters, for example "/", ":", "$"
You can use this regex by including those symbols in a character class:
s.matches("[0-9$/:]+")
Read more about character class
You can add the other characters that you need to match to the end of the character group, like this:
if (!s.matches("[0-9/:$]+")) return false;
You need to be careful about several things:
If ^ is among the characters, it must not be the first one of the group
If - is among the characters, it must be the last one in the group
If ] is among the characters, it needs to be escaped for regex and for Java, e.g. [\\]]
If \ is among the characters, it needs to be escaped for regex and for Java, e.g. [\\\\]
Regex:
String regex = "\\d/:$+";

Java Regular Expression: what is " '- "

I came up to a line in java that uses regular expressions.
It needs a user input of Last Name
return lastName.matches( "[a-zA-z]+([ '-][a-zA-Z]+)*" );
I would like to know what is the function of the [ '-].
Also why do we need both a "+" and a "*" at the same time, and the [ '-][a-zA-Z] is in brackets?
Your RE is: [a-zA-z]+([ '-][a-zA-Z]+)*
I'll break it down into its component parts:
[a-zA-Z]+
The string must begin with any letter, a-z or A-Z, repeated one or more times (+).
([ '-][a-zA-Z]+)*
[ '-]
Any single character of <space>, ', or -.
[a-zA-Z]+
Again, any letter, a-z or A-Z, repeated once or more times.
This combination of letters ('- and a-ZA-Z) may then be repeated zero or more times.
Why [ '-]? To allow for hiphenated names, such as Higgs-Boson or names with apostrophes, such as O'Reilly, or names with spaces such as Van Dyke.
The expression [ '-] means "one of ', , or -". The order is very important - the dash must be the last one, otherwise the character class would be considered a range, and other characters with code points between the space and the quote ' would be accepted as well.
+ means "one or more repetitions"; * means "zero or more repetitions", referring to the term of the regular expression preceding the + or * modifier.]
Overall, the expression matches groups of lowercase and uppercase letters separated by spaces, dashes, or single quotes.
it means it can be any of the characters space ' or - ( space, quote dash )
the - can be done as \- as it also can mean a range... like a-z
This looks like it is a pattern to match double-barreled (space or hyphen) or I-don't-know-what-to-call-it names like O'Grady... for example:
It would match
counter-terrorism
De'ville
O'Grady
smith-jones
smith and wesson
But it will not match
jones-
O'Learys'
#hashtag
Bob & Sons
The idea is, after the first [A-Za-z]+ consumes all the letters it can, the match will end right there unless the next character is a space, an apostrophe, or a hyphen ([ '-]). If one of those characters is present, it must be followed by at least one more letter.
A lot of people have difficulty with this. The naively write something like [A-Za-z]+[ '-]?[A-Za-z]*, figuring both the separator and the extra chunks of letters are optional. But they're not independently optional; if there is a separator ([ '-]), it must be followed by at least one more letter. Otherwise it would treat strings like R'- j'-' as valid. Your regex doesn't have that problem.
By the way, you've got a typo in your regex: [a-zA-z]. You want to watch out for that, because [A-z] does match all the uppercase and lowercase letters, so it will seem to be working correctly as long as the inputs are valid. But it also matches several non-letter characters whose code points happen to lie between those of Z and a. And very few IDEs or regex tools will catch that error.

Categories

Resources