Regex to exclude special characters Java - java

I want to write a regex to include: Letters, Digits, and Spaces but I want to exclude special characters like !'^+%&/()=?_-*£#$, etc.
I thought I can use [a-zA-Z] for Letters, [0-9] for Digits and \S for Space characters.
[a-zA-Z0-9\s]
but the string I am trying to clear might have letters like é,ü,ğ,i,ç and so on.
I do not want these letters to be removed.
Is it possible to write such regex?

Yes, it is possible.
\p{L} matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç
\d matches a digit (equal to [0-9])
\s matches a space, tab, carriage return, new line, vertical tab or form feed character
[\p{L}\d\s]+ should match one or more character present in the list
Here you can see an example:
https://regex101.com/r/uQmu7a/1

If you want to do it using non regex way then you can do it using Apache StringUtils.isAlphanumericSpace(String str)

You could go a different way.
Note - these two regex have to be run with the Unicode character class flag option.
There are two ways to go
Using alnum and staying within the Ascii and Extended-Ascii range.
Note that this U+011F ğ LATIN SMALL LETTER G WITH BREVE is outside
the 0 - FF range in the regex below, so that won't get matched.
(?:\p{Alnum}(?<=[\x{00}-\x{FF}])|\s)+
Explained
(?:
\p{Alnum} # Any alpha numeric Unicode
(?<= [\x{00}-\x{FF}] ) # In the U+0 - U+0FF codepoint range
| # or,
\s # Whitespace
)+
Or, you can go the Latin classes route, using Latin block's/script and staying within the alnum range.
(?:[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}](?<=\p{Alnum})|\s)+
Expanded
(?:
[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}]
(?<= \p{Alnum} )
|
\s
)+

Related

Regular expression --- username valid expression [duplicate]

http://regexr.com/3ars8
^(?=.*[0-9])(?=.*[A-z])[0-9A-z-]{17}$
Should match "17 alphanumeric chars, hyphens allowed too, must include at least one letter and at least one number"
It'll correctly match:
ABCDF31U100027743
and correctly decline to match:
AB$DF31U100027743
(and almost any other non-alphanumeric char)
but will apparently allow:
AB^DF31U100027743
Because your character class [A-z] matches this symbol.
[A-z] matches [, \, ], ^, _, `, and the English letters.
Actually, it is a common mistake. You should use [a-zA-Z] instead to only allow English letters.
Here is a visualization from Expresso, showing what the range [A-z] actually covers:
So, this regex (with i option) won't capture your string.
^(?=.*[0-9])(?=.*[a-z])[0-9a-z-]{17}$
In my opinion, it is always safer to use Ignorecase option to avoid such an issue and shorten the regex.
regex uses ASCII printable characters from the space to the tilde range.
Whenever we use [A-z] token it matches the following table highlighted characters. If we use [ -~] token it matches starting from SPACE to tilde.
You're allowing A-z (capital 'A' through lower 'z'). You don't say what regex package you're using, but it's not necessarily clear that A-Z and a-z are contiguous; there could be other characters in between. Try this instead:
^(?=.*[0-9])(?=.*[A-Za-z])[0-9A-Za-z-]{17}$
It seems to meet your criteria for me in regexpal.

Regex to match comma separated values

I'm new to Regex in Java and I wanted to know how can I build one that only takes a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace.
I would need to filter out strings that start with a comma, that end with a comma or strings that have multiple consecutive commas.
All these would be invalid:
"D,, D"
"D D,,"
"D, ,D"
"D, ,,D"
"D,, ,D"
"D,,"
",,A"
",A"
"A,"
All these would be valid:
"D,D T,F"
"D,D T"
"A,A"
"A"
I used (\s?("[\w\s]*"|\d*)\s?(,,|$)) for consecutive commas but it doesn't do the trick when the comma is at the end or beggining of one of the whitespace separated substring like "D, ,D"
Should I aim to split by whitespace and look for a simpler regex for each of the substrings?
That would be something like this:
^[A-Z](,[A-Z])*( [A-Z](,[A-Z])*)*$
What happens here, is the following:
We expect a letter, optionally followed by one or more times a comma-immediately-followed-by-another-letter.
Then we optionally accept a space, and then the abovementioned pattern. And this is repeated.
Test: https://regex101.com/r/kzLhtw/1
You could, of course, slightly optimize the regex by making all capturing groups non-capturing: just put ?: immediately behind the (, that is, (?:.
You might use
^[A-Z](?: [A-Z])*(?:,[A-Z](?: [A-Z])*){0,2}$
^ Start of string
[A-Z] Match a single char A-Z
(?: [A-Z])* Optionally repeat a space and and a single char A-Z
(?: Non capture group
,[A-Z](?: [A-Z])* Match a comma, char A-Z followed by optionally repeat matching a space and a char A-Z
){0,2} Close the group and repeat 0-2 times
$ End of string
Regex demo
"a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace"
Not sure how to exactly interpretate the above, but my reading is: One or two comma-seperated lists where each list may only consist of uppercase characters. In the case of two lists, the two lists are seperated by a single space.
You could try:
^(?!.* .* )[A-Z](?:[ ,][A-Z])*$
See the online demo
^ - Start string anchor.
(?!.* .* ) - Negative lookahead to prevent two spaces present.
[A-Z] - A single uppercase alpha-char.
(?: - Open non-capture group:
[ ,] - A comma or space.
[A-Z] - A single uppercase alpha-char.
)* - Close non-capture group and match 0+ times upt to;
$ - End string anchor.

Replacing characters in String using Meta characters or character classes

I am writing to remove all non-alphanumeric characters in a String with only lowercase letters.
I am using the replaceAll function and have looked at a few regexes
My reference is from: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html which shows that
\s : A whitespace character, short for [ \t\n\x0b\r\f]
\W : A non-word character [^\w]
I tried the folllowing in Java but the results didn't remove the spaces or symbols:
lowercased = lowercased.replaceAll("\\W\\s", "");
output:
amanaplanac analp anam a
May I know what is wrong?
Regex \W\s means "a non-word character followed by a whitespace character".
If you want to replace any character that is one of those, use one of these:
\W|\s where | means or
[\W\s] where [ ] is a character class that in this case merges the built-in special character classes \W and \s, because that's what those are.
Of the two, I recommend using the second.
Of course, having \s there is redundant, because \s means whitespace character, and \W means non-word character, and since whitespaces are not word characters, using \W alone is enough.
lowercased = lowercased.replaceAll("\\W+", "");
Regex \W is meant for matching character's that are not numbers(0-9), alphabets(A-Z and a-z) and underscore (_). And /s is meant for matching space.
As /W already take care for matching non alphanumeric characters (excluding underscore). No need to use \s.
So if you are using \W you are allowing underscore(_) with alphanumeric values.
use the following to exclude underscore as well.
lowercased = lowercased.replaceAll("\\W|_", "");
Use | (or operator) like \W|\s since both \W and \s are independent case for which you want to replace. And since whitespace are not word character you can use \W only.
lowercased = lowercased.replaceAll("\\W|\\s", "");

Regular expression to mask email except the three characters before the domain

I am trying to mask email address in the following different ways.
Mask all characters except first three and the ones follows the # symbol.
This expression works fine.
(?<=.{3}).(?=[^#]*?#)
abcdefgh#gmail.com -> abc*****#gmail.com
Mask all characters except last three before # symbol.
Example : abcdefgh#gmail.com -> *****fgh#gmail.com
I am not sure how to check for # and do reverse match.
Can someone throw pointers on this?
Maybe you could do a positive lookahead:
.(?=.*...#)
See the online Demo
. - Any character other than newline.
(?=.*...#) - Positive lookahead for zero or more characters other than newline followed by three characters other than newline and #.
You could use a negated character class [^\s#] matching a non whitespace char except an #. Then assert what is on the right is that negated character class 3 times followed by matching the # sign.
In the replacement use *
[^\s#](?=[^#\s]*[^#\s]{3}#)
[^\s#] Negated character class, match a non whitespace char except #
(?= Positive lookahead, assert what is on the right is
[^#\s]* Match 0+ times a non whitespace char except #
[^#\s]{3} Match 3 times a non whitespace char except #
# Match the #
) Close lookahead
Regex demo
If there can be only a single # in the email address, you could for example make use of a finite quantifier in the positive lookbehind:
(?<=(?<!\S)[^\s#]{0,1000})[^\s#](?=[^#\s]*[^#\s]{3}#[^\s#]+\.[a-z]{2,}(?!\S))
Regex demo

How to write a Regular expression to match any non alphabet or number and also matching dot

I need to match any special character in a string. For example, if the string has & % € (), etc. I could have Unicode alphabets such as ä ö å.
But I also want to match a dot "." For example, if I have a string as "8x8 Inc." . It should return true. Because it has a .
I tried a few expression so far but none of them worked for me. Please let me how it can be done? Thanks in advance!
You can do that one:
[^a-zA-Z\d\s] -> basically anything outside the group of all a-Z characters, digits and spaces. It will capture all other characters including special letters ä, dots, commas, braces etc
A simpler version would be [^\w\s] and it would match any non word/space characters but it will not match ä ö å
Java Regex .* will match all characters.
If you want to match only dot(.) then use escape character like \. It will match only dot(.) in string.
And in Java Program you have to use it like.
String regex="\\.";
Take a look at Unicode character classes. For your example, I think something like "(\\p{IsAlphabetic}|\\d)+" should work

Categories

Resources