Replacing characters in String using Meta characters or character classes - java

I am writing to remove all non-alphanumeric characters in a String with only lowercase letters.
I am using the replaceAll function and have looked at a few regexes
My reference is from: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html which shows that
\s : A whitespace character, short for [ \t\n\x0b\r\f]
\W : A non-word character [^\w]
I tried the folllowing in Java but the results didn't remove the spaces or symbols:
lowercased = lowercased.replaceAll("\\W\\s", "");
output:
amanaplanac analp anam a
May I know what is wrong?

Regex \W\s means "a non-word character followed by a whitespace character".
If you want to replace any character that is one of those, use one of these:
\W|\s where | means or
[\W\s] where [ ] is a character class that in this case merges the built-in special character classes \W and \s, because that's what those are.
Of the two, I recommend using the second.
Of course, having \s there is redundant, because \s means whitespace character, and \W means non-word character, and since whitespaces are not word characters, using \W alone is enough.
lowercased = lowercased.replaceAll("\\W+", "");

Regex \W is meant for matching character's that are not numbers(0-9), alphabets(A-Z and a-z) and underscore (_). And /s is meant for matching space.
As /W already take care for matching non alphanumeric characters (excluding underscore). No need to use \s.
So if you are using \W you are allowing underscore(_) with alphanumeric values.
use the following to exclude underscore as well.
lowercased = lowercased.replaceAll("\\W|_", "");

Use | (or operator) like \W|\s since both \W and \s are independent case for which you want to replace. And since whitespace are not word character you can use \W only.
lowercased = lowercased.replaceAll("\\W|\\s", "");

Related

How to remove all special Characters from a string except - . and space

I have a string where I want to remove all special characters except hyphen , a dot and space.
I am using filename.replaceAll("[^a-zA-Z0-9.-]",""). It is working for . and - but not for space.
What should I add to this to make it work for space as well?
Use either \s or simply a space character as explained in the Pattern class javadoc
\s - A whitespace character: [ \t\n\x0B\f\r]
- Literal space character
You must either escape - character as \- so it won't be interpreted as range expression or make sure it stays as the last regex character. Putting it all together:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "")
filename.replaceAll("[^a-zA-Z0-9 .-]", "")
You can use this regex [^a-zA-Z0-9\s.-] or [^a-zA-Z0-9 .-]
\s matches whitespace and (space character) matches only space.
So in this case if you want to match whitespaces use this:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "");
And if you want only match space use this:
filename.replaceAll("[^a-zA-Z0-9 .-]", "");

Regex to exclude special characters Java

I want to write a regex to include: Letters, Digits, and Spaces but I want to exclude special characters like !'^+%&/()=?_-*£#$, etc.
I thought I can use [a-zA-Z] for Letters, [0-9] for Digits and \S for Space characters.
[a-zA-Z0-9\s]
but the string I am trying to clear might have letters like é,ü,ğ,i,ç and so on.
I do not want these letters to be removed.
Is it possible to write such regex?
Yes, it is possible.
\p{L} matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç
\d matches a digit (equal to [0-9])
\s matches a space, tab, carriage return, new line, vertical tab or form feed character
[\p{L}\d\s]+ should match one or more character present in the list
Here you can see an example:
https://regex101.com/r/uQmu7a/1
If you want to do it using non regex way then you can do it using Apache StringUtils.isAlphanumericSpace(String str)
You could go a different way.
Note - these two regex have to be run with the Unicode character class flag option.
There are two ways to go
Using alnum and staying within the Ascii and Extended-Ascii range.
Note that this U+011F ğ LATIN SMALL LETTER G WITH BREVE is outside
the 0 - FF range in the regex below, so that won't get matched.
(?:\p{Alnum}(?<=[\x{00}-\x{FF}])|\s)+
Explained
(?:
\p{Alnum} # Any alpha numeric Unicode
(?<= [\x{00}-\x{FF}] ) # In the U+0 - U+0FF codepoint range
| # or,
\s # Whitespace
)+
Or, you can go the Latin classes route, using Latin block's/script and staying within the alnum range.
(?:[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}](?<=\p{Alnum})|\s)+
Expanded
(?:
[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}]
(?<= \p{Alnum} )
|
\s
)+

regex for allowing only certain special characters and also including the alphanumerical characters

I'm struggling with REGEX and require it for a program.
The input require only alphanumerical keys and also (allow only comma,:,space,/,- in special chars)
I have tried = (^[a-zA-Z0-9,:\S/-]*$)
As far as i understand and please correct me if I'm wrong.
a-zA-Z0-9 - The alphanumerical keys.
,: - Comma and colon
\S - Space
/ - I'm not sure how to represent a forward slash thus i escaped it
- - Dash also not sure if it is needed to escape it.
Would be appreciated if this can be corrected and also a explanation of each part.
Thanks in advance.
You can replace a-zA-Z0-9 with just \\w which is short for [a-zA-Z_0-9]. Furthermore, \\S is any character, but not a whitespace, you should use a \\s instead. You don't need to escape /, and even - if it's the first one or the last one, because if it's placed between two characters it could be interpreted as range and you'll have to escape it. So, you can make your regex like ^([\w,:\s/-]*)$
The \S shorthand matches any character except whitespace, just the opposite of what you want. Lowercase \s matches whitespace [\t\v\n\r\f ]. But if you only want spaces, just put a space in the character class.
a hyphen - needs to be escaped inside characters, unless it's the first or last character in the character class, but you could always escape it just to be sure.
Slashes / don't need to be escaped. They're escaped in other languages where you use them as pattern delimiters. ie: /regex/i.
Besides hyphens and shorthands, only backslashes \\ and closing brackets \] need to be escaped.
Remember in java, you always need to use double backslashes (one is interpreted by java, the other by the regex engine).
Regex
pattern = "^[a-zA-Z0-9 ,:/\\-]*$"
Move the Start of Line ^ and End of Line $ outside the group - like
^([a-zA-Z0-9,:\S/-]*)$
That should do it.

Java regex "[.]" vs "."

I'm trying to use some regex in Java and I came across this when debugging my code.
What's the difference between [.] and .?
I was surprised that .at would match "cat" but [.]at wouldn't.
[.] matches a dot (.) literally, while . matches any character except newline (\n) (unless you use DOTALL mode).
You can also use \. ("\\." if you use java string literal) to literally match dot.
The [ and ] are metacharacters that let you define a character class. Anything enclosed in square brackets is interpreted literally. You can include multiple characters as well:
[.=*&^$] // Matches any single character from the list '.','=','*','&','^','$'
There are two specific things you need to know about the [...] syntax:
The ^ symbol at the beginning of the group has a special meaning: it inverts what's matched by the group. For example, [^.] matches any character except a dot .
Dash - in between two characters means any code point between the two. For example, [A-Z] matches any single uppercase letter. You can use dash multiple times - for example, [A-Za-z0-9] means "any single upper- or lower-case letter or a digit".
The two constructs above (^ and -) are common to nearly all regex engines; some engines (such as Java's) define additional syntax specific only to these engines.
regular-expression constructs
. => Any character (may or may not match line terminators)
and to match the dot . use the following
[.] => it will matches a dot
\\. => it will matches a dot
NOTE: The character classes in Java regular expression is defined using the square brackets "[ ]", this subexpression matches a single character from the specified or, set of possible characters.
Example : In string address replaces every "." with "[.]"
public static void main(String[] args) {
String address = "1.1.1.1";
System.out.println(address.replaceAll("[.]","[.]"));
}
if anything is missed please add :)

RegEx: match any non-word and non-digit character except

To match any non-word and non-digit character (special characters) I use this: [\\W\\D]. What should I add if I want to also ignore some concrete characters? Let's say, underscore.
First of all, you must know that \W is equivalent to [^a-zA-Z0-9_]. So, you can change your current regex to:
[\\W]
This will automatically take care of \D.
Now, if you want to ignore some other character, say & (underscore is already exluded in \W), you can use negated character class:
[^\\w&]

Categories

Resources