I have a string where I want to remove all special characters except hyphen , a dot and space.
I am using filename.replaceAll("[^a-zA-Z0-9.-]",""). It is working for . and - but not for space.
What should I add to this to make it work for space as well?
Use either \s or simply a space character as explained in the Pattern class javadoc
\s - A whitespace character: [ \t\n\x0B\f\r]
- Literal space character
You must either escape - character as \- so it won't be interpreted as range expression or make sure it stays as the last regex character. Putting it all together:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "")
filename.replaceAll("[^a-zA-Z0-9 .-]", "")
You can use this regex [^a-zA-Z0-9\s.-] or [^a-zA-Z0-9 .-]
\s matches whitespace and (space character) matches only space.
So in this case if you want to match whitespaces use this:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "");
And if you want only match space use this:
filename.replaceAll("[^a-zA-Z0-9 .-]", "");
Related
I am writing to remove all non-alphanumeric characters in a String with only lowercase letters.
I am using the replaceAll function and have looked at a few regexes
My reference is from: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html which shows that
\s : A whitespace character, short for [ \t\n\x0b\r\f]
\W : A non-word character [^\w]
I tried the folllowing in Java but the results didn't remove the spaces or symbols:
lowercased = lowercased.replaceAll("\\W\\s", "");
output:
amanaplanac analp anam a
May I know what is wrong?
Regex \W\s means "a non-word character followed by a whitespace character".
If you want to replace any character that is one of those, use one of these:
\W|\s where | means or
[\W\s] where [ ] is a character class that in this case merges the built-in special character classes \W and \s, because that's what those are.
Of the two, I recommend using the second.
Of course, having \s there is redundant, because \s means whitespace character, and \W means non-word character, and since whitespaces are not word characters, using \W alone is enough.
lowercased = lowercased.replaceAll("\\W+", "");
Regex \W is meant for matching character's that are not numbers(0-9), alphabets(A-Z and a-z) and underscore (_). And /s is meant for matching space.
As /W already take care for matching non alphanumeric characters (excluding underscore). No need to use \s.
So if you are using \W you are allowing underscore(_) with alphanumeric values.
use the following to exclude underscore as well.
lowercased = lowercased.replaceAll("\\W|_", "");
Use | (or operator) like \W|\s since both \W and \s are independent case for which you want to replace. And since whitespace are not word character you can use \W only.
lowercased = lowercased.replaceAll("\\W|\\s", "");
I want to write a regex to include: Letters, Digits, and Spaces but I want to exclude special characters like !'^+%&/()=?_-*£#$, etc.
I thought I can use [a-zA-Z] for Letters, [0-9] for Digits and \S for Space characters.
[a-zA-Z0-9\s]
but the string I am trying to clear might have letters like é,ü,ğ,i,ç and so on.
I do not want these letters to be removed.
Is it possible to write such regex?
Yes, it is possible.
\p{L} matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç
\d matches a digit (equal to [0-9])
\s matches a space, tab, carriage return, new line, vertical tab or form feed character
[\p{L}\d\s]+ should match one or more character present in the list
Here you can see an example:
https://regex101.com/r/uQmu7a/1
If you want to do it using non regex way then you can do it using Apache StringUtils.isAlphanumericSpace(String str)
You could go a different way.
Note - these two regex have to be run with the Unicode character class flag option.
There are two ways to go
Using alnum and staying within the Ascii and Extended-Ascii range.
Note that this U+011F ğ LATIN SMALL LETTER G WITH BREVE is outside
the 0 - FF range in the regex below, so that won't get matched.
(?:\p{Alnum}(?<=[\x{00}-\x{FF}])|\s)+
Explained
(?:
\p{Alnum} # Any alpha numeric Unicode
(?<= [\x{00}-\x{FF}] ) # In the U+0 - U+0FF codepoint range
| # or,
\s # Whitespace
)+
Or, you can go the Latin classes route, using Latin block's/script and staying within the alnum range.
(?:[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}](?<=\p{Alnum})|\s)+
Expanded
(?:
[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}]
(?<= \p{Alnum} )
|
\s
)+
(^\s*\d+\)(.*) | ) | (^\s*Q\d+\.\s*(.*))
The above regex is not matching Q1. qeqwewqeqeq qerqer
But If I remove white space before and after |
(^\s*\d+\)(.*) | )|(^\s*Q\d+\.\s*(.*))
It matches my string.
What does white space mean? Is it equal to \s? It affects my readability.
Yes, whitespace affects your regex. No, it is not equivalent to \s.
The \s shorthand character class is equivalent to the character class [ \t\r\n\f] - i.e. a character class that will match any whitespace character. So, while your formatting spaces are included in \s, they are not equivalent to it.
As has been said in the comments, literal whitespace is important in regexes. In fact, I believe it's causing an error in your first alternate (the sub-pattern (^\s*\d+\)(.*) | )).
If I'm reading the intent of that sub-pattern right, it's supposed to match text of the form
2) some_text
But it will:
Only match this text if it's followed by a space
Also match a single literal space
A better way to construct this sub-pattern would be (^\s*\d+\)(.*)), disposing of the end space and the alternation altogether. Furthermore, in order to improve readability, we can do this:
(^\s*(?:Q\d+\.|\d+\))\s*(.*))
Which only alternates on the question number format, rather than the whole pattern.
Demo on Regex101
The contents of a regex are 100% applicable to the determination of whether or not an input matches. Your imagination does not change regex processing.
The regex "\dignore this part\d" will not match the input "12" but will match the input "1ignore this part2". No matter how much in imagine the "ignore this part" will be skipped, it is still part of the regular expression.
In your case, the extra spaces are your form of "ignore this part".
Inside a regex pattern, spaces are meaningful atoms that match spaces. If you need to format your pattern with spaces and tabs and newlines - with whitespace that will not be accounted for by the regex engine - you may use the (?x) modifier, or the Pattern.COMMENTS flag.
Then, to match a literal space in such a pattern with (?x) option, you need to escape spaces to match literal spaces. Or, you may consider matching any whitespace with \s:
\s A whitespace character: [ \t\n\x0B\f\r]
Note that in case you add (?U) modifier, Pattern.UNICODE_CHARACTER_CLASS flag, \s will match all Unicode whitespace (like [\p{Zs}\t\r\n]).
I'm struggling with REGEX and require it for a program.
The input require only alphanumerical keys and also (allow only comma,:,space,/,- in special chars)
I have tried = (^[a-zA-Z0-9,:\S/-]*$)
As far as i understand and please correct me if I'm wrong.
a-zA-Z0-9 - The alphanumerical keys.
,: - Comma and colon
\S - Space
/ - I'm not sure how to represent a forward slash thus i escaped it
- - Dash also not sure if it is needed to escape it.
Would be appreciated if this can be corrected and also a explanation of each part.
Thanks in advance.
You can replace a-zA-Z0-9 with just \\w which is short for [a-zA-Z_0-9]. Furthermore, \\S is any character, but not a whitespace, you should use a \\s instead. You don't need to escape /, and even - if it's the first one or the last one, because if it's placed between two characters it could be interpreted as range and you'll have to escape it. So, you can make your regex like ^([\w,:\s/-]*)$
The \S shorthand matches any character except whitespace, just the opposite of what you want. Lowercase \s matches whitespace [\t\v\n\r\f ]. But if you only want spaces, just put a space in the character class.
a hyphen - needs to be escaped inside characters, unless it's the first or last character in the character class, but you could always escape it just to be sure.
Slashes / don't need to be escaped. They're escaped in other languages where you use them as pattern delimiters. ie: /regex/i.
Besides hyphens and shorthands, only backslashes \\ and closing brackets \] need to be escaped.
Remember in java, you always need to use double backslashes (one is interpreted by java, the other by the regex engine).
Regex
pattern = "^[a-zA-Z0-9 ,:/\\-]*$"
Move the Start of Line ^ and End of Line $ outside the group - like
^([a-zA-Z0-9,:\S/-]*)$
That should do it.
I just started learning java regex today so excuse me for taking a rather lazy approach to learning what's wrong with my regex. Basically I'm trying to split a string on 'white space', -, 'white space' pattern (where white space is either one or none) and I'm not sure why the pattern isn't compiling. I get an error on the second s in: [^(\s?-\s?)] (index 8). If someone could help me out, I'd really appreciate it!
You are placing your pattern you are trying to split on inside of a negated character class in which it is doing the exact opposite of what you are expecting it to do.
[^(\s?-\s?)] # matches any character except:
# '('
# whitespace (\n, \r, \t, \f, and " ")
# '?'
# '-'
# whitespace (\n, \r, \t, \f, and " ")
# '?'
# ')'
Your syntax is indeed incorrect, but why will it not compile? Well, Inside of a character class the hyphen has special meaning. You can place a hyphen as the first or last character of the class. In some regular expression implementations, you can also place directly after a range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
To fix the compile issue, you would simply escape the hyphen, still this regex does not do what you want.
I'm trying to split a string on white space, -, white space pattern...
Remove the character class and the capturing group from the pattern:
String s = "foo - bar - baz-quz";
String[] parts = s.split("\\s?-\\s?");
System.out.println(Arrays.toString(parts)); //=> [foo, bar, baz, quz]
Here are a few references for learning regular expressions.
Regular-Expressions.info
RexEgg - Tutorial
\s is a character class. Characters inside [] are treated as those characters. (Or not specific characters in the case of [^]). It doesn't make sense to use \s inside of [].
Perhaps you mean to use parenthesis instead of braces?