Why adding white space makes my regex wrong? - java

(^\s*\d+\)(.*) | ) | (^\s*Q\d+\.\s*(.*))
The above regex is not matching Q1. qeqwewqeqeq qerqer
But If I remove white space before and after |
(^\s*\d+\)(.*) | )|(^\s*Q\d+\.\s*(.*))
It matches my string.
What does white space mean? Is it equal to \s? It affects my readability.

Yes, whitespace affects your regex. No, it is not equivalent to \s.
The \s shorthand character class is equivalent to the character class [ \t\r\n\f] - i.e. a character class that will match any whitespace character. So, while your formatting spaces are included in \s, they are not equivalent to it.
As has been said in the comments, literal whitespace is important in regexes. In fact, I believe it's causing an error in your first alternate (the sub-pattern (^\s*\d+\)(.*) | )).
If I'm reading the intent of that sub-pattern right, it's supposed to match text of the form
2) some_text
But it will:
Only match this text if it's followed by a space
Also match a single literal space
A better way to construct this sub-pattern would be (^\s*\d+\)(.*)), disposing of the end space and the alternation altogether. Furthermore, in order to improve readability, we can do this:
(^\s*(?:Q\d+\.|\d+\))\s*(.*))
Which only alternates on the question number format, rather than the whole pattern.
Demo on Regex101

The contents of a regex are 100% applicable to the determination of whether or not an input matches. Your imagination does not change regex processing.
The regex "\dignore this part\d" will not match the input "12" but will match the input "1ignore this part2". No matter how much in imagine the "ignore this part" will be skipped, it is still part of the regular expression.
In your case, the extra spaces are your form of "ignore this part".

Inside a regex pattern, spaces are meaningful atoms that match spaces. If you need to format your pattern with spaces and tabs and newlines - with whitespace that will not be accounted for by the regex engine - you may use the (?x) modifier, or the Pattern.COMMENTS flag.
Then, to match a literal space in such a pattern with (?x) option, you need to escape spaces to match literal spaces. Or, you may consider matching any whitespace with \s:
\s A whitespace character: [ \t\n\x0B\f\r]
Note that in case you add (?U) modifier, Pattern.UNICODE_CHARACTER_CLASS flag, \s will match all Unicode whitespace (like [\p{Zs}\t\r\n]).

Related

How to remove all special Characters from a string except - . and space

I have a string where I want to remove all special characters except hyphen , a dot and space.
I am using filename.replaceAll("[^a-zA-Z0-9.-]",""). It is working for . and - but not for space.
What should I add to this to make it work for space as well?
Use either \s or simply a space character as explained in the Pattern class javadoc
\s - A whitespace character: [ \t\n\x0B\f\r]
- Literal space character
You must either escape - character as \- so it won't be interpreted as range expression or make sure it stays as the last regex character. Putting it all together:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "")
filename.replaceAll("[^a-zA-Z0-9 .-]", "")
You can use this regex [^a-zA-Z0-9\s.-] or [^a-zA-Z0-9 .-]
\s matches whitespace and (space character) matches only space.
So in this case if you want to match whitespaces use this:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "");
And if you want only match space use this:
filename.replaceAll("[^a-zA-Z0-9 .-]", "");

Free space regex option (Pattern.COMMENTS) not working as expected

I'm trying to detect profanity using regex. But I want to detect the word even if they've spaced out the word like "Profa nity". However when using the "(?x)" option it still doesn't want to detect.
I currently got:
(?ix).*Bad Word.*
I've tried using http://www.rubular.com to debug the expression with not luck.
If it helps in any way it's for at Teamspeak Bot where I want to kick the user for having banned words in their name. In the config it refers to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html where I can't find anything relating to the (?) options.
The bot itself can be found here: https://forum.teamspeak.com/threads/51286-JTS3ServerMod-Multifunction-TS3-Server-Bot-(Idle-Record-Away-Mute-Welcome-)
when using the "(?x)" option it still doesn't want to detect
The (?x) is an embedded flag option (also known as an inline modifier/option) enables the Pattern.COMMENTS option, also known as free-spacing mode that enables comments inside regular expressions and makes the regex engine ignore all regular whitespace inside the pattern. As per Free-Spacing in Character Classes:
In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs, and line breaks. Note that only whitespace between tokens is ignored. a b c is the same as abc in free-spacing mode. But \ d and \d are not the same. The former matches d, while the latter matches a digit. \d is a single regex token composed of a backslash and a "d". Breaking up the token with a space gives you an escaped space (which matches a space), and a literal "d".
Likewise, grouping modifiers cannot be broken up. (?>atomic) is the same as (?> ato mic ) and as ( ?>ato mic). They all match the same atomic group. They're not the same as (? >atomic). The latter is a syntax error. The ?> grouping modifier is a single element in the regex syntax, and must stay together. This is true for all such constructs, including lookaround, named groups, etc.
So, to match a single space in a pattern with the (?x) modifier, you need to escape it:
String reg = "(?ix).*Bad\\ Word.*"; // Escaped space matches a space in free spacing mode
String reg = "(?ix).* Bad\\ Word .*"; // More formatting spaces, same pattern
NOTE that you CAN'T put the space into a character class to make it meaningful in a Java regex. See below:
Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore spaces, line breaks, and comments inside character classes. So in Java's free-spacing mode, [abc] is identical to [ a b c ].
Besides, I think you actually wanted to make sure your pattern can match full strings that may contain line breaks. That means, you need (?s), Pattern.DOTALL, modifier:
String reg = "(?is).*Bad Word.*";
Also, to match any whitespace, you may rely on \s:
String reg = "(?ix).*Bad\\sWord.*"; // To only match 1 whitespace
String reg = "(?ix).*Bad\\s+Word.*"; // To account for 1 or more whitespaces

regex for allowing only certain special characters and also including the alphanumerical characters

I'm struggling with REGEX and require it for a program.
The input require only alphanumerical keys and also (allow only comma,:,space,/,- in special chars)
I have tried = (^[a-zA-Z0-9,:\S/-]*$)
As far as i understand and please correct me if I'm wrong.
a-zA-Z0-9 - The alphanumerical keys.
,: - Comma and colon
\S - Space
/ - I'm not sure how to represent a forward slash thus i escaped it
- - Dash also not sure if it is needed to escape it.
Would be appreciated if this can be corrected and also a explanation of each part.
Thanks in advance.
You can replace a-zA-Z0-9 with just \\w which is short for [a-zA-Z_0-9]. Furthermore, \\S is any character, but not a whitespace, you should use a \\s instead. You don't need to escape /, and even - if it's the first one or the last one, because if it's placed between two characters it could be interpreted as range and you'll have to escape it. So, you can make your regex like ^([\w,:\s/-]*)$
The \S shorthand matches any character except whitespace, just the opposite of what you want. Lowercase \s matches whitespace [\t\v\n\r\f ]. But if you only want spaces, just put a space in the character class.
a hyphen - needs to be escaped inside characters, unless it's the first or last character in the character class, but you could always escape it just to be sure.
Slashes / don't need to be escaped. They're escaped in other languages where you use them as pattern delimiters. ie: /regex/i.
Besides hyphens and shorthands, only backslashes \\ and closing brackets \] need to be escaped.
Remember in java, you always need to use double backslashes (one is interpreted by java, the other by the regex engine).
Regex
pattern = "^[a-zA-Z0-9 ,:/\\-]*$"
Move the Start of Line ^ and End of Line $ outside the group - like
^([a-zA-Z0-9,:\S/-]*)$
That should do it.

Regex matching a space a digit and 8 characters

I want to match a string containing,
a space
any number of digit
a space
1-8 characters - (alphanumeric and special characters)
example,
01 Stack
This is what i tried,
\\s\\d+\\s[^.]{1, 8} - i tried here except for .,
Try this, to catch (and restrict to) the punctuation and alphanumerics: \s\d+\s[\p{Punct}\p{Alnum}]{1,8}; wrap it all in ^...$ if you want the begin/end line anchors.
If "any number of digits" means 1 or more digit, then the pattern above is fine. If it means "zero or more digits", then the \d+ needs to become \d*.
As an aside, the pattern [^.] will match anything that's not a period. It includes a bit too much, I think, and excludes a bit too much. So I'm opting for the more specific pattern [\p{Punct}\p{Alnum}].
See documentation here.
Try \\s\\d+\\s[^.]{1,8}? It looks like the only problem here is a superfluous space.
Also, \\S is for everything except whitespaces. [^ ] is for everything excpet space. . is for everything.
I don't understand the use of [^.]. The character . matches "any character". So you are asking it to match "any character except any character". Instead you should match non-space characters with \\S.

Help with regex

I'm constructing a regex which will accept at least 1 alpha numerical character and any number of spaces.
Right now I've got...[A-Za-z0-9]+[ \t\r\n]* which I understand to be at least 1 alphanumeric OR at least 1 space. How would I fix this?
EDIT: To answer the comments below I want it to accept strings which contain ATLEAST 1 alphanumeric AND any number of (including no) spaces. Right now it will accept JUST a whitespace.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
\s*\p{Alnum}[\p{Alnum}\s]*
Your regex, [A-Za-z0-9]+[ \t\r\n]*, requires the string to start with a letter or digit (or, more accurately, it doesn't start matching until it sees one). Adding \s* allows the match to start with whitespace, but you still won't match any alphanumerics after the first whitespace character that follows an alphanumeric (for example, it won't match the xyz in abc xyz. Changing the trailing \s* to [\p{Alnum}\s]* fixes that problem.
On a side note, \p{Alnum} is exactly equivalent to [A-Za-z0-9] in Java, which is not the case in all regex flavors. I used \p{Alnum}, not just because it's shorter, but because it gives more protection from typos like [A-z] (which is syntactically valid, but almost certainly not what the author really meant).
EDIT: Performance should be considered, too. I originally included a + after the first \p{Alnum}, but I realized that wasn't a good idea. If this were part of a longer regex, and the regex didn't match right away, it could end up wasting a lot of time trying to match the same groups of characters with \p{Alnum}+ or [\p{Alnum}\s]*. The leading \s* is okay, though, because \s doesn't match any of the characters that \p{Alnum} matches.
Any one or more word char zero or more whitespace
\w+\s*
Hey try this ([^\s]+\s*) [^\s] means catch everything that is not white space, while \s* means that an white space is optional (if you really want at least one white space put + instead of )
Edit: sory mine catch everithing not only alphanumeric (put ([a-zA-Z0-9]+\s) for alphanumeric)
This should do the trick:
\s*\p{Alnum}+\s*
\p{Alnum} is an alphanumeric character: [\p{Alpha}\p{Digit}]
* says "zero or more times"
+ says "at least one" (not "or" as you seem to believe, or is written |)
| means "or"
\s is a whitespace character: [ \t\n\x0B\f\r]
EDIT: To answer the comments below I want it to accept strings which contain AT LEAST 1 alphanumeric AND any number of (including no) spaces.
The pattern I suggested requires at least one alpha numeric character.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
The pattern I suggested will not accept only white space characters only.

Categories

Resources