Java regex: [^(\\s?-\\s?)]. PatternSyntaxException on 's'? - java

I just started learning java regex today so excuse me for taking a rather lazy approach to learning what's wrong with my regex. Basically I'm trying to split a string on 'white space', -, 'white space' pattern (where white space is either one or none) and I'm not sure why the pattern isn't compiling. I get an error on the second s in: [^(\s?-\s?)] (index 8). If someone could help me out, I'd really appreciate it!

You are placing your pattern you are trying to split on inside of a negated character class in which it is doing the exact opposite of what you are expecting it to do.
[^(\s?-\s?)] # matches any character except:
# '('
# whitespace (\n, \r, \t, \f, and " ")
# '?'
# '-'
# whitespace (\n, \r, \t, \f, and " ")
# '?'
# ')'
Your syntax is indeed incorrect, but why will it not compile? Well, Inside of a character class the hyphen has special meaning. You can place a hyphen as the first or last character of the class. In some regular expression implementations, you can also place directly after a range. If you place the hyphen anywhere else you need to escape it in order to add it to your class.
To fix the compile issue, you would simply escape the hyphen, still this regex does not do what you want.
I'm trying to split a string on white space, -, white space pattern...
Remove the character class and the capturing group from the pattern:
String s = "foo - bar - baz-quz";
String[] parts = s.split("\\s?-\\s?");
System.out.println(Arrays.toString(parts)); //=> [foo, bar, baz, quz]
Here are a few references for learning regular expressions.
Regular-Expressions.info
RexEgg - Tutorial

\s is a character class. Characters inside [] are treated as those characters. (Or not specific characters in the case of [^]). It doesn't make sense to use \s inside of [].
Perhaps you mean to use parenthesis instead of braces?

Related

How to remove all special Characters from a string except - . and space

I have a string where I want to remove all special characters except hyphen , a dot and space.
I am using filename.replaceAll("[^a-zA-Z0-9.-]",""). It is working for . and - but not for space.
What should I add to this to make it work for space as well?
Use either \s or simply a space character as explained in the Pattern class javadoc
\s - A whitespace character: [ \t\n\x0B\f\r]
- Literal space character
You must either escape - character as \- so it won't be interpreted as range expression or make sure it stays as the last regex character. Putting it all together:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "")
filename.replaceAll("[^a-zA-Z0-9 .-]", "")
You can use this regex [^a-zA-Z0-9\s.-] or [^a-zA-Z0-9 .-]
\s matches whitespace and (space character) matches only space.
So in this case if you want to match whitespaces use this:
filename.replaceAll("[^a-zA-Z0-9\\s.-]", "");
And if you want only match space use this:
filename.replaceAll("[^a-zA-Z0-9 .-]", "");

Why adding white space makes my regex wrong?

(^\s*\d+\)(.*) | ) | (^\s*Q\d+\.\s*(.*))
The above regex is not matching Q1. qeqwewqeqeq qerqer
But If I remove white space before and after |
(^\s*\d+\)(.*) | )|(^\s*Q\d+\.\s*(.*))
It matches my string.
What does white space mean? Is it equal to \s? It affects my readability.
Yes, whitespace affects your regex. No, it is not equivalent to \s.
The \s shorthand character class is equivalent to the character class [ \t\r\n\f] - i.e. a character class that will match any whitespace character. So, while your formatting spaces are included in \s, they are not equivalent to it.
As has been said in the comments, literal whitespace is important in regexes. In fact, I believe it's causing an error in your first alternate (the sub-pattern (^\s*\d+\)(.*) | )).
If I'm reading the intent of that sub-pattern right, it's supposed to match text of the form
2) some_text
But it will:
Only match this text if it's followed by a space
Also match a single literal space
A better way to construct this sub-pattern would be (^\s*\d+\)(.*)), disposing of the end space and the alternation altogether. Furthermore, in order to improve readability, we can do this:
(^\s*(?:Q\d+\.|\d+\))\s*(.*))
Which only alternates on the question number format, rather than the whole pattern.
Demo on Regex101
The contents of a regex are 100% applicable to the determination of whether or not an input matches. Your imagination does not change regex processing.
The regex "\dignore this part\d" will not match the input "12" but will match the input "1ignore this part2". No matter how much in imagine the "ignore this part" will be skipped, it is still part of the regular expression.
In your case, the extra spaces are your form of "ignore this part".
Inside a regex pattern, spaces are meaningful atoms that match spaces. If you need to format your pattern with spaces and tabs and newlines - with whitespace that will not be accounted for by the regex engine - you may use the (?x) modifier, or the Pattern.COMMENTS flag.
Then, to match a literal space in such a pattern with (?x) option, you need to escape spaces to match literal spaces. Or, you may consider matching any whitespace with \s:
\s A whitespace character: [ \t\n\x0B\f\r]
Note that in case you add (?U) modifier, Pattern.UNICODE_CHARACTER_CLASS flag, \s will match all Unicode whitespace (like [\p{Zs}\t\r\n]).

regex for allowing only certain special characters and also including the alphanumerical characters

I'm struggling with REGEX and require it for a program.
The input require only alphanumerical keys and also (allow only comma,:,space,/,- in special chars)
I have tried = (^[a-zA-Z0-9,:\S/-]*$)
As far as i understand and please correct me if I'm wrong.
a-zA-Z0-9 - The alphanumerical keys.
,: - Comma and colon
\S - Space
/ - I'm not sure how to represent a forward slash thus i escaped it
- - Dash also not sure if it is needed to escape it.
Would be appreciated if this can be corrected and also a explanation of each part.
Thanks in advance.
You can replace a-zA-Z0-9 with just \\w which is short for [a-zA-Z_0-9]. Furthermore, \\S is any character, but not a whitespace, you should use a \\s instead. You don't need to escape /, and even - if it's the first one or the last one, because if it's placed between two characters it could be interpreted as range and you'll have to escape it. So, you can make your regex like ^([\w,:\s/-]*)$
The \S shorthand matches any character except whitespace, just the opposite of what you want. Lowercase \s matches whitespace [\t\v\n\r\f ]. But if you only want spaces, just put a space in the character class.
a hyphen - needs to be escaped inside characters, unless it's the first or last character in the character class, but you could always escape it just to be sure.
Slashes / don't need to be escaped. They're escaped in other languages where you use them as pattern delimiters. ie: /regex/i.
Besides hyphens and shorthands, only backslashes \\ and closing brackets \] need to be escaped.
Remember in java, you always need to use double backslashes (one is interpreted by java, the other by the regex engine).
Regex
pattern = "^[a-zA-Z0-9 ,:/\\-]*$"
Move the Start of Line ^ and End of Line $ outside the group - like
^([a-zA-Z0-9,:\S/-]*)$
That should do it.

Semi colon separated alphanumeric

I need to validate the below string using regular expression in Java:
String alphanumericList ="[\"State\"; \"districtOne\";\"districtTwo\"]";
I have tried the following:
String pattern="^\\[ (\"[\\w]\")\\s+(?:\\s+;\\s+ (\"[\\w]\")+) \\]$";
String alphanumericList ="[\"State1\"; \"district1\";\"district2\"]";
But the validation fails.
Any help is appreciated.
I'll try and mark the possible issues with your expression (issue numbers above the chars):
1 4 2 3 1 4 5 1
"^\\[ (\"[\\w]\")\\s+(?:\\s+;\\s+ (\"[\\w]\")+) \\]$"
As you can see, there are at least 5 issues:
The spaces in your expression are interpreted literally, i.e. if the input doesn't contain them, it would not match. Most probably you want to remove those spaces.
You expect at least one whitespace character after the first group (\\s+), which the input doesn't seem to contain. You probably want to remove that or change the quantifier from + to *.
You expect at least one whitespace character before each semicolon. Together with no. 2 this would make at least two after the first group. The solution would be the same as for no. 2.
Your expression the strings between double quotes seems wrong. (\"[\\w]\")+ means "a double quote, a single word character, a double quote" and all at least once. Besides that, \w is already a character class, you the brackets around that are not needed here (unless you want to add more classes or characters inside). You probably want (\"\\w+\") instead.
Additionally to 4 your non-capturing group that contains the semicolon ((?:\\s+;\\s+ (\"[\\w]\")+)) doesn't have a quantifier, i.e. it would be expected exactly once. You probably want to put the quantifier + or * after that group.
Another point that's not a direct issue is the capturing group around \"[\\w]\". Since you seem to want to match multiple strings after semicolons you'd only be able to capture one of the matching groups. Hence you'd most probably not be able to do what you intended anyways and thus the group is not necessary.
That said the fixed original expression would look like this:
pattern = "^\\[(\"\\w+\")(?:\\s*;\\s+\"\\w+\")+\\]$"
You are looking for this pattern:
String pattern = "\\[\\s*\"[^\"]*\"\\s*(?:;\\s*\"[^\"]*\"\\s*)*+\\]";
No need to add anchors since there are implicit if you use the matches() method since this method is the more appropriate for validation tasks.
pattern details:
\\[ # a literal opening square bracket
\\s* # optional whitespaces
\" # literal quote
[^\"]* # content between quotes: chars that are not a quote (zero or more)
\"
\\s*
(?: # non-capturing group:
; # a literal semi-colon
\\s*
\" # quoted content
[^\"]*
\"
\\s*
)*+ # repeat this group zero or more time (with a possessive quantifier)
\\] # a literal closing square bracket
The possessive quantifier prevent the regex engine to backtrack into repeated non-capturing groups if the closing square bracket is not present. It is a security to prevent uneeded backtracking and to make the pattern fail faster. Not that you can make possessive other quantifiers too before the non-capturing group for the same reason. More about possessive quantifiers.
I decided to describe the content between quotes in this way: \"[^\"]*\", but you can be more restrictive, allowing for example only words characters: \"\\w*\" or more general, allowing escaped quotes: \"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*+\"
Try this
static final String HEAD = "^\\[\\s*";
static final String TAIL = "\\s*\\]$";
static final String SEP = "\\s*;\\s*";
static final String ITEM = "\"[^\"]*\"";
static final String PAT = HEAD + ITEM + "(" + SEP + ITEM + ")*" + TAIL;
Try:
pattern = "^\\[(\"\\w+\";\\s*)*(\"\\w+\")\\]$";

Stop regular expression from matching across lines

I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.
\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.
Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]

Categories

Resources