Why closing square bracket "]" doesn't require escaping in regex?

Why closing square bracket "]" doesn't require escaping in regex? - java

Consider the array:
new Pattern[] {Pattern.compile("\\["),Pattern.compile("\\]") };
Intellij IDEA tells me that \\ is redundant and tells me to replace this with ] e.g. the result is:
new Pattern[] {Pattern.compile("\\["),Pattern.compile("]") };
Why in the first Pattern.compile("\\[") is the \\ OK, but for the second it is redundant?

The ] symbol is not a special regex operator outside the character class if there is no corresponding unescaped [ before it. Only special characters require escaping. A [ is a special regex operator outside a character class (as it may mark the starting point of a character class). Once the Java regular expression engine sees an unescaped [ in the pattern, it knows there must be a ] to close the character class ahead. Whether it is escaped or not, it does not matter for the engine. If there is no opening [ in the expression, the ] is treated as a mere literal ] symbol. So, [abc] will match a, b or c, and \[abc] or \[abc\] will match [abc] literal character sequence.
So, the [ should be escaped always, and ] does not have to be escaped outside a character class.
When used inside a character class, both [ and ] must be escaped inside a Java regular expression as they may form intersection/subtraction patterns, unless the ] appears at the beginning of a character class (i.e. "[a]".replaceAll("[]\\[]", "") returns a).
Other regex flavors
icu onigmo - In ICU and Onigmo regex flavor, ] behaves the same as in Java regex flavor. Languages affected: swift, ruby, r (stringr), kotlin, groovy.
pcre boost .net re2 python posix - In Boost, PCRE, ] is not a special char (i.e. needs no escaping) outside a character class, and is a special char (=needs escaping) inside a character class (where it does not need escaping only if it is the first char in the character class.) It is not an error to escape it everywhere where it is supposted to match a literal ] char. Languages/tools affected: php, perl, c#/vb.net/etc., python, sed, grep, awk, elixir, r (both default base R TRE and PCRE enabled with "perl=TRUE"), tcl, google-sheets.
ecmascript - In ECMAScript flavors, ] is not special outside a character class, while [ is special outside a character class. Inside a character class, ] must ALWAYS be escaped, even if it is the first char in the character class. [ inside a character class is not special, but escaping it is an error if the regexp is compiled with the /u flag (in JavaScript). So, be careful here. Languages affected: javascript, dart, c++, vba, google-apps-script (which uses JavaScript).

The ] is considered metacharacter only when it is used to close character set [...].
If before ] there is no unclosed and unescaped opening square bracket [ then ] is as simple literal which doesn't require escaping (but allows it, which is why your IDE gives you "warning" instead of error).
Only place when you may want to escape ] is inside character set when you want regex to treat is as simple symbol instead of metacharacter which is closing character set.
For instance regex like "[ab\\]cd]" represents a or b or ] or c or d.
BUT similar regex can be also written like [a-d]|]. Notice that last ] is not "special" because there is no opened character class before it. So it is considered as literal - character without special meaning, which means it doesn't require escaping.

Related

openapi - regex for not allowing whitespace or hyphen [duplicate]

I tried this but it doesn't work :
[^\s-]
Any Ideas?

[^\s-]
should work and so will
[^-\s]
[] : The char class
^ : Inside the char class ^ is the
negator when it appears in the beginning.
\s : short for a white space
- : a literal hyphen. A hyphen is a
meta char inside a char class but not
when it appears in the beginning or
at the end.

It can be done much easier:
\S which equals [^ \t\r\n\v\f]

Which programming language are you using? May be you just need to escape the backslash like "[^\\s-]"

In Java:
String regex = "[^-\\s]";
System.out.println("-".matches(regex)); // prints "false"
System.out.println(" ".matches(regex)); // prints "false"
System.out.println("+".matches(regex)); // prints "true"
The regex [^-\s] works as expected. [^\s-] also works.
See also
Regular expressions and escaping special characters
regular-expressions.info/Character class
Metacharacters Inside Character Classes
The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret.

Note that regex is not one standard, and each language implements its own based on what the library designers felt like. Take for instance the regex standard used by bash, documented here: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05.
If you are having problems with regular expressions not working, it might be good to simplify it, for instance using "[^ -]" if this covers all forms of whitespace in your case.

Try [^- ], \s will match 5 other characters beside the space (like tab, newline, formfeed, carriage return).

Regex on a directory on Linux and Windows [duplicate]

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]
BRE/ed/grep/sed
    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.

POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.

Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.

Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:

https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/

For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()

To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).

For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.

to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator

Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
｢#.*?｣
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/

Java - How to set the dilimiter to multiple different things using .useDelimiter()?

I wish to use Scanner class method : .useDilimiter() to parse a file, previously I would've used a series of .replaceAll() statements to replace what I wanted the dilimiter to be with white space.
Anyway, I'm trying to make a Scanner's dilimiter the any of the following characters: ., (,),{,},[,],,,! and standard white space. How would I go about doing this?

Scanner uses regular expression (regex) to describe delimiter. By default it is \p{javaWhitespace}+ which represents one or more (due to + operator) whitespaces.
In regex to represent single character from set of characters we can use character class [...]. But since [ and ] in regex represents start and end of character class these characters are metacharacters (even inside character class). To treat them as literals we need to escape them first. We can do it by
adding \ (in string written as "\\") before them,
or by placing them in \Q...\E which represents quote section (where all characters are considered as literals, not metacharacters).
So regex representing one of ( ) { } [ ] , ! characters can look like "[\\Q(){}[],!\\E]".
If you want to add support for standard delimiter you can combine this regex with \p{javaWhitespace}+ using OR operator which is |.
So your code can look like:
yourScanner.useDelimiter("[\\Q(){}[],!\\E]|\\p{javaWhitespace}+");

Why does the pattern ignore the space inside character class

I am trying to match some codes that are short strings with simple structure:
5 digits
Colon
Some letters
Space or underscore
Some digits.
I want to use a Pattern.COMMENT option to format my pattern:
String pat = "(?x) ([0-9]{5}) : ([a-zA-Z]+ [_ ] [0-9]+) ";
This pattern works fine at https://regex101.com/r/oW8vQ4/1.
However, in Java, this line:
"31500:STR 200".matches(pat)
yields false.
Why does it return false here? Shouldn't the [_ ] match the space even if the Pattern.COMMENT is enabled as it is inside a character class?

I think the problem is that you need to scape the space inside the character classes. From http://www.regular-expressions.info/freespacing.html
Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore whitespace and comments inside character classes. So in Java's free-spacing mode, [abc] is identical to [ a b c ]. To add a space to a character class, you'll have to escape it with a backslash. But even in free-spacing mode, the negating caret must appear immediately after the opening bracket. [ ^ a b c ] matches any of the four characters ^, a, b or c just like [abc^] would. With the negating caret in the proper place, [^ a b c ] matches any character that is not a, b or c.
Give it a try with the pattern - just added \\ before the space... but didn't test this myself.
String pat = "(?x) ([0-9]{5}) : ([a-zA-Z]+ [_\\ ] [0-9]+) ";

Unclosed Character Class Error?

Here is the error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 3
], [
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.clazz(Pattern.java:2493)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.split(String.java:2313)
at java.lang.String.split(String.java:2355)
at testJunior2013.J2.main(J2.java:31)
This is the area of the code that is causing the issues.
String[][] split = new String[1][rows];
split[0] = (Arrays.deepToString(array2d)).split("], ["); //split at the end of an array row
What does this error mean and what needs to be done to fix the code above?

TL;DR
You want:
.split("\\], \\[")`
Escape each square bracket twice — once for each context in which you need to strip them from their special meaning: within a Regular Expression first, and within a Java String secondly.
Consider using Pattern#quote when you need your entire pattern to be interpreted literally.
Explanation
String#split works with a Regular Expression but [ and ] are not standard characters, regex-wise: they have a special meaning in that context.
In order to strip them from their special meaning and simply match actual square brackets, they need to be escaped, which is done by preceding each with a backslash — that is, using \[ and \].
However, in a Java String, \ is not a standard character either, and needs to be escaped as well.
Thus, just to split on [, the String used is "\\[" and you are trying to obtain:
.split("\\], \\[")
A sensible alternative
However, in this case, you're not just semantically escaping a few specific characters in a Regular Expression, but actually wishing that your entire pattern be interpreted literally: there's a method to do just that 🙂
Pattern#quote is used to signify that the:
Metacharacters [...] in your pattern will be given no special meaning.
(from the Javadoc linked above)
I recommend, in this case, that you use the following, more sensible and readable:
.split(Pattern.quote("], ["))

Split receives a regex and [, ] characters have meaning in regex, so you should escape them with \\[ and \\].
The way you are currently doing it, the parser finds a ] without a preceding [ so it throws that error.

String.split() takes a regular expression, not a normal string as an argument. In a regular expression, ] and [ are special characters, which need to be preceded by backslashes to be taken literally. Use .split("\\], \\["). (the double backslashes tell Java to interpret the string as "\], \[").

.split("], [")
^---start of char class
end----?
Change it to
.split("], \[")
^---escape the [

Try to use it
String stringToSplit = "8579.0,753.34,796.94,\"[784.2389999999999,784.34]\",\"[-4.335912230999999, -4.3603307895,4.0407909059, 4.08669583455]\",[],[],[],0.1744,14.4,3.5527136788e-15,0.330667850653,0.225286999939,Near_Crash";
String [] arraySplitted = stringToSplit.replaceAll("\"","").replaceAll("\\[","").replaceAll("\\]","").trim().split(",");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.