java regex pattern matching vs schema validation - java

Please consider the regex pattern : .*[a-zA-Z0-9\\-\\_].*.
If I use Java regex pattern matching to match "-", it says it is true.
String regexCostcode1=".*[a-zA-Z0-9\\-\\_].*";
Pattern regex_costcode=Pattern.compile(regexCostcode1);
String test="-";
Matcher m = regex_costcode.matcher(test);
System.out.println(m.matches());
This prints true.
But same regex fails for "-" in XSD schema validation.
I checked using http://regexr.com/ it fails to match "-".
So why it is matching using Java pattern matching?

Mind that in a Java string literal you need 2 backslashes to define a literal backslash. When you use \\ at the regexr.com, or in XML Schema regex, you use 2 literal backslashes that match a literal backslash in the input string, and the [\\-\\] construct matches a single \.
In XML Schema, you need to define the regex as
<xs:pattern value=".*[a-zA-Z0-9_-].*"/>
Put the - at the end of the character class to be parsed as a literal -. The underscore does not need to be escaped at all, as it is never a special char (it is actually a "word" char).
Actually, I'd advise to use ".*[a-zA-Z0-9_-].*" in Java, too, to avoid any ambiguity.

For non-Java regexes you don't need to use double back-slashes. So your regex should be .*[a-zA-Z0-9\\-\\_].* in Java and .*[a-zA-Z0-9\-\_].* in XSD schema validation.
If you input .*[a-zA-Z0-9\\-\\_].* in the site you mentioned, it tells you that \\-\\ is being interpreted as a "range of characters from \ to \" since \\ is just an escaped back-slash.
If you input .*[a-zA-Z0-9\-\_].* it interprets \- as just an escaped hypen and correctly matches -.

Related

Disable string escaping (backslash hell)

I've started using Java regexes and I find I have to write patterns like this (contrived example):
C:\\\\windows\\\\system\\d+
to match
C:\windows\system32
Is there any way to use java regex without insane amounts of backslashes?
Use Pattern.quote(String s) (click for documentation)
It treats all meta characters in the passed String as literal characters (but you still must escape backslashes in construction of a String literal). This lets you type \\ instead of \\\\ to denote an actual/literal \ in the regex pattern. But this also means that any other special characters will be interpreted literally as well (such as \d+ in your example).
But in your example, you could use:
Pattern.quote("C:\\windows\\system") + "\\d+";
Test it with this: System.out.println("C:\\windows\\system32".matches(Pattern.quote("C:\\windows\\system") + "\\d+"));

Underlined backslash IntelliJ

I am using a backslash as an escape character for a serialization format I am working on. I have it as a constant but IntelliJ is underlining it and highlighting it red. On hover it gives no error messages or any information as to why it does not like it.
What is the reason for this and how do I fix it?
IntelliJ is smarter than I am and realised that I was using this character in a regular expression where 2 backslashes would be needed, however, IntelliJ also assumed that my puny mind could find the problem without giving me any information about it.
If it's being used as a regular expression, then the "\" must be escaped.
If you're escaping a "\" as "\" like traditional regular expressions require, then you also need to add two more \\ for a total of \\\\.
This is because of the way Java interprets "\":
In literal Java strings the backslash is an escape character. The
literal string "\" is a single backslash. In regular expressions, the
backslash is also an escape character. The regular expression \
matches a single backslash. This regular expression as a Java string,
becomes "\\". That's right: 4 backslashes to match a single one.
The regex \w matches a word character. As a Java string, this is
written as "\w".
The same backslash-mess occurs when providing replacement strings for
methods like String.replaceAll() as literal Java strings in your Java
code. In the replacement text, a dollar sign must be encoded as \$ and
a backslash as \ when you want to replace the regex match with an
actual dollar sign or backslash. However, backslashes must also be
escaped in literal Java strings. So a single dollar sign in the
replacement text becomes "\$" when written as a literal Java string.
The single backslash becomes "\\". Right again: 4 backslashes to
insert a single one.

Java Regex Escape Characters

I'm learning Regex, and running into trouble in the implementation.
I found the RegexTestHarness on the Java Tutorials, and running it, the following string correctly identifies my pattern:
[\d|\s][\d]\.
(My pattern is any double digit, or any single digit preceded by a space, followed by a period.)
That string is obtained by this line in the code:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
When I try to write a simple class in Eclipse, it tells me the escape sequences are invalid, and won't compile unless I change the string to:
[\\d|\\s][\\d]\\.
In my class I'm using`Pattern pattern = Pattern.compile();
When I put this string back into the TestHarness it doesn't find the correct matches.
Can someone tell me which one is correct? Is the difference in some formatting from console.readLine()?
\ is special character in String literals "...". It is used to escape other special characters, or to create characters like \n \r \t.
To create \ character in string literal which can be used in regex engine you need to escape it by adding another \ before it (just like you do in regex when you need to escape its metacharacters like dot \.). So String representing \ will look like "\\".
This problem doesn't exist when you are reading data from user, because you are already reading literals, so even if user will write in console \n it will be interpreted as two characters \ and n.
Also there is no point in adding | inside class character [...] unless your intention is to make that class also match | character, remember that [abc] is the same as (a|b|c) so there is no need for | in "[\\d|\\s]".
If you want to represent a backslash in a Java string literal you need to escape it with another backslash, so the string literal "\\s" is two characters, \ and s. This means that to represent the regular expression [\d\s][\d]\. in a Java string literal you would use "[\\d\\s][\\d]\\.".
Note that I also made a slight modification to your regular expression, [\d|\s] will match a digit, whitespace, or the literal | character. You just want [\d\s]. A character class already means "match one of these", since you don't need the | for alternation within a character class it loses its special meaning.
My pattern is any double digit or single digit preceded by a space, followed by a period.)
Correct regex will be:
Pattern pattern = Pattern.compile("(\\s\\d|\\d{2})\\.");
Also if you're getting regex string from user input then your should call:
Pattern.quote(useInputRegex);
To escape all the regex special characters.
Also you double escaping because 1 escape is handled by String class and 2nd one is passed on to regex engine.
What is happening is that escape sequences are being evaluated twice. Once for java, and then once for your regex.
the result is that you need to escape the escape character, when you use a regex escape sequence.
for instance, if you needed a digit, you'd use
"\\d"

How to match \Q and \E in Java regex?

I want to match \Q and \E in a Java regex.
I am writing a program which will compute the length of the string, matching to the pattern (this program assumes that there is no any quantifier in regex except {some number}, that's why the length of the string is uniquely defined) and I want at first delete all expressions like \Qsome text\E.
But regex like this:
"\\Q\\Q\\E\\Q\\E\\E"
obviously doesn't work.
Use Pattern.quote(...):
String s = "\\Q\\Q\\E\\Q\\E\\E";
String escaped = Pattern.quote(s);
Just escape the backslashes. The sequence \\\\ matches a literal backslash, so to match a literal \Q:
"\\\\Q"
and to match a literal \E:
"\\\\E"
You can make it more readable for a maintainer by making it obvious that each sequence matches a single character using [...] as in:
"[\\\\][Q]"

How to undo replace performed by regex?

In java, I have the following regex ([\\(\\)\\/\\=\\:\\|,\\,\\\\]) which is compiled and then used to escape each of the special characters ()/=:|,\ with a backslash as follows escaper.matcher(value).replaceAll("\\\\$1")
So the string "A/C:D/C" would end up as "A\/C\:D\/C"
Later on in the process, I need to undo that replace. That means I need to match on the combination of \(, \), \/ etc. and replace it with the character immediately following the backslash character. A backslash followed by any other character should not be matched and there could be cases where a special character will exist without the preceeding backslash, in which case it shouldn't match either.
Since I know all of the cases I could do something like
myString.replaceAll("\\(", "(").replaceAll("\\)", ")").replaceAll("\\/", "/")...
but I wonder if there is a simpler regex that would allow me to perform the replace for all the special characters in a single step.
That seems pretty straightforward. If this were your original code (excess escapes removed):
Pattern escaper = Pattern.compile("([()/=:|,\\\\])");
String escaped = escaper.matcher(original).replaceAll("\\\\$1");
...the opposite would be:
Pattern unescaper = Pattern.compile("\\\\([()/=:|,\\\\])");
String unescaped = unescaper.matcher(escaped).replaceAll("$1");
If you weren't escaping and unescaping backslashes themselves (as you're doing), you would have problems, but this should work fine.
I don't know java regex flavor but this work with PCRE
replace \\ followed by ([()/=:|,\\]) by $1
in perl you can do
$str =~ s#\\([()/=:|,\\])#$1#g;

Categories

Resources