How to match \Q and \E in Java regex? - java

I want to match \Q and \E in a Java regex.
I am writing a program which will compute the length of the string, matching to the pattern (this program assumes that there is no any quantifier in regex except {some number}, that's why the length of the string is uniquely defined) and I want at first delete all expressions like \Qsome text\E.
But regex like this:
"\\Q\\Q\\E\\Q\\E\\E"
obviously doesn't work.

Use Pattern.quote(...):
String s = "\\Q\\Q\\E\\Q\\E\\E";
String escaped = Pattern.quote(s);

Just escape the backslashes. The sequence \\\\ matches a literal backslash, so to match a literal \Q:
"\\\\Q"
and to match a literal \E:
"\\\\E"
You can make it more readable for a maintainer by making it obvious that each sequence matches a single character using [...] as in:
"[\\\\][Q]"

Related

Regular expression for matching texts before and after string

I would like to match URL strings which can be specified in the following manner.
xxx.yyy.com (For example, the regular expression should match all strings like 4xxx.yyy.com, xxx4.yyy.com, xxx.yyy.com, 4xxx4.yyy.com, 444xxx666.yyy.com, abcxxxdef.yyy.com etc).
I have tried to use
([a-zA-Z0-9]+$)xxx([a-zA-Z0-9]+$).yyy.com
([a-zA-Z0-9]*)xxx([a-zA-Z0-9]*).yyy.com
But they don't work. Please help me write a correct regular expression. Thanks in advance.
Note: I'm trying to do this in Java.
If you want to make sure there is xxx and you want to allow all non whitespace chars before and after. If you want to match the whole string, you could add anchors at the start and end.
Note to escape the dot to match it literally.
^\S*xxx\S*\.yyy\.com$
^ Start of string
\S*xxx\S* Match xxx between optional non whitespace chars
\.yyy Match .yyy
\.com Match .com
$ End of string
Regex demo
In Java double escape the backslash
String regex = "^\\S*xxx\\S*\\.yyy\\.com$";
Or specify the characters on the left and right that you would allow to match in the character class:
^[0-9A-Za-z!##$%^&*()_+]*xxx[0-9A-Za-z!##$%^&*()_+]*\.yyy\.com$
Regex demo

java regex pattern matching vs schema validation

Please consider the regex pattern : .*[a-zA-Z0-9\\-\\_].*.
If I use Java regex pattern matching to match "-", it says it is true.
String regexCostcode1=".*[a-zA-Z0-9\\-\\_].*";
Pattern regex_costcode=Pattern.compile(regexCostcode1);
String test="-";
Matcher m = regex_costcode.matcher(test);
System.out.println(m.matches());
This prints true.
But same regex fails for "-" in XSD schema validation.
I checked using http://regexr.com/ it fails to match "-".
So why it is matching using Java pattern matching?
Mind that in a Java string literal you need 2 backslashes to define a literal backslash. When you use \\ at the regexr.com, or in XML Schema regex, you use 2 literal backslashes that match a literal backslash in the input string, and the [\\-\\] construct matches a single \.
In XML Schema, you need to define the regex as
<xs:pattern value=".*[a-zA-Z0-9_-].*"/>
Put the - at the end of the character class to be parsed as a literal -. The underscore does not need to be escaped at all, as it is never a special char (it is actually a "word" char).
Actually, I'd advise to use ".*[a-zA-Z0-9_-].*" in Java, too, to avoid any ambiguity.
For non-Java regexes you don't need to use double back-slashes. So your regex should be .*[a-zA-Z0-9\\-\\_].* in Java and .*[a-zA-Z0-9\-\_].* in XSD schema validation.
If you input .*[a-zA-Z0-9\\-\\_].* in the site you mentioned, it tells you that \\-\\ is being interpreted as a "range of characters from \ to \" since \\ is just an escaped back-slash.
If you input .*[a-zA-Z0-9\-\_].* it interprets \- as just an escaped hypen and correctly matches -.

Java Regex Escape Characters

I'm learning Regex, and running into trouble in the implementation.
I found the RegexTestHarness on the Java Tutorials, and running it, the following string correctly identifies my pattern:
[\d|\s][\d]\.
(My pattern is any double digit, or any single digit preceded by a space, followed by a period.)
That string is obtained by this line in the code:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
When I try to write a simple class in Eclipse, it tells me the escape sequences are invalid, and won't compile unless I change the string to:
[\\d|\\s][\\d]\\.
In my class I'm using`Pattern pattern = Pattern.compile();
When I put this string back into the TestHarness it doesn't find the correct matches.
Can someone tell me which one is correct? Is the difference in some formatting from console.readLine()?
\ is special character in String literals "...". It is used to escape other special characters, or to create characters like \n \r \t.
To create \ character in string literal which can be used in regex engine you need to escape it by adding another \ before it (just like you do in regex when you need to escape its metacharacters like dot \.). So String representing \ will look like "\\".
This problem doesn't exist when you are reading data from user, because you are already reading literals, so even if user will write in console \n it will be interpreted as two characters \ and n.
Also there is no point in adding | inside class character [...] unless your intention is to make that class also match | character, remember that [abc] is the same as (a|b|c) so there is no need for | in "[\\d|\\s]".
If you want to represent a backslash in a Java string literal you need to escape it with another backslash, so the string literal "\\s" is two characters, \ and s. This means that to represent the regular expression [\d\s][\d]\. in a Java string literal you would use "[\\d\\s][\\d]\\.".
Note that I also made a slight modification to your regular expression, [\d|\s] will match a digit, whitespace, or the literal | character. You just want [\d\s]. A character class already means "match one of these", since you don't need the | for alternation within a character class it loses its special meaning.
My pattern is any double digit or single digit preceded by a space, followed by a period.)
Correct regex will be:
Pattern pattern = Pattern.compile("(\\s\\d|\\d{2})\\.");
Also if you're getting regex string from user input then your should call:
Pattern.quote(useInputRegex);
To escape all the regex special characters.
Also you double escaping because 1 escape is handled by String class and 2nd one is passed on to regex engine.
What is happening is that escape sequences are being evaluated twice. Once for java, and then once for your regex.
the result is that you need to escape the escape character, when you use a regex escape sequence.
for instance, if you needed a digit, you'd use
"\\d"

proper way to pattern match with escaped char

Pattern.matches("123$45","123$45") returns false, I presume because of the special $ char.
My suspicion was that escaping the $ would make it pass
e.g. Pattern.matches("123\$45","123\$45")
But this also fails.
What is the proper way to make sure they match?
This is the "canonical" regex which is \$, but here this is a Java string. And in a Java string, a \ is written "\\". Therefore:
"123\\$45"
As to your target string, it just needs to be "123$45".
If the pattern you are looking for is fixed pattern, then manually escape the '$' character so that it isn't treated as a regex metacharacter; i.e.
boolean itMatches = Pattern.matches("123\\$45", "123$45");
The '$' is escaped at the level of the String object using a single backslash. However, since we are expressing this using a String literal, and backslash is the escape character for string literals, we need to (string) escape the (regex) escape character. Hence, we need two backslashes ... here.
If you don't escape the escape, the Java compiler says in effect "I don't recognize "\$" as a valid String literal escape sequence. ERROR!".
On the other hand, if the pattern input or generated, then you can use Pattern.quote() to quote it; i.e.
String literal = "123$45"; // ... or any literal string you want to match.
boolean itMatches = Pattern.matches(Pattern.quote(literal), "123$45");

How to escape a square bracket for Pattern compilation?

I have comma separated list of regular expressions:
.{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]
I have done a split on the comma. Now I'm trying to match this regex against a generated password. The problem is that Pattern.compile does not like square brackets that is not escaped.
Can some please give me a simple function that takes a string like so: [0-9] and returns the escaped string \[0-9\].
For some reason, the above answer didn't work for me. For those like me who come after, here is what I found.
I was expecting a single backslash to escape the bracket, however, you must use two if you have the pattern stored in a string. The first backslash escapes the second one into the string, so that what regex sees is \]. Since regex just sees one backslash, it uses it to escape the square bracket.
\\]
In regex, that will match a single closing square bracket.
If you're trying to match a newline, for example though, you'd only use a single backslash. You're using the string escape pattern to insert a newline character into the string. Regex doesn't see \n - it sees the newline character, and matches that. You need two backslashes because it's not a string escape sequence, it's a regex escape sequence.
You can use Pattern.quote(String).
From the docs:
public static String quote​(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
You can use the \Q and \E special characters...anything between \Q and \E is automatically escaped.
\Q[0-9]\E
Pattern.compile() likes square brackets just fine. If you take the string
".{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]"
and split it on commas, you end up with five perfectly valid regexes: the first one matches eight non-line-separator characters, the second matches an ASCII digit, and so on. Unless you really want to match strings like ".{8}" and "[0-9]", I don't see why you would need to escape anything.

Categories

Resources