Java Regex Escape Characters - java

I'm learning Regex, and running into trouble in the implementation.
I found the RegexTestHarness on the Java Tutorials, and running it, the following string correctly identifies my pattern:
[\d|\s][\d]\.
(My pattern is any double digit, or any single digit preceded by a space, followed by a period.)
That string is obtained by this line in the code:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
When I try to write a simple class in Eclipse, it tells me the escape sequences are invalid, and won't compile unless I change the string to:
[\\d|\\s][\\d]\\.
In my class I'm using`Pattern pattern = Pattern.compile();
When I put this string back into the TestHarness it doesn't find the correct matches.
Can someone tell me which one is correct? Is the difference in some formatting from console.readLine()?

\ is special character in String literals "...". It is used to escape other special characters, or to create characters like \n \r \t.
To create \ character in string literal which can be used in regex engine you need to escape it by adding another \ before it (just like you do in regex when you need to escape its metacharacters like dot \.). So String representing \ will look like "\\".
This problem doesn't exist when you are reading data from user, because you are already reading literals, so even if user will write in console \n it will be interpreted as two characters \ and n.
Also there is no point in adding | inside class character [...] unless your intention is to make that class also match | character, remember that [abc] is the same as (a|b|c) so there is no need for | in "[\\d|\\s]".

If you want to represent a backslash in a Java string literal you need to escape it with another backslash, so the string literal "\\s" is two characters, \ and s. This means that to represent the regular expression [\d\s][\d]\. in a Java string literal you would use "[\\d\\s][\\d]\\.".
Note that I also made a slight modification to your regular expression, [\d|\s] will match a digit, whitespace, or the literal | character. You just want [\d\s]. A character class already means "match one of these", since you don't need the | for alternation within a character class it loses its special meaning.

My pattern is any double digit or single digit preceded by a space, followed by a period.)
Correct regex will be:
Pattern pattern = Pattern.compile("(\\s\\d|\\d{2})\\.");
Also if you're getting regex string from user input then your should call:
Pattern.quote(useInputRegex);
To escape all the regex special characters.
Also you double escaping because 1 escape is handled by String class and 2nd one is passed on to regex engine.

What is happening is that escape sequences are being evaluated twice. Once for java, and then once for your regex.
the result is that you need to escape the escape character, when you use a regex escape sequence.
for instance, if you needed a digit, you'd use
"\\d"

Related

Java Regular Expression - how to use backslash [duplicate]

This question already has answers here:
java, regular expression, need to escape backslash in regex
(4 answers)
Closed 6 years ago.
I am really confused with how to escape. Sometimes I just need to prepend a backslash but sometimes I need to prepend double backslash like "\\.".
Could any one tell me why?
Also, could anyone give me an explanation of difference in
String.split("\t"),
String.split("\\t"),
String.split("\\\t"),
String.split("\\\\t")?
Backslash is special character in string literals - we can use it to create \n or escape " like \".
But backslash is also special in regular expression engine - for instance we can use it to use default character classes like \w \d \s.
So if you want to create string which will represent regex/text like \w you need to write it as "\\w".
If you want to write regex which will represent \ literal then text representing such regex needs to look like \\ which means String representing such text needs to be written as "\\\\".
In other words we need to escape backslash twice:
- once in regex \\
- and once in string "\\\\".
If you want to pass to regex engine literal which will represent tab then you don't need to escape backslash at all. Java will understand "\t" string as string representing tab character and you can pass such string to your regex engine without problems.
For our comfort regex engine in Java interprets text representing \t (also \r and \n) same way as string literals interpret "\t". In other words we can pass to regex engine text which will represent \ character and t character and be sure that it will be interpreted as representation of tab character.
So code like split("\t") or split("\\t") will try to split on tab.
Code like split("\\\\t") will try to split text not on tab character, but on \ character followed by t. It happens because "\\\\" as explained represents text \\ which regex engine sees as escaped \ (so it is treated as literal).

How to split a string with double quotes " as the delimiter?

I tried splitting like this-
tableData.split("\\"")
but it does not work.
It seems that you tried to escape it same way as you would escape | which is "\\|". But difference between | and " is that
| is metacharacter in regex engine (it represents OR operator)
" is metacharacter in Java language in string literal (it represents start/end of the string)
To escape any String metacharacter (like ") you need to place before it other String metacharacter responsible for escaping which is \1. So to create String which would contain " like this is "quote" you would need to write it as
String s = "this is \"quote\"";
// ^^ ^^ these represent " literal, not end of string
Same idea is applied if we would like to create \ literal (we would need to escape it by placing another \ before it). For instance if we would want to create string representing c:\foo\bar we would need to write it as
String s = "c:\\foo\\bar";
// ^^ ^^ these will represent \ literal
So as you see \ is used to escape metacharacters (make them simple literals).
This character is used in Java language for Strings, but it also is used in regex engine to escape its metacharacters:
\, ^, $, ., |, ?, *, +, (, ), [, {.
If you would like to create regex which will match [ character you will need to use regex \[ but String representing this regex in Java needs to be written as
String leftBracketRegex = "\\[";
// ^^ - Remember what was said earlier?
// To create \ literal in String we need to escape it
So to split on [ we would need to invoke split("\\[") because regex representing [ is \[ which needs to be written as "\\[" in Java.
Since " is not special character in regex but it is special in String we need to escape it only in string literal by writing it as
split("\"");
1) \ is also used to create other characters line separators \n, tab \t. It can also be used to create Unicode characters like \uXXXX where XXXX is index of character in Unicode table in hexadecimal form.
You have escaped the \ by putting in \ twice, try
tableData.split("\"")
Why does this happen?
A backslash escapes the following character. Since the next character is another backslash, the second backslash will be escaped, thus the doublequote won't.
Your resulting escaped string is \", where it should really be just ".
Edit:
Also keep in mind, that String.split() interprets its pattern parameter as a regular expression, which has several special characters, which have to be escaped in the resulting string.
So if you want split by a .(which is a special regex character), you need to specify it as String.split("\\."). The first backslash escapes the escaping function of the second backlash and would result in "\.".
In case of regex characters you could also just use Pattern.quote(); to escape your desired delimiter, but this is far out of the scope the question orignally had.
Try with single backslash \
tableData.split("\"")
Try like this by escaping " with single backslash \ :
tableData.split("\"")
You are not escaping properly. The snippet code will not even compile because of it. The correct way to do it is
tableData.split("\"");
A single backslash will do the trick.
Like this:
tableData.split("\"");
You can actually split without the backward slash. You only have to use single quote
tableData.split('"');

How does string.replaceAll() work?

I am making a program that replaces a certain part of the string.
String x = "hello";
x=x.replaceAll("e","\\\\s");
System.out.println(x);
output: h\sllo
but for
System.out.println("\\s");
output: \s
why do we need extra escape characters in the first case.
You need \\ for a single \ character in regex
But Java string also interprets backslash therefore you need to escape each \ for String hence you need 2+2=4 backslashes to match a single \ (2 for String and 2 for regex engine)
Also note that 2nd argument to String#replaceAll method is also interpreted by regex engine due to potential presence of back-references and that is the reason same regex rules apply for replacement string also.
Your regex is using replacement string of a literal \ followed by a literal s

How to undo replace performed by regex?

In java, I have the following regex ([\\(\\)\\/\\=\\:\\|,\\,\\\\]) which is compiled and then used to escape each of the special characters ()/=:|,\ with a backslash as follows escaper.matcher(value).replaceAll("\\\\$1")
So the string "A/C:D/C" would end up as "A\/C\:D\/C"
Later on in the process, I need to undo that replace. That means I need to match on the combination of \(, \), \/ etc. and replace it with the character immediately following the backslash character. A backslash followed by any other character should not be matched and there could be cases where a special character will exist without the preceeding backslash, in which case it shouldn't match either.
Since I know all of the cases I could do something like
myString.replaceAll("\\(", "(").replaceAll("\\)", ")").replaceAll("\\/", "/")...
but I wonder if there is a simpler regex that would allow me to perform the replace for all the special characters in a single step.
That seems pretty straightforward. If this were your original code (excess escapes removed):
Pattern escaper = Pattern.compile("([()/=:|,\\\\])");
String escaped = escaper.matcher(original).replaceAll("\\\\$1");
...the opposite would be:
Pattern unescaper = Pattern.compile("\\\\([()/=:|,\\\\])");
String unescaped = unescaper.matcher(escaped).replaceAll("$1");
If you weren't escaping and unescaping backslashes themselves (as you're doing), you would have problems, but this should work fine.
I don't know java regex flavor but this work with PCRE
replace \\ followed by ([()/=:|,\\]) by $1
in perl you can do
$str =~ s#\\([()/=:|,\\])#$1#g;

How to escape a square bracket for Pattern compilation?

I have comma separated list of regular expressions:
.{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]
I have done a split on the comma. Now I'm trying to match this regex against a generated password. The problem is that Pattern.compile does not like square brackets that is not escaped.
Can some please give me a simple function that takes a string like so: [0-9] and returns the escaped string \[0-9\].
For some reason, the above answer didn't work for me. For those like me who come after, here is what I found.
I was expecting a single backslash to escape the bracket, however, you must use two if you have the pattern stored in a string. The first backslash escapes the second one into the string, so that what regex sees is \]. Since regex just sees one backslash, it uses it to escape the square bracket.
\\]
In regex, that will match a single closing square bracket.
If you're trying to match a newline, for example though, you'd only use a single backslash. You're using the string escape pattern to insert a newline character into the string. Regex doesn't see \n - it sees the newline character, and matches that. You need two backslashes because it's not a string escape sequence, it's a regex escape sequence.
You can use Pattern.quote(String).
From the docs:
public static String quote​(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
You can use the \Q and \E special characters...anything between \Q and \E is automatically escaped.
\Q[0-9]\E
Pattern.compile() likes square brackets just fine. If you take the string
".{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]"
and split it on commas, you end up with five perfectly valid regexes: the first one matches eight non-line-separator characters, the second matches an ASCII digit, and so on. Unless you really want to match strings like ".{8}" and "[0-9]", I don't see why you would need to escape anything.

Categories

Resources