Java regex - remove quotes unless preceded by odd number of backslashes [duplicate] - java

This question already has answers here:
RegEx: Look-behind to avoid odd number of consecutive backslashes
(2 answers)
Closed 4 years ago.
I am using regex to strip quotes from a String value. These String values can contain escaped quotes but also escaped backslash characters.
I do not want to remove escaped quotes, only non-escaped quotes. However, the cases where escaped backslash characters are preceding a non-escaped quote is causing difficulty.
I want results like the following:
"value" -> value
'value' -> value
"\"value\"" -> \"value\" <-- contains escaped quotes
"value\" -> value\"
"value\\" -> value\\ <-- contains escaped backslash before non-escaped quote
"""val"ue\\\""" -> value\\\"
The following regex almost works for me, except that it is also stripping backslashes when there is an even number of them before a quote, when I only want to escape double and single quote characters.
(?<!\\\\)(?:\\\\{2})*[\"']

The problem occurs because you match those backslashes and they are removed. To keep them, capture these backslashes, and replace with $1 placeholder:
s.replaceAll("((?<!\\\\)(?:\\\\{2})*)[\"']", "$1")
See the regex demo.
The ((?<!\\\\)(?:\\\\{2})*) is now wrapped in (...) and you may refer to the value captured within this group by using $1 in the replacement pattern.

Related

Java Regular Expression - how to use backslash [duplicate]

This question already has answers here:
java, regular expression, need to escape backslash in regex
(4 answers)
Closed 6 years ago.
I am really confused with how to escape. Sometimes I just need to prepend a backslash but sometimes I need to prepend double backslash like "\\.".
Could any one tell me why?
Also, could anyone give me an explanation of difference in
String.split("\t"),
String.split("\\t"),
String.split("\\\t"),
String.split("\\\\t")?
Backslash is special character in string literals - we can use it to create \n or escape " like \".
But backslash is also special in regular expression engine - for instance we can use it to use default character classes like \w \d \s.
So if you want to create string which will represent regex/text like \w you need to write it as "\\w".
If you want to write regex which will represent \ literal then text representing such regex needs to look like \\ which means String representing such text needs to be written as "\\\\".
In other words we need to escape backslash twice:
- once in regex \\
- and once in string "\\\\".
If you want to pass to regex engine literal which will represent tab then you don't need to escape backslash at all. Java will understand "\t" string as string representing tab character and you can pass such string to your regex engine without problems.
For our comfort regex engine in Java interprets text representing \t (also \r and \n) same way as string literals interpret "\t". In other words we can pass to regex engine text which will represent \ character and t character and be sure that it will be interpreted as representation of tab character.
So code like split("\t") or split("\\t") will try to split on tab.
Code like split("\\\\t") will try to split text not on tab character, but on \ character followed by t. It happens because "\\\\" as explained represents text \\ which regex engine sees as escaped \ (so it is treated as literal).

System.out.println ("\"\"\\\\\"\""); [duplicate]

This question already has answers here:
What is the backslash character (\\)?
(6 answers)
Closed 7 years ago.
Why does this string print only ""\\""? Does the backslash do something to the string? Please explain the function of the backslash. All I know is that it is the escape character, but I don't understand why it does this to strings.
The backslash '\' can be used in a String to add characters that would otherwise be illegal (e.g. " and ') or have another meaning (e.g. t, b, n, r, f and \). for your particular example :
The first 2 backslashes are escaping the double quotes. So \"\" is printed as ""
The next backslashes are escaping the backslashes that immediately follow so \\\\ is printed as \\
The last 2 backslashes behave as the first 2 escaping the quotes so \"\" is printed as ""
The Backslash is the escape character, used to encode special things like " in your string (which you normally couldn't use, because they'd mark the end of a string). You should read up on "String literals" in the official Java documentation or the book you read to learn Java.

Underlined backslash IntelliJ

I am using a backslash as an escape character for a serialization format I am working on. I have it as a constant but IntelliJ is underlining it and highlighting it red. On hover it gives no error messages or any information as to why it does not like it.
What is the reason for this and how do I fix it?
IntelliJ is smarter than I am and realised that I was using this character in a regular expression where 2 backslashes would be needed, however, IntelliJ also assumed that my puny mind could find the problem without giving me any information about it.
If it's being used as a regular expression, then the "\" must be escaped.
If you're escaping a "\" as "\" like traditional regular expressions require, then you also need to add two more \\ for a total of \\\\.
This is because of the way Java interprets "\":
In literal Java strings the backslash is an escape character. The
literal string "\" is a single backslash. In regular expressions, the
backslash is also an escape character. The regular expression \
matches a single backslash. This regular expression as a Java string,
becomes "\\". That's right: 4 backslashes to match a single one.
The regex \w matches a word character. As a Java string, this is
written as "\w".
The same backslash-mess occurs when providing replacement strings for
methods like String.replaceAll() as literal Java strings in your Java
code. In the replacement text, a dollar sign must be encoded as \$ and
a backslash as \ when you want to replace the regex match with an
actual dollar sign or backslash. However, backslashes must also be
escaped in literal Java strings. So a single dollar sign in the
replacement text becomes "\$" when written as a literal Java string.
The single backslash becomes "\\". Right again: 4 backslashes to
insert a single one.

How do you match military time? [duplicate]

This question already has answers here:
Invalid escape sequence \d
(2 answers)
Closed 10 years ago.
I'm trying to create a valid Java regex for matching strings representing standard "military time":
String militaryTimeRegex = "^([01]\d|2[0-3]):?([0-5]\d)$";
This gives me a compiler error:
Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \ )
Where am I going wrong?!?
Make sure you use double backslashes for escaping characters:
String militaryTimeRegex = "^([01]\\d|2[0-3]):?([0-5]\\d)$";
Single backslashes indicate the beginning of an escape sequence. You need to use \\ to get the character as it appears in the String.
To answer your comment, you are currently only matching 19:00. You need to account for the additional :00 at the end of the String in your pattern:
String militaryTimeRegex = "^([01]\\d|2[0-3]):?([0-5]\\d):?([0-5]\\d)$";
In Java, you need to double-escape all the \ characters:
String militaryTimeRegex = "^([01]\\d|2[0-3]):([0-5]\\d):([0-5]\\d)$";
Why? because \ is the escape character for strings, and if you need a literal \ to appear somewhere inside a string, then you have to escape it, too: \\.
According to the error message \d does not exist. Escape it with \\d
Although \d is valid regex syntax, you need to escape the backslash in the Java string:
String militaryTimeRegex = "^([01]\\d|2[0-3]):?([0-5]\\d)$";

Splitting a string that has escape sequence using regular expression in Java

String to be split
abc:def:ghi\:klm:nop
String should be split based on ":"
"\" is escape character. So "\:" should not be treated as token.
split(":") gives
[abc]
[def]
[ghi\]
[klm]
[nop]
Required output is array of string
[abc]
[def]
[ghi\:klm]
[nop]
How can the \: be ignored
Use a look-behind assertion:
split("(?<!\\\\):")
This will only match if there is no preceding \. Using double escaping \\\\ is required as one is required for the string declaration and one for the regular expression.
Note however that this will not allow you to escape backslashes, in the case that you want to allow a token to end with a backslash. To do that you will have to first replace all double backslashes with
string.replaceAll("\\\\\\\\", ESCAPE_BACKSLASH)
(where ESCAPE_BACKSLASH is a string which will not occur in your input) and then, after splitting using the look-behind assertion, replace the ESCAPE_BACKSLASH string with an unescaped backslash with
token.replaceAll(ESCAPE_BACKSLASH, "\\\\")
Gumbo was right using a look-behind assertion, but in case your string contains the escaped escape character (e.g. \\) right in front of a comma, the split might break. See this example:
test1\,test1,test2\\,test3\\\,test3\\\\,test4
If you do a simple look-behind split for (?<!\\), as Gumbo suggested, the string gets split into two parts only test1\,test1 and test2\\,test3\\\,test3\\\\,test4. This is because the look-behind just checks one character back for the escape character. What would actually be correct, if the string is split on commas and commas preceded by an even number of escape characters.
To achieve this a slightly more complex (double) look-behind expression is needed:
(?<!(?<![^\\]\\(?:\\{2}){0,10})\\),
Using this more complex regular expression in Java, again requires to escape all \ by \\. So this should be a more sophisticated answer to your question:
"any comma separated string".split("(?<!(?<![^\\\\]\\\\(?:\\\\{2}){0,10})\\\\),");
Note: Java does not support infinite repetitions inside of lookbehinds. Therefore only up to 10 repeating double escape characters are checked by using the expression {0,10}. If needed, you can increase this value by adjusting the latter number.

Categories

Resources