Why do ";" and "\\;" find the same? - java

I just found Java code like this:
"bla;bla;bla".split("\\;");
It returns:
["bla","bla","bla"] // String array of course
String.split does use regex, but from my research I found that ; is not a special character in regex and doesn't have to be escaped. So I tried replacing it with:
"bla;bla;bla;".split(";");
and it still does the same! So what is happening here? Is Java trying to be nice and ignores a useless backslash in the regex? But I tried it with Notepad++, too, and there it also both finds a single semikolon.

In the following code:
"bla;bla;bla".split("\\;");
String#split() executes in a regex context. Two backslashes \\ result in a literal backslash, and so you end up splitting on \;, which functionally is the same as just splitting on ;, because semicolon does not need to be escaped.
If you tried the following split, you would not the result you expect:
"bla;bla;bla".split("\\\\;");
This would correspond, in regex terms, to splitting on literal \;. Since that separator never appears in your string, you would just get an array whose first element is that input string.
See the answer by #AndyTurner for an explanation on why splitting on \; is allowed in the first place.

From the Javadoc of Pattern (emphasis mine):
The backslash character ('\') serves to introduce escaped constructs
...
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.

The answers are fine. However, nobody mentioned Pattern.quote()
Java does not have a raw or literal string (e.g. like a #"..."; verbatim string in C# or a r"..." raw string in Python). Nonetheless, for regular expressions we have the quote method that returns a literal pattern String for the specified String:
This method produces a String that can be used to create a Pattern
that would match the string s as if it were a literal pattern.
So, if you would have used quote to specify your pattern, no split would have happened as illustrated in the following code sample:
import java.util.regex.Pattern;
class Example
{
public static void main (String[] args) throws java.lang.Exception
{
String sourcestring = "bla;bla;bla";
Pattern re = Pattern.compile(Pattern.quote("\\;"));
String[] parts = re.split(sourcestring);
for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ){
System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
}
}
}
Output:
[0] = bla;bla;bla
Otherwise, it's just an escaped semi-colon in the regex context of the split method as explained by Tim and Andy.

Related

How do i check if string contains char sequence and backslash "\"?

I'm trying to get true in the following test. I have a string with the backslash, that for some reason doesn't recognized.
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\.");
System.out.println(test);
I've tried a lot of variants, but only one (.*)news(.*) works. But that actually means any characters after news, i need only with \.
How can i do that?
Group the elements at the end:(.*)news\\(.*)
You can use this instead :
Boolean test = s.matches("(.*)news\\\\(.*)");
Try something like:
Boolean test = s.matches(".*news\\\\.*");
Here .* means any number of characters followed by news, followed by double back slashes (escaped in a string) and then any number of characters after that (can be zero as well).
With your regex what it means is:
.* Any number of characters
news\\ - matches by "news\" (see one slash)
. followed by one character.
which doesn't satisfies for String in your program "Good news\ everyone!"
You are testing for an escaped occurrence of a literal dot: ".".
Refactor your pattern as follows (inferring the last part as you need it for a full match):
String s = "Good news\\ everyone!";
System.out.println(s.matches("(.*)news\\\\.*"));
Output
true
Explanation
The back-slash is used to escape characters and the back-slash itself in Java Strings
In Java Pattern representations, you need to double-escape your back-slashes for representing a literal back-slash ("\\\\"), as double-back-slashes are already used to represent special constructs (e.g. \\p{Punct}), or escape them (e.g. the literal dot \\.).
String.matches will attempt to match the whole String against your pattern, so you need the terminal part of the pattern I've added
you can try this :
String s = "Good news\\ everyone!";
Boolean test = s.matches("(.*)news\\\\(.*)");
System.out.println(test);

how to ignore newlines for split function

I am splitting the string using ^ char. The String which I am reading, is coming from some external source. This string contains some \n characters.
The string may look like:
Hi hello^There\nhow are\nyou doing^9987678867abc^popup
when I am splitting like below, why the array length is coming as 2 instead of 4:
String[] st = msg[0].split("^");
st.length //giving "2" instead of "4"
It look like, split is ignoring after \n.
How can I fix it without replacing \n to some other character.
the string parameter for split is interpreted as regular expression.
So you have to escape the char and use:
st.split("\\^")
see this answer for more details
Escape the ^ character. Use msg[0].split("\\^") instead.
String.split considers its argument as regular expression. And as ^ has a special meaning when it comes to regular expressions, you need to escape it to use its literal representation.
If you want to split by ^ only, then
String[] st = msg[0].split("\\^");
If I read your question correctly, you want to split by ^ and \n characters, so this would suffice.
String[] st = msg[0].split("[\\^\\\\n]");
This considers that \n literally exists as 2 characters in a string.
"^" it's know as regular expression by the JDK.
To avoid this confusion you need to modify the code as below
old code = msg[0].split("^")
new code = msg[0].split("\\^")

java regex illegal escape character error not occurring from command line arguments [duplicate]

This question already has answers here:
Why does this Java regex cause "illegal escape character" errors?
(7 answers)
Closed 3 years ago.
This simple regex program
import java.util.regex.*;
class Regex {
public static void main(String [] args) {
System.out.println(args[0]); // #1
Pattern p = Pattern.compile(args[0]); // #2
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.println(m.start()+" "+m.group());
}
}
}
invoked by java regex "\d" "sfdd1" compiles and runs fine.
But if #1 is replaced by Pattern p = Pattern.compile("\d");, it gives compiler error saying illegal escape character. In #1 I also tried printing the pattern specified in the command line arguments. It prints \d, which means it is just getting replaced by \d in #2.
So then why won't it throw any exception? At the end it's string argument that Pattern.compile() is taking, doesn't it detect illegal escape character then? Can someone please explain why is this behaviour?
A backslash character in a string literal needs to be escaped (preceded by a backslash). When passed in from the command line the string is not a string literal. The compiler complains because "\d" is not a valid escape sequence (see Escape Sequences for Character and String Literals ).
The \ character is used as an escape character for both Java string literals and regular expressions. This confuses many programmers. When you want to create a String in Java to represent a regular expression that has an escape character then you need to escape the Java escape character.
When passing the string in on the command line the JVM handles this for you and simply creates the String.
What you want is this
Pattern p = Pattern.compile("\\d");
The backslash \ in Java results in an escape in strings. For example, the string "\t" results in a tab character in java. This is also why "\n" produces a newline.
In regular expressions, \d is an escape with respect to the regular expression, not Java. This means in order to get \d in a string literal, you have to type "\\d" in the string. Basically, you have to escape the \ to get the literal value \d, and then when Pattern compiles the regex, it further escapes the \d to be parsed as a digit.
This can be confusing, but long story short, you should never have a single \ in a string literal for a regular expression since even the string literal "\\n" gets parsed properly.
I'm not entirely sure if I understand the question, but it seems like your problem is that you're treating "\d" as a Java escape character, which doesn't exist. To treat it as a regex escape character, use "\d" to escape the Java escape.

Replacing double backslashes with single backslash

I have a string "\\u003c", which belongs to UTF-8 charset. I am unable to decode it to unicode because of the presence of double backslashes. How do i get "\u003c" from "\\u003c"? I am using java.
I tried with,
myString.replace("\\\\", "\\");
but could not achieve what i wanted.
This is my code,
String myString = FileUtils.readFileToString(file);
String a = myString.replace("\\\\", "\\");
byte[] utf8 = a.getBytes();
// Convert from UTF-8 to Unicode
a = new String(utf8, "UTF-8");
System.out.println("Converted string is:"+a);
and content of the file is
\u003c
You can use String#replaceAll:
String str = "\\\\u003c";
str= str.replaceAll("\\\\\\\\", "\\\\");
System.out.println(str);
It looks weird because the first argument is a string defining a regular expression, and \ is a special character both in string literals and in regular expressions. To actually put a \ in our search string, we need to escape it (\\) in the literal. But to actually put a \ in the regular expression, we have to escape it at the regular expression level as well. So to literally get \\ in a string, we need write \\\\ in the string literal; and to get two literal \\ to the regular expression engine, we need to escape those as well, so we end up with \\\\\\\\. That is:
String Literal String Meaning to Regex
−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−
\ Escape the next character Would depend on next char
\\ \ Escape the next character
\\\\ \\ Literal \
\\\\\\\\ \\\\ Literal \\
In the replacement parameter, even though it's not a regex, it still treats \ and $ specially — and so we have to escape them in the replacement as well. So to get one backslash in the replacement, we need four in that string literal.
Not sure if you're still looking for a solution to your problem (since you have an accepted answer) but I will still add my answer as a possible solution to the stated problem:
String str = "\\u003c";
Matcher m = Pattern.compile("(?i)\\\\u([\\da-f]{4})").matcher(str);
if (m.find()) {
String a = String.valueOf((char) Integer.parseInt(m.group(1), 16));
System.out.printf("Unicode String is: [%s]%n", a);
}
OUTPUT:
Unicode String is: [<]
Here is online demo of the above code
Regarding the problem of "replacing double backslashes with single backslashes" or, more generally, "replacing a simple string, containing \, with a different simple string, containing \" (which is not entirely the OP problem, but part of it):
Most of the answers in this thread mention replaceAll, which is a wrong tool for the job here. The easier tool is replace, but confusingly, the OP states that replace("\\\\", "\\") doesn't work for him, that's perhaps why all answers focus on replaceAll.
Important note for people with JavaScript background:
Note that replace(CharSequence, CharSequence) in Java does replace ALL occurrences of a substring - unlike in JavaScript, where it only replaces the first one!
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
On the other hand, replaceAll(String regex, String replacement) -- more docs also here -- is treating both parameters as more than regular strings:
Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string.
(this is because \ and $ can be used as backreferences to the captured regex groups, hence if you want to used them literally, you need to escape them).
In other words, both first and 2nd params of replace and replaceAll behave differently. For replace you need to double the \ in both params (standard escaping of a backslash in a string literal), whereas in replaceAll, you need to quadruple it! (standard string escape + function-specific escape)
To sum up, for simple replacements, one should stick to replace("\\\\", "\\") (it needs only one escaping, not two).
https://ideone.com/ANeMpw
System.out.println("a\\\\b\\\\c"); // "a\\b\\c"
System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\\\")); // "a\b\c"
//System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\")); // runtime error
System.out.println("a\\\\b\\\\c".replace("\\\\", "\\")); // "a\b\c"
https://www.ideone.com/Fj4RCO
String str = "\\\\u003c";
System.out.println(str); // "\\u003c"
System.out.println(str.replaceAll("\\\\\\\\", "\\\\")); // "\u003c"
System.out.println(str.replace("\\\\", "\\")); // "\u003c"
Another option, capture one of the two slashes and replace both slashes with the captured group:
public static void main(String args[])
{
String str = "C:\\\\";
str= str.replaceAll("(\\\\)\\\\", "$1");
System.out.println(str);
}
Try using,
myString.replaceAll("[\\\\]{2}", "\\\\");
This is for replacing the double back slash to single back slash
public static void main(String args[])
{
String str = "\\u003c";
str= str.replaceAll("\\\\", "\\\\");
System.out.println(str);
}
"\\u003c" does not 'belong to UTF-8 charset' at all. It is five UTF-8 characters: '\', '0', '0', '3', and 'c'. The real question here is why are the double backslashes there at all? Or, are they really there? and is your problem perhaps something completely different? If the String "\\u003c" is in your source code, there are no double backslashes in it at all at runtime, and whatever your problem may be, it doesn't concern decoding in the presence of double backslashes.

How do I replace all "[", "]" and double quotes in Java

I'm am having difficulty using the replaceAll method to replace square brackets and double quotes. Any ideas?
Edit:
So far I've tried:
replace("\[", "some_thing") // returns illegal escape character
replace("[[", "some_thing") // returns Unclosed character class
replace("^[", "some_thing") // returns Unclosed character class
Don't use replaceAll, use replace. The former uses regular expressions, and [] are special characters within a regex.
String replaced = input.replace("]", ""); //etc
The double quote is special in Java so you need to escape it with a single backslash ("\"").
If you want to use a regex you need to escape those characters and put them in a character class. A character class is surrounded by [] and escaping a character is done by preceding it with a backslash \. However, because a backslash is also special in Java, it also needs to be escaped, and so to give the regex engine a backslash you have to use two backslashes (\\[).
In the end it should look like this (if you were to use regex):
String replaced = input.replaceAll("[\\[\\]\"]", "");
The replaceAll method is operating against Regular Expressions. You're probably just wanting to use the "replace" method, which despite its name, does replace all occurrences.
Looking at your edit, you probably want:
someString
.replace("[", "replacement")
.replace("]", "replacement")
.replace("\"", "replacement")
or, use an appropriate regular expression, the approach I'd actually recommend if you're willing to learn regular expressions (see Mark Peter's answer for a working example).
replaceAll() takes a regex so you have to escape special characters. If you don't want all the fancy regex, use replace().
String s = "[h\"i]";
System.out.println( s.replace("[","").replace("]","").replace("\"","") );
With double quotes, you have to escape them like so: "\""
In java:
String resultString = subjectString.replaceAll("[\\[\\]\"]", "");
this will replace []" with nothing.
Alternatively, if you wished to replace ", [ and ] with different characters (instead of replacing all with empty String) you could use the replaceEachRepeatedly() method in the StringUtils class from Commons Lang.
For example,
String input = "abc\"de[fg]hi\"";
String replaced = StringUtils.replaceEachRepeatedly(input,
new String[]{"[","]","\""},
new String[]{"<open_bracket>","<close_bracket>","<double_quote>"});
System.out.println(replaced);
Prints the following:
abc<double_quote>de<open_bracket>fg<close_bracket>hi<double_quote>

Categories

Resources