I have a string "\\u003c", which belongs to UTF-8 charset. I am unable to decode it to unicode because of the presence of double backslashes. How do i get "\u003c" from "\\u003c"? I am using java.
I tried with,
myString.replace("\\\\", "\\");
but could not achieve what i wanted.
This is my code,
String myString = FileUtils.readFileToString(file);
String a = myString.replace("\\\\", "\\");
byte[] utf8 = a.getBytes();
// Convert from UTF-8 to Unicode
a = new String(utf8, "UTF-8");
System.out.println("Converted string is:"+a);
and content of the file is
\u003c
You can use String#replaceAll:
String str = "\\\\u003c";
str= str.replaceAll("\\\\\\\\", "\\\\");
System.out.println(str);
It looks weird because the first argument is a string defining a regular expression, and \ is a special character both in string literals and in regular expressions. To actually put a \ in our search string, we need to escape it (\\) in the literal. But to actually put a \ in the regular expression, we have to escape it at the regular expression level as well. So to literally get \\ in a string, we need write \\\\ in the string literal; and to get two literal \\ to the regular expression engine, we need to escape those as well, so we end up with \\\\\\\\. That is:
String Literal String Meaning to Regex
−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−
\ Escape the next character Would depend on next char
\\ \ Escape the next character
\\\\ \\ Literal \
\\\\\\\\ \\\\ Literal \\
In the replacement parameter, even though it's not a regex, it still treats \ and $ specially — and so we have to escape them in the replacement as well. So to get one backslash in the replacement, we need four in that string literal.
Not sure if you're still looking for a solution to your problem (since you have an accepted answer) but I will still add my answer as a possible solution to the stated problem:
String str = "\\u003c";
Matcher m = Pattern.compile("(?i)\\\\u([\\da-f]{4})").matcher(str);
if (m.find()) {
String a = String.valueOf((char) Integer.parseInt(m.group(1), 16));
System.out.printf("Unicode String is: [%s]%n", a);
}
OUTPUT:
Unicode String is: [<]
Here is online demo of the above code
Regarding the problem of "replacing double backslashes with single backslashes" or, more generally, "replacing a simple string, containing \, with a different simple string, containing \" (which is not entirely the OP problem, but part of it):
Most of the answers in this thread mention replaceAll, which is a wrong tool for the job here. The easier tool is replace, but confusingly, the OP states that replace("\\\\", "\\") doesn't work for him, that's perhaps why all answers focus on replaceAll.
Important note for people with JavaScript background:
Note that replace(CharSequence, CharSequence) in Java does replace ALL occurrences of a substring - unlike in JavaScript, where it only replaces the first one!
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
On the other hand, replaceAll(String regex, String replacement) -- more docs also here -- is treating both parameters as more than regular strings:
Note that backslashes () and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string.
(this is because \ and $ can be used as backreferences to the captured regex groups, hence if you want to used them literally, you need to escape them).
In other words, both first and 2nd params of replace and replaceAll behave differently. For replace you need to double the \ in both params (standard escaping of a backslash in a string literal), whereas in replaceAll, you need to quadruple it! (standard string escape + function-specific escape)
To sum up, for simple replacements, one should stick to replace("\\\\", "\\") (it needs only one escaping, not two).
https://ideone.com/ANeMpw
System.out.println("a\\\\b\\\\c"); // "a\\b\\c"
System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\\\")); // "a\b\c"
//System.out.println("a\\\\b\\\\c".replaceAll("\\\\\\\\", "\\")); // runtime error
System.out.println("a\\\\b\\\\c".replace("\\\\", "\\")); // "a\b\c"
https://www.ideone.com/Fj4RCO
String str = "\\\\u003c";
System.out.println(str); // "\\u003c"
System.out.println(str.replaceAll("\\\\\\\\", "\\\\")); // "\u003c"
System.out.println(str.replace("\\\\", "\\")); // "\u003c"
Another option, capture one of the two slashes and replace both slashes with the captured group:
public static void main(String args[])
{
String str = "C:\\\\";
str= str.replaceAll("(\\\\)\\\\", "$1");
System.out.println(str);
}
Try using,
myString.replaceAll("[\\\\]{2}", "\\\\");
This is for replacing the double back slash to single back slash
public static void main(String args[])
{
String str = "\\u003c";
str= str.replaceAll("\\\\", "\\\\");
System.out.println(str);
}
"\\u003c" does not 'belong to UTF-8 charset' at all. It is five UTF-8 characters: '\', '0', '0', '3', and 'c'. The real question here is why are the double backslashes there at all? Or, are they really there? and is your problem perhaps something completely different? If the String "\\u003c" is in your source code, there are no double backslashes in it at all at runtime, and whatever your problem may be, it doesn't concern decoding in the presence of double backslashes.
Related
I try to explain my problem with a little example.
I implemented version 1 and version 2, but I didn't get the desired result. Which replacement-parameter do I have to use to get the desired result with the replaceAll method ?
Version 1:
String s = "TEST";
s = s.replaceAll("TEST", "TEST\nTEST");
System.out.println(s);
Output:
TEST
TEST
Version 2:
String s = "TEST";
s = s.replaceAll("TEST", "TEST\\nTEST");
System.out.println(s);
Output:
TESTnTEST
Desired Output:
TEST\nTEST
From the javadoc of String#replaceAll(String, String):
Note that backslashes (\) and dollar signs ($) in the replacement
string may cause the results to be different than if it were being
treated as a literal replacement string; see Matcher.replaceAll. Use
Matcher.quoteReplacement(java.lang.String) to suppress the special
meaning of these characters, if desired.
s = s.replaceAll("TEST", Matcher.quoteReplacement("TEST\\nTEST"));
You still need 2 backslashes, as \ is a metachar for string literals.
You can also use 4 backslashes without Matcher.quoteReplacement:
you want one \ in the output
you need to escape it with \, as \ is a metachar for replacement strings: \\
you need to escape both with \, as \ is a metachar for string literals: \\\\
s = s.replaceAll("TEST", "TEST\\\\nTEST");
Don't use replaceAll()!
replaceAll() does a regex search and replace, but your task doesn't need regex - just use the plain text version replace(), also replaces all occurrences.
You need a literal backslash, which is coded as two backslashes in a Java String literal:
String s = "TEST";
s = s.replace("TEST", "TEST\\nTEST");
System.out.println(s);
Output:
TEST\nTEST
I have html string from file. I need to escape all double quotes. So I do this way:
String content=readFile(file.getAbsolutePath(), StandardCharsets.UTF_8);
content=content.replaceAll("\"","\\\"");
System.out.println(content);
However, the double quotes are not escaped and the string is the same as it was before replaceAll method. When I do
String content=readFile(file.getAbsolutePath(), StandardCharsets.UTF_8);
content=content.replaceAll("\"","^^^");
System.out.println(content);
All double quotes are replaced with ^^^.
Why content.replaceAll("\"","\\\""); doesn't work?
You need to use 4 backslashes to denote one literal backslash in the replacement pattern:
content=content.replaceAll("\"","\\\\\"");
Here, \\\\ means a literal \ and \" means a literal ".
More details at Java String#replaceAll documentation:
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll
And later in Matcher.replaceAll documentation:
Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
Another fun replacement is replacing quotes with dollar sign: the replacement is "\\$". The 2 \s turn into 1 literal \ for the regex engine and it escapes the special character $ used to define backreferences. So, now it is a literal inside the replacement pattern.
You need to do :
String content = "some content with \" quotes.";
content = content.replaceAll("\"", "\\\\\"");
Why will this work?
\" represents the " symbol, while you need \".
If you add a \ as a prefix (\\") then you'll have to escape the prefix too, i.e. you'll have a \\\". This will now represent \", where \ is not the escaping character, but the symbol \.
However in the Java String the " character will be escaped with a \ and you will have to replace it as well. Therefore prefixing again with \\ will do fine:
x = x.replaceAll("\"", "\\\\\"");
It took me way too long in Java to discover Pattern.quote and Matcher.quoteReplacement. These will you achieve what you are trying to do here - which is a simple "find" and "replace" - without any regex and escape logic. The Pattern.quote here would not be necessary but it shows how you can ensure that the "find" part is not interpreted as a regex string:
#Test
public void testEscapeQuotes()
{
String content="some content with \"quotes\".";
content=content.replaceAll(Pattern.quote("\""), Matcher.quoteReplacement("\\\""));
Assert.assertEquals("some content with \\\"quotes\\\".", content);
}
Remember that you can also use the simple .replace method which will also "replaceAll" but will not interpret your parameters as regular expressions:
#Test
public void testEscapeQuotes()
{
String content="some content with \"quotes\".";
content=content.replace("\"", "\\\"");
Assert.assertEquals("some content with \\\"quotes\\\".", content);
}
Much easier with Apache Commons Text-
System.out.println(StringEscapeUtils.escapeJava("\""));
Output:
\"
Honestly, I am surprised by the behaviour, but it seems like you need to double-escape the backslash:
System.out.println("\"Hello world\"".replaceAll("\"", "\\\\\""));
which outputs:
\"Hello world\"
Demo
I have small code as shown below
public class Testing {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String firstString = sc.next();
System.out.println("First String : " + firstString);
String secondString = "text\\";
System.out.println("Second String : " + secondString);
}
}
When I provide input as text\\ I get output as
First String : text\\
Second String : text\
Why I am getting two different string when input I provide to first string is same as second string.
Demo at www.ideone.com
The double backslash in the console you provide as input on runtime are really two backslashes. You simply wrote two times ASCII character backslash.
The double backslash inside the string literal means only one backslash. Because you can't write a single backslash in the a string literal. Why? Because backslash is a special character that is used to "escape" special characters. Eg: tab, newline, backslash, double quote. As you see, backslash is also one of the character that needs to be escaped. How do you escape? With a backslash. So, escaping a backslash is done by putting it behind a backslash. So this results in two backslashes. This will be compiled into a single backslash.
Why do you have to escape characters? Look at this string: this "is" a string. If you want to write this as a string literal in Java, you might intentionally think that it would look like this:
String str = "this "is" a string";
As you can see, this won't compile. So escape them like this:
String str = "this \"is\" a string";
Right now, the compiler knows that the " doesn't close the string but really means character ", because you escaped it with a backslash.
In Strings \ is special character, for example you can use it like \n to create new line sign. To turn off its special meaning you need to use another \ like \\. So in your 2nd case \\ will be interpreted as one \ character.
In case when you are reading Strings from outside sources (like streams) Java assume that they are normal characters, because special characters had already been converted to for example tabulators, new line chars, and so on.
Java use the \ as an escape character in the second string
EDITED on demand
In the first case, the input take all the typed characters and encapsulate them in a String, so all characters are printed (no evaluation, as they are read, they are printed).
In the second, JVM evaluate the String between ", character by character, and the first \ is read has a meta character protecting the second one, so it will not be printed.
String internally sequence of char must not be confused with the sequence of char between double quotes specially because backslash has a special meaning:
"\n\r\t\\\0" => { (char)10,(char)13,(char)9,'\\',(char)0 }
The line
System.out.println("\\");
prints a single back-slash (\). And
System.out.println("\\\\");
prints double back-slashes (\\). Understood!
But why in the following code:
class ReplaceTest
{
public static void main(String[] args)
{
String s = "hello.world";
s = s.replaceAll("\\.", "\\\\");
System.out.println(s);
}
}
is the output:
hello\world
instead of
hello\\world
After all, the replaceAll() method is replacing a dot (\\.) with (\\\\).
Can someone please explain this?
When replacing characters using regular expressions, you're allowed to use backreferences, such as \1 to replace a using a grouping within the match.
This, however, means that the backslash is a special character, so if you actually want to use a backslash it needs to be escaped.
Which means it needs to actually be escaped twice when using it in a Java string. (First for the string parser, then for the regex parser.)
The javadoc of replaceAll says:
Note that backslashes ( \ ) and dollar signs ($) in the replacement
string may cause the results to be different than if it were being
treated as a literal replacement string; see Matcher.replaceAll. Use
Matcher.quoteReplacement(java.lang.String) to suppress the special
meaning of these characters, if desired.
This is a formatted addendum to my comment
s = s.replaceAll("\\.", Matcher.quoteReplacement("\\"));
IS MORE READABLE AND MEANINGFUL THAN
s = s.replaceAll("\\.", "\\\\\\");
If you don't need regex for replacing and just need to replace exact strings, escape regex control characters before replace
String trickyString = "$Ha!I'm tricky|.|";
String safeToUseInReplaceAllString = Pattern.quote(trickyString);
The backslash is an escape character in Java Strings. e.g. backslash has a predefined meaning in Java. You have to use "\ \" to define a single backslash. If you want to define " \ w" then you must be using "\ \ w" in your regex. If you want to use backslash you as a literal you have to type \ \ \ \ as \ is also a escape character in regular expressions.
I believe in this particular case it would be easier to use replace instead of replace all.
Reverend Gonzo Has the correct answer when he talks about escaping the character.
Using replaceAll:
s = s.replaceAll("\\.", "\\\\\\\\");
Using replace:
s = s.replaceAll(".", "\\");
replace just takes a string to match to, not a regular expression.
I don't like this implementation of regex. We should be able to escape characters with a single '\' , not '\'. But anyway if you want to get THIS.Out_Of_That you can do:
String prefix = role.replaceFirst("(\\.).*", "");
So you get prefix = THIS;
My eventual goal is to have a string like
def newline = 'C:\\www\web-app\StudyReports\\test.bat'
but my old line only has one '\'.
I tried different ways of using the following:
def newline = oldline.replaceAll(/\\/,'//')
but that did not compile.
If I were you, I would replace the backslashes with forward slashes:
def newline=oldline.replaceAll(/\\+/, '/')
Both Java and Windows will accept the forward slash as a file separator, and it's lot easier to work with.
In Java, you'd use the String.replace(CharSequence target, CharSequence replacement), which is NOT regex-based.
You'd write something like:
String after = before.replace("\\", "\\\\");
This doubles up every \ in before.
String path = "1\\2\\\\3\\4";
System.out.println(path);
path = path.replace("\\", "\\\\");
System.out.println(path);
The output of the above is (as seen on ideone.com)
1\2\\3\4
1\\2\\\\3\\4
To match a single backslash in Java or Groovy, you have to enter it 4 times, because both the compiler and the regex engine use the backslash as the escape character. So if you enter "\\\\" as a String in Java, the compiler generates the string containing the two characters \\, which the regex engine interprets as a match for exactly one backslash \.
The replacement string must be escaped twice too, so you have to enter 8 backslashes as the replacement string.