I'm trying to escape a string to ensure that special characters are escaped.
Using
StringEscapeUtils.escapeJava("😀") escapes to \\uD83D\\uDE00
StringEscapeUtils.escapeJava("% ! # $ ^ & * ") doesn't escape any of the characters
StringEscapeUtils.escapeJava("£") escapes to \\u00A3
I can understand that emojis contain backslashes and so are escaped, but why is the pound sign being escaped, and how do I stop it from being escaped?
The documentation of StringEscapeUtils.escapeJava() is vague on exactly what "Java String rules" are.
I guess it is referring to the bit in JLS Chapter 3, where it says:
Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.
and
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.
So it might mean escaping the string so that it can be written using only ASCII characters.
%, !, #, $, ^, & and * are all ASCII characters. They have values less than 128 (i.e. they are in the 7-bit block).
£ isn't an ASCII character: in ISO8859-1, it is encoded as 163 (0xA3), which is outside the 7-bit ASCII block.
If you open a file with the £ in a string literal, it might be rendered as something else, if that editor doesn't set the character encoding correctly. For example, it could be Ł, if it's interpreted in ISO8859-2.
In order to be unambiguous, the pound sign is therefore escaped.
how do I stop it from being escaped
You can't, using this method; you'd need to find an alternative. The only thing you can do would be to replace the \u00A7s in the string with £ again.
Related
I am reading the 3.3 Unicode Escapes section in the lang spec. (https://docs.oracle.com/javase/specs/jls/se19/html/jls-3.html#jls-3.3)
There is this particular piece of text I am having difficulty understanding:
consider how many backslashes appeared contiguously as raw input characters in the result, back to a non-backslash character or the start of the result. (It is immaterial whether any such backslash arose from an ASCII \ character in the compiler's raw input or from a Unicode escape \u005c in the compiler's raw input.) If this number is even, then the ASCII \ character is eligible to begin a Unicode escape; if the number is odd, then the ASCII \ character is not eligible to begin a Unicode escape.
For example, the raw input "\\u2122=\u2122" results in the eleven characters " \ \ u 2 1 2 2 = ™ " because while the second ASCII \ character in the raw input is not eligible to begin a Unicode escape, the third ASCII \ character is eligible, and \u2122 is the Unicode encoding of the character ™.
but in \\u2122=\u2122, in the left side of =, there are 2 backslashes making it eligible to begin a unicode escape, but the right side of =, there's only 1 backslash.
I understand that this is wrong interpretation because it makes sense that \\u2122 is wrong. And I can count it as being odd number of backslashes if I am ignoring the backslash associated with the unicode escape. But there is this following mention in parentheses:
It is immaterial whether any such backslash arose from an ASCII \ character in the compiler's raw input or from a Unicode escape \u005c in the compiler's raw input.
I want to understand how and where my understanding of this went wrong.
It looks like the text in the first paragraph you quote just swapped "even" and "odd" in the text. Because, just as the example in the same paragraph shows, it is quite the opposite: an odd sequence will imply that at the end there is one non-escaped \ , which then can start a unicode escape sequence.
When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?
There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.
Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.
I have a String like this which is coming in a JSON processing data call\\U007fabc computers when I try to parse it jackson throwsn an exception like this:
org.codehaus.jackson.JsonParseException: Unrecognized character escape 'U' (code 85)
at [Source: java.io.StringReader#1b43c429; line: 1, column: 361]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1292)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonParserMinimalBase._handleUnrecognizedCharacterEscape(JsonParserMinimalBase.java:360)
at org.codehaus.jackson.impl.ReaderBasedParser._decodeEscaped(ReaderBasedParser.java:1064)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString2(ReaderBasedParser.java:785)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString(ReaderBasedParser.java:762)
I think the problem is happening because of \\U007f. It definitely means something in UTF-8. Any idea how we can avoid this issue? Does JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER will help anything here?
Your JSON data is malformed.
JSON uses the \u escape sequence to encode a UTF-16 codeunit.
In this case, your JSON data is trying to escape Unicode codepoint U+007F DELETE (which is an ASCII control character that is not required by the JSON spec to be escaped, but is allowed to be escaped), but is using the \U escape sequence to do so. The JSON spec explicitly states that \u MUST be used:
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.
...
Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point.
...
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.
Although not explicitly stated in that last paragraph, the twelve-character sequence for a UTF-16 surrogate pair consists of two six-character sequences that must follow the same escape format as characters in the BMP. This is enforced by the character encoding diagram:
(source: json.org)
There is no \U escape sequence defined. That is what the parser error message is complaining about:
Unrecognized character escape 'U'
Unicode Character U+007F DELETE is probably what you are facing.
This answer states that it shouldnt have been encoded.
However to circumvent, you can refer to this answer on how to strip them off.
Let say I have
String str="hello\" world\\";
when printing str, the output is
hello" world\
even when printing str.length() the output is
13
Is there any way to prove that str value has escape character(s)?
There is no such thing as escape characters at run time.
Escape characters appear only in String literals. For example,
String literal = "Some\nEscape\rSequence\\\"";
At compilation time, the compiler produces a String value with their actual binary representation (UTF-8 iirc). The JVM uses that String value directly.
You wrote
I am thinking that whenever we print a string and the output contains
character such as " and \, then we can conclude that those character,
" and \ was escaped?
This is not true, those characters might have been read from a file or some other InputStream. They were definitely not escaped in a text file.
Yes.
Use the Apache Commons Library, specifically StringEscapeUtils#escapeJava.
jshell> StringEscapeUtils.escapeJava("Newline \n here \u0344 and unicode \f\n\r\t\"\0\13 and more")
$136 ==> "Newline \\n here \\u0344 and unicode \\f\\n\\r\\t\\\"\\u0000\\u000B and more"
This prepends a backslash to each escape sequence and also swaps the variable-width octal sequences for fixed-width Unicode sequences. This means that every escape sequence will consist of "\\" two backslashes, followed by one of {n, b, r, t, f, ", \}, or a 'u' character, plus exactly four hexadecimal [0-F] digits.
If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string. If you want to find the positions of those sequences, it's a bit more involved.
See more at this Gist.
This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?
The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.
JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]
If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available