What can I do with a hex String literal? [duplicate] - java

This question already has answers here:
Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?
(5 answers)
Closed 8 years ago.
I'm learning Java, and I'm on a book chapter about hex String literals. It tells me that I can create a hex String literal in this format: "\uxxxx". So I tried this:
char c = '\u0010';
int x = c;
System.out.println(x); // prints 16.
Firstly, why does the following hex String literal cause a compilation error? I was expecting that 'a' in hex would equal 10 in decimal.
char c = '\u000a';
Returns the following error:
..\src\pkgs\main\Main.java:360: error: illegal line end in character literal
char c = '\u000a';
Secondly, because of my novice Java status, I'm currently not able to appreciate what hex String literals are used for. Why would I want to use one? Can someone please provide me with a "real world" example of their use? Thanks a lot.

The fact that the compiler gives an error is because the compiler will parse the \u000a to CR
char A = '\u000A';
therefore becomes...
char A ='
';
which results in a compile-time error. To avoid this error, always use the special escape characters '\n' (line feed) and '\r' (carriage return).

As noted already, Unicode escapes are actually processed during compilation as a replacement:
Because Unicode escapes are processed very early, it is not correct to write '\u000a' for a character literal whose value is linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (§3.3) and the linefeed becomes a LineTerminator in step 2 (§3.4), and so the character literal is not valid in step 3. Instead, one should use the escape sequence '\n' (§3.10.6). Similarly, it is not correct to write '\u000d' for a character literal whose value is carriage return (CR). Instead, use '\r'.
This aspect of Unicode escapes is not just limited to character literals. For example, the following will print "hello world":
// \u000A System.out.println("hello world");
Another way to get special characters beyond an escape is to use an integer literal:
static final char NUL = 0x0000;
As for their usefulness, for one, because otherwise you'd have to copy and paste special characters or type them in with some keyboard combination. The other reason is that certain characters don't have a proper visual representation. Examples of this are null, escape, backspace and delete. Also code point 7, the bell character, which is actually an instruction for the computer to emit a beep when it gets printed.

Char in Java is of 2 bytes and hence you can print Unicode characters using char.
So if you know unicode character code, then you can store it as hex literal in char and you can use the other language characters.
You can visit this link:
http://voices.yahoo.com/how-print-unicode-characters-java-12507717.html
For understanding the use of hex literals

Related

How to convert any kind of white space to a char?

I use String.strip() (Java 11) to remove trailing & leading white spaces from a String. There are 25 different kinds of white spaces in a String. I want to test my code with some of these 25 types of white space.
I have a code example which converts a particular type of white space (ex. \u2002) into a char and then uses it in a String. When I try to convert another white space type like \u000A to char, I get a compiler error. Why does this happen and how to fix it ?
public static void main(String...args){
char chr = '\u2002';//No problem.
//Compiler error :
//Intellij IDEA compiler - Illegal escape character in character literal.
//Java compiler - java: illegal line end in character literal.
chr = '\u000a';
String text = chr + "hello world" + chr;
text = text.strip();
System.out.println(text);
}
Are you sure you're not seeing this error instead?
error: illegal line end in character literal
Escape sequences like \u000a are processed very early in the compilation process. The \u000a is being replaced with an actual line feed character (code point 10).
It's as if you wrote this:
chr = '
';
which is why, when I try and compile your code using JDK 11.0.8, I get the "illegal line end" error.
This early conversion is described in the Java Language Specification:
Because Unicode escapes are processed very early, it is not correct to write '\u000a' for a character literal whose value is linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (§3.3) and the linefeed becomes a LineTerminator in step 2 (§3.4), and so the character literal is not valid in step 3. Instead, one should use the escape sequence '\n' (§3.10.6). Similarly, it is not correct to write '\u000d' for a character literal whose value is carriage return (CR). Instead, use '\r'.

Java regex escaped characters

When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?
There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.
Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.

Why does Java Language Spec allow this: \uuuu0041? [duplicate]

In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is:
What is the purpose of (u)* in the above syntax?
One use case that I understood which represents Yen symbol in Java is:
char ch = '\u00A5';
Interesting question. Section 3.3 of the JLS says:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
and
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
So you're right, there can be one or more u after the backslash. The reason is given further down:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
So this input
\u0020ä
becomes
\uu0020\u00e4
The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.
It means you can add as many u as you want - for example these lines are equivalent:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)
Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.
For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output ᵀ0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.
But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:
String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):

converting string of unicode "\u0063" into "c"

I'm doing some cryptoanalysis homework and was trying to write code that does a + b = c. My idea was to use unicode. b +(b-a) = c. Problem is my code returns a the unicode value of c not the String "c" and I can't convert it.
Please can someone explain the difference between the string below called unicode and those called test and test2? Also is there any way I could get the string unicodeOfC to print "c"?
//this calculates the unicode value for c
String unicodeOfC = ("\\u" + Integer.toHexString('b'+('b'-'a') | 0x10000).substring(1));
//this prints \u0063
System.out.println(unicodeOfC);
String test = "\u0063";
//this prints c
System.out.println(test);
//this is false
System.out.println(test.equals(unicodeOfC));
String test2 = "\u0063";
//this is true
System.out.println(test.equals(test2));
There is no difference between test and test2. They are both String literals referring to the same String. This String literal is made up of a unicode escape.
A compiler for the Java programming language ("Java compiler") first
recognizes Unicode escapes in its input, translating the ASCII
characters \u followed by four hexadecimal digits to the UTF-16 code
unit (§3.1) for the indicated hexadecimal value, and passing all other
characters unchanged.
So the compiler will translate this unicode escape and convert it to the corresponding UTF-16 code unit. That is, the unicode escape \u0063 translates to the character c.
In this
String unicodeOfC = ("\\u" + Integer.toHexString('b'+('b'-'a') | 0x10000).substring(1));
the String literal "\\u" (which uses a \ character to escape a \ character) has a runtime value of \u, ie. the two character \ and u. That String is concatenated with the result of invoking toHexString(..). You then invoke substring on the resulting String and assign its result to unicodeOfC. So the String value is \u0063, ie. the 6 characters \, u, 0, 0, 6, and 3.
Also is there any way I could get the string unicodeOfC to print "c"?
Similarly to how you created it, you need to get the numerical part of the unicode escape,
String numerical = unicodeOfC.replace("\\u", "");
int val = Integer.parseInt(numerical, 16);
System.out.println((char) val);
You can then print it out.
I think you're not understanding how string escaping works.
In Java backslash is an escape character that allows you to use characters in strings like newlines \n, tabs \t, or unicode \u0063.
Suppose I am writing code and I need to print a newline. I would do this System.out.println("\n");
Now lets say I want to show a backslash, System.out.println("\"); will be a compile error but System.out.println("\\"); will print \.
So your first string is printing the literal backslash character then the letter u then the hexadecimal number.

Determine if there is/are escape character(s) in string

Let say I have
String str="hello\" world\\";
when printing str, the output is
hello" world\
even when printing str.length() the output is
13
Is there any way to prove that str value has escape character(s)?
There is no such thing as escape characters at run time.
Escape characters appear only in String literals. For example,
String literal = "Some\nEscape\rSequence\\\"";
At compilation time, the compiler produces a String value with their actual binary representation (UTF-8 iirc). The JVM uses that String value directly.
You wrote
I am thinking that whenever we print a string and the output contains
character such as " and \, then we can conclude that those character,
" and \ was escaped?
This is not true, those characters might have been read from a file or some other InputStream. They were definitely not escaped in a text file.
Yes.
Use the Apache Commons Library, specifically StringEscapeUtils#escapeJava.
jshell> StringEscapeUtils.escapeJava("Newline \n here \u0344 and unicode \f\n\r\t\"\0\13 and more")
$136 ==> "Newline \\n here \\u0344 and unicode \\f\\n\\r\\t\\\"\\u0000\\u000B and more"
This prepends a backslash to each escape sequence and also swaps the variable-width octal sequences for fixed-width Unicode sequences. This means that every escape sequence will consist of "\\" two backslashes, followed by one of {n, b, r, t, f, ", \}, or a 'u' character, plus exactly four hexadecimal [0-F] digits.
If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string. If you want to find the positions of those sequences, it's a bit more involved.
See more at this Gist.

Categories

Resources