In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is:
What is the purpose of (u)* in the above syntax?
One use case that I understood which represents Yen symbol in Java is:
char ch = '\u00A5';
Interesting question. Section 3.3 of the JLS says:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
and
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
So you're right, there can be one or more u after the backslash. The reason is given further down:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
So this input
\u0020ä
becomes
\uu0020\u00e4
The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.
It means you can add as many u as you want - for example these lines are equivalent:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)
Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.
For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output ᵀ0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.
But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:
String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):
Related
This question already has answers here:
Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?
(5 answers)
Closed 8 years ago.
I'm learning Java, and I'm on a book chapter about hex String literals. It tells me that I can create a hex String literal in this format: "\uxxxx". So I tried this:
char c = '\u0010';
int x = c;
System.out.println(x); // prints 16.
Firstly, why does the following hex String literal cause a compilation error? I was expecting that 'a' in hex would equal 10 in decimal.
char c = '\u000a';
Returns the following error:
..\src\pkgs\main\Main.java:360: error: illegal line end in character literal
char c = '\u000a';
Secondly, because of my novice Java status, I'm currently not able to appreciate what hex String literals are used for. Why would I want to use one? Can someone please provide me with a "real world" example of their use? Thanks a lot.
The fact that the compiler gives an error is because the compiler will parse the \u000a to CR
char A = '\u000A';
therefore becomes...
char A ='
';
which results in a compile-time error. To avoid this error, always use the special escape characters '\n' (line feed) and '\r' (carriage return).
As noted already, Unicode escapes are actually processed during compilation as a replacement:
Because Unicode escapes are processed very early, it is not correct to write '\u000a' for a character literal whose value is linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (§3.3) and the linefeed becomes a LineTerminator in step 2 (§3.4), and so the character literal is not valid in step 3. Instead, one should use the escape sequence '\n' (§3.10.6). Similarly, it is not correct to write '\u000d' for a character literal whose value is carriage return (CR). Instead, use '\r'.
This aspect of Unicode escapes is not just limited to character literals. For example, the following will print "hello world":
// \u000A System.out.println("hello world");
Another way to get special characters beyond an escape is to use an integer literal:
static final char NUL = 0x0000;
As for their usefulness, for one, because otherwise you'd have to copy and paste special characters or type them in with some keyboard combination. The other reason is that certain characters don't have a proper visual representation. Examples of this are null, escape, backspace and delete. Also code point 7, the bell character, which is actually an instruction for the computer to emit a beep when it gets printed.
Char in Java is of 2 bytes and hence you can print Unicode characters using char.
So if you know unicode character code, then you can store it as hex literal in char and you can use the other language characters.
You can visit this link:
http://voices.yahoo.com/how-print-unicode-characters-java-12507717.html
For understanding the use of hex literals
Let say I have
String str="hello\" world\\";
when printing str, the output is
hello" world\
even when printing str.length() the output is
13
Is there any way to prove that str value has escape character(s)?
There is no such thing as escape characters at run time.
Escape characters appear only in String literals. For example,
String literal = "Some\nEscape\rSequence\\\"";
At compilation time, the compiler produces a String value with their actual binary representation (UTF-8 iirc). The JVM uses that String value directly.
You wrote
I am thinking that whenever we print a string and the output contains
character such as " and \, then we can conclude that those character,
" and \ was escaped?
This is not true, those characters might have been read from a file or some other InputStream. They were definitely not escaped in a text file.
Yes.
Use the Apache Commons Library, specifically StringEscapeUtils#escapeJava.
jshell> StringEscapeUtils.escapeJava("Newline \n here \u0344 and unicode \f\n\r\t\"\0\13 and more")
$136 ==> "Newline \\n here \\u0344 and unicode \\f\\n\\r\\t\\\"\\u0000\\u000B and more"
This prepends a backslash to each escape sequence and also swaps the variable-width octal sequences for fixed-width Unicode sequences. This means that every escape sequence will consist of "\\" two backslashes, followed by one of {n, b, r, t, f, ", \}, or a 'u' character, plus exactly four hexadecimal [0-F] digits.
If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string. If you want to find the positions of those sequences, it's a bit more involved.
See more at this Gist.
This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?
The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.
JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]
If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available
What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費
You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)
If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.
Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).
You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.
There is a nice introduction to UniCode regex here.
I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"