In java,
"Carriage return" is represented as '\r'
&
"Line Feed" is represented as '\n'.
But Java does not allow,
"Carriage return" as '\u000d'
and
"Line Feed" as '\u000a'.
Why?
The Unicode escape sequences are applied earlier in the source transformation than the character literal escape sequences. Unicode escape sequences are transformed very early in the process - before any other lexing happens, including before line breaks are detected. See JLS 3.2 for details.
So when you put \u000a into a Java source file, it will behave exactly as if you'd put an actual line feed in there - causing a line break as far as the rest of the compiler is concerned.
(Personally I think this was a design mistake; I prefer the C# approach of only allowing Unicode escape sequences at very specific points in the code, but that's a different matter.)
Unicode escapes are recognised anywhere in a Java source file, not just inside string literals, and are processed very early in the compiler chain. A \u000d is treated as a literal carriage return, not an escaped one, i.e. For the source code
String cr = "\u000d";
what the compiler sees is
String cr = "
";
And this is not legal Java code.
Related
When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?
There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.
Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.
In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is:
What is the purpose of (u)* in the above syntax?
One use case that I understood which represents Yen symbol in Java is:
char ch = '\u00A5';
Interesting question. Section 3.3 of the JLS says:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
and
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
So you're right, there can be one or more u after the backslash. The reason is given further down:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
So this input
\u0020ä
becomes
\uu0020\u00e4
The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.
It means you can add as many u as you want - for example these lines are equivalent:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)
Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.
For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output ᵀ0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.
But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:
String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):
This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?
The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.
JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]
If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available
There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".
The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.
So how do you properly match against Unicode strings? Is there some other library that gets this right?
Here you have a very nice explanation:
http://www.regular-expressions.info/unicode.html
Some hints:
Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+.
In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.
Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like L for any letter, or a subcategory like Lu for uppercase or Lt for title-case.
Quoting from the JavaDoc of java.util.regex.Pattern.
Unicode support
This class is in conformance with
Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
Unicode escape sequences such as
\u2014 in Java source code are
processed as described in §3.3 of the
Java Language Specification. Such
escape sequences are also implemented
directly by the regular-expression
parser so that Unicode escapes can be
used in expressions that are read from
files or from the keyboard. Thus the
strings "\u2014" and "\\u2014", while
not equal, compile into the same
pattern, which matches the character
with hexadecimal value 0x2014.
Unicode blocks and categories are
written with the \p and \P constructs
as in Perl. \p{prop} matches if the
input has the property prop, while
\P{prop} does not match if the input
has that property. Blocks are
specified with the prefix In, as in
InMongolian. Categories may be
specified with the optional prefix Is:
Both \p{L} and \p{IsL} denote the
category of Unicode letters. Blocks
and categories can be used both inside
and outside of a character class.
The supported categories are those of
The Unicode Standard in the version
specified by the Character class. The
category names are those defined in
the Standard, both normative and
informative. The block names supported
by Pattern are the valid block names
accepted and defined by
UnicodeBlock.forName.
I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][#-z]
I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.
Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?#_\[\]^\\], so \\\^[A-Z?#_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.
Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)
If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.
Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..