Matching (e.g.) a Unicode letter with Java regexps

Matching (e.g.) a Unicode letter with Java regexps - java

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".
The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.
So how do you properly match against Unicode strings? Is there some other library that gets this right?

Here you have a very nice explanation:
http://www.regular-expressions.info/unicode.html
Some hints:
Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+.
In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.

Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like L for any letter, or a subcategory like Lu for uppercase or Lt for title-case.

Quoting from the JavaDoc of java.util.regex.Pattern.
Unicode support
This class is in conformance with
Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
Unicode escape sequences such as
\u2014 in Java source code are
processed as described in §3.3 of the
Java Language Specification. Such
escape sequences are also implemented
directly by the regular-expression
parser so that Unicode escapes can be
used in expressions that are read from
files or from the keyboard. Thus the
strings "\u2014" and "\\u2014", while
not equal, compile into the same
pattern, which matches the character
with hexadecimal value 0x2014.
Unicode blocks and categories are
written with the \p and \P constructs
as in Perl. \p{prop} matches if the
input has the property prop, while
\P{prop} does not match if the input
has that property. Blocks are
specified with the prefix In, as in
InMongolian. Categories may be
specified with the optional prefix Is:
Both \p{L} and \p{IsL} denote the
category of Unicode letters. Blocks
and categories can be used both inside
and outside of a character class.
The supported categories are those of
The Unicode Standard in the version
specified by the Character class. The
category names are those defined in
the Standard, both normative and
informative. The block names supported
by Pattern are the valid block names
accepted and defined by
UnicodeBlock.forName.

Related

Do Java regular expressions support collating sequences?

I am trying the regex ([[.ch.]]*)c against the test string chchch. According to the spec:
[[.ch.]]*c matches the first to fifth character in the string chchch
When I test it in Java, it indeed matches those characters, but so does [[ch]]*c. Thus I am not sure if the collating symbol is respected. Is it?

TL;DR - No.
The specification you are reading/quoting is the Open Group's SUS (Single UNIX® Specification) version of the regular expression part of IEEE's POSIX (Portable Operating System Interface for uniX) collection of standards. (See https://www.regular-expressions.info/posix.html ¹)
In general, only POSIX-compliant regular expression engines fully support POSIX bracket expressions, which are essentially what other regex flavors call character classes but with a few special features, one being that [. and .] are interpreted as the start and end of a collating sequence when used within the expressions.
Unfortunately, very few regex engines are POSIX-compliant and, in fact, some claiming to implement POSIX regexes just use the regular expression syntax defined by POSIX and don't have full locale support. Thus they don't implement all/any of the bracket expression features/quirks.
Java's regular expressions are in no way POSIX-compliant, as can be seen from this Regular Expression Engine Comparison Chart ². Its regex package implements a "Perl-like" regex engine, missing a few features (e.g. conditional expressions and comments), but including some extra ones (e.g. possessive quantifiers and variable-length, but finite, look-behind assertions).
Neither Perl nor Java support the collation-related bracket delimiters [= and =] (character equivalence), or [. and .] (collating sequence). Perl does support character classes using the POSIX [: and :] delimiters, but Java only supports them using the \p operator (with a few caveats as explained here).
So, what is going on with the regex [[.ch.]]*c in Java? (I'm ignoring the capturing group as it doesn't change the analysis.)
Well, it turns out that Java's regex package supports unions in its character classes. This is achieved by nesting. For example, [set1[set2]] is equivalent to [set3] where the characters in set3 are the union of the characters in set1 and the characters in set2. (As an aside, note that [[set1][set2]] and [[set1]set2] also produce the same result.)
So, [[.ch.]] is simply the character class containing the union of an empty set of characters with the set of characters in the character class [.ch.], so basically it's the same as the character class [.ch.]. This is equivalent to [.ch] (since the second . is redundant) and thus [[.ch.]]*c is the same as [.ch]*c.
Similarly, [[ch]]*c simplifies to [ch]*c.
Finally, since there aren't any . characters in the string chchch, the regexes [.ch]*c and [ch]*c will produce the same result. (Try testing against the string c.hchch to see the difference and prove the above.)
Notes:
This is not a very good example for either demonstrating collating sequences or for detecting if they are implemented, as [[.ch.]]*c will match chchc in chchch both when collating sequences are supported (and ch is a valid sequence in the current locale) and when they are not but unions are.
A much better demo/test is to use the regex [[.ch.]] with the test string ch:
Collating sequences are supported if ch is matched.
Any other match means they are not.
They may be supported if an error is returned, as this is what happens if ch is not a valid sequence in the current locale (it's a valid collating sequence in the Czech locale):
If the error specifies that ch is not a valid collating sequence, then they are supported.
If the error returned is that the delimiter/token [. and/or .] is invalid/unsupported, then collating sequences are not supported.
If the error is ambiguous, or for a guaranteed way to check for support, you need to switch to the Czech locale (and confirm that ch is indeed a valid collating sequence) or switch to any other locale that has at least one defined collating sequence which can be used instead of ch.
¹ I am neither Jan Goyvaerts nor in no way affiliated with the Regular-Expressions.info site.
² Nor am I CMCDragonkai.

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!

As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

why is java accepting unicode outside the " " and ' ' also?

This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?

The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.

JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]

If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available

Regular expression for Multi Bytes string

What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費

You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)

If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.

There is a nice introduction to UniCode regex here.

Java Unicode variable names

I got into an interesting discussion in a forum where we discussed the naming of variables.
Conventions aside, I noticed that it is legal for a variable to have the name of a Unicode character, for example the following is legal:
int \u1234;
However, if I for example gave it the name #, it produces an error. According to Sun's tutorial it is valid if "beginning with a letter, the dollar sign "$", or the underscore character "_"."
But the unicode 1234 is some Ethiopic character. So what is really defined as a "letter"?

The Unicode standard defines what counts as a letter.
From the Java Language Specification, section 3.8:
Letters and digits may be drawn from
the entire Unicode character set,
which supports most writing scripts in
use in the world today, including the
large sets for Chinese, Japanese, and
Korean. This allows programmers to use
identifiers in their programs that are
written in their native languages.
A
"Java letter" is a character for which
the method
Character.isJavaIdentifierStart(int)
returns true. A "Java letter-or-digit"
is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.
From the Character documenation for isJavaIdentifierPart:
Determines if the character (Unicode code point) may be part of a Java identifier as other
than the first character.
A character may be part of a Java identifier if any of the following are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable(codePoint) returns true for the character

Unicode characters fall into character classes. There's a set of Unicode characters which fall into the class "letter".
Determined by Character.isLetter(c) for Java. But for identifiers, Character.isJavaIdentifierStart(c) and Character.isJavaIdentifierPart(c) are more relevant.
For the relevant Unicode spec, see this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.