Regular expression for Multi Bytes string

Regular expression for Multi Bytes string - java

What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費

You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)

If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.

There is a nice introduction to UniCode regex here.

Related

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!

As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

why is java accepting unicode outside the " " and ' ' also?

This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?

The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.

JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]

If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available

Matching (e.g.) a Unicode letter with Java regexps

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".
The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.
So how do you properly match against Unicode strings? Is there some other library that gets this right?

Here you have a very nice explanation:
http://www.regular-expressions.info/unicode.html
Some hints:
Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+.
In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.

Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like L for any letter, or a subcategory like Lu for uppercase or Lt for title-case.

Quoting from the JavaDoc of java.util.regex.Pattern.
Unicode support
This class is in conformance with
Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
Unicode escape sequences such as
\u2014 in Java source code are
processed as described in §3.3 of the
Java Language Specification. Such
escape sequences are also implemented
directly by the regular-expression
parser so that Unicode escapes can be
used in expressions that are read from
files or from the keyboard. Thus the
strings "\u2014" and "\\u2014", while
not equal, compile into the same
pattern, which matches the character
with hexadecimal value 0x2014.
Unicode blocks and categories are
written with the \p and \P constructs
as in Perl. \p{prop} matches if the
input has the property prop, while
\P{prop} does not match if the input
has that property. Blocks are
specified with the prefix In, as in
InMongolian. Categories may be
specified with the optional prefix Is:
Both \p{L} and \p{IsL} denote the
category of Unicode letters. Blocks
and categories can be used both inside
and outside of a character class.
The supported categories are those of
The Unicode Standard in the version
specified by the Character class. The
category names are those defined in
the Standard, both normative and
informative. The block names supported
by Pattern are the valid block names
accepted and defined by
UnicodeBlock.forName.

What is a regular expression for control characters?

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][#-z]
I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine.

Match an ASCII text string of the form ^X using the pattern \^., nothing more. Match an ASCII text string of the form \^X with the pattern \\\^.. You may wish to constrain that dot to [?#_\[\]^\\], so \\\^[A-Z?#_\[\]^\\]. It’s easier to read as [?\x40-\x5F] for the bracketed character class, hence \\\^[?\x40-\x5F] for a literal BACKSLASH, followed by a literal CIRCUMFLEX, followed by something that turns into one of the valid control characters.
Note that that is the result of printing out the pattern, or what you’d read from a file. It’s what you need to pass to the regex compiler. If you have it as a string literal, you must of course double each of those backslashes. `\\\\\\^[?\\x40-\\x5F]" Yes, it is insane looking, but that is because Java does not support regexes directly as Groovy and Scala — or Perl and Ruby — do. Regex work is always easier without the extra bbaacckksslllllaasshheesssssess. :)
If you had real control characters instead of indirect representations of them, you would use \pC for all literal code points with the property GC=Other, or \p{Cc} for just GC=Control.

Check this out: http://www.regular-expressions.info/characters.html . You should be able to use \cA to \cZ to find the control characters..

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.