Matching Unicode Dashes in Java Regular Expressions?

Matching Unicode Dashes in Java Regular Expressions? - java

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Related

Java regex escaped characters

When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?

There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.

Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!

As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

Regex: Illegal hexadecimal escape sequence

I am new to Regex..I wrote the following regex to check phone numbers in javascript: ^[0-9\+\-\s\(\)\[\]\x]*$
Now, I try to the same thing in java using the following code:
public class testRegex {
public static void main(String[] args){
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\x]*$";
String phone="98650056";
System.out.println(phone.matches(regex));
}
However, I always get the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal hexadecimal escape sequence near index 21^[0-9\+\-\s\\(\\)\\[\\]\x]*$
Please advise.

Since you are trying to match what I assume is x (as in a phone extension), it needs to be escaped with four backslashes, or not escaped at all; otherwise \x is interpreted as a hexidecimal escape code. Because \x is interpreted as a hex code without the two to four additional required chars it's an error.
[\\x] \x{nn} or {nnnn} (hex code nn to nnnn)
[\\\\x] x (escaped)
[x] x
So the pattern becomes:
String regex="^[-0-9+()\\s\\[\\]x]*$";
Escaped Alternatives:
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]x]*$";
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\\\x]*$";

You have waaaay too many back slashes!
Firstly, to code a literal backslash in java, you must write two of them.
Secondly, most characters lose their special regex meaning when in a character class.
Thirdly, \x introduces a hex literal - you don't want that.
Write your regex like this:
String regex="^[0-9+\\s()\\[\\]x-]*$";
Note how you don't need to escape the hyphen in a character class when it appears first or last.

unicode regex pattern not working

I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT:
I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above
Edit3:
ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text.
The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}

Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.
On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:
(?:[\\u05d0\\u05df]){2,}

It's not clear what you're trying to do. If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. Try this to match all Unicode characters in the range:
[\\u05d0-\\u0eff]

Looks like you have unnecessary \\ in your input string. Following works by replacing your specified unicode character range in regex:
String text = "\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "###"));
OUTPUT:
###
</span>
Note that in our input text you had \\n and \\u05db etc that I have fixed.

Regular expression for Multi Bytes string

What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費

You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)

If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.

Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).

You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.

There is a nice introduction to UniCode regex here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.