Regex: Illegal hexadecimal escape sequence - java

I am new to Regex..I wrote the following regex to check phone numbers in javascript: ^[0-9\+\-\s\(\)\[\]\x]*$
Now, I try to the same thing in java using the following code:
public class testRegex {
public static void main(String[] args){
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\x]*$";
String phone="98650056";
System.out.println(phone.matches(regex));
}
However, I always get the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal hexadecimal escape sequence near index 21^[0-9\+\-\s\\(\\)\\[\\]\x]*$
Please advise.

Since you are trying to match what I assume is x (as in a phone extension), it needs to be escaped with four backslashes, or not escaped at all; otherwise \x is interpreted as a hexidecimal escape code. Because \x is interpreted as a hex code without the two to four additional required chars it's an error.
[\\x] \x{nn} or {nnnn} (hex code nn to nnnn)
[\\\\x] x (escaped)
[x] x
So the pattern becomes:
String regex="^[-0-9+()\\s\\[\\]x]*$";
Escaped Alternatives:
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]x]*$";
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\\\x]*$";

You have waaaay too many back slashes!
Firstly, to code a literal backslash in java, you must write two of them.
Secondly, most characters lose their special regex meaning when in a character class.
Thirdly, \x introduces a hex literal - you don't want that.
Write your regex like this:
String regex="^[0-9+\\s()\\[\\]x-]*$";
Note how you don't need to escape the hyphen in a character class when it appears first or last.

Related

Pattern Syntax Exception in regex constructed with inputted text

I am parsing a .txt, line by line, with considering a target token. I use a regex processor engine.
I match each line against:
"(^|.*[\\s])"+token+"([\\s].*|$)"
where token is a string. When:
token="6-7(3-7"
it arises the following exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed group near index 27
(^|.*[\s])6-7(3-7([\s].*|$)
How can I solve this?
You have special characters in your token.
Have a look at Pattern.quote():
public static String quote(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
This should do the trick for you:
String pattern = "(^|.*[\\s])" + Pattern.quote(token) + "([\\s].*|$)";
No need for doing the string magic yourself! :-)
You should make sure to escape special characters in any plain-text string you use to make regex patterns. Replace "(" with "\(", and similarly for bare backslashes (before any other steps), periods, and all other special characters, at least all those you expect to see in the input. (If it's arbitrary input from users, assume every character will be included.)

Java Regex Escape Characters

I'm learning Regex, and running into trouble in the implementation.
I found the RegexTestHarness on the Java Tutorials, and running it, the following string correctly identifies my pattern:
[\d|\s][\d]\.
(My pattern is any double digit, or any single digit preceded by a space, followed by a period.)
That string is obtained by this line in the code:
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
When I try to write a simple class in Eclipse, it tells me the escape sequences are invalid, and won't compile unless I change the string to:
[\\d|\\s][\\d]\\.
In my class I'm using`Pattern pattern = Pattern.compile();
When I put this string back into the TestHarness it doesn't find the correct matches.
Can someone tell me which one is correct? Is the difference in some formatting from console.readLine()?
\ is special character in String literals "...". It is used to escape other special characters, or to create characters like \n \r \t.
To create \ character in string literal which can be used in regex engine you need to escape it by adding another \ before it (just like you do in regex when you need to escape its metacharacters like dot \.). So String representing \ will look like "\\".
This problem doesn't exist when you are reading data from user, because you are already reading literals, so even if user will write in console \n it will be interpreted as two characters \ and n.
Also there is no point in adding | inside class character [...] unless your intention is to make that class also match | character, remember that [abc] is the same as (a|b|c) so there is no need for | in "[\\d|\\s]".
If you want to represent a backslash in a Java string literal you need to escape it with another backslash, so the string literal "\\s" is two characters, \ and s. This means that to represent the regular expression [\d\s][\d]\. in a Java string literal you would use "[\\d\\s][\\d]\\.".
Note that I also made a slight modification to your regular expression, [\d|\s] will match a digit, whitespace, or the literal | character. You just want [\d\s]. A character class already means "match one of these", since you don't need the | for alternation within a character class it loses its special meaning.
My pattern is any double digit or single digit preceded by a space, followed by a period.)
Correct regex will be:
Pattern pattern = Pattern.compile("(\\s\\d|\\d{2})\\.");
Also if you're getting regex string from user input then your should call:
Pattern.quote(useInputRegex);
To escape all the regex special characters.
Also you double escaping because 1 escape is handled by String class and 2nd one is passed on to regex engine.
What is happening is that escape sequences are being evaluated twice. Once for java, and then once for your regex.
the result is that you need to escape the escape character, when you use a regex escape sequence.
for instance, if you needed a digit, you'd use
"\\d"

unicode regex pattern not working

I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT:
I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above
Edit3:
ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text.
The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}
Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.
On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:
(?:[\\u05d0\\u05df]){2,}
It's not clear what you're trying to do. If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. Try this to match all Unicode characters in the range:
[\\u05d0-\\u0eff]
Looks like you have unnecessary \\ in your input string. Following works by replacing your specified unicode character range in regex:
String text = "\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "###"));
OUTPUT:
###
</span>
Note that in our input text you had \\n and \\u05db etc that I have fixed.

java regex illegal escape character error not occurring from command line arguments [duplicate]

This question already has answers here:
Why does this Java regex cause "illegal escape character" errors?
(7 answers)
Closed 3 years ago.
This simple regex program
import java.util.regex.*;
class Regex {
public static void main(String [] args) {
System.out.println(args[0]); // #1
Pattern p = Pattern.compile(args[0]); // #2
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.println(m.start()+" "+m.group());
}
}
}
invoked by java regex "\d" "sfdd1" compiles and runs fine.
But if #1 is replaced by Pattern p = Pattern.compile("\d");, it gives compiler error saying illegal escape character. In #1 I also tried printing the pattern specified in the command line arguments. It prints \d, which means it is just getting replaced by \d in #2.
So then why won't it throw any exception? At the end it's string argument that Pattern.compile() is taking, doesn't it detect illegal escape character then? Can someone please explain why is this behaviour?
A backslash character in a string literal needs to be escaped (preceded by a backslash). When passed in from the command line the string is not a string literal. The compiler complains because "\d" is not a valid escape sequence (see Escape Sequences for Character and String Literals ).
The \ character is used as an escape character for both Java string literals and regular expressions. This confuses many programmers. When you want to create a String in Java to represent a regular expression that has an escape character then you need to escape the Java escape character.
When passing the string in on the command line the JVM handles this for you and simply creates the String.
What you want is this
Pattern p = Pattern.compile("\\d");
The backslash \ in Java results in an escape in strings. For example, the string "\t" results in a tab character in java. This is also why "\n" produces a newline.
In regular expressions, \d is an escape with respect to the regular expression, not Java. This means in order to get \d in a string literal, you have to type "\\d" in the string. Basically, you have to escape the \ to get the literal value \d, and then when Pattern compiles the regex, it further escapes the \d to be parsed as a digit.
This can be confusing, but long story short, you should never have a single \ in a string literal for a regular expression since even the string literal "\\n" gets parsed properly.
I'm not entirely sure if I understand the question, but it seems like your problem is that you're treating "\d" as a Java escape character, which doesn't exist. To treat it as a regex escape character, use "\d" to escape the Java escape.

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Categories

Resources