Illegal escape character error in Java regex - java

I've read the manual, and at the end there was an exercise:
Use a backreference to write an expression that will match a person's name only if that person's first name and last name are the same.
I've written the next program http://pastebin.com/YkuUuP5M
But when I compile it, I'm getting an error:
PersonName.java:18: illegal escape character
p = Pattern.compile("([A-Z][a-zA-Z]+)\s+\1");
^
If I rewrite 18 line in this way:
pattern = Pattern.compile(console.readLine("%nEnter your regex: "));
and write the pattern in the console, then the program works fine. Why I can't use the pattern as in the 1st program case and is there some way to fix it?

You want to get this text into a string:
([A-Z][a-zA-Z]+)\s+\1
However, \ in a string literal in Java source code is the character used for escaping (e.g. "\t" for tab). Therefore you need to use "\" in a string literal to end up with a single backslash in the resulting string. So you want:
"([A-Z][a-zA-Z]+)\\s+\\1"
Note that there's nothing regular-expression-specific to this. Any time you want to express a string containing a backslash in a Java string literal, you'll need to escape that backslash. Regular expressions and Windows filenames are just the most common cases for that.

Related

Pattern Syntax Exception in regex constructed with inputted text

I am parsing a .txt, line by line, with considering a target token. I use a regex processor engine.
I match each line against:
"(^|.*[\\s])"+token+"([\\s].*|$)"
where token is a string. When:
token="6-7(3-7"
it arises the following exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed group near index 27
(^|.*[\s])6-7(3-7([\s].*|$)
How can I solve this?
You have special characters in your token.
Have a look at Pattern.quote():
public static String quote(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
This should do the trick for you:
String pattern = "(^|.*[\\s])" + Pattern.quote(token) + "([\\s].*|$)";
No need for doing the string magic yourself! :-)
You should make sure to escape special characters in any plain-text string you use to make regex patterns. Replace "(" with "\(", and similarly for bare backslashes (before any other steps), periods, and all other special characters, at least all those you expect to see in the input. (If it's arbitrary input from users, assume every character will be included.)

Java Regex - Invalid Escape Sequence w/ one backslash, Delete These Tokens w/ two

I'm attempting to make a program to parse an output in Eclipse, but when I enter the regular expression like so:
Pattern signaturePattern = Pattern.compile("[A-Z0-9_]+[" "]+[A-Za-z0-9\.]+[" "]+[A-Za-z0-9\.]+[" "]+[A-Za-z0-9\.]+[" "]+[A-Za-z0-9\.]+[" "]+");
The compiler gives me an error that says "invalid escape sequence." However, when I do what many answers to this question recommend - that is, to add an extra backslash to the dots - and I enter this instead:
Pattern signaturePattern = Pattern.compile("[A-Z0-9_]+[" "]+[A-Za-z0-9\\.]+[" "]+[A-Za-z0-9\\.]+[" "]+[A-Za-z0-9\\.]+[" "]+[A-Za-z0-9\\.]+[" "]+");
The compiler instead says "Syntax error on tokens, delete these tokens." How can I get it to simply read the regular expression as-is?
You forgot to escape your double quotes, as such (one escape only): \".
Here is your escaped Pattern (both code and Pattern compile, but I'm not guaranteeing it does what you want).
Pattern signaturePattern = Pattern.compile("[A-Z0-9_]+[\" \"]+[A-Za-z0-9\\.]+[\" \"]+[A-Za-z0-9\\.]+[\" \"]+[A-Za-z0-9\\.]+[\" \"]+[A-Za-z0-9\\.]+[\" \"]+");

unicode regex pattern not working

I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT:
I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above
Edit3:
ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text.
The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}
Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.
On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:
(?:[\\u05d0\\u05df]){2,}
It's not clear what you're trying to do. If your goal is to simplify matching a range of Unicode characters, then you need to realize that the hex digits are completely case insensitive, and so your a-fA-F is redundant, even if you could split character literals. Try this to match all Unicode characters in the range:
[\\u05d0-\\u0eff]
Looks like you have unnecessary \\ in your input string. Following works by replacing your specified unicode character range in regex:
String text = "\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n </span>\n<br style=\\";
System.out.println(text.replaceAll("[\u05d0-\u05ea]{2,}", "###"));
OUTPUT:
###
</span>
Note that in our input text you had \\n and \\u05db etc that I have fixed.

Replacing backslash by another symbol in escape symbol

I have string some_text\1\12\3. Need to get string some_text.1.12.3, i. e. replace \ by .. The problem is that Java interprets \1 as one symbol (escape-symbol). And actually I need to replace part of escape-symbol.
It sounds like all you're missing is the knowledge of how to escape the backslash in a Java string literal - which is a matter of doubling the backslash:
String replaced = original.replace('\\', '.');
On the other hand, it's not clear where your text is coming from or going to anyway - the \1 part would only be relevant if it's being processed as part of a text literal. If you're actually trying to create a string of "some_text\1\12\3" in Java source code to start with, you'd want:
String withBackslashes = "some_text\\1\\12\\3";
Note that the actual text of the string that withBackslashes refers to only has three backslashes, not six. It's only the source code that needs them doubling. At that point, the replacement code at the top would replace the backslashes with dots.
This will do the job:
str = str.replace('\\', '.');

Remove escape char ' \' from string in java

I have to remove \ from the string.
My String is "SEPIMOCO EUROPE\119"
I tried replace, indexOf, Pattern but I am not able to remove this \ from this string
String strconst="SEPIMOCO EUROPE\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.replace("\\\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.indexOf("\\",0)); //Gives -1
Any solutions for this ?
Your string doesn't actually contain a backslash. This part: "\11" is treated as an octal escape sequence (so it's really a tab - U+0009). If you really want a backslash, you need:
String strconst="SEPIMOCO EUROPE\\119";
It's not really clear where you're getting your input data from or what you're trying to achieve, but that explains everything you're seeing at the moment.
You have to distinguish between the string literal, i.e. the thing you write in your source code, enclosed with double quotes, and the string value it represents. When turning the former into the latter, escape sequences are interpreted, causing a difference between these two.
Stripping from string literals
\11 in the literal represents the character with octal value 11, i.e. a tab character, in the actual string value. \11 is equivalent to \t.
There is no way to reliably obtain the escaped version of a string literal. In other words, you cannot know whether the source code contained \11 or \t, because that information isn't present in the class file any more. Therefore, if you wanted to “strip backslashes” from the sequence, you wouldn't know whether 11 or t was the correct replacement.
For this reason, you should try to fix the string literals, either to not include the backslashes if you don't want them at all, or to contain proper backslashes, by escaping them in the literal as well. \\ in a string literal gives a single \ in the string it expresses.
Runtime strings
As you comments to other answers indicate that you're actually receiving this string at runtime, I would expect the string to contain a real backslash instead of a tab character. Unless you employ some fancy input method which parses escape sequences, you will still have the raw backslash. In order to simulate that situation in testing code, you should include a real backslash in your string, i.e. a double backslash \\ in your string literal.
When you have a real backslash in your string, strconst.replace("\\", " ") should do what you want it to do:
String strconst="SEPIMOCO EUROPE\\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 119
Where does your String come from? If you declare it like in the example you will want to add another escaping backslash before the one you have there.

Categories

Resources