I am parsing a .txt, line by line, with considering a target token. I use a regex processor engine.
I match each line against:
"(^|.*[\\s])"+token+"([\\s].*|$)"
where token is a string. When:
token="6-7(3-7"
it arises the following exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed group near index 27
(^|.*[\s])6-7(3-7([\s].*|$)
How can I solve this?
You have special characters in your token.
Have a look at Pattern.quote():
public static String quote(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
This should do the trick for you:
String pattern = "(^|.*[\\s])" + Pattern.quote(token) + "([\\s].*|$)";
No need for doing the string magic yourself! :-)
You should make sure to escape special characters in any plain-text string you use to make regex patterns. Replace "(" with "\(", and similarly for bare backslashes (before any other steps), periods, and all other special characters, at least all those you expect to see in the input. (If it's arbitrary input from users, assume every character will be included.)
Related
I have a String input as well as String pattern and assume they could contain all sort of such characters which have special meaning for regex, and I would like exact word replacement to take place without giving any special consideration to special characters. Any special meaning should be ignored. And I won't know at compile time exactly how many such special characters might be present in either the input string or the input pattern.
So here is the formal problem statement:-
Assume the the object input_string is the input of type String.
Then we have another string input_pattern which is also an object of type String.
Now I want to perform the following:-
String result=input_string.replaceFirst(input_pattern,"replacewithsomethingdoesntmatter");
the replacement should take place in 'exact' match manner, without considering any regex special meaning of characters if present in the strings. How to make it happen?
You can use the Pattern.quote() method to escape characters that have a special meaning in regular expressions:
String pattern = "^(.*)$";
String quotedPattern = Pattern.quote(pattern);
System.out.println(quotedPattern);
This will wrap the pattern in quotation markers (\Q and \E), indicating that the wrapped sequence needs to be matched literally.
Alternatively, you can wrap the pattern in quotation markers manually:
String pattern = "^(.*)$";
String quotedPattern = "\\Q" + pattern + "\\E";
System.out.println(quotedPattern);
The first approach is probably safer, because it will also make accommodations for expressions that already contain quotation markers.
I'm trying to understand Pattern.quote using the following code:
String pattern = Pattern.quote("1252343% 8 567 hdfg gf^$545");
System.out.println("Pattern is : "+pattern);
produces the output:
Pattern is : \Q1252343% 8 567 hdfg gf^$545\E
What are \Q and \E here? The documentation description says :
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
But Pattern.quote's return type is String and not a compiled Pattern object.
Why is this method required and what are some usage examples?
\Q means "start of literal text" (i.e. regex "open quote")
\E means "end of literal text" (i.e. regex "close quote")
Calling the Pattern.quote() method wraps the string in \Q...\E, which turns the text is into a regex literal. For example, Pattern.quote(".*") would match a dot and then an asterisk:
System.out.println("foo".matches(".*")); // true
System.out.println("foo".matches(Pattern.quote(".*"))); // false
System.out.println(".*".matches(Pattern.quote(".*"))); // true
The method's purpose is to not require the programmer to have to remember the special terms \Q and \E and to add a bit of readability to the code - regex is hard enough to read already. Compare:
someString.matches(Pattern.quote(someLiteral));
someString.matches("\\Q" + someLiteral + "\\E"));
Referring to the javadoc:
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
The Pattern.quote method quotes part of a regex pattern to make regex interpret it as string literals.
Say you have some user input in your search program, and you want to regex for it. But this input may have unsafe characters so you can use
Pattern pattern = Pattern.compile(Pattern.quote(userInput));
This method does not quote a Pattern but, as you point out, wraps a String in regex quotes.
\Q and \E, among all others, are thoroughly documented on the java.util.regex.Pattern Javadoc page. They mean "begin Quote", "End quote" and demark a region where all the chars have the literal meaning. The way to use the return of Pattern.quote is to feed it to Pattern.compile, or any other method that accepts a pattern string, such as String.split.
If you compile the String returned by Pattern.quote, you'll get a Pattern which matches the literal string that you quoted.
\Q and \E mark the beginning and end of the quoted part of the string.
Regex collides frequently with normal strings. Say I want a regex to search for a certain string that is only known at runtime. How can we be sure that the string doesn't have regex meaning eg(".*.*.*")? We quote it.
This method used to make the pattern treated as a sequence of literal characters.
This has the same effect as a PATTERN.LITERAL flag.
I am new to Regex..I wrote the following regex to check phone numbers in javascript: ^[0-9\+\-\s\(\)\[\]\x]*$
Now, I try to the same thing in java using the following code:
public class testRegex {
public static void main(String[] args){
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\x]*$";
String phone="98650056";
System.out.println(phone.matches(regex));
}
However, I always get the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal hexadecimal escape sequence near index 21^[0-9\+\-\s\\(\\)\\[\\]\x]*$
Please advise.
Since you are trying to match what I assume is x (as in a phone extension), it needs to be escaped with four backslashes, or not escaped at all; otherwise \x is interpreted as a hexidecimal escape code. Because \x is interpreted as a hex code without the two to four additional required chars it's an error.
[\\x] \x{nn} or {nnnn} (hex code nn to nnnn)
[\\\\x] x (escaped)
[x] x
So the pattern becomes:
String regex="^[-0-9+()\\s\\[\\]x]*$";
Escaped Alternatives:
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]x]*$";
String regex="^[0-9\\+\\-\\s\\(\\)\\[\\]\\\\x]*$";
You have waaaay too many back slashes!
Firstly, to code a literal backslash in java, you must write two of them.
Secondly, most characters lose their special regex meaning when in a character class.
Thirdly, \x introduces a hex literal - you don't want that.
Write your regex like this:
String regex="^[0-9+\\s()\\[\\]x-]*$";
Note how you don't need to escape the hyphen in a character class when it appears first or last.
I'm trying to understand Pattern.quote using the following code:
String pattern = Pattern.quote("1252343% 8 567 hdfg gf^$545");
System.out.println("Pattern is : "+pattern);
produces the output:
Pattern is : \Q1252343% 8 567 hdfg gf^$545\E
What are \Q and \E here? The documentation description says :
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
But Pattern.quote's return type is String and not a compiled Pattern object.
Why is this method required and what are some usage examples?
\Q means "start of literal text" (i.e. regex "open quote")
\E means "end of literal text" (i.e. regex "close quote")
Calling the Pattern.quote() method wraps the string in \Q...\E, which turns the text is into a regex literal. For example, Pattern.quote(".*") would match a dot and then an asterisk:
System.out.println("foo".matches(".*")); // true
System.out.println("foo".matches(Pattern.quote(".*"))); // false
System.out.println(".*".matches(Pattern.quote(".*"))); // true
The method's purpose is to not require the programmer to have to remember the special terms \Q and \E and to add a bit of readability to the code - regex is hard enough to read already. Compare:
someString.matches(Pattern.quote(someLiteral));
someString.matches("\\Q" + someLiteral + "\\E"));
Referring to the javadoc:
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.
The Pattern.quote method quotes part of a regex pattern to make regex interpret it as string literals.
Say you have some user input in your search program, and you want to regex for it. But this input may have unsafe characters so you can use
Pattern pattern = Pattern.compile(Pattern.quote(userInput));
This method does not quote a Pattern but, as you point out, wraps a String in regex quotes.
\Q and \E, among all others, are thoroughly documented on the java.util.regex.Pattern Javadoc page. They mean "begin Quote", "End quote" and demark a region where all the chars have the literal meaning. The way to use the return of Pattern.quote is to feed it to Pattern.compile, or any other method that accepts a pattern string, such as String.split.
If you compile the String returned by Pattern.quote, you'll get a Pattern which matches the literal string that you quoted.
\Q and \E mark the beginning and end of the quoted part of the string.
Regex collides frequently with normal strings. Say I want a regex to search for a certain string that is only known at runtime. How can we be sure that the string doesn't have regex meaning eg(".*.*.*")? We quote it.
This method used to make the pattern treated as a sequence of literal characters.
This has the same effect as a PATTERN.LITERAL flag.
I've read the manual, and at the end there was an exercise:
Use a backreference to write an expression that will match a person's name only if that person's first name and last name are the same.
I've written the next program http://pastebin.com/YkuUuP5M
But when I compile it, I'm getting an error:
PersonName.java:18: illegal escape character
p = Pattern.compile("([A-Z][a-zA-Z]+)\s+\1");
^
If I rewrite 18 line in this way:
pattern = Pattern.compile(console.readLine("%nEnter your regex: "));
and write the pattern in the console, then the program works fine. Why I can't use the pattern as in the 1st program case and is there some way to fix it?
You want to get this text into a string:
([A-Z][a-zA-Z]+)\s+\1
However, \ in a string literal in Java source code is the character used for escaping (e.g. "\t" for tab). Therefore you need to use "\" in a string literal to end up with a single backslash in the resulting string. So you want:
"([A-Z][a-zA-Z]+)\\s+\\1"
Note that there's nothing regular-expression-specific to this. Any time you want to express a string containing a backslash in a Java string literal, you'll need to escape that backslash. Regular expressions and Windows filenames are just the most common cases for that.