Java Regular expression does not match using Apache RE - java

I have this regular expression string:
^[a-zA-Z0-9\t\s\n\r!$()*,-./:;=?#`][{}_~|]+$
This RE should return true for the following:
!$()*,-./:;=?#`][{}_~|
I'm using RE of Apache and get false when running match function.
I think my regular expression is missing something, maybe handling with special characters.
The question is, what is wrong with my expression? here is my RE matching function:
public static String runRegularExpression(String string, String regularExpression, int parenthesis)
{
String result = null;
try
{
RE reCmd = new RE(regularExpression);
if (reCmd.match(string))
{
result = reCmd.getParen(parenthesis);
}
}
catch (Exception re)
{
}
return result;
}

You regex must not have unescaped hyphen in the middle of character class.
If you already have \s then there is no need to match \n and \t since \s matches all white-spaces that includes space, tab and newlines.
[a-zA-Z0-9_] can be shortened to \w
Backslashes need to be double escaped.
Try this regex:
^[\\w\\s\\r!$()*,./:;=?#`{}\\[\\]~|-]+$

Related

Regex that matches the string ÷x% [duplicate]

This question already has answers here:
Why does replaceAll fail with "illegal group reference"?
(8 answers)
Closed 4 years ago.
I've been trying to create a regex that matches the following pattern:
÷x%
here is my code:
String string = "÷x%2%x#3$$#";
String myregex = "all the things I've tried";
string = string.replaceAll(myregex,"÷1x#1$%");
I've tried the following regexes: (÷x%) , [÷][x][%] , [÷]{1}[x]{1}[%]{1}
I am using NetBeans IDE and it gives me an
Illegal group reference
However, when I change the value of string to something else, a word for example.
NetBeans does not give me an exception.
any thoughts, thanks
To replace all occurrences of a sub-string you don't need a pattern. You can use String.replace():
String input = "÷x%abc÷x%def÷x%";
String output = input.replace("÷x%", "÷1x#1$%");
System.out.println(output); // ÷1x#1$%abc÷1x#1$%def÷1x#1$%
As per method javadoc:
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
As per the comments in the question, I am hoping that this will shed some light on how the replaceAll works.
As per the JavaDoc, the replaceAll takes in a regular expression as first argument. In your case, the regular expression appears to be sound, so there is no issue there.
The second argument that the replaceAll accepts, is the string that will be used to replace whatever the regular expression matches.
In some cases, you will need to replace the same pattern with the same (hard coded, if you will) string:
String myString = "123abc1344";
myString = myString.replaceAll("\\d+", "number");
myString = myString.replaceAll("\\w+", "word");
System.out.println(myString); //Would yield something of the sort: numberwordnumber
BUT, there are situations were you want use chunks of what you are replacing in the replacement string itself. This is where the $ comes in:
String myString = "Age:9;Gender:Male";
Let us say that you want to change the format of the string to the following: "I am a {Gender} and I am {Age} years of age.".
In this case, your replacement string needs to extract information from the string to be replaced and inject it in the replacement itself. You do this by using the following:
String myString = "Age:9;Gender:Male";
myString = myString.replaceAll("Age:(\\d+);Gender:(\\w+)", "I am a $2 and I am $1 years of age.";
The above should yield the string that you are after. Notice that I am using $1 and $2 to access regular expression groups. In regular expression language, the 0th group is whatever it is matched by the entire regular expression. Any other round parenthesis denotes another regular expression group which you can access through the $ keyword.
This is why it needs to be escaped.
In the Java Regex you have to escape the $ sign.
If you write $% you would refer to the group % which is not existant.
You can try:
try {
String string = "÷x%2%x#3$$#";
String myregex = "÷x%";
String replace = "÷1x#1\\$%";
String resultString = string.replaceAll(myregex, replace);
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}

Regular expression to remove HTML tags doesn't match

I have a String like <li><font color='#008000'> [INFO]a random user. and I want to eliminate html tags such as <li> and <font> from this String.
I tried to achieve this with String.replaceAll method in Java but it doesn't work...
public static String removeHTMLTags(String original){
String str = original.replaceAll("^<.+>$", "");
return str;
}
Your regex isn't finding a match because the ^ and $ anchors specify that the very first character in the input string must be < and the very last must be >.
Without those anchors, your regex still won't do what you want, however, because quantifiers (such as .+) are by default greedy.
So if your input string was text1 <a href=foo>bar</a> text2, your transformed output would be text1 text2, because the regex would match everything from the first < to the last >.
So in order to stop at the first >, you should make your quantifier non-greedy: .+?.
Remove the ^ and $ and use a reluctant quantifier with the dotall flag (so dot matches newlines too):
public static String removeHTMLTags(String original){
return original.replaceAll("(?s)<.+?>", "");
}
or use a negated character class (which will match newlines)
public static String removeHTMLTags(String original){
return original.replaceAll("<[^>]+>", "");
}
You're transforming a HTML string that might have newline characters as well. DOT doesn't match new line characters in regex. You need to use (?s) (DOTALL) flag with lazy quantifier and without anchors:
String str = original.replaceAll("(?s)<.+?>", "");
Though I must caution you using regex to parse/transform HTML, it can be error prone.

Why is this Java regex not working?

I'm trying to match any string consisting of:
Any alphanumeric string of 1+ chars; then
Two periods (".."); then
Any alphanumeric string of 1+ chars
For example:
mydatabase..mytable
anotherDatabase23..table28
etc.
Given the following function:
public boolean isValidDBTableName(String candidate) {
if(candidate.matches("[a-zA-Z0-9]+..[a-zA-Z0-9]+"))
return true;
else
return false;
}
Passing this function the value "mydb..tablename" causes it to return false. Why? Thanks in advance!
As NeplatnyUdaj has pointed out in comment, your current regex should return true for the input "mydb..tablename".
However, your regex has the problem of over-matching, where it returns true for invalid names such as nodotname.
You need to escape ., since in Java regex, it will match any character except for line separators:
"[a-zA-Z0-9]+\\.\\.[a-zA-Z0-9]+"
In regex, you can escape meta-characters (character with special meaning) with \. To specify \ in string literal, you need to escape it again.
You must escape the period in regexes. As a \ must also be escaped, this gives
"[a-zA-Z0-9]+\\.\\.[a-zA-Z0-9]+"
I just tried your regex in Eclipse and it worked. Or at least did not fail. Try stripping whitespace characters.
#Test
public void test()
{
String testString = "mydb..tablename";
Assert.assertTrue("no match", testString.matches("[a-zA-Z0-9]+..[a-zA-Z0-9]+"));
Assert.assertFalse("falsematch", "a.b".matches("[a-zA-Z0-9]+..[a-zA-Z0-9]+"));
}

Escaping special characters in Java Regular Expressions

Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
This would be very handy in dynamically building a regular expression, without having to manually escape each individual character.
For example, consider a simple regex like \d+\.\d+ that matches numbers with a decimal point like 1.2, as well as the following code:
String digit = "d";
String point = ".";
String regex1 = "\\d+\\.\\d+";
String regex2 = Pattern.quote(digit + "+" + point + digit + "+");
Pattern numbers1 = Pattern.compile(regex1);
Pattern numbers2 = Pattern.compile(regex2);
System.out.println("Regex 1: " + regex1);
if (numbers1.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
System.out.println("Regex 2: " + regex2);
if (numbers2.matcher("1.2").matches()) {
System.out.println("\tMatch");
} else {
System.out.println("\tNo match");
}
Not surprisingly, the output produced by the above code is:
Regex 1: \d+\.\d+
Match
Regex 2: \Qd+.d+\E
No match
That is, regex1 matches 1.2 but regex2 (which is "dynamically" built) does not (instead, it matches the literal string d+.d+).
So, is there a method that would automatically escape each regex meta-character?
If there were, let's say, a static escape() method in java.util.regex.Pattern, the output of
Pattern.escape('.')
would be the string "\.", but
Pattern.escape(',')
should just produce ",", since it is not a meta-character. Similarly,
Pattern.escape('d')
could produce "\d", since 'd' is used to denote digits (although escaping may not make sense in this case, as 'd' could mean literal 'd', which wouldn't be misunderstood by the regex interpeter to be something else, as would be the case with '.').
Is there any method in Java or any open source library for escaping (not quoting) a special character (meta-character), in order to use it as a regular expression?
If you are looking for a way to create constants that you can use in your regex patterns, then just prepending them with "\\" should work but there is no nice Pattern.escape('.') function to help with this.
So if you are trying to match "\\d" (the string \d instead of a decimal character) then you would do:
// this will match on \d as opposed to a decimal character
String matchBackslashD = "\\\\d";
// as opposed to
String matchDecimalDigit = "\\d";
The 4 slashes in the Java string turn into 2 slashes in the regex pattern. 2 backslashes in a regex pattern matches the backslash itself. Prepending any special character with backslash turns it into a normal character instead of a special one.
matchPeriod = "\\.";
matchPlus = "\\+";
matchParens = "\\(\\)";
...
In your post you use the Pattern.quote(string) method. This method wraps your pattern between "\\Q" and "\\E" so you can match a string even if it happens to have a special regex character in it (+, ., \\d, etc.)
I wrote this pattern:
Pattern SPECIAL_REGEX_CHARS = Pattern.compile("[{}()\\[\\].+*?^$\\\\|]");
And use it in this method:
String escapeSpecialRegexChars(String str) {
return SPECIAL_REGEX_CHARS.matcher(str).replaceAll("\\\\$0");
}
Then you can use it like this, for example:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*" + escapeSpecialRegexChars(text) + ".*");
}
We needed to do that because, after escaping, we add some regex expressions. If not, you can simply use \Q and \E:
Pattern toSafePattern(String text)
{
return Pattern.compile(".*\\Q" + text + "\\E.*")
}
The only way the regex matcher knows you are looking for a digit and not the letter d is to escape the letter (\d). To type the regex escape character in java, you need to escape it (so \ becomes \\). So, there's no way around typing double backslashes for special regex chars.
The Pattern.quote(String s) sort of does what you want. However it leaves a little left to be desired; it doesn't actually escape the individual characters, just wraps the string with \Q...\E.
There is not a method that does exactly what you are looking for, but the good news is that it is actually fairly simple to escape all of the special characters in a Java regular expression:
regex.replaceAll("[\\W]", "\\\\$0")
Why does this work? Well, the documentation for Pattern specifically says that its permissible to escape non-alphabetic characters that don't necessarily have to be escaped:
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
For example, ; is not a special character in a regular expression. However, if you escape it, Pattern will still interpret \; as ;. Here are a few more examples:
> becomes \> which is equivalent to >
[ becomes \[ which is the escaped form of [
8 is still 8.
\) becomes \\\) which is the escaped forms of \ and ( concatenated.
Note: The key is is the definition of "non-alphabetic", which in the documentation really means "non-word" characters, or characters outside the character set [a-zA-Z_0-9].
Use this Utility function escapeQuotes() in order to escape strings in between Groups and Sets of a RegualrExpression.
List of Regex Literals to escape <([{\^-=$!|]})?*+.>
public class RegexUtils {
static String escapeChars = "\\.?![]{}()<>*+-=^$|";
public static String escapeQuotes(String str) {
if(str != null && str.length() > 0) {
return str.replaceAll("[\\W]", "\\\\$0"); // \W designates non-word characters
}
return "";
}
}
From the Pattern class the backslash character ('\') serves to introduce escaped constructs. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.
Example: String to be matched (hello) and the regex with a group is (\(hello\)). Form here you only need to escape matched string as shown below. Test Regex online
public static void main(String[] args) {
String matched = "(hello)", regexExpGrup = "(" + escapeQuotes(matched) + ")";
System.out.println("Regex : "+ regexExpGrup); // (\(hello\))
}
Agree with Gray, as you may need your pattern to have both litrals (\[, \]) and meta-characters ([, ]). so with some utility you should be able to escape all character first and then you can add meta-characters you want to add on same pattern.
use
pattern.compile("\"");
String s= p.toString()+"yourcontent"+p.toString();
will give result as yourcontent as is

How to check if the string is a regular expression or not

I have a string. How I can check if the string is a regular expression or contains regular expression or it is a normal string?
The only reliable check you could do is if the String is a syntactically correct regular expression:
boolean isRegex;
try {
Pattern.compile(input);
isRegex = true;
} catch (PatternSyntaxException e) {
isRegex = false;
}
Note, however, that this will result in true even for strings like Hello World and I'm not a regex, because technically they are valid regular expressions.
The only cases where this will return false are strings that are not valid regular expressions, such as [unclosed character class or (unclosed group or +.
This is ugly but will detect simple regular expressions (with the caveat they must be designed for Java i.e. have the relevant back-slash character escaping).
public boolean isRegex(final String str) {
try {
java.util.regex.Pattern.compile(str);
return true;
} catch (java.util.regex.PatternSyntaxException e) {
return false;
}
}
Maybe you'd try to compile that regular expression using regexp package from Apache ( http://jakarta.apache.org/regexp/ ) and, if you get an exception then that's not a valid regexp so you'd say it's a normal string.
boolean validRE = true;
try {
RE re = new RE(stringToCheck);
} catch (RESyntaxException e) {
validRE = false;
}
Obviously, the user would have typed an invalid regexp and you'd be handling it as a normal string.
there is no difference between a 'normal' sting and a regular expression. A regular expression is just a normal string which is used as a pattern to match occurrences of the pattern in another string.
As others have pointed out, it is possible that the string might not be a valid regular expression, but I think that is the only check you can do. If it is valid then there is no way to know if it is a regular expression or just a normal string because it will be a regular expression
It is just a normal string which is interpreted in a specific way by the regex engine.
for example "blah" is a regular expression which will only match the string "blah" where ever it occurs in another string.
When looked at this way, you can see that a regular expression does not need to contain any of the 'special characters' that do more advanced pattern matching, and it will only match the string in the pattern
If anyone just want to distinguish just plain text strings and regular-expressions:
static boolean hasSpecialRegexCharacters(String s){
Pattern regexSpecialCharacters = Pattern
.compile("[\\\\\\.\\[\\]\\{\\}\\(\\)\\<\\>\\*\\+\\-\\=\\!\\?
\\^\\$\\|]");
return regexSpecialCharacters.matcher(s).find();
}
/**
* If input string is a regex, matches will always return a false.
*/
public boolean isRegex(final String str) {
return str != null ? !str.matches(str) : false;
}

Categories

Resources