I would like to add 2 numbers of backslash to the prefix of a RegEx to overcome dangling metacharacter exception. I am using the following piece of code:
if (pattern != null) {
pattern = pattern.trim();
if (pattern.length() > 0) {
if (pattern.indexOf("+[") >= 0) {
pattern = "\\\\" + pattern;
}
String formatterPattern ="formatter.setPattern(Pattern.compile(\"" + pattern + "\"));";
System.out.println("formatterPattern : " + formatterPattern);
}
}
But, I am getting,
formatter.setPattern(Pattern.compile("\+[0-9]{1,3}-[0-9()+-]{1,30}"));
instead of
formatter.setPattern(Pattern.compile("\\+[0-9]{1,3}-[0-9()+-]{1,30}"));
Even I tried with multiple \\s like, still, it adds only one backslash to the RegEx.
Related
I am currently working on a tool, which helps me to analyze a constantly growing String, that can look like this: String s = "AAAAAAABBCCCDDABQ". What I want to do is to find a sequence of A's and B's, do something and then remove that sequence from the original String.
My code looks like this:
while (someBoolean){
if(Pattern.matches("A+B+", s)) {
//Do stuff
//Remove the found pattern
}
if(Pattern.matches("C+D+", s)) {
//Do other stuff
//Remove the found pattern
}
}
return s;
Also, how I could remove the three sequences, so that s just contains "Q" at the end of the calculation, without and endless loop?
You should use a regex replacement loop, i.e. the methods appendReplacement(StringBuffer sb, String replacement) and appendTail(StringBuffer sb).
To find one of many patterns, use the | regex matcher, and capture each pattern separately.
You can then use group(int group) to get the matched string for each capture group (first group is group 1), which returns null if that group didn't match. For better performance, to simply check whether the group matched, use start(int group), which returns -1 if that group didn't match.
Example:
String s = "AAAAAAABBCCCDDABQ";
StringBuffer buf = new StringBuffer();
Pattern p = Pattern.compile("(A+B+)|(C+D+)");
Matcher m = p.matcher(s);
while (m.find()) {
if (m.start(1) != -1) { // Group 1 found
System.out.println("Found AB: " + m.group(1));
m.appendReplacement(buf, ""); // Replace matched substring with ""
} else if (m.start(2) != -1) { // Group 2 found
System.out.println("Found CD: " + m.group(2));
m.appendReplacement(buf, ""); // Replace matched substring with ""
}
}
m.appendTail(buf);
String remain = buf.toString();
System.out.println("Remain: " + remain);
Output
Found AB: AAAAAAABB
Found CD: CCCDD
Found AB: AB
Remain: Q
This solution assumes that the string always ends in Q.
String s="AAAAAAABBCCCDDABQ";
Pattern abPattern = Pattern.compile("A+B+");
Pattern cdPattern = Pattern.compile("C+D+");
while (s.length() > 1){
Matcher abMatcher = abPattern.matcher(s);
if (abMatcher.find()) {
s = abMatcher.replaceFirst("");
//Do other stuff
}
Matcher cdMatcher = cdPattern.matcher(s);
if (cdMatcher.find()) {
s = cdMatcher.replaceFirst("");
//Do other stuff
}
}
System.out.println(s);
You are probably looking for something like this:
String input = "AAAAAAABBCCCDDABQ";
String result = input;
String[] chars = {"A", "B", "C", "D"}; // chars to replace
for (String ch : chars) {
if (result.contains(ch)) {
String pattern = "[" + ch + "]+";
result = result.replaceAll(pattern, ch);
}
}
System.out.println(input); //"AAAAAAABBCCCDDABQ"
System.out.println(result); //"ABCDABQ"
This basically replace sequence of each character for single one.
If you want to remove the sequence completely, just replace ch to "" in replaceAll method parameters inside if body.
i have this problem:
i have to make a regular expression which take this urls:
http://www.amazon.it/TP-LINK-TL-WR841N-Wireless-300Mbps-Ethernet/dp/B001FWYGJS?ie=UTF8&redirect=true&ref_=s9_simh_gw_p147_d0_i2
http://www.amazon.it/gp/product/B014KMQWU0/
http://www.amazon.it/gp/product/glance/B014KMQWU0/
I need a regular expression which matches the full url until the ASIN of the product (ASIN is a word of 10 capital letters)
I have write this regex but not make what i want:
String regex="http:\\/\\/(?:www\\.|)amazon\\.com\\/(?:gp\\ product|| gp\\ product\\ glance || [^\\/]+\\/dp|dp)\\/([^\\/]{10})";
Pattern pattern=Pattern.compile(regex);
Matcher urlAmazonMatcher = pattern.matcher(url);
while (urlAmazonMatcher.find()) {
System.out.println("PROVA "+urlAmazonMatcher.group(0));
}
This is my solution. Finally it works :D
String regex="(http|www\\.)amazon\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp)\\/([^\\/]{10})";
Pattern pattern=Pattern.compile(regex);
Matcher urlAmazonMatcher = pattern.matcher(url);
String toReturn = null;
while (urlAmazonMatcher.find()) {
toReturn=urlAmazonMatcher.group(0);
}
How about
/[^/?]{10}(/$|\?)
This matches 10 characters that are neither / nor ? following a slash if those characters are followed by a final slash or a question mark.
You can get the part that precedes or follows the ASIN using one of the various Matcher functions.
Here is my work from a previous project that was to extract URLs from text:
private Pattern getUriPattern() {
if(uriPattern == null) {
// taken from http://labs.apache.org/webarch/uri/rfc/rfc3986.html
//TODO implement the full URI syntax
String genDelims = "\\:\\/\\?\\#\\[\\]\\#";
String subDelims = "\\!\\$\\&\\'\\*\\+\\,\\;\\=";
String reserved = genDelims + subDelims;
String unreserved = "\\w\\-\\.\\~"; // i.e. ALPHA / DIGIT / "-" / "." / "_" / "~"
String allowed = reserved + unreserved;
// ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
uriPattern = Pattern.compile("((?:[^\\:/\\?\\#]+:)?//[" + allowed + "&&[^\\?\\#]]*(?:\\?([" + allowed + "&&[^\\#]]*))?(?:\\#[" + allowed + "]*)?).*");
}
return uriPattern;
}
You can use the above method as follows:
Matcher uriMatcher =
getUriPattern().matcher(text);
if(uriMatcher.matches()) {
String candidateUriString = uriMatcher.group(1);
try {
new URI(candidateUriString); // check once again if you matched a URL
// your code here
} catch (Exception e) {
// error handling
}
}
This will catch the whole URL, including params. You can then split it up to the first occurence of '?' (if any) and take the first part. Of course, you can rework the regex too.
I want to replace all :variable (word starting with :) with ${variable}$.
For example,
:aks_num with ${aks_num}$
:brn_num with ${brn_num}$
Following is my code, which does not work:
public static void main(String[] argv) throws Exception
{
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
// replaceAll also not working
//String s = chSeq.replaceAll(":\\([a-z_]*\\)","\\${ $1 \\}$");
Pattern p = Pattern.compile(":\\([a-z_]*\\)");
Matcher m = p.matcher(chSeq);
if (m.find()) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
} else {
System.out.println("NO MATCH");
}
}
While in shell script the following regex works perfectly:
s/:\([a-z_]*\)/${\1}$/g
:\\([a-z_]*\\) (with escaped parenthesis) means that you want to match expressions like :(aks_num). Obviously, there are no such expression in the input string. That explains why there are no matches.
Instead, if you want to use parenthesis in order to capture some variables, you should not escape the parenthesis.
Example :
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
Pattern p = Pattern.compile(":([a-z_]*)");
Matcher m = p.matcher(chSeq);
while (m.find()) {
System.out.println("Found value: " + m.group(0)+". Captured : "+m.group(1));
}
Output:
Found value: :aks_num. Captured : aks_num
Found value: :aks_num. Captured : aks_num
Found value: :brn_num. Captured : brn_num
Found value: :brn_num. Captured : brn_num
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
// replaceAll also not working
//String s = chSeq.replaceAll(":\\([a-z_]*\\)","\\${ $1 \\}$");
Pattern p = Pattern.compile(":(\\w+)");
Matcher m = p.matcher(chSeq);
while (m.find()) {
System.out.println("Found value: " + m.group(1) );
}
Ideone Demo
Working fine with replaceAll
Pattern p = Pattern.compile("(:\\w+)");
Matcher m = p.matcher(x);
x = m.replaceAll("\\${$1}\\$");
You don't need to escape the parentheses, so
Pattern.compile(":([a-z_]*)");
should work.
I believe you got confused with the Java's regex syntax that is different from regular sed syntax. You do not need to escape parentheses to make them "special" grouping operators. Vice versa, in Java, when you escape parentheses, they start matching literal ( and ) symbols.
In the replacement pattern, $ must be escaped for the regex engine to replace with literal $ symbols, but you do not need to escape braces there.
So, just use
.replaceAll(":([a-z_]+)", "\\${$1}\\$")
See the IDEONE demo
I suggest the + quantifier because I doubt you need to match a : followed with a space, or digits - any non-letter.
BTW, you do not need any /g flag in Java since replaceAll will replace all matches with the provided replacement pattern.
NOTE: you can further adjust the pattern to match all letters/digits/underscores with ":(\\w+)". Or just alphanumerics/underscore: ":([\\p{Alnum}_]+)".
I am facing a little difficulty with a Syntax highlighter that I've made and is 90% complete. What it does is that it reads in the text from the source of a .java file, detects keywords, comments, etc and writes a (colorful) output in an HTML file. Sample output from it is:
(I couldn't upload a whole html page, so this is a screenshot.) As (I hope) you can see, my program seems to work correctly with keywords, literals and comments (see below) and hence can normally document almost all programs. But it seems to break apart when I store the escape sequence for " i.e. \" inside a String. An error case is shown below:
The string literal highlighting doesn't stop at the end of the literal, but continues until it finds another cue, like a keyword or another literal.
So, the question is how do I disguise/hide/remove this \" from within a String?
The stringFilter method of my program is:
public String stringFilter(String line) {
if (line == null || line.equals("")) {
return "";
}
StringBuffer buf = new StringBuffer();
if (line.indexOf("\"") <= -1) {
return keywordFilter(line);
}
int start = 0;
int startStringIndex = -1;
int endStringIndex = -1;
int tempIndex;
//Keep moving through String characters until we want to stop...
while ((tempIndex = line.indexOf("\"")) > -1 && !isInsideString(line, tempIndex)) {
//We found the beginning of a string
if (startStringIndex == -1) {
startStringIndex = 0;
buf.append( stringFilter(line.substring(start,tempIndex)) );
buf.append("</font>");
buf.append(literal).append("\"");
line = line.substring(tempIndex+1);
}
//Must be at the end
else {
startStringIndex = -1;
endStringIndex = tempIndex;
buf.append(line.substring(0,endStringIndex+1));
buf.append("</font>");
buf.append(normal);
line = line.substring(endStringIndex+1);
}
}
buf.append( keywordFilter(line) );
return buf.toString();
}
EDIT
in response to the first few comments and answers, here's what I tried:
A snippet from htmlFilter(String), but it doesn't work :(
//replace '&' i.e. ampersands with HTML escape sequence for ampersand.
line = line.replaceAll("&", "&");
//line = line.replaceAll(" ", " ");
line = line.replaceAll("" + (char)35, "#");
// replace less-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll("<", "<");
// replace greater-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll(">", ">");
line = multiLineCommentFilter(line);
//replace the '\\' i.e. escape for backslash with HTML escape sequences.
//fixes a problem when backslashes preceed quotes.
//line = line.replaceAll("\\\"", "\"");
//line = line.replaceAll("" + (char)92 + (char)92, "\\");
return line;
My idea is that when a backslash is met, ignore the next character.
String str = "blah\"blah\\blah\n";
int index = 0;
while (true) {
// find the beginning
while (index < str.length() && str.charAt(index) != '\"')
index++;
int beginIndex = index;
if (index == str.length()) // no string found
break;
index++;
// find the ending
while (index < str.length()) {
if (str.charAt(index) == '\\') {
// escape, ignore the next character
index += 2;
} else if (str.charAt(index) == '\"') {
// end of string found
System.out.println(beginIndex + " " + index);
break;
} else {
// plain content
index++;
}
}
if (index >= str.length())
throw new IllegalArgumentException(
"String literal is not properly closed by a double-quote");
index++;
}
Check for char found at tempIndex-1 it it is \ then don't consider as beginning or ending of string.
String originalLine=line;
if ((tempIndex = originalLine.indexOf("\"", tempIndex + 1)) > -1) {
if (tempIndex==0 || originalLine.charAt(tempIndex - 1) != '\\') {
...
Steps to follow:
First replace all \" with some temp string such as
String tempStr="forward_slash_followed_by_double_quote";
line = line.replaceAll("\\\\\"", tempStr);
//line = line.replaceAll("\\\"", tempStr);
do what ever you are doing
Finally replace that temp string with \"
line = line.replaceAll(tempStr, "\\\\\"");
//line = line.replaceAll(tempStr, "\\\"");
The trouble with finding a quote and then trying to work out whether it's escaped is that it's not enough to simply look at the previous character to see if it's a backslash - consider
String basedir = "C:\\Users\\";
where the \" isn't an escaped quote, but is actually an escaped backslash followed by an unescaped quote. In general a quote preceded by an odd number of backslashes is escaped, one preceded by an even number of backslashes isn't.
A more sensible approach would be to parse through the string one character at a time from left to right rather than trying to jump ahead to quote characters. If you don't want to have to learn a proper parser generator like JavaCC or antlr then you can tackle this case with regular expressions using the \G anchor (to force each subsequent match to start at the end of the previous one with no gaps) - if we assume that str is a substring of your input starting with the character following the opening quote of a string literal then
Pattern p = Pattern.compile("\\G(?:\\\\u[0-9A-Fa-f]{4}|\\\\.|[^\"\\\\])");
StringBuilder buf = new StringBuilder();
Matcher m = p.matcher(str);
while(m.find()) buf.append(m.group());
will leave buf containing the content of the string literal up to but not including the closing quote, and will handle escapes like \", \\ and unicode escapes \uNNNN.
Use double slash "\\"" instead of "\""... Maybe it works...
INPUT
Input can be in any of the form shown below with following mandatory content TXT{Any comma separated strings in any format}
String loginURL = "http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}";
String loginURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}";
String loginURL2 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}";
String loginURL3 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}";
String loginURL4 = "http://ip:port/path?username=abcd&password={PASS}";
Required Output
1. OutputURL corresponding to loginURL.
String outputURL = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL2 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL3 = "http://ip:port/path?";
String outputURL4 = "http://ip:port/path?username=abcd&password={PASS}";
2. Deleted pattern(if any)
String deletedPatteren = TXT{UE-IP,UE-Username,UE-Password}
My Attempts
String loginURLPattern = TXT+"\\{([\\w-,]*)\\}&*";
System.out.println("1. ");
getListOfTemplates(loginURL, loginURLPattern);
System.out.println();
System.out.println("2. ");
getListOfTemplates(loginURL1, loginURLPattern);
System.out.println();
private static void getListOfTemplates(String inputSequence,String pattern){
System.out.println("Input URL : " + inputSequence);
Matcher templateMatcher = Pattern.compile(pattern).matcher(inputSequence);
if (templateMatcher.find() && templateMatcher.group(1).length() > 0) {
System.out.println(templateMatcher.group(1));
System.out.println("OutputURL : " + templateMatcher.replaceAll(""));
}
}
OUTPUT obtained
1.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
UE-IP,UE-Username,UE-Password}&password={PASS
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&
2.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
UE-IP,UE-Username,UE-Password
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&
DRAWBACK OF ABOVE PATTERN
If i add any String containing character like #,%,# in between TXT{} then my code breaks.
How can i achieve it using java.util.regex library so that user can input any comma separated String between TXT{Any Comma Separated Strings}.
I would recommend using Matcher.appendReplacement:
public static void main(final String[] args) throws Exception {
final String[] loginURLs = {
"http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}",
"http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?username=abcd&password={PASS}"};
final Pattern patt = Pattern.compile("(\\?)?&?(TXT\\{[^}]++})(&)?");
for (final String loginURL : loginURLs) {
System.out.printf("%1$-10s %2$s%n", "Processing", loginURL);
final StringBuffer sb = new StringBuffer();
final Matcher matcher = patt.matcher(loginURL);
while (matcher.find()) {
final String found = matcher.group(2);
System.out.printf("%1$-10s %2$s%n", "Found", found);
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
}
matcher.appendTail(sb);
System.out.printf("%1$-10s %2$s%n%n", "Processed", sb.toString());
}
}
Output:
Processing http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path
Processing http://ip:port/path?username=abcd&password={PASS}
Processed http://ip:port/path?username=abcd&password={PASS}
As you rightly point out, there are 3 possible cases:
"?{TEXT}&" -> "?"
"&{TEXT}&" -> "&"
"?{TEXT}" -> ""
So what we need to do is test for those cases in the regex. Here is the pattern:
(\\?)?&?(TXT\\{[^}]++})(&)?
Explanation:
(\\?)? optionally matches and captures a ?
&? optionally captures an &
(TXT\\{[^}]++}) matches and captures TXT, followed by {, followed by one or most not } (possessively), followed by } (closing brackets don't need to be escaped
(&)? optionally matches and captures a &
We have 3 groups:
potentially a ?
the required text
potentially an &
Now when we find a match we need to replace with the appropriate capture of case 1..3
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
If groups 1 and 3 are both present:
We must be in case 1; we must replace with "?" which is in group 1 so $1.
Otherwise we are in case 2 or 3:
In case 2 we need to replace with "&" and in 3 with "".
In case 2 group 3 will hold "&" and in case 3 it will hold "" so we can replace with $3 in both these cases.
Here I only capture the TXT{...} part using a match group. This means that although the leading ? or & is replaced it is not in the String found. I you only want the bit between {} then just move the parenthesis.
Note that I reuse the Pattern - you can also reuse the Matcher if performance is a concern. You should always reuse the Pattern as it is (very) expensive to create. Store it in a static final if you can - it's threadsafe, matchers are not. The usual way to do it is to store the Pattern in a static final and then reuse the Matcher in the context of a method.
Also, the use of Matcher.appendReplacement is much more efficient than your current approach as it only needs to process the input once. Your approach parses the string twice.