Regex for String with possible escape characters

Regex for String with possible escape characters - java

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}

The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!

You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead

Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

Need help in regex matching

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.

You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"

Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69

Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

Combined positive lookbehind and lookahead

I want to parse an array from a custom key-value protocol. It looks like this
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor"
FLAGS: 1, 2, 3
In Java the String looks this (it uses CRLF as linebreak):
RESPONSE GAMEINFO OK\\r\\nNAME: \"gamelobby\"\\r\\nPLAYERS: \"alice\", \"bob\", \"hodor\"FLAGS: 1, 2, 3\\r\\n
I want to capture "alice", "bob", "hodor" as-is. So I used this regexp, which was tested in Sublime Text and on regex101.com (keys are case insensitive)
(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*)(?=\r\n)
This is a screenshot from Sublime Text (note: I left out \r here):
When I try to capture the group, I get the next line too:
Pattern p = Pattern.compile("(?<=(?i:"+key+"): )([A-Za-z0-9\\s\\.,:;\\?!\\n\"_-]*)(?=\\r\\n)");
Matcher matcher = p.matcher(message);
matcher.find();
String value = new String();
try {
value = matcher.group(); // = "\"alice\", \"bob\", \"hodor\"\\r\\nFLAGS: 1, 2, 3"
} ...
NOTE: \" or \\\" don't seem to make a difference.
Why is FLAGS: 1, 2, 3 captured until \\r\\n, but not in the line above? Is positive lookbehind and lookahead possible? Which lookhead / lookbehind is evaluated first?
EDIT: Definition of the string array is
values = string*("," WSP string)
string = DQUOTE *(ALPHA / DIGIT / WSP / punctuation / "\n") DQUOTE
punctuation = "." / ":" / "," / ";" / "?" / "!" / "-" / "_"

Just write the code according to your grammar. The grammar doesn't seem ambiguous to me, so if you just follow it and compose your regex piece by piece, you are going to be alright:
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
Currently,\r\n is used to check for line separator at the end of PLAYERS entry. Change it to whatever specified in your specification.
Caveat
This solution only works for parsing valid input. Parsing invalid input depends on your recovery algorithm and the line separator.
If the line separator allows for \n as well as \r\n, it is hard to recover from an error. For example, if there is a user named ABC\nFLAGS: 1, 2, 3 (allowed according to grammar), but the closing double quote is missing, the list of players will be broken, and you won't be able to tell whether FLAGS: is part of the previous line or a different header.
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor", "ABC
FLAGS: 1, 2, 3
FLAGS: 1, 2, 3
Full example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SO28290386 {
public static void main(String[] args) {
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
System.out.println(PLAYERS_RE);
String input = "RESPONSE GAMEINFO OK\r\nNAME: \"gamelobby\"\r\nPLAYERS: \"alice\", \"bob\", \"hodor\", \"new\nline\"\r\nFLAGS: 1, 2, 3\r\n";
System.out.println("INPUT");
System.out.println(input);
Pattern p = Pattern.compile(PLAYERS_RE);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
}
}

You can use a non-greedy multiplier on the bracket expression:
(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*?)(?=\r\n)
The reason matching does not stop at the \r\n when you use the greedy multipler * is because the bracket expression contains \s. The definition of \s (according to documentation of Pattern class) is [ \t\n\x0B\f\r], so the bracket expression is actually barreling through the CRLF line terminator and everything else in its path, until it gets to the end of the whole string.
I suppose if you were OK with explicitly preventing lone CRs from being present in the quoted-word list, then another viable solution would be to replace \s with an explicit [\n\t\f ], but I'll leave that up to you.
The non-greedy multiplier *? solution works because when the regex engine hits the first CRLF to satisfy the final look-ahead assertion, it stops matching, even though the bracket expression could gobble it up.
The test code on regex101 fails for the case where the string contains new line since the site doesn't seem to support CRs, so we can't really do a full test there. But in the real regex in the Java code, the look-ahead assertion would require a CRLF to terminate the search, so it would end up matching the whole quoted-word list.

How do I escape '+' in pattern matching to highlight keyword?

I'm implementing a keyword highlighter in Java. I'm using java.util.regex.Pattern to highlight (making bold) keyword within String content. The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters. For example, in String content, I would like to highlight the keyword c++ which has the special character + (plus), but it's not getting highlighted properly. How do I escape + character so that c++ is highlighted?
public static void main(String[] args)
{
String content = "java,c++,ejb,struts,j2ee,hibernate";
System.out.println("CONTENT: " + content);
String highlight = "C++";
System.out.println("HIGHLIGHT KEYWORD: " + highlight);
//highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
System.out.println("PATTERN: " + pattern.pattern());
java.util.regex.Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println("Match found!!!");
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
}
}
System.out.println("RESULT: " + content);
}
Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate
HIGHLIGHT KEYWORD: C++
PATTERN: \bC++\b
Match found!!!
c
RESULT: java,c++,ejb,struts,j2ee,hibernate
I even tried to escape '+' before calling Pattern.compile like this,
highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
but still I'm not able to get the syntax right. Can somebody help me solve this?

This should do what you need:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "\\b",
Pattern.CASE_INSENSITIVE);
Update: you are right, the above doesn't work for C++ (\b matches word boundaries and doesn't recognize ++ as a word). We need a more complicated solution:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
// anything other than whitespace or punctuation
Pattern.CASE_INSENSITIVE);
Update in response to comments: it seems that you need more logic in your pattern creation. Here's a helper method to create the pattern for you:
private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";
public static Pattern createHighlightPattern(final String highlight) {
final Pattern pattern = Pattern.compile(
(Character.isLetterOrDigit(highlight.charAt(0))
? WORD_BOUNDARY : LOOKBEHIND)
+ Pattern.quote(highlight)
+ (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
? WORD_BOUNDARY : LOOKAHEAD),
Pattern.CASE_INSENSITIVE);
return pattern;
}
And here is some test code to check that it works:
private static void testMatch(final String haystack, final String needle) {
final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
if (!matcher.find())
System.out.println("Failed to find pattern " + needle);
while (matcher.find())
System.out.println("Found additional match: " + matcher.group() +
" for pattern " + needle);
}
public static void main(final String[] args) {
final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
testMatch(testString, "java");
testMatch(testString, "c++");
testMatch(testString, ".net");
testMatch(testString, "c#");
}
When I run this method, I don't see any output (which is good :-))

The problem is that the \b word boundary anchor is not matching, because + is a non word character and I assume there is a whitespace following that is also a non word character.
A word boundary \b is matching a change from a word character (Member in \w) to a non word character (no member of \w).
Also if you want to match a + literally you have to escape it. Here you are searching for C++ that means match at least one C and the ++ is a possessive quantifier matching at least 1 C and does not backtrack.
Try changing your pattern to something like this
java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);
(?=\s) is a positive lookahead that will check if there is a whitespace following your highlight
Additionally you will need to esacape the + your are searching for.

All you need is here :
Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);

Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:
// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
"(?<=[\\s'\".?!,;:]|^) # Word preceded by ws, quote, punct or BOS.\n" +
// Escape any regex metacharacters in the keyword phrase.
java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +
"(?=[\\s'\".?!,;:]|$) # Word followed by ws, quote, punct or EOS.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Note that this solution works even if your keyword is a phrase containing spaces.

RegEx To Ignore Text Between Quotes

I have a Regex, which is [\\.|\\;|\\?|\\!][\\s]
This is used to split a string. But I don't want it to split . ; ? ! if it is in quotes.

I'd not use split but Pattern & Matcher instead.
A demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String text = "start. \"in quotes!\"; foo? \"more \\\" words\"; bar";
String simpleToken = "[^.;?!\\s\"]+";
String quotedToken =
"(?x) # enable inline comments and ignore white spaces in the regex \n" +
"\" # match a double quote \n" +
"( # open group 1 \n" +
" \\\\. # match a backslash followed by any char (other than line breaks) \n" +
" | # OR \n" +
" [^\\\\\r\n\"] # any character other than a backslash, line breaks or double quote \n" +
") # close group 1 \n" +
"* # repeat group 1 zero or more times \n" +
"\" # match a double quote \n";
String regex = quotedToken + "|" + simpleToken;
Matcher m = Pattern.compile(regex).matcher(text);
while(m.find()) {
System.out.println("> " + m.group());
}
}
}
which produces:
> start
> "in quotes!"
> foo
> "more \" words"
> bar
As you can see, it can also handle escaped quotes inside quoted tokens.

Here is what I do in order to ignore quotes in matches.
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*? # <-- append the query you wanted to search for - don't use something greedy like .* in the rest of your regex.
To adapt this for your regex, you could do
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?[.;?!]\s*

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex for String with possible escape characters - java

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Need help in regex matching

Combined positive lookbehind and lookahead

How do I escape '+' in pattern matching to highlight keyword?

RegEx To Ignore Text Between Quotes

Categories

Resources