Combined positive lookbehind and lookahead - java

I want to parse an array from a custom key-value protocol. It looks like this
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor"
FLAGS: 1, 2, 3
In Java the String looks this (it uses CRLF as linebreak):
RESPONSE GAMEINFO OK\\r\\nNAME: \"gamelobby\"\\r\\nPLAYERS: \"alice\", \"bob\", \"hodor\"FLAGS: 1, 2, 3\\r\\n
I want to capture "alice", "bob", "hodor" as-is. So I used this regexp, which was tested in Sublime Text and on regex101.com (keys are case insensitive)
(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*)(?=\r\n)
This is a screenshot from Sublime Text (note: I left out \r here):
When I try to capture the group, I get the next line too:
Pattern p = Pattern.compile("(?<=(?i:"+key+"): )([A-Za-z0-9\\s\\.,:;\\?!\\n\"_-]*)(?=\\r\\n)");
Matcher matcher = p.matcher(message);
matcher.find();
String value = new String();
try {
value = matcher.group(); // = "\"alice\", \"bob\", \"hodor\"\\r\\nFLAGS: 1, 2, 3"
} ...
NOTE: \" or \\\" don't seem to make a difference.
Why is FLAGS: 1, 2, 3 captured until \\r\\n, but not in the line above? Is positive lookbehind and lookahead possible? Which lookhead / lookbehind is evaluated first?
EDIT: Definition of the string array is
values = string*("," WSP string)
string = DQUOTE *(ALPHA / DIGIT / WSP / punctuation / "\n") DQUOTE
punctuation = "." / ":" / "," / ";" / "?" / "!" / "-" / "_"

Just write the code according to your grammar. The grammar doesn't seem ambiguous to me, so if you just follow it and compose your regex piece by piece, you are going to be alright:
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
Currently,\r\n is used to check for line separator at the end of PLAYERS entry. Change it to whatever specified in your specification.
Caveat
This solution only works for parsing valid input. Parsing invalid input depends on your recovery algorithm and the line separator.
If the line separator allows for \n as well as \r\n, it is hard to recover from an error. For example, if there is a user named ABC\nFLAGS: 1, 2, 3 (allowed according to grammar), but the closing double quote is missing, the list of players will be broken, and you won't be able to tell whether FLAGS: is part of the previous line or a different header.
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor", "ABC
FLAGS: 1, 2, 3
FLAGS: 1, 2, 3
Full example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SO28290386 {
public static void main(String[] args) {
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
System.out.println(PLAYERS_RE);
String input = "RESPONSE GAMEINFO OK\r\nNAME: \"gamelobby\"\r\nPLAYERS: \"alice\", \"bob\", \"hodor\", \"new\nline\"\r\nFLAGS: 1, 2, 3\r\n";
System.out.println("INPUT");
System.out.println(input);
Pattern p = Pattern.compile(PLAYERS_RE);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
}
}

You can use a non-greedy multiplier on the bracket expression:
(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*?)(?=\r\n)
The reason matching does not stop at the \r\n when you use the greedy multipler * is because the bracket expression contains \s. The definition of \s (according to documentation of Pattern class) is [ \t\n\x0B\f\r], so the bracket expression is actually barreling through the CRLF line terminator and everything else in its path, until it gets to the end of the whole string.
I suppose if you were OK with explicitly preventing lone CRs from being present in the quoted-word list, then another viable solution would be to replace \s with an explicit [\n\t\f ], but I'll leave that up to you.
The non-greedy multiplier *? solution works because when the regex engine hits the first CRLF to satisfy the final look-ahead assertion, it stops matching, even though the bracket expression could gobble it up.
The test code on regex101 fails for the case where the string contains new line since the site doesn't seem to support CRs, so we can't really do a full test there. But in the real regex in the Java code, the look-ahead assertion would require a CRLF to terminate the search, so it would end up matching the whole quoted-word list.

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!
You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead
Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

Need help in regex matching

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.
You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"
Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69
Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

Regex for String with possible escape characters

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

JTextField reSource; //contains the regex expression the user wants to search for
String re=reSource.getText();
Pattern p=Pattern.compile(re,myflags); //myflags defined elsewhere in code
Matcher m=p.matcher(src); //src is the text to search and comes from a JTextArea
while (m.find()==true) {
If the user enters \t it finds \t not tab.
If the user enters \\\t it finds \\\t not tab.
If the user enters [\t] or [\\\t] it finds t not tab.
I want it such that if the user enters \t it finds tab. Of course it also needs to work with \n, \r etc...
If re="\t"; is used instead of re=reSource.getText(); with \t in the JTextField then it finds tabs. How do I get it to work with the contents of the JTextField?
Example:
String src = "This\tis\ta\ttest";
System.out.println("src=\"" + src + '"'); // --> prints "This is a test"
String re="\\t";
System.out.println("re=\"" + re + '"'); // --> prints "\t" - as when you use reSource.getText();
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(src);
while (m.find()) {
System.out.println('"' + m.group() + '"');
}
Output:
src="This is a test"
re="\t"
" "
" "
" "
Try this:
re=re.replace("\\t", "\t");
OR
re=re.replace("\\t", "\\\\t");
I think the problem is in understanding that when you type:
String str = "\t";
Then it is actualy same as:
String str = " ";
But if you type:
String str = "\\t";
Then the System.out.print(str) will be "\t".
Matching \t should work, however, your flags might have a problem.
Here's what works for me:
String src = "A\tBC\tD";
Pattern p=Pattern.compile("\\w\\t\\w"); //simulates the user entering \w\t\w
Matcher m=p.matcher(src);
while (m.find())
{
System.out.println("Match: \"" + m.group(0) + "\"");
}
Output is:
Match: "A B"
Match: "C D"
My experience is that Java Swing JTextField and JTable GUI controls escape user-entered backslashes by prefixing a backslash.
User types two-character sequence "backslash t", control's getText() method returns a String containing the three-character sequence "backslash backslash t". The SO formatter does its own thing with backslashes in text so here it is as code:
Single backslash: input is 2 char sequence \t and return value is 3 char \\t
For three-character input sequence "backsl backsl t", getText() returns the five-character sequence "backsl backsl backsl backsl t". As code:
Double backslash: input is 3 char sequence \\t and return value is 5 char \\\\t
This basically prevents the backslash from modifying the t to yield a character sequence that becomes a tab when interpreted by something like System.out.println.
Conveniently, and surprisingly to me, the regex processor accepts it either way. A two-character sequence "\t" matches a tab character, as does a three-character sequence "\\t". Please see demo code below. The system.out calls demonstrate which sequences and patterns, have tabs, and in JDK 1.7 both matches yield true.
package my.text;
/**
* Demonstrate use of tab character in regexes
*/
public class RegexForSo {
public static void main(String [] argv) {
final String sequenceTab="x\ty\tz";
final String patternBsTab = "x\t.*";
final String patternBsBsTab = "x\\t.*";
System.out.println("sequence is >" + sequenceTab + "<");
System.out.println("pattern BsTab is >" + patternBsTab + "<");
System.out.println("pattern BsBsTab is >" + patternBsBsTab + "<");
System.out.println("matched BsTab = " + sequenceTab.matches(patternBsTab));
System.out.println("matched BsBsTab = " + sequenceTab.matches(patternBsBsTab));
}
}
Output on my JDK1.7 system is below, tabs in output might not survive SO formatter :)
sequence is >x y z<
pattern BsTab is >x .*<
pattern BsBsTab is >x\t.*<
matched BsTab = true
matched BsBsTab = true
HTH

Whats the regular expression for finding string between " "

I have a string as below:
"http:172.1." = (10, 1,3);
"http:192.168." = (15, 2,6);
"http:192.168.1.100" = (1, 2,8);
The string inside " " is a Tag and inside () is the value for preceding tag.
What is the regular expression that will return me:
Tag: http:172.1.
Value: 10, 1,3
This regex
"([^\"]*)"\s*=\s*\(([^\)]*)\)*.
returns the text between quotes "" as group 1, and the text in parentheses () as group 2.
NB: when saving this as a string, you will have to escape the quote characters and double the slashes. It becomes unreadable very quickly - like this:
"\"([^\\\"]*)\"\\s*=\\s*\\(([^\\)]*)\\)*."
EDIT: As requested, here's an example use:
Pattern p = Pattern.compile("\"([^\\\"]*)\"\\s*=\\s*\\(([^\\)]*)\\)*.");
// put p as a class member so it's computed only once...
String stringToMatch = "\"http://123.45\" = (0,1,3)";
// the string to match - hardcoded here, but you will probably read
// this from a file or similar
Matcher m = p.matches(stringToMatch);
if (m.matches()) {
String url = p.group(1); // what's between quotes
String value = p.group(2); // what's between parentheses
System.out.println("url: "+url); // http://123.45
System.out.println("value: "+value); // 0,1,3
}
For more details, see the Sun Tutorial - Regular Expressions.

Categories

Resources