Java - explain this regular expression (",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1) - java

I am separating a string "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"" by commas but want to keep the commas in the quotes. This question was answered in this Java: splitting a comma-separated string but ignoring commas in quotes question but it fails to fully explain how the poster created this piece of code which is:
line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
OK so I do understand some of what is going on but there is a bit that is confusing me. I know the first comma is for matching.
Then
(?=
is a forward search.
Then the first part is grouped
([^\"]*\"[^\"]*\").
This where I get confused. So the first part
[^\"]*
means that beginning of any line with quotes separate tokens zero or more times.
Then comes \". Now is this like opening a quote in string or is it saying match this quote?
Then it repeats the exact same line of code, why?
([^\"]*\"[^\"]*\")
In the second part adds the same code again to explain that it must finish with quotes.
Can someone explain the part i am not getting?

[^\"] is any string without ". \" matches ". So basically ([^\"]*\"[^\"]*\") matches a string that contains 2 " and the last character is ".

I think they do a pretty good job of explaining later in the answer:
[^\"] is match other than quote.
\" is quote.
So this part ([^\"]*\"[^\"]*\") is
[^\"]* match other than quote 0 or more times
\" match quote, yes this is the opening quote
[^\"]* match other than quote 0 or more times
\" match quote, closing quote
They only require the first [^\"]* because they do not start with a quote, their example input is like a="abc",b="d,ef". If you were parsing "abc","d,ef" you wouldn't need it.

here is your string /,(?=([^\"]\"[^\"]\")[^\"]$)/
here is the readout from https://regex101.com/
, matches the character , literally
(?=([^\"]*\"[^\"]*\")*[^\"]*$) Positive Lookahead - Assert that the regex below can be matched
1st Capturing group ([^\"]*\"[^\"]*\")*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
\" matches the character " literally
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
\" matches the character " literally
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
$ assert position at end of the string

Related

Regex pattern matching with multiple strings

Forgive me. I am not familiarized much with Regex patterns.
I have created a regex pattern as below.
String regex = Pattern.quote(value) + ", [NnoneOoff0-9\\-\\+\\/]+|[NnoneOoff0-9\\-\\+\\/]+, "
+ Pattern.quote(value);
This regex pattern is failing with 2 different set of strings.
value = "207e/160";
Use Case 1 -
When channelStr = "207e/160, 149/80"
Then channelStr.matches(regex), returns "true".
Use Case 2 -
When channelStr = "207e/160, 149/80, 11"
Then channelStr.matches(regex), returns "false".
Not able to figure out why? As far I can understand it may be because of the multiple spaces involved when more than 2 strings are present with separated by comma.
Not sure what should be correct pattern I should write for more than 2 strings.
Any help will be appreciated.
If you print your pattern, it is:
\Q207e/160\E, [NnoneOoff0-9\-\+\/]+|[NnoneOoff0-9\-\+\/]+, \Q207e/160\E
It consists of an alternation | matching a mandatory comma as well on the left as on the right side.
Using matches(), should match the whole string and that is the case for 207e/160, 149/80 so that is a match.
Only for this string 207e/160, 149/80, 11 there are 2 comma's, so you do get a partial match for the first part of the string, but you don't match the whole string so matches() returns false.
See the matches in this regex demo.
To match all the values, you can use a repeating pattern:
^[NnoeOf0-9+/-]+(?:,\h*[NnoeOf0-90+/-]+)*$
^ Start of string
[NnoeOf0-9\\+/-]+
(?: Non capture group
,\h* Match a comma and optional horizontal whitespace chars
[NnoeOf0-90-9\\+/-]+ Match 1+ any of the listed in the character class
)* Close the non capture group and optionally repeat it (if there should be at least 1 comma, then the quantifier can be + instead of *)
$ End of string
Regex demo
Example using matches():
String channelStr1 = "207e/160, 149/80";
String channelStr2 = "207e/160, 149/80, 11";
String regex = "^[NnoeOf0-9+/-]+(?:,\\h*[NnoeOf0-90+/-]+)*$";
System.out.println(channelStr1.matches(regex));
System.out.println(channelStr2.matches(regex));
Output
true
true
Note that in the character class you can put - at the end not having to escape it, and the + and / also does not have to be escaped.
You can use regex101 to test your RegEx. it has a description of everything that's going on to help with debugging. They have a quick reference section bottom right that you can use to figure out what you can do with examples and stuff.
A few things, you can add literals with \, so \" for a literal double quote.
If you want the pattern to be one or more of something, you would use +. These are called quantifiers and can be applied to groups, tokens, etc. The token for a whitespace character is \s. So, one or more whitespace characters would be \s+.
It's difficult to tell exactly what you're trying to do, but hopefully pointing you to regex101 will help. If you want to provide examples of the current RegEx you have, what you want to match and then the strings you're using to test it I'll be happy to provide you with an example.
^(?:[NnoneOoff0-9\\-\\+\\/]+ *(?:, *(?!$)|$))+$
^ Start
(?: ... ) Non-capturing group that defines an item and its separator. After each item, except the last, the separator (,) must appear. Spaces (one, several, or none) can appear before and after the comma, which is specified with *. This group can appear one or more times to the end of the string, as specified by the + quantifier after the group's closing parenthesis.
Regex101 Test

java regex to capture any number of periods within a string

I am trying to match on any of the following:
$tag:parent.child$
$tag:grand.parent.child$
$tag:great.grand.parent.child$
I have tried a bunch of combos but not sure how to do this without an exp for each one: https://regex101.com/r/cMvx9I/1
\$tag:[a-z]*\.[a-z]*\$
I know this is wrong, but haven't been able to find the right method yet. Help is greatly appreciated.
Your regex was: \$tag:[a-z]*\.[a-z]*\$
You need a repeating group of .name, so use: \$tag:[a-z]+(?:\.[a-z]+)+\$
That assumes there has to be at least 2 names. If only one name is allowed, i.e. no period, then change last + to *.
You can use \$tag:(?:[a-z]+\.)*[a-z]+\$
\$ a literal $
tag: literal tag:
(?:...) a non-capturing group of:
[a-z]+ one or more lower-case letters and
\. a literal dot
* any number of the previous group (including zero of them)
[a-z]+ one or more lower-case letters
\$ a literal $
The following pattern will match any periods within a string:
\.
Not sure if this is what you want, but you can make a non-capturing group out of a pattern and then find that a certain number of times:
\$tag:(?:[a-z]+?\.*){1,4}\$
\$tag: - Literal $tag:
(?:[a-z]+?\.*) - Non-capturing group of any word character one or more times (shortest match) followed by an optional literal period
{1,4} - The capturing group appears anywhere between 1-4 times (you can change this as needed, or use a simple + if it could be any number of times).
\$ - Literal $
I normally prefer \w instead of [a-z] as it is equivalent to [a-zA-Z0-9_], but using this depends on what you are trying to find.
Hope this helps.

Regex Lookahead and Lookbehind to parse SQL statement

I am trying to parse SQL statements with regex and save it's parameters to use later.
Lets say I have this SQL statement:
INSERT INTO tablename (id, name, email) VALUES (#id, #name, #email)
The following regex will work just fine:
(#[0-9a-zA-Z$_]+)
However in this statement I should ignore everything in ' ' or " " and save only first parameter:
UPDATE mytable SET id = #id, name = 'myname#id' WHERE id = 1;
According to this answer https://stackoverflow.com/a/307957 "it's not practical to do it in a single regular expression", but I am still trying to do this.
I tried to add Regex Lookahead and Lookbehind, but its not working:
(?<!\').*(#[0-9a-zA-Z$_]+).*(?!\')
Is there any way to do it using only one regular expression? Should I use lookahead/lookbehind or something else?
You can use: [\=\(\s]\s*\#[0-9+^a-zA-Z_0-9$_]+\s*[\),]
Explanation:
[\=\(\s] match a single character present in the list below
\= matches the character = literally
\( matches the character ( literally
\s match any white space character [\r\n\t\f ]
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\# matches the character # literally
[0-9+^a-zA-Z_0-9$_]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
+^ a single character in the list +^ literally
a-z a single character in the range between a and z (case insensitive)
A-Z a single character in the range between A and Z (case insensitive)
_ the literal character _
0-9 a single character in the range between 0 and 9
$_ a single character in the list $_ literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
[\),] match a single character present in the list below
\) matches the character ) literally
, the literal character ,
You can simplify your regex. Note the group you want always to capture is followed with , or ). Being aware of this fact you get this regex:
(#[0-9a-zA-Z$_]+)(?=[,)])
#[0-9a-zA-Z$_]+ is your value
(?=[,)]) checks if the ) or , character follows.
If the way describing where your string can't be placed is complicated, better look where it must be places instead.
See how it works at Regex101.

Find java comments (multi and single line) using regex

I found the following regex online at http://regexlib.com/
(\/\*(\s*|.*?)*\*\/)|(\/\/.*)
It seems to work well for the following matches:
// Compute the exam average score for the midterm exam
/**
* The HelloWorld program implements an application that
*/
BUT it also tends to match
http://regexr.com/foo.html?q=bar
at least starting at the //
I'm new to regex and a total infant, but I read that if you put a caret at the beginning it forces the match to start at the beginning of the line, however this doesn't seem to work on RegExr.
I'm using the following:
^(\/\*(\s*|.*?)*\*\/)|(\/\/.*)$
The regex you are looking for is one that allows the comment beginning (// or /*) to appear anywhere except in each of the regexps that result in tokens that can contain those substrings inside. If you look at the lexical structure of java language, you'll see that the only lexical element that can contain a // or a /* inside is the string literal, so to match a comment inside a string you have to match all the string (for not having a string literal before your match that happens to begin a string literal --- and contain your comment inside)
So, the string before your comment should be composed of any valid string that don't begin a string literal (without ending) and so, it can be rounded by any number of string literals with any string that doesn't form a string literal in between. If you consider a string literal, it should be matched by the following:
\"()*\"
and the inside of the parenthesis must be filled with something that cannot be a \n, a single ", a single \, and also not a unicode literal \uxxxx that results in a valid " (java forbids to use normal java characters to be encoded as unicode sequences, so this last case doesn't apply) but can be a escaped \\ or a escaped \", so this leads to
\"([^\\\"\n]|\\.)*\"
and this can be repeated any number of times optionaly, and preceded of any character not being a " (that should begin the last part considered):
([^\\"](\"([^\\\"\n]|\\.)*\")?)*
well, the previous part to our valid string should be matched by this string, and then comes the comment string, it can be any of two forms:
\/\/[^\n]*$
or
/\*([^\*]|\*[^\/])*\*\/
(this is, a slash, an asterisk (escaped), and any number of things that can be: either something different than a * or * followed by something not a /, to finally reach a */ sequence)
These can be grouped in an alternative group, as in:
(\/\/[^\n]*\n|\/\*([^\*]|\*[^\/])*\*\/)
finally, our expression shows:
^([^\\"](\"([^\\\"\n]|\\.)*\")?)*(\/\/[^\n]*|\/\*([^\*]|\*[^/])*\*\/)
But you should be careful that your matched comment begins not at the beginning, but in the 4th group (in the mark of the 4th left parenthesis) and the regexp should match the string from the beginning, see demo
Note
Think you are matching not only the comment, but the text before. This makes the result match to be composed of what is before the matching you want and the matched. Also think that if you try this regexp with several comments in sequence, it will match only the last, as we have not covered the case of a /* ... /* .... */ sequence (the comment is also something that can be embedded into a comment, but considering also this case will make you hate regexps forever. The correct way to cope with this problem is to write a lex/flex specification to get the java tokens and you'll only get them, but this is out of scope in this explanation. See an probably valid example here.
You can try this pattern:
(?ms)^[^'"\n]*?(?:(?:"(?:\\.|[^"])*"|'\\?.')[^'"\n]*?)*((?:(?://[^\n]*|/\*.*?\*/)[ \t]*)+)
This captures comments in group 1, but only if the comment is not inside a string. Demo.
Breakdown:
(?ms) multiline flag, makes ^ match at the start of a line
singleline flag makes . match newlines
^ start of line
[^'"\n]*? match anything but " or ' or newline
(?: then, any number strings:
(?:
" start with a quote...
(?: ...followed by any number of...
\\. ...a backslash and the escaped character
| or
[^"] any character other than "
)*
" ...and finally the closing quote
| or...
'\\?.' a single character in single quotes, possibly escaped
)
[^'"\n]*? and everything up to the next string or newline
)*
( finally, capture (any number of) comments:
(?:
(?: either...
//[^\n]* a single line comment
| or
/\*.*?\*/ a multiline comment
)
[ \t]* and any subsequent comments if only separated by whitespace
)+
)

Semi colon separated alphanumeric

I need to validate the below string using regular expression in Java:
String alphanumericList ="[\"State\"; \"districtOne\";\"districtTwo\"]";
I have tried the following:
String pattern="^\\[ (\"[\\w]\")\\s+(?:\\s+;\\s+ (\"[\\w]\")+) \\]$";
String alphanumericList ="[\"State1\"; \"district1\";\"district2\"]";
But the validation fails.
Any help is appreciated.
I'll try and mark the possible issues with your expression (issue numbers above the chars):
1 4 2 3 1 4 5 1
"^\\[ (\"[\\w]\")\\s+(?:\\s+;\\s+ (\"[\\w]\")+) \\]$"
As you can see, there are at least 5 issues:
The spaces in your expression are interpreted literally, i.e. if the input doesn't contain them, it would not match. Most probably you want to remove those spaces.
You expect at least one whitespace character after the first group (\\s+), which the input doesn't seem to contain. You probably want to remove that or change the quantifier from + to *.
You expect at least one whitespace character before each semicolon. Together with no. 2 this would make at least two after the first group. The solution would be the same as for no. 2.
Your expression the strings between double quotes seems wrong. (\"[\\w]\")+ means "a double quote, a single word character, a double quote" and all at least once. Besides that, \w is already a character class, you the brackets around that are not needed here (unless you want to add more classes or characters inside). You probably want (\"\\w+\") instead.
Additionally to 4 your non-capturing group that contains the semicolon ((?:\\s+;\\s+ (\"[\\w]\")+)) doesn't have a quantifier, i.e. it would be expected exactly once. You probably want to put the quantifier + or * after that group.
Another point that's not a direct issue is the capturing group around \"[\\w]\". Since you seem to want to match multiple strings after semicolons you'd only be able to capture one of the matching groups. Hence you'd most probably not be able to do what you intended anyways and thus the group is not necessary.
That said the fixed original expression would look like this:
pattern = "^\\[(\"\\w+\")(?:\\s*;\\s+\"\\w+\")+\\]$"
You are looking for this pattern:
String pattern = "\\[\\s*\"[^\"]*\"\\s*(?:;\\s*\"[^\"]*\"\\s*)*+\\]";
No need to add anchors since there are implicit if you use the matches() method since this method is the more appropriate for validation tasks.
pattern details:
\\[ # a literal opening square bracket
\\s* # optional whitespaces
\" # literal quote
[^\"]* # content between quotes: chars that are not a quote (zero or more)
\"
\\s*
(?: # non-capturing group:
; # a literal semi-colon
\\s*
\" # quoted content
[^\"]*
\"
\\s*
)*+ # repeat this group zero or more time (with a possessive quantifier)
\\] # a literal closing square bracket
The possessive quantifier prevent the regex engine to backtrack into repeated non-capturing groups if the closing square bracket is not present. It is a security to prevent uneeded backtracking and to make the pattern fail faster. Not that you can make possessive other quantifiers too before the non-capturing group for the same reason. More about possessive quantifiers.
I decided to describe the content between quotes in this way: \"[^\"]*\", but you can be more restrictive, allowing for example only words characters: \"\\w*\" or more general, allowing escaped quotes: \"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*+\"
Try this
static final String HEAD = "^\\[\\s*";
static final String TAIL = "\\s*\\]$";
static final String SEP = "\\s*;\\s*";
static final String ITEM = "\"[^\"]*\"";
static final String PAT = HEAD + ITEM + "(" + SEP + ITEM + ")*" + TAIL;
Try:
pattern = "^\\[(\"\\w+\";\\s*)*(\"\\w+\")\\]$";

Categories

Resources