Regex Lookahead and Lookbehind to parse SQL statement - java

I am trying to parse SQL statements with regex and save it's parameters to use later.
Lets say I have this SQL statement:
INSERT INTO tablename (id, name, email) VALUES (#id, #name, #email)
The following regex will work just fine:
(#[0-9a-zA-Z$_]+)
However in this statement I should ignore everything in ' ' or " " and save only first parameter:
UPDATE mytable SET id = #id, name = 'myname#id' WHERE id = 1;
According to this answer https://stackoverflow.com/a/307957 "it's not practical to do it in a single regular expression", but I am still trying to do this.
I tried to add Regex Lookahead and Lookbehind, but its not working:
(?<!\').*(#[0-9a-zA-Z$_]+).*(?!\')
Is there any way to do it using only one regular expression? Should I use lookahead/lookbehind or something else?

You can use: [\=\(\s]\s*\#[0-9+^a-zA-Z_0-9$_]+\s*[\),]
Explanation:
[\=\(\s] match a single character present in the list below
\= matches the character = literally
\( matches the character ( literally
\s match any white space character [\r\n\t\f ]
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\# matches the character # literally
[0-9+^a-zA-Z_0-9$_]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
+^ a single character in the list +^ literally
a-z a single character in the range between a and z (case insensitive)
A-Z a single character in the range between A and Z (case insensitive)
_ the literal character _
0-9 a single character in the range between 0 and 9
$_ a single character in the list $_ literally
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
[\),] match a single character present in the list below
\) matches the character ) literally
, the literal character ,

You can simplify your regex. Note the group you want always to capture is followed with , or ). Being aware of this fact you get this regex:
(#[0-9a-zA-Z$_]+)(?=[,)])
#[0-9a-zA-Z$_]+ is your value
(?=[,)]) checks if the ) or , character follows.
If the way describing where your string can't be placed is complicated, better look where it must be places instead.
See how it works at Regex101.

Related

Regex to match comma separated values

I'm new to Regex in Java and I wanted to know how can I build one that only takes a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace.
I would need to filter out strings that start with a comma, that end with a comma or strings that have multiple consecutive commas.
All these would be invalid:
"D,, D"
"D D,,"
"D, ,D"
"D, ,,D"
"D,, ,D"
"D,,"
",,A"
",A"
"A,"
All these would be valid:
"D,D T,F"
"D,D T"
"A,A"
"A"
I used (\s?("[\w\s]*"|\d*)\s?(,,|$)) for consecutive commas but it doesn't do the trick when the comma is at the end or beggining of one of the whitespace separated substring like "D, ,D"
Should I aim to split by whitespace and look for a simpler regex for each of the substrings?
That would be something like this:
^[A-Z](,[A-Z])*( [A-Z](,[A-Z])*)*$
What happens here, is the following:
We expect a letter, optionally followed by one or more times a comma-immediately-followed-by-another-letter.
Then we optionally accept a space, and then the abovementioned pattern. And this is repeated.
Test: https://regex101.com/r/kzLhtw/1
You could, of course, slightly optimize the regex by making all capturing groups non-capturing: just put ?: immediately behind the (, that is, (?:.
You might use
^[A-Z](?: [A-Z])*(?:,[A-Z](?: [A-Z])*){0,2}$
^ Start of string
[A-Z] Match a single char A-Z
(?: [A-Z])* Optionally repeat a space and and a single char A-Z
(?: Non capture group
,[A-Z](?: [A-Z])* Match a comma, char A-Z followed by optionally repeat matching a space and a char A-Z
){0,2} Close the group and repeat 0-2 times
$ End of string
Regex demo
"a string that consists of one or two comma-separated lists of uppercase letters, separated by a single whitespace"
Not sure how to exactly interpretate the above, but my reading is: One or two comma-seperated lists where each list may only consist of uppercase characters. In the case of two lists, the two lists are seperated by a single space.
You could try:
^(?!.* .* )[A-Z](?:[ ,][A-Z])*$
See the online demo
^ - Start string anchor.
(?!.* .* ) - Negative lookahead to prevent two spaces present.
[A-Z] - A single uppercase alpha-char.
(?: - Open non-capture group:
[ ,] - A comma or space.
[A-Z] - A single uppercase alpha-char.
)* - Close non-capture group and match 0+ times upt to;
$ - End string anchor.

Regex to validate custom format

I have this format: xx:xx:xx or xx:xx:xx-y, where x can be 0-9 a-f A-F and y can be only 0 or 1.
I come up with this regex: ([0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}[:][0-9A-Fa-f]{2}|[-][0-1]{1})
(See regexr).
But this matches 0a:0b:0c-3 too, which is not expected.
Is there any way to remove these cases from result?
[:] means a character from the list that contains only :. It is the same as
:. The same for [-] which has the same result as -.
Also, {1} means "the previous piece exactly one time". It does not have any effect, you can remove it altogether.
To match xx:xx:xx or xx:xx:xx-y, the part that matches -y must be optional. The quantifier ? after the optional part mark it as optional.
All in all, your regex should be like this:
[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}(-[01])?
If the regex engine you use can be told to ignore the character case then you can get rid of A-F (or a-f) from all character classes and the regex becomes:
[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?
How it works, piece by piece:
[0-9a-f] # any digit or letter from (and including) 'a' to 'f'
{2} # the previous piece exactly 2 times
: # the character ':'
[0-9a-f]
{2}
:
[0-9a-f]
{2}
( # start a group; it does not match anything
- # the character '-'
[01] # any character from the class (i.e. '0' or '1')
) # end of group; the group is needed for the next quantifier
? # the previous piece (i.e. the group) is optional
# it can appear zero or one times
See it in action: https://regexr.com/4rfvr
Update
As #the-fourth-bird mentions in a comment, if the regex must match the entire string then you need to anchor its ends:
^[0-9a-f]{2}:[0-9a-f]{2}:[0-9a-f]{2}(-[01])?$
^ as the first character of a regex matches the beginning of the string, $ as the last character matches the end of the string. This way the regex matches the entire string only (when there aren't other characters before or after the xx:xx:xx or xx:xx:xx-y part).
If you use the regex to find xx:xx:xx or xx:xx:xx-y in a larger string then you don't need to add ^ and $. Of course, you can add only ^ or $ to let the regex match only at the beginning or at the end of the string.
You want
xx:xx:xx or if it is followed by a -, then it must be a 0 or 1 and then it is the end (word boundry).
So you don't want any of these
0a:0b:0c-123
0a:0b:0cd
10a:0b:0c
either.
Then you want "negative lookingahead", so if you match the first part, you don't want it to be followed by a - (the first pattern) and it should end there (word boundary), and if it is followed by a -, then it must be a 0 or 1, and then a word boundary:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}(?!-)\b|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
To prevent any digit in front, a word boundary is added to the front as well.
Example: https://regexr.com/4rg42
The following almost worked:
/\b([0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}\b[^-]|\b[0-9a-f]{2}[:][0-9a-f]{2}[:][0-9a-f]{2}-[01]\b)/i
but if it is the end of file and it is 3a:2b:11, then the [^-] will try to match a non - character and it won't match.
Example: https://regexr.com/4rg4q

Java - explain this regular expression (",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1)

I am separating a string "foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"" by commas but want to keep the commas in the quotes. This question was answered in this Java: splitting a comma-separated string but ignoring commas in quotes question but it fails to fully explain how the poster created this piece of code which is:
line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
OK so I do understand some of what is going on but there is a bit that is confusing me. I know the first comma is for matching.
Then
(?=
is a forward search.
Then the first part is grouped
([^\"]*\"[^\"]*\").
This where I get confused. So the first part
[^\"]*
means that beginning of any line with quotes separate tokens zero or more times.
Then comes \". Now is this like opening a quote in string or is it saying match this quote?
Then it repeats the exact same line of code, why?
([^\"]*\"[^\"]*\")
In the second part adds the same code again to explain that it must finish with quotes.
Can someone explain the part i am not getting?
[^\"] is any string without ". \" matches ". So basically ([^\"]*\"[^\"]*\") matches a string that contains 2 " and the last character is ".
I think they do a pretty good job of explaining later in the answer:
[^\"] is match other than quote.
\" is quote.
So this part ([^\"]*\"[^\"]*\") is
[^\"]* match other than quote 0 or more times
\" match quote, yes this is the opening quote
[^\"]* match other than quote 0 or more times
\" match quote, closing quote
They only require the first [^\"]* because they do not start with a quote, their example input is like a="abc",b="d,ef". If you were parsing "abc","d,ef" you wouldn't need it.
here is your string /,(?=([^\"]\"[^\"]\")[^\"]$)/
here is the readout from https://regex101.com/
, matches the character , literally
(?=([^\"]*\"[^\"]*\")*[^\"]*$) Positive Lookahead - Assert that the regex below can be matched
1st Capturing group ([^\"]*\"[^\"]*\")*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
\" matches the character " literally
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
\" matches the character " literally
[^\"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\" matches the character " literally
$ assert position at end of the string

Regex to avoid some characters and some character combinations

I have to create a text box where user input comment and I have to validate that, this input does not contains below characters(character combination).
:|, &, ; , $ , % , # , ' , " , \' , \" , <> , (), +, CR, LF, \
The list above is comma delimited, so if two characters appear between
a set of commas, it’s the character combination that is potentially
malicious, not the character in isolation
I tried to create regex for this and tried Positive Lookahead also, but not working anything for me. I have gone through some earlier questions also, but not found solution for my query.
I am able to validate single malicious characters but not the combination.
As for characters, that's really simple. You can just specify what type of chars are not allowed in the string by using [^], in your case, [^&;$%#\'\"+\\]
[^&;$%#\'\"+\\]* will match a string that doesn't contain the mentioned symbols.
As for the combinations, regex has negative lookahead for that. Before the engine starts matching something, it can test if there aren't patterns present in the string. Syntax: (?!.*thing1|.*thing2|...) (the .* is needed so that the whole string is checked, not only the next word, so (?!.*:\||.*<>|.*\(\)|.*CR|.*LF)
All together: ^(?!.*:\||.*<>|.*\(\)|.*CR|.*LF)[^&;$%#\'\"+\\]*$

Java Regular Expression: what is " '- "

I came up to a line in java that uses regular expressions.
It needs a user input of Last Name
return lastName.matches( "[a-zA-z]+([ '-][a-zA-Z]+)*" );
I would like to know what is the function of the [ '-].
Also why do we need both a "+" and a "*" at the same time, and the [ '-][a-zA-Z] is in brackets?
Your RE is: [a-zA-z]+([ '-][a-zA-Z]+)*
I'll break it down into its component parts:
[a-zA-Z]+
The string must begin with any letter, a-z or A-Z, repeated one or more times (+).
([ '-][a-zA-Z]+)*
[ '-]
Any single character of <space>, ', or -.
[a-zA-Z]+
Again, any letter, a-z or A-Z, repeated once or more times.
This combination of letters ('- and a-ZA-Z) may then be repeated zero or more times.
Why [ '-]? To allow for hiphenated names, such as Higgs-Boson or names with apostrophes, such as O'Reilly, or names with spaces such as Van Dyke.
The expression [ '-] means "one of ', , or -". The order is very important - the dash must be the last one, otherwise the character class would be considered a range, and other characters with code points between the space and the quote ' would be accepted as well.
+ means "one or more repetitions"; * means "zero or more repetitions", referring to the term of the regular expression preceding the + or * modifier.]
Overall, the expression matches groups of lowercase and uppercase letters separated by spaces, dashes, or single quotes.
it means it can be any of the characters space ' or - ( space, quote dash )
the - can be done as \- as it also can mean a range... like a-z
This looks like it is a pattern to match double-barreled (space or hyphen) or I-don't-know-what-to-call-it names like O'Grady... for example:
It would match
counter-terrorism
De'ville
O'Grady
smith-jones
smith and wesson
But it will not match
jones-
O'Learys'
#hashtag
Bob & Sons
The idea is, after the first [A-Za-z]+ consumes all the letters it can, the match will end right there unless the next character is a space, an apostrophe, or a hyphen ([ '-]). If one of those characters is present, it must be followed by at least one more letter.
A lot of people have difficulty with this. The naively write something like [A-Za-z]+[ '-]?[A-Za-z]*, figuring both the separator and the extra chunks of letters are optional. But they're not independently optional; if there is a separator ([ '-]), it must be followed by at least one more letter. Otherwise it would treat strings like R'- j'-' as valid. Your regex doesn't have that problem.
By the way, you've got a typo in your regex: [a-zA-z]. You want to watch out for that, because [A-z] does match all the uppercase and lowercase letters, so it will seem to be working correctly as long as the inputs are valid. But it also matches several non-letter characters whose code points happen to lie between those of Z and a. And very few IDEs or regex tools will catch that error.

Categories

Resources