Regex capturing sub groups with exact and partial matches

Regex capturing sub groups with exact and partial matches - java

I have
input string :
$data.store.author_handle.name0_handle[*].some.min()
regex :
^\$([a-zA-Z_0-9]+)(.[a-zA-Z_0-9.]+[\[\*0-9\]]*[.a-zA-Z_0-9]*)(.[min\(\)]*$)
So I get groups as follows
data
.store.author_handle.name0_handle[*].some.min
()
Where as I want to capture as below
data
.store.author_handle.name0_handle[*].some
.min()
Please note input can take the forms
$<literal>.<json path> <aggregator function>
<aggregator function> is optional and can be min/max/avg
<literal> : ([a-zA-Z_0-9]+)
Json path is any that is allowed by https://github.com/jayway/JsonPath

You can use following regex
^\$([\w]+)(\..+?)((?:\.(?:min|max|avg)\(\))?$)
Regex Demo
Regex Breakdown
^ #start of string
\$ #Match $ literally
( #Start of 1st capturing group
[\w]+ #Match characters in set [A-Za-z0-9_] at least once(you can also use [^.]+)
) #End of 1st capturing group
( #Start of 2nd capturing group
\. #Match . literally
.+? #Match lazily till next condition is met
) #End of 2nd capturing group
( #Start of 3rd capturing group
(?: #Non capturing group
\. #Match . literally
(?: #Non capturing group
min|max|avg #Match any from min,max or avg
)
\(\) #Match () literally
)? #As mentioned, this all can be optional(aggregation part)
$ #End of string(Kept here so that if nothing matches 0 sized string is returned instead of null)
) #End of 3rd capturing group
or
^\$([\w]+)(\..+?)((?:\.(?:(?:\w+)\(\)))?$)
for generalized aggregation function
Ideone Demo

Related

Java regular expression match two same number

I want to use RE to match the file paths like below:
../90804/90804_0.jpg
../89246/89246_8.jpg
../89247/89247_14.jpg
Currently, I use the code as below to match:
Pattern r = Pattern.compile("^(.*?)[/](\\d+?)[/](\\d+?)[_](\\d+?).jpg$");
Matcher m = r.matcher(file_path);
But I found it will be an unexpected match like for:
../90804/89246_0.jpg
Is impossible in RE to match two same number?

You may use a \2 backreference instead of the second \d+ here:
s.matches("(.*?)/(\\d+)/(\\2)_(\\d+)\\.jpg")
See the regex demo. Note that if you use matches method, you won't need ^ and $ anchors.
Details
(.*?) - Group 1: any 0+ chars other than line break chars as few as possible
/ - a slash
(\\d+) - Group 2: one or more digits
/ - a slash
(\\2) - Group 3: the same value as in Group 2
_ - an underscore
(\\d+) - Group 4: one or more digits
\\.jpg - .jpg.
Java demo:
Pattern r = Pattern.compile("(.*?)/(\\d+)/(\\2)_(\\d+)\\.jpg");
Matcher m = r.matcher(file_path);
if (m.matches()) {
System.out.println("Match found");
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
System.out.println(m.group(4));
}
Output:
Match found
..
90804
90804
0

You can use this regex with a capture group and back-reference of the same:
(\d+)/\1
RegEx Demo
Equivalent Java regex string will be:
final String regex = "(\\d+)/\\1";
Details:
(\d+): Match 1+ digits and capture it in group #1
/: Math literal /
\1: Using back-reference #1, match same number as in group #1

this regEx ^(.*)\/(\d+?)\/(\d+?)_(\d+?)\.jpg$
is matching stings like this:
../90804/90804_0.jpg
../89246/89246_8.jpg
../89247/89247_14.jpg
into 4 parts.
See example Result:

Reg-ex to match statsD Format

I am using the following reg-ex to match StatsD data format -
^[\w.]+:.+\|.\|#(?:[\w.]+:[^,\n]+(?:,|$))*$
This satisfies any of the following formats -
performance.os.disk:1099511627776|g|#region:us-west-1,datacenter:us-west-1a
or
performance.os.disk:1099511627776|g|#
or
performance.os.disk:1099511627776|g|#region:us-west-1
But I am unable to match it against -
datastore.reads:9876|ms
Any help?
RegEx 101 to try - https://regex101.com/r/H8vQTa/1/

You may use
^[\w.]+:[^|]+\|[^|]+(?:\|#(?:[\w.]+:[^,\n]+(?:,|$))*)?$
^^^^^^^^ ^^
See the regex demo
The point is that you only match any char with . between two |s, I suggest matching 1 or more chars other than | there, and make the rest optional by wrapping \|#(?:[\w.]+:[^,\n]+(?:,|$))* within an optional non-capturing group, (?:...)?.
Details
^ - start of string
[\w.]+ - 1+ word or . chars
: - a colon
[^|]+ - a negated character class matching 1+ non-| chars
\| - a | char
[^|]+ - 1+ chars other than |
(?:\|#(?:[\w.]+:[^,\n]+(?:,|$))*)? - an optional non-capturing group matching 1 or 0 occurrences of
\|# - |# substring
(?:[\w.]+:[^,\n]+(?:,|$))* - 0 or more consecutive repetitions of
[\w.]+: - 1+ word or . chars and then :
[^,\n]+ - 1+ chars other than LF (I guess it is used for debug purposes here) and ,
(?:,|$) - , or end of string
$ - end of string.

Regex: Match any word that is not the one defined by regex

I want to extract the words between the two bracket "blocks" and also the word in first brackets (RUNNING or STOPPED).
Example (extract the bolded part):
[ **RUNNING** ] **My First Application** [Pid: 4194]
[ **RUNNING** ] **Second app (some data)** [Pid: 5248]
[ **STOPPED** ] **Logger App**
So, as you can see, the [Pid: X] part is optional. I can write the regex as follows:
\[\s+(RUNNING|STOPPED)\s+\]\s+([^\[]+).*
and it will work. But this would fail if App name would contain the '[' character. I tried the following, but it won't work:
\[\s+(RUNNING|STOPPED)\s+\]\s+(?!\[Pid)+.*
My idea was to match any words/characters that are not starting with "[Pid", but I guess this would match any words that are not followed by "[Pid".
Is there any way to do exactly that: Match any word that is not "[Pid", i.e. match the part until first appearing of "[Pid" substring?

You may use
\[\s+(RUNNING|STOPPED)\s+\]\s+([^\[]*(?:\[(?!Pid:)[^\[]*)*)
See the regex demo
Details:
\[ - a literal [
\s+ - 1+ whitespaces
(RUNNING|STOPPED) - Group 1 capturing either RUNNING or STOPPED
\s+ - 1+ whitespaces
\] - a literal ]
\s+ - 1 or more whitespaces
([^\[]*(?:\[(?!Pid:)[^\[]*)*) - Group 2 capturing:
[^\[]* - zero or more chars other than [
(?:\[(?!Pid:)[^\[]*)* - zero or more sequences of:
\[(?!Pid:) - a [ not followed with Pid:
[^\[]* - zero or more chars other than [.
Java code:
String rx = "\\[\\s+(RUNNING|STOPPED)\\s+\\]\\s+([^\\[]*(?:\\[(?!Pid:)[^\\[]*)*)";
Pattern p = Pattern.compile(rx);
Matcher m = p.matcher("[ RUNNING ] My First Application");
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}

You can specify end of regex as [Pid or end of line by using this syntax:
\[\s+(RUNNING|STOPPED)\s+\]\s+(.*)(\[Pid|$)
Example.

You could achieve it with:
\[\ (RUNNING|STOPPED)\ \] # RUNNING or STOPPED -> group 1
(.+?) # everything afterwards in the same line lazily
(?:\[Pid:\ (\d+)\]|$) # [Pid:, numbers -> group 2, optional
See it working on regex101.com.

Weird password check matching using regex in Java

I'm trying to check a password with the following constraint:
at least 9 characters
at least 1 upper case
at least 1 lower case
at least 1 special character into the following list:
~ ! # # $ % ^ & * ( ) _ - + = { } [ ] | : ; " ' < > , . ?
no accentuated letter
Here's the code I wrote:
Pattern pattern = Pattern.compile(
"(?!.*[âêôûÄéÆÇàèÊùÌÍÎÏÐîÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ€£])"
+ "(?=.*\\d)"
+ "(?=.*[a-z])"
+ "(?=.*[A-Z])"
+ "(?=.*[`~!##$%^&*()_\\-+={}\\[\\]\\\\|:;\"'<>,.?/])"
+ ".{9,}");
Matcher matcher = pattern.matcher(myNewPassword);
if (matcher.matches()) {
//do what you've got to do when you
}
The issue is that some characters like € or £ doesn't make the password wrong.
I don't understand why this is working that way since I explicitly exclude € and £ from the authorized list.

Rather than trying to disallow those non-ascii characters why not makes your regex accept only ASCII characters like this:
Pattern pattern = Pattern.compile(
"(?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\\p{Print})\\p{ASCII}{9,})");
Also see use of \p{Print} instead of the big character class. I believe that would be suffice for you.
Check Javadoc for more details

This just allows printable Ascii. Note that it allows space character, but you could disallow space by setting \x21 instead.
Edit - I didn't see a number in the requirement, saw it in your regex, but wasn't sure.
# "^(?=.*[A-Z])(?=.*[a-z])(?=.*[`~!##$%^&*()_\\-+={}\\[\\]|:;\"'<>,.?])[\\x20-\\x7E]{9,}$"
^
(?= .* [A-Z] )
(?= .* [a-z] )
(?= .* [`~!##$%^&*()_\-+={}\[\]|:;"'<>,.?] )
[\x20-\x7E]{9,}
$

Regular expression - Negative lookahead

Using the following expression:
(?<!XYZ\d{8})(?>REF[A-Z]*)?(\d{3}+)(\d{6}+)(\d{3}+)
I am getting unexpected matches. Please could you explain why the following matches occur:
Input XYZ12345678123456789123 - Matches on 123456781234 - I was expecting it to only match on 123456789123 because it is the only sequence not preceded by (?<!XYZ\d{8})
Weirdly enough, if i use XYZ12345678REF123456789876 as input, it returns a match on 123456789876 but not REF123456789876. It correctly ignored the XYZ12345678, but it didn't pick up the optional REF characters.
Basically what i want to achieve is to extract a 12 digit identifier from a string that contains two identifiers. The first identifier has the format XYZ\d{8} and the second identifier has the format (?>REF[A-Z]*)?(\d{3}+)(\d{6}+)(\d{3}+)
To avoid a match on the wrong 12 digits in a string such as XYZ12345678123456789123, i want to say - get the twelve digits as long as the digits are not part of an XYZ\d{8} type identifier.
Edit
Here are a couple of examples of what i want to achieve
XYZ12345678123456789123 match on 123456789123
123456789123 match on 123456789123
XYZ12345678REF123456789123 should match on REF123456789123
12345678912 no match because not 12 digits
REF123456789123 match on REF123456789123
REF12345678912 no match because not 12 digits
XYZ12345678123456789123ABC match on 123456789123
XYZ123456789123 No match
XYZ1234567891234 no match

You ware almost there. Change (?<!XYZ\\d{8}) to (?<!XYZ\\d{0,7}). You need to check if your match is not part of previous identifier XYZ\\d{8} which means it cant have
XYZ
XYZ1
XYZ12
...
XYZ1234567
before it.
Demo based on your examples
String[] data ={
"XYZ12345678123456789123", //123456789123
"123456789123", //123456789123
"XYZ12345678REF123456789123 ", //REF123456789123
"12345678912", //no match because not 12 digits
"REF123456789123", //REF123456789123
"REF12345678912", //no match because not 12 digits
"XYZ12345678123456789123ABC", //123456789123
"XYZ123456789123", //no match
"XYZ1234567891234", //no match
};
Pattern p = Pattern.compile("(?<!XYZ\\d{0,7})(?>REF[A-Z]*)?(\\d{3}+)(\\d{6}+)(\\d{3}+)");
for (String s:data){
System.out.printf("%-30s",s);
Matcher m = p.matcher(s);
while (m.find())
System.out.print("match: "+m.group());
System.out.println();
}
output:
XYZ12345678123456789123 match: 123456789123
123456789123 match: 123456789123
XYZ12345678REF123456789123 match: REF123456789123
12345678912
REF123456789123 match: REF123456789123
REF12345678912
XYZ12345678123456789123ABC match: 123456789123
XYZ123456789123
XYZ1234567891234

The engine starts looking at the first character in the string.
If the string is "ABCDEF" and the regex is (?<!C)...
Looking at A, it sees there is no C to the left of it.
The assertion being satisfied, it then matches ABC.
Assertions just test the characters around it at the current position it is at.
They don't force the engine to find C first and match the char's after it.
edit
From your examples you would need somethin like this, that is anchored.
If not anchored, it could be harder.
Also, Java doesn't have branch reset, so you will have to see which group
cluster matched.
# "^(?:(?:XYZ\\d{8})(\\d{3})(\\d{6})(\\d{3})|(?:REF)(\\d{3})(\\d{6})(\\d{3})|(\\d{3})(\\d{6})(\\d{3}))"
^
(?:
(?: XYZ \d{8} )
( \d{3} ) # (1)
( \d{6} ) # (2)
( \d{3} ) # (3)
|
(?: REF )
( \d{3} ) # (4)
( \d{6} ) # (5)
( \d{3} ) # (6)
|
( \d{3} ) # (7)
( \d{6} ) # (8)
( \d{3} ) # (9)
)
alternative,
# "^(?:(?:XYZ\\d{8})|(?:REF))?(\\d{3})(\\d{6})(\\d{3})"
^
(?:
(?: XYZ \d{8} )
| (?: REF )
)?
( \d{3} ) # (1)
( \d{6} ) # (2)
( \d{3} ) # (3)

You can check if it's match is not part of previous identifier XYZ\d{8} which means it cant have
XYZ
XYZ1
XYZ12
...
XYZ1234567
before it.
Also, Java doesn't have branch reset, so you will have to see which group
cluster matched.
I will make the change
(?<!XYZ\\d{8}) to (?<!XYZ\\d{0,7}).
hope this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex capturing sub groups with exact and partial matches - java

Related

Java regular expression match two same number

Reg-ex to match statsD Format

Regex: Match any word that is not the one defined by regex

Weird password check matching using regex in Java

Regular expression - Negative lookahead

Categories

Resources