Regular expressions - capturing groups confusion - java

I am reading an Oracle tutorial on regular expressions. I am on the topic Capturing groups. Though the reference is excellent, but except that a parenthesis represents a group, I am finding many difficulties in understanding the topic. Here are my confusions.
What is the significance of counting groups in an expression?
What are non-capturing groups?
Elaborating with examples would be nice.

One usually doesn't count groups other than to know which group has which number. E.g. ([abc])([def](\d+)) has three groups, so I know to refer to them as \1, \2 and \3. Note that group 3 is inside 2. They are numbered from the left by where they begin.
When searching with regex to find something in a string, as opposed to matching when you make sure the whole string matches the subject, group 0 will give you just the matched string, but not the stuff that was before or after it. Imagine if you will a pair of brackets around your whole regex. It's not part of the total count because it's not really considered a group.
Groups can be used for other things than capturing. E.g. (foo|bar) will match "foo" or "bar". If you're not interested in the contents of a group, you can make it non-capturing (e.g: (?:foo|bar) (varies by dialect)), so as not to "use up" the numbers assigned to groups. But you don't have to, it's just convenient sometimes.
Say I want to find a word that begins and ends in the same letter: \b([a-z])[a-z]*\1\b The \1 will then be the same as whatever the first group captured. Of course it can be used for much more powerful stuff, but I think you'll get the idea.
(Coming up with relevant examples is certainly the hardest part.)
Edit: I answered when the questions were:
What is the significance of counting groups in an expression?
There is a special group, called as group-0, which means the entire expression. It is not reported by groupCount() method. Why is that?
I don't understand what are non-capturing groups?
Why we need back-references? What is the significance of back-references?

Say you have a string, abcabc, and you want to figure out whether the first part of the string matches the second part. You can do this with a single regex by using capturing groups and backreferences. Here is the regex I would use:
(.+)\1
The way this works is .+ matches any sequence of characters. Because it is in parentheses, it is caught in a group. \1 is a backreference to the 1st capturing group, so it is the equivalent of the text caught by the capturing group. After a bit of backtracking, the capturing group matches the first part of the string, abc. The backreference \1 is now the equivalent of abc, so it matches the second half of the string. The entire string is now matched, so it is confirmed that the first half of the string matches the second half.
Another use of backreferences is in replacing. Say you want to replace all {...} with [...], if the text inside { and } is only digits. You can easily do this with capturing groups and backreferences, using the regex
{(\d+)}
And replacing with that with [\1].
The regex matches {123} in the string abc {123} 456, and captures 123 in the first capturing group. The backreference \1 is now the equivalent of 123, so replacing {(\d+)} in abc {123} 456 with [\1] results in abc [123] 456.
The reason non-capturing groups exist is because groups in general have more uses that just capturing. The regex (xyz)+ matches a string that consists entirely of the group, xyz, repeated, such as xyzxyzxyz. A group is needed because xyz+ only matches xy and then z repeated, i.e. xyzzzzz. The problem with using capturing groups is that they are slightly less efficient compared to non-capturing groups, and they take up an index. If you have a complicated regex with a lot of groups in it, but you only need to reference a single one somewhere in the middle, it's a lot better to just reference \1 rather than trying to count all the groups up to the one you want.
I hope this helps!

Can't think of an appropriate example at the moment, but I'm assuming someone might need to know the number of sub matches in the RegEx.
Group 0 is always the entire base match. I'm assuming groupCount() just lets you know how many capture groups you've specified in the expression.
A non-capturing group (?:) would be used to, well, not capture a group. Ex. if you need to test if a string contains one of several words and don't want to capture the word in a new group: (?:hello|hi there) world !== hello|hi there world. The first matches "hello world" or "hi there world" but the second matches "hello" or "hi there world".
They can be used as a part of a multitude of powerful reasons, such as testing whether or not a number is prime or composite. :) Or you could simply test to ensure a search parameter isn't repeated, ie. ^(\d)(?!.*\1)\d+$ would ensure the first digit is unique in a string.

Related

Regex backreference (.) [duplicate]

Wishing to put some order into my knowledge of regular expressions I decided to go through a book about them, Introducing Regular Expressions. And I know it's silly but one of the introductory examples doesn't make sense to me.
(\d)\d\1
Sample text:
123-456-7890
(should capture the first number, 123)
Can anyone explain what is going on in here?
As far as I can figure out, the first \d captures the number 123. The \1 backreferences (marks) the group for later use. The parenthesis limit the scope of the group. But what does the second \d does?
Simple explanation, like to a small child or a golden retriever are prefered.
\d is just one digit.
This regular expression doesn't match the "123-456-7890" string but it would match "323" (which could be part of a greater string, for example "323-456-7890") :
(\d) : first digit ("3")
\d : another digit ("2")
\1 : first group (which was "3")
Now, if your book pretends that (\d)\d\1 should capture "123" in "123-456-7890", then it might contain an error...
(\d)\d\1 step by step:
The first \d matches one digit
And the parentheses () mark this as a capturing group - this is the first one, so the digit is remembered as "group 1"
The second \d says there is another digit
\1 says "here is the value from our previous group 1" - that is the digit that was matched in step 1.
So like dystroy already said: the regex should match a sequence of three digits of which the first and the third are equal.

Regex to match everything except pattern

I am coming from this question. Now what I want is the exact opposite.
I want to match all chracters except this pattern:
yearid="[0-9]+"
Why do I do that please?
I have tried (?!yearid="[0-9]+") but it refuses to match match.
There are actually two ways to do this. You can use [^0-9]+ where the ^ negates the term inside the brackets, or \D+ where \D is any non-digit character.
re.sub(r'yearid="[0-9]+"', '', string_to_fix)
Capture the group like normal, then substitute nothing for it, and return the complete string.
Or, if you want to go the hard way and negate it:
re.sub(r'(.*?)(?:yearid="[0-9]+")(.*)', '\1\2', string_to_fix)
This first matches everything lazily (.*?), until it finds the yearid="XXXX", matches that as a noncapturing group (?:yearid="[0-9]+"), then matches everything else (.*). Finally, it replaces the original full string with just the 1st and 2nd capture groups, essentially cutting out the section you want.

Regex to replace repeated characters

Can someone give me a Java regex to replace the following.
If I have a word like this "Cooooool", I need to convert this to "Coool" with 3 o's. So that I can distinguish it with the normal word "cool".
Another ex: "happyyyyyy" should be "happyyy"
replaceAll("(.)\\1+","$1"))
I tried this but it removes all the repeating characters leaving only one.
Change your regex like below.
string.replaceAll("((.)\\2{2})\\2+","$1");
( start of the first caturing group.
(.) captures any character. For this case, you may use [a-z]
\\2 refers the second capturing group. \\2{2} which must be repeated exactly two times.
) End of first capturing group. So this would capture the first three repeating characters.
\\2+ repeats the second group one or more times.
DEMO
I think you might want something like this:
str.replaceAll("([a-zA-Z])\\1\\1+", "$1$1$1");
This will match where a character is repeated 3 or more times and will replace it with the same character, three times.
$1 only matches one character, because you're surrounding the character to match.
\\1\\1+ matches the character only, if it occurs at least three times in a row.
This call is also a lot more readable, than having a huge regex and only using one $1.

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Java/Hive regex interpretation

Straightforward question, it's just difficult to google regex syntax...
I'm going through the HortonWorks Hive tutorials (Hive uses same regex as Java), and the following SELECT statement uses regex to pull from what's probably JSON data...
INSERT OVERWRITE TABLE batting
SELECT
regexp_extract(col_value,'^(?:([^,]*)\.?){1}',1) player_id,
regexp_extract(col_value,'^(?:([^,]*)\.?){2}',1) year,
regexp_extract(col_value,'^(?:([^,]*)\.?){9}',1) run
FROM temp_batting;
The data looks like this:
PlayerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
aardsda01,2004,1,SFN,NL,11,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11
aardsda01,2006,1,CHN,NL,45,43,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,45
aardsda01,2007,1,CHA,AL,25,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
And so PlayerID is in column1, year is column2, R (runs) is column 9. How is regexp_extract successfully pulling this data?
I'm new to non-capturing groups, but it looks to me like the entire thing is a non-capturing group. Also, I'm used to seeing {1}, {2}, or {9} in the form [0-9]{9} meaning it matches a 9-digit number. In this case it looks like it's pointing to the 9th match of something, what is this syntax called?
First break apart the regex:
^(?:([^,]*)\.?){n}
^ is the start of a String
(?:...){n} is a non-capturing group repeated n times
([^,]*) is a negated character class, it matches "not ," zero or more times
\.? is an optional (literal) .
So, how does this work?
The non-capturing group is solely there for the numeric quantifier, i.e. it makes the entire pattern in the group repeat n times.
The actual pattern being captured is in the capturing group ([^,]*). I'm not sure why the optional . is there and I don't see any inputs ending with a . in your sample data but I assume there are some.
What happens is the the group is captured n times but only the last capture is stored and this is stored in the first group, i.e. group 1. This is the default in the regexp_extract.
So when the pattern repeats once in the first case we capture the first element on the comma separated array. When the pattern repeats twice in the second example we capture the second element. When the pattern repeats nine times then the ninth element is captured.
The pattern itself is actually pretty horrible as it allows for a zero length pattern to be repeated, this means that the regex engine can backtrack a lot if there is a non-matching pattern. I imagine this isn't an issue for you but it is generally bad practice.
It would be best to either make the [^,]* possessive by adding a +:
^(?:([^,]*+)\.?){n}
Or make the entire non-capturing group atomic:
^(?>([^,]*)\.?){n}
I believe that a good way to practice and learn regex is on this site: http://www.regexr.com/
Just paste your expression on there, and delete/replace parts of it. It'll all make a bit more sense than trying to decipher a regex by sight.
Another way to do this without using regex is to use split function
select split('aardsda01,2006,1,CHN,NL,45,43,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,45',',')[0] as player_id,
split('aardsda01,2006,1,CHN,NL,45,43,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,45',',')[1] as year,
split('aardsda01,2006,1,CHN,NL,45,43,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,45',',')[1] as runs

Categories

Resources