Regular expression non-greedy but still - java

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.

Use a regex like this. .*abc([0-9]+).*?jk
demo here

I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO

You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.

Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.

Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Related

Regex with at least one of two options but in sequence

I am trying to write a regex (for use in a Java Pattern) that will match strings that possibly have a letter that is possibly followed by a space then number, but must have at least one of them. For example, the following strings should be matched:
"a 5"
"b 9"
" 8"
However, it should not match an empty string ("").
Furthermore, I would like to make each of the components part a named capture group.
The following works, but allows the empty string.
"(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
To ensure that there is at least one of them, you can use lookahead (?=\\p{Alpha}| \\p{Digit}) at the beginning:
"(?=\\p{Alpha}| \\p{Digit})(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
In general, to avoid empty strings you can use (?=.).
You can use a negative lookahead to avoid empty input and keep your regex as:
^(?!$)(?<let>\p{L})?(?:\h+(?<num>\p{N}))?$
RegEx Demo
(?!$) is negative lookahead to fail the match for empty strings.
You can solve problem with:
([a-z]? \d)|([a-z] \d?)
You can see this code that covers your test cases in demo here. You can see this code in demo here. This is very basic regular expression knowledge, you should definitely learn more about regular expressions, there are bunch of good tutorials on web (e.g this one).
You can use | for or, then simply repeat "any pattern" to match everything like this.
((?<let>[A-z])|(?<num>\d)\s*)+
That lets you match any number of named patterns in any order.

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Regex to match everything except pattern

I am coming from this question. Now what I want is the exact opposite.
I want to match all chracters except this pattern:
yearid="[0-9]+"
Why do I do that please?
I have tried (?!yearid="[0-9]+") but it refuses to match match.
There are actually two ways to do this. You can use [^0-9]+ where the ^ negates the term inside the brackets, or \D+ where \D is any non-digit character.
re.sub(r'yearid="[0-9]+"', '', string_to_fix)
Capture the group like normal, then substitute nothing for it, and return the complete string.
Or, if you want to go the hard way and negate it:
re.sub(r'(.*?)(?:yearid="[0-9]+")(.*)', '\1\2', string_to_fix)
This first matches everything lazily (.*?), until it finds the yearid="XXXX", matches that as a noncapturing group (?:yearid="[0-9]+"), then matches everything else (.*). Finally, it replaces the original full string with just the 1st and 2nd capture groups, essentially cutting out the section you want.

Regular expressions - capturing groups confusion

I am reading an Oracle tutorial on regular expressions. I am on the topic Capturing groups. Though the reference is excellent, but except that a parenthesis represents a group, I am finding many difficulties in understanding the topic. Here are my confusions.
What is the significance of counting groups in an expression?
What are non-capturing groups?
Elaborating with examples would be nice.
One usually doesn't count groups other than to know which group has which number. E.g. ([abc])([def](\d+)) has three groups, so I know to refer to them as \1, \2 and \3. Note that group 3 is inside 2. They are numbered from the left by where they begin.
When searching with regex to find something in a string, as opposed to matching when you make sure the whole string matches the subject, group 0 will give you just the matched string, but not the stuff that was before or after it. Imagine if you will a pair of brackets around your whole regex. It's not part of the total count because it's not really considered a group.
Groups can be used for other things than capturing. E.g. (foo|bar) will match "foo" or "bar". If you're not interested in the contents of a group, you can make it non-capturing (e.g: (?:foo|bar) (varies by dialect)), so as not to "use up" the numbers assigned to groups. But you don't have to, it's just convenient sometimes.
Say I want to find a word that begins and ends in the same letter: \b([a-z])[a-z]*\1\b The \1 will then be the same as whatever the first group captured. Of course it can be used for much more powerful stuff, but I think you'll get the idea.
(Coming up with relevant examples is certainly the hardest part.)
Edit: I answered when the questions were:
What is the significance of counting groups in an expression?
There is a special group, called as group-0, which means the entire expression. It is not reported by groupCount() method. Why is that?
I don't understand what are non-capturing groups?
Why we need back-references? What is the significance of back-references?
Say you have a string, abcabc, and you want to figure out whether the first part of the string matches the second part. You can do this with a single regex by using capturing groups and backreferences. Here is the regex I would use:
(.+)\1
The way this works is .+ matches any sequence of characters. Because it is in parentheses, it is caught in a group. \1 is a backreference to the 1st capturing group, so it is the equivalent of the text caught by the capturing group. After a bit of backtracking, the capturing group matches the first part of the string, abc. The backreference \1 is now the equivalent of abc, so it matches the second half of the string. The entire string is now matched, so it is confirmed that the first half of the string matches the second half.
Another use of backreferences is in replacing. Say you want to replace all {...} with [...], if the text inside { and } is only digits. You can easily do this with capturing groups and backreferences, using the regex
{(\d+)}
And replacing with that with [\1].
The regex matches {123} in the string abc {123} 456, and captures 123 in the first capturing group. The backreference \1 is now the equivalent of 123, so replacing {(\d+)} in abc {123} 456 with [\1] results in abc [123] 456.
The reason non-capturing groups exist is because groups in general have more uses that just capturing. The regex (xyz)+ matches a string that consists entirely of the group, xyz, repeated, such as xyzxyzxyz. A group is needed because xyz+ only matches xy and then z repeated, i.e. xyzzzzz. The problem with using capturing groups is that they are slightly less efficient compared to non-capturing groups, and they take up an index. If you have a complicated regex with a lot of groups in it, but you only need to reference a single one somewhere in the middle, it's a lot better to just reference \1 rather than trying to count all the groups up to the one you want.
I hope this helps!
Can't think of an appropriate example at the moment, but I'm assuming someone might need to know the number of sub matches in the RegEx.
Group 0 is always the entire base match. I'm assuming groupCount() just lets you know how many capture groups you've specified in the expression.
A non-capturing group (?:) would be used to, well, not capture a group. Ex. if you need to test if a string contains one of several words and don't want to capture the word in a new group: (?:hello|hi there) world !== hello|hi there world. The first matches "hello world" or "hi there world" but the second matches "hello" or "hi there world".
They can be used as a part of a multitude of powerful reasons, such as testing whether or not a number is prime or composite. :) Or you could simply test to ensure a search parameter isn't repeated, ie. ^(\d)(?!.*\1)\d+$ would ensure the first digit is unique in a string.

Regex nth match in equal string

Is it possible to use only a regex (no additional code!) for matching the nth match? For example:
"CAR" - "TRAIN" - "BOAT" - "BICYCLE"
Now I only want to match the BOAT, regex for matching would be "[A-Z]+" however this also matches the first, second and fourth.
Does anyone have a pure regex solution for this? I need this because I can't change the code that uses the regex, but I can provide a regex.
Best regards,
Robin
I think this lookbehind should do it:
(?<=^("[A-Z]+"[\s-]+){2})"[A-Z]+"
It matches a word that comes two words after the start of the string
If I understood you right and you're putting multiple strings into one regexp, string by string, then no, this is not possible.
The regular expression itself does not have memory which lasts longer than matching time,
so if you're matching one thing and after that another thing, there's no information of the first.
(?!(\"[A-Z]\"\s-\s){2})(\"[A-Z]\") - where the {2} means which index you want.
The only problem with it is that it returns every match after the specified index as well. You could perform the match and return the first result.
Tested it using regexpal with your example.

Categories

Resources