Regex match the first occurence relative to the occurence of another pattern - java

I have a json from which I need to replace the entire parameter if the placeholder value is not found.
For eg -
{ "A": { |"B": "{0}",| |"C": "{1}",| |"D": "{2}"|}};
In this json if the replacement for placeholder 1 is not found, i would like to remove
|"C": "{1}",|
When I use this regex -
(\|.*)(\{1}",\|)
The previous parameter is also matched -
|"B": "{0}",| |"C": "{1}",|
How do I use a lazy version of the same to get the desired result?
Thanks

Assuming I understand the problem correctly: A lazy (or reluctant) quantifier (i.e. using .*? instead of .*) won't give you what you want. This is because the pattern matching still goes from left to right. So when the matcher sees the first |, it will then find the smallest number of characters it can until it finds the next {1}. Since it's a reluctant quantifier, it will look for the smallest number of characters it can before it finds {1}, rather than the largest. But that doesn't help you, because it's still searching from the wrong initial |.
The solution here is to realize that the match can't have any | in it other than the first and last characters. So instead of .*, use a pattern that excludes |:
(\|[^|]*)(\{1}",\|)

Related

A regex to match the smallest nested part first

I am quite new to RegEx. I have not much experience. I already searched the internet and tried many things on regex101.com. Nothing seems to work.
This is the pattern:
\\((.*?)\\)
I use it in combination with Java 's replaceAll to add a ?: to each (...) provided in a string (the user input).
The user input is used as regular expression as well. But currently I am treating it as a normal String.
Imagine this user input: (Welcome, (StackOverflow|World)|Hello, Dad)
What I want as the result is: (?:Welcome, (?:StackOverflow|World)|Hello, Dad)
But I only get the first ?: : (?:Welcome, (StackOverflow|World)|Hello, Dad)
I think, I understand the problem. I guess, RegEx scans from right to left and is trying to get the smallest match (see .*? ). It searches for ( till the next ) . And this is (Welcome, (StackOverflow|World) .
What could I do to match these nested matches first? I cannot let the user modify their input. I have to find a better regex pattern to match from the smallest possible match to the greatest possible match, and not from the left to the right.
I suggest searching for any unescaped ( (so as not to add ?: after literal () that is not followed with ? (to avoid matching lookarounds/non-capturing groups/etc,):
(?<!\\)((?:\\{2})*)\((?!\?)
and replace with $1(?:. See the regex demo.
Java declaration:
String pat = "(?<!\\\\)((?:\\\\{2})*)\\((?!\\?)";
Details:
(?<!\\) - no backslash immediately to the left of the current location
((?:\\{2})*) - Group 1: zero or more even number of backslashes
\(- a literal (...
(?!\?) - that is not immediately followed with a literal ?.

Continue scanning a string until it has found the first/last occurrence of a string

I have this line of text that I want to scan using regex.
axhaweacb
I want to get the text from "a" to "b". This is my current pattern:
pattern = "a.*?b";
The current output is: axhaweacb (it's taking everything in between a and b), but what I want to receive back is "acb".
Why you may ask? The logic/regex I am trying to apply is:
When you find the first occurrence of the "from" regex ("a"), start scanning. If you find another occurrence of the "from" letter without finding the "last" occurrence of a letter - in this case "b", remove the previous string - which is axh so that the string becomes: aweacb. If you find another occurrence of "from" - in this case a, without finding "to" - b. Remove the previous string so that it becomes acb. Then start scanning again. In this case we have found our pattern - a to b, without another "a" in our way.
I know that I can substring the string to begin with, and strip down everything until the last occurance of "a" - but I want to reuse this for different strings as well. And in that case, it will always substring everything until the last occurance of something - which results in removing a lot of data.
I hope I made my question/problem clear. If not, please tell me and I will do my best to clarify my problem.
Thank you.
The regex engine searches for a match from left to right. When it finds a with a.*?b, it is the first a in your string. Then, the first b found and matched is the last character in your axhaweacb string.
Lazy quantifier matches up to the closest right-most character matching the subsequent subpattern, not the shortest possible substring.
So, what you need is a way to exclude (=fail if found) all occurrences of the leading and trailing subpatterns in between them.
It can be done with the help of a tempered greedy token:
pattern = "a(?:(?!a|b).)*b";
^^^^^^^^^^^^^
Here is a demo
You can use this negative lookahead based regex:
a(?:(?![ab]).)*b
(?![ab]) is the negative regex to match anything but a and b`
(?:(?![ab]).)* matches 0 or more of any character that is not a and b, thus giving us shortest match betweenaandb`
RegEx Demo

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Pattern matcher using Greedy and Reluctant

In java regex I have read about Greedy and Reluctant Quantifiers. They mentioned as
A reluctant or "non-greedy" quantifier first matches as little as
possible. So the .* matches nothing at first, leaving the entire
string unmatched
In this example
source: yyxxxyxx
pattern: .*xx
greedy quantifier * and produces
0 yyxxxyxx
reluctant qualifier *?, and we get the following:
0 yyxx
4 xyxx
Why result of yxx, yxx not possible even it is the smallest possible value?
The regex engine returns the first and leftmost match it find as a result.
Basically it tries to match the pattern starting from the first character. If it doesn't find a corresponding match, the transmission jumps in and it tries again from the second character, and so on.
If you use a+?b on bab it will first try from the first b. That doesn't work, so we try from the second character.
But here it finds a match right from the first character. Starting from the second isn't even considered, we found a match so we return.
If you apply a+?b on aab, we try at the first a and find an overall match: end of story, no reason to try anything else.
To sum up: the regex engine goes from the left to the right, so laziness can only affect the right side length.

Reason for the behavior of the reluctunt quantifier ?? in java regex

I know that the ? is a greedy quantifier and ?? is the reluctant one for it.
When I use it as follows it gives me an empty output always? Is it because of it always operates from left to right (first looking at the zero occurrence then the matched occurrence) or another one?
Pattern pattern = Pattern.compile("a??");
Matcher matcher = pattern.matcher("aba");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[]0
1[]1
2[]2
3[]3
Your regex could be explained as follows: "try to match zero characters, and if that fails try to match one 'a' character".
Trying to match zero characters will always succeed, so there is really no purpose for a regex that only contains a single reluctant element.
I'm not sure about the Java implementation but regular-expressions.info states this for ?? :
Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.
Thus you get 4 matches (3 character positions + the empty string at the ent) and the optional a is excluded from each of those.

Categories

Resources