A regex to match the smallest nested part first - java

I am quite new to RegEx. I have not much experience. I already searched the internet and tried many things on regex101.com. Nothing seems to work.
This is the pattern:
\\((.*?)\\)
I use it in combination with Java 's replaceAll to add a ?: to each (...) provided in a string (the user input).
The user input is used as regular expression as well. But currently I am treating it as a normal String.
Imagine this user input: (Welcome, (StackOverflow|World)|Hello, Dad)
What I want as the result is: (?:Welcome, (?:StackOverflow|World)|Hello, Dad)
But I only get the first ?: : (?:Welcome, (StackOverflow|World)|Hello, Dad)
I think, I understand the problem. I guess, RegEx scans from right to left and is trying to get the smallest match (see .*? ). It searches for ( till the next ) . And this is (Welcome, (StackOverflow|World) .
What could I do to match these nested matches first? I cannot let the user modify their input. I have to find a better regex pattern to match from the smallest possible match to the greatest possible match, and not from the left to the right.

I suggest searching for any unescaped ( (so as not to add ?: after literal () that is not followed with ? (to avoid matching lookarounds/non-capturing groups/etc,):
(?<!\\)((?:\\{2})*)\((?!\?)
and replace with $1(?:. See the regex demo.
Java declaration:
String pat = "(?<!\\\\)((?:\\\\{2})*)\\((?!\\?)";
Details:
(?<!\\) - no backslash immediately to the left of the current location
((?:\\{2})*) - Group 1: zero or more even number of backslashes
\(- a literal (...
(?!\?) - that is not immediately followed with a literal ?.

Related

Java Regex to replace only part of string (url)

I want to replace only numeric section of a string. Most of the cases it's either full URL or part of URL, but it can be just a normal string as well.
/users/12345 becomes /users/XXXXX
/users/234567/summary becomes /users/XXXXXX/summary
/api/v1/summary/5678 becomes /api/v1/summary/XXXX
http://example.com/api/v1/summary/5678/single becomes http://example.com/api/v1/summary/XXXX/single
Notice that I am not replacing 1 from /api/v1
So far, I have only following which seem to work in most of the cases:
input.replaceAll("/[\\d]+$", "/XXXXX").replaceAll("/[\\d]+/", "/XXXXX/");
But this has 2 problems:
The replacement size doesn't match with the original string length.
The replacement character is hardcoded.
Is there a better way to do this?
In Java you can use:
str = str.replaceAll("(/|(?!^)\\G)\\d(?=\\d*(?:/|$))", "$1X");
RegEx Demo
RegEx Details:
\G asserts position at the end of the previous match or the start of the string for the first match.
(/|(?!^)\\G): Match / or end of the previous match (but not at start) in capture group #1
\\d: Match a digit
(?=\\d*(?:/|$)): Ensure that digits are followed by a / or end.
Replacement: $1X: replace it with capture group #1 followed by X
Not a Java guy here but the idea should be transferrable. Just capture a /, digits and / optionally, count the length of the second group and but it back again.
So
(/)(\d+)(/?)
becomes
$1XYZ$3
See a demo on regex101.com and this answer for a lambda equivalent to e.g. Python or PHP.
First of all you need something like this :
String new_s1 = s3.replaceAll("(\\/)(\\d)+(\\/)?", "$1XXXXX$3");

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Problems with Regex pattern

I'm trying to remove the first occurrence of a pattern from a string in Java.
Source string: DUMMY01012016DUMMY01012016
Format is 1-8 alpha-numeric characters followed by a date MMddyyyy followed by any number of alpha-numerics.
Want I'm trying to achieve is remove all beginning chars including the first date occurrence. So in the example below I would be left with DUMMY01012016.
Here is a simplified version of what I have tried: ".*\\d{4}(2016|2017|2015)"
That works well until the pattern is matched more than once. So in the example matcher.replaceFirst("") will replace the entire source string and not just the first occurrence.
Any thoughts would be greatly appreciated.
Thanks. Stephan
Your issue is that the * quantifier is greedy. It will cause the preceding sub-pattern to match as many times as possible without causing the overall match to fail (if a match is possible at all). Thus the tail of your pattern .*\d{4}(2016|2017|2015) will match the last occurrence of a date in the string, whereas you want it to match the first.
You can solve this problem by switching to a "reluctant" quantifier instead:
myString.replaceFirst(".*?\d{4}(2016|2017|2015)", "");
There, *? is a reluctant quantifier: it matches zero or more instances of the preceding sub-pattern, as few as possible to enable an overall match (if an overall match is possible).
This regex should work:
(\w{1,8}?\d{8})(?:\1)
One of your problems is that the .* is greedy. It means that it matches as much as it can at first. Then the regexp engine starts to step back symbol by symbol until a full match had been found.
So, roughly:
Step 1) .* macthes the whole DUMMY01012016DUMMY01012016
Step 2) The engine steps back symbol by symbol trying to match the remaining part:
DUMMY01012016DUMMY0101201 -> DUMMY01012016DUMMY010120 -> DUMMY01012016DUMMY01012 -> .. -> DUMMY01012016DUMMY
Step 3) A complete match is found -> DUMMY01012016DUMMY01012016
You can try something like this:
#Test
public void testReplace()
{
String string = "DUMMY01012016DUMMY01012016";
String replaced = string.replaceFirst("\\w{1,8}\\d{4}(2016|2017|2015)", "");
Assert.assertEquals("DUMMY01012016", replaced);
}
To understand the difference between lazy and greedy you can experiment and make the asterisk lazy by adding a question mark ?., e.g. .*?\d{4}(2016|2017|2015). Then the engine will do the opposite, it will match lazily at the beginning and step forward character by character.

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Should capturing parentheses affect a separate negative lookahead?

I am using Java. I have the following text:
"hyst and hy"
Why (hy)(?![a-z]) returns two "hy"s. The idea is to match any "hy" that is not followed by any character between a-z.
If I do hy(?![a-z]) (hy without parentheses) it works (finds only the second "hy") but I don't understand why if I use parentheses (hy) in the RegEx it matches the first "hy" in hyst
When you use a capture group you obtain two results, the first is the whole pattern and the second the capture group. The first hy has never been matched.
If you remove the parenthesis, you obtain only that match the whole pattern.

Categories

Resources