How can I handle multiple parenthesis in a regex? - java

I have strings of this type:
text (more text)
What I would like to do is to have a regular expression that extracts the "more text" segment of the string. So far I have been using this regular expression:
"^.*\\((.*)\\)$"
Which although it works on many cases, it seems to fail if I have something of the sort:
text (more text (even more text))
What I get is: even more text)
What I would like to get instead is:
more text (even more text) (basically the content of the outermost pair of brackets.)

Besides lazy quantification, another way is:
"^[^(]*\\((.*)\\)$"
In both regexes, there is a explicitly specified left parenthesis ("\\(", with Java String escaping) immediately before the matching group. In the original, there was a .* before that, allowing anything (including other left parentheses). In mine, left parentheses are not allowed here (there is a negated character class), so the explicitly specified left parenthesis in the outermost.

I recommend this (double escaping of the backslash removed, since this is not part of the regex):
^[^(]*\((.*)\)
Matching with your version (^.*\((.*)\)$) occurs like this:
The star matches greedily, so your first .* goes right to the end of the string.
Then it backtracks just as much as necessary so the \( can match - that would be the last opening paren in the string.
Then the next .* goes right to the end of the string again.
Then it backtracks just as much so the \) can match, i.e. to the last closing paren.
When you use [^(]* instead of .*, it can't go past the first opening paren, so the first opening paren (the correct one) in the string will delimit your sub-match.

Try:
"^.*?\\((.*)\\)$"
That should make the first matching less greedy. Greedy means it swallows everything it possibly can while still getting an overall pattern match.
The other suggestion:
"^[^(]*\\((.*)\\)$"
Might be more along the line of what you're looking for though. For this simple example it doesn't really matter so much, but it could if you wanted to expand on the regex, for example by making the part inside the braces optional.

Try this:
"^.*?\\((.*)\\)$"

True regular expressions can't count parentheses; this requires a pushdown automaton. Some regex libraries have extensions to support this, but I don't think Java's does (could be wrong; Java isn't my forté).
BTW, the other answers I've seen so far will work with the example given, but will break with, e.g., text (more text (even more text)) (another bit of text). Changing greediness doesn't make up for the inability to count.

$str =~ /^.*?\((.*)\)/

I think the reason is because you second wildcard is picking up the closing parenthesis. You'll need to exclude it.

Related

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

regex with lookbehind weird behavior

I have been trying to resolve this for the past 2 days...
Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>
This is my regular expression
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input string is:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regular expression selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone please help me?
A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.
So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.
You have two technics to do that, you can replace the .*? with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.
The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.
Example:
With the first way:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this variant:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this variant:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.
First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.
Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.
Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...
<DTL1 val="3"><DTL2 val="4"></HDR>
...when presumably you want this...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
...or this...
<DTL1 val="3"><DTL2 val="4">
So, in the fist case, don't use a lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a look-ahead for the closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

regex to strip all comments except within parentheses

I have a regex that finds single and multiline comments just fine, but I want to exclude comments that are within parentheses. I think i need a negative lookahead but have been unable to get it to work.
regex:
(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)
samples of what to ignore:
url(data:image/gif;base64,R0lGODlhBgAGAIAAAOrq6v///yH5BAAHAP8ALAAAAAYzqlgoFADs=)
src: url(//:) format('no404')
Any help would be appreciated.
Making sure // is inside a pair of parentheses is not directly possible with a Java regex. But you can take the liberty of checking whether the next parenthesis is a closing one (and reject the match if it is). Of course, that only works if there are no nested parentheses and no parentheses in a comment:
//(?![^()\\r\\n]*\\)).*
would do this.
As for the first part of your regex that matches /*...*/ comments - that's a bit overcomplicated, I think. Since Java doesn't allow nested comments,
/\\*.*?\\*/
would do. You just need to make sure that the dot is allowed to match newlines in this part of the regex:
Pattern regex = Pattern.compile("(?s)/\\*.*?\\*/|(?-s)//(?![^()\r\n]*\\)).*");
You could get away with this:
your_regex(?![^(]*\\))
ie add a negative look ahead to the end to assert that the next bracket character is not a close bracket

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.
As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").
I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.
the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

regex for that excludes matches within quotes

I'm working on this pretty big re-factoring project and I'm using intellij's find/replace with regexp to help me out.
This is the regexp I'm using:
\b(?<!\.)Units(?![_\w(.])\b
I find that most matches that are not useful for my purpose are the matches that occur with strings within quotes, for example: "units"
I'd like to find a way to have the above expression not match when it finds a matching string that's between quotes...
Thx in advance, this place rocks!
Assuming the quotes are always paired on a given line, you could create matches before and after for an even number of quotes, and make sure the whole line is matched:
^([^"]*("[^"]*")*[^"]*)*\b(?<!\.)Units(?![_\w(.])\b([^"]*("[^"]*")*[^"]*)*$
this works because the fragment
([^"]*("[^"]*")*[^"]*)*
will only match paired quotes. By adding the begin and end line anchors, it forces the quotes on the left and right side of your regex to be an even count.
This won't handle embedded escaped quotes properly, and multiline quoted strings will be trouble.
Intellij uses Java regexes, doesn't it? Try this:
(?m)(?<![\w.])Units(?![\w(.])(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
The first part is your regex after a little cosmetic surgery:
(?<![\w.])Units(?![\w(.])
The \b at the beginning and end were effectively the same as a negative lookbehind and a negative lookahead (respectively) for \w, so I folded them into your existing lookarounds. The new lookahead matches the rest of the line if it contains even number (including zero) of unescaped quotation marks:
(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
That handles pathological cases like the one Welbog pointed out, and unlike Michael's regex it will find multiple occurrences of the text the same line. But it doesn't take comments into account. Is Intellij's find/replace feature intelligent enough to disregard text in comments? Come to think of it, doesn't it have some kind of refactoring support built in?

Categories

Resources