regex for that excludes matches within quotes - java

I'm working on this pretty big re-factoring project and I'm using intellij's find/replace with regexp to help me out.
This is the regexp I'm using:
\b(?<!\.)Units(?![_\w(.])\b
I find that most matches that are not useful for my purpose are the matches that occur with strings within quotes, for example: "units"
I'd like to find a way to have the above expression not match when it finds a matching string that's between quotes...
Thx in advance, this place rocks!

Assuming the quotes are always paired on a given line, you could create matches before and after for an even number of quotes, and make sure the whole line is matched:
^([^"]*("[^"]*")*[^"]*)*\b(?<!\.)Units(?![_\w(.])\b([^"]*("[^"]*")*[^"]*)*$
this works because the fragment
([^"]*("[^"]*")*[^"]*)*
will only match paired quotes. By adding the begin and end line anchors, it forces the quotes on the left and right side of your regex to be an even count.
This won't handle embedded escaped quotes properly, and multiline quoted strings will be trouble.

Intellij uses Java regexes, doesn't it? Try this:
(?m)(?<![\w.])Units(?![\w(.])(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
The first part is your regex after a little cosmetic surgery:
(?<![\w.])Units(?![\w(.])
The \b at the beginning and end were effectively the same as a negative lookbehind and a negative lookahead (respectively) for \w, so I folded them into your existing lookarounds. The new lookahead matches the rest of the line if it contains even number (including zero) of unescaped quotation marks:
(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
That handles pathological cases like the one Welbog pointed out, and unlike Michael's regex it will find multiple occurrences of the text the same line. But it doesn't take comments into account. Is Intellij's find/replace feature intelligent enough to disregard text in comments? Come to think of it, doesn't it have some kind of refactoring support built in?

Related

regex not matched

I have this expression to match in order to find the right file
(123456A)(.*?)(?:\\")' 'parts'}}{{parts.0}}{{parts.1}}
It does not match with 123456A000000022
I suspect that the HTML nature of posting here has hidden some of your potential match string from us, but it clearly won't match.
This is a great tool for building complex regex strings and testing them: https://regex101.com/
From what I see, the non-capturing group requires the string to have a backslash, so the match fails right there. I don't understand the rest of what you were intending with your match string, because there are mismatched quotes and braces. Again, I suspect I'm not seeing what you actually intended.

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Regular Expression Match after double space, and before comma

I am trying to match the bolded portion of the below String, which would represent a city.
1795 New Test Dr Test TEst Wildwood, MI 48769-1100
There are two spaces between Dr and Test, the starting portion should happen after those double spaces, and end before the comma.
I feel like I am very close to having this correct but can't quite get it 100%, as it is including the white space characters before Test.
(?=\s{2})[\w+\s]*[^,]
The above is what I have so far, also the many other alternatives did not work either they still include the white space characters I do not want at the beginning.
I feel like I missing something simple, but even after looking many places I cannot seem to find the regex that would match this pattern.
Also I know this can be easily accomplished with split and substrings, but the requirement is a regex unfortunately, as this is for a database driven automation application and the format should be able to change on the fly without requiring a deploy due to code changes.
You need a look-behind for the spaces rather than a look-ahead, as you want the match to start immediately after them. From that point on, you can simply do a greedy match for anything that is not a comma:
(?<=\s{2})[^,]*
The * is greedy and will consume as many characters as it can, ending the match immediately before the comma.
\s actually also matches whitespace other than space, which may or may not be not be what you what.
How about ^.*? ([^,]*).*$. That's a non-greedy match at the beginning of the line ^.*?, followed by two literal spaces , then capturing everything that isn't a comma, then matching everything else to the end of the line.
Be aware, though, that when I copy and paste your example text, it does not contain two spaces. This might be causing you problems, or it's just a transcription issue and your original has the two spaces.

regex with lookbehind weird behavior

I have been trying to resolve this for the past 2 days...
Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>
This is my regular expression
(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>
And the input string is:
<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>
But this regular expression selects
abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>
Can anyone please help me?
A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.
So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.
You have two technics to do that, you can replace the .*? with:
(?>[^<]+|<(?!/HDR))*
or with:
(?:(?!</HDR).)*+
Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.
The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.
Example:
With the first way:
(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>
or this variant:
(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>
With the second way:
(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
or this variant:
(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>
Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.
First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.
Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:
(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>
A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.
Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...
<DTL1 val="3"><DTL2 val="4"></HDR>
...when presumably you want this...
<HDR><DTL1 val="3"><DTL2 val="4"></HDR>
...or this...
<DTL1 val="3"><DTL2 val="4">
So, in the fist case, don't use a lookbehind for the opening tag:
<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>
In the second case, use a look-ahead for the closing tag:
(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.
As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").
I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.
the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

Categories

Resources