Get node value between curly braces with Java Pattern.compile() [duplicate] - java

https://regex101.com/r/sB9wW6/1
(?:(?<=\s)|^)#(\S+) <-- the problem in positive lookbehind
Working like this on prod: (?:\s|^)#(\S+), but I need a correct start index (without space).
Here is in JS:
var regex = new RegExp(/(?:(?<=\s)|^)#(\S+)/g);
Error parsing regular expression: Invalid regular expression:
/(?:(?<=\s)|^)#(\S+)/
What am I doing wrong?
UPDATE
Ok, no lookbehind in JS :(
But anyways, I need a regex to get the proper start and end index of my match. Without leading space.

Make sure you always select the right regex engine at regex101.com. See an issue that occurred due to using a JS-only compatible regex with [^] construct in Python.
JS regex - at the time of answering this question - did not support lookbehinds. Now, it becomes more and more adopted after its introduction in ECMAScript 2018. You do not really need it here since you can use capturing groups:
var re = /(?:\s|^)#(\S+)/g;
var str = 's #vln1\n#vln2\n';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res);
The (?:\s|^)#(\S+) matches a whitespace or the start of string with (?:\s|^), then matches #, and then matches and captures into Group 1 one or more non-whitespace chars with (\S+).
To get the start/end indices, use
var re = /(\s|^)#\S+/g;
var str = 's #vln1\n#vln2\n';
var pos = [];
while ((m = re.exec(str)) !== null) {
pos.push([m.index+m[1].length, m.index+m[0].length]);
}
console.log(pos);
BONUS
My regex works at regex101.com, but not in...
First of all, have you checked the Code Generator link in the Tools pane on the left?
All languages - "Literal string" vs. "String literal" alert - Make sure you test against the same text used in code, literal string, at the regex tester. A common scenario is copy/pasting a string literal value directly into the test string field, with all string escape sequences like \n (line feed char), \r (carriage return), \t (tab char). See Regex_search c++, for example. Mind that they must be replaced with their literal counterparts. So, if you have in Python text = "Text\n\n abc", you must use Text, two line breaks, abc in the regex tester text field. Text.*?abc will never match it although you might think it "works". Yes, . does not always match line break chars, see How do I match any character across multiple lines in a regular expression?
All languages - Backslash alert - Make sure you correctly use a backslash in your string literal, in most languages, in regular string literals, use double backslash, i.e. \d used at regex101.com must written as \\d. In raw string literals, use a single backslash, same as at regex101. Escaping word boundary is very important, since, in many languages (C#, Python, Java, JavaScript, Ruby, etc.), "\b" is used to define a BACKSPACE char, i.e. it is a valid string escape sequence. PHP does not support \b string escape sequence, so "/\b/" = '/\b/' there.
All languages - Default flags - Global and Multiline - Note that by default m and g flags are enabled at regex101.com. So, if you use ^ and $, they will match at the start and end of lines correspondingly. If you need the same behavior in your code check how multiline mode is implemented and either use a specific flag, or - if supported - use an inline (?m) embedded (inline) modifier. The g flag enables multiple occurrence matching, it is often implemented using specific functions/methods. Check your language reference to find the appropriate one.
line-breaks - Line endings at regex101.com are LF only, you can't test strings with CRLF endings, see regex101.com VS myserver - different results. Solutions can be different for each regex library: either use \R (PCRE, Java, Ruby) or some kind of \v (Boost, PCRE), \r?\n, (?:\r\n?|\n)/(?>\r\n?|\n) (good for .NET) or [\r\n]+ in other libraries (see answers for C#, PHP). Another issue related to the fact that you test your regex against a multiline string (not a list of standalone strings/lines) is that your patterns may consume the end of line, \n, char with negated character classes, see an issue like that. \D matched the end of line char, and in order to avoid it, [^\d\n] could be used, or other alternatives.
php - You are dealing with Unicode strings, or want shorthand character classes to match Unicode characters, too (e.g. \w+ to match Стрибижев or Stribiżew, or \s+ to match hard spaces), then you need to use u modifier, see preg_match() returns 0 although regex testers work - To match all occurrences, use preg_match_all, not preg_match with /...pattern.../g, see PHP preg_match to find multiple occurrences and "Unknown modifier 'g' in..." when using preg_match in PHP?- Your regex with inline backreference like \1 refuses to work? Are you using a double quoted string literal? Use a single-quoted one, see Backreference does not work in PHP
phplaravel - Mind you need the regex delimiters around the pattern, see https://stackoverflow.com/questions/22430529
python - Note that re.search, re.match, re.fullmatch, re.findall and re.finditer accept the regex as the first argument, and the input string as the second argument. Not re.findall("test 200 300", r"\d+"), but re.findall(r"\d+", "test 200 300"). If you test at regex101.com, please check the "Code Generator" page. - You used re.match that only searches for a match at the start of the string, use re.search: Regex works fine on Pythex, but not in Python - If the regex contains capturing group(s), re.findall returns a list of captures/capture tuples. Either use non-capturing groups, or re.finditer, or remove redundant capturing groups, see re.findall behaves weird - If you used ^ in the pattern to denote start of a line, not start of the whole string, or used $ to denote the end of a line and not a string, pass re.M or re.MULTILINE flag to re method, see Using ^ to match beginning of line in Python regex
- If you try to match some text across multiple lines, and use re.DOTALL or re.S, or [\s\S]* / [\s\S]*?, and still nothing works, check if you read the file line by line, say, with for line in file:. You must pass the whole file contents as the input to the regex method, see Getting Everything Between Two Characters Across New Lines. - Having trouble adding flags to regex and trying something like pattern = r"/abc/gi"? See How to add modifers to regex in python?
c#, .net - .NET regex does not support possessive quantifiers like ++, *+, ??, {1,10}?, see .NET regex matching digits between optional text with possessive quantifer is not working - When you match against a multiline string and use RegexOptions.Multiline option (or inline (?m) modifier) with an $ anchor in the pattern to match entire lines, and get no match in code, you need to add \r? before $, see .Net regex matching $ with the end of the string and not of line, even with multiline enabled - To get multiple matches, use Regex.Matches, not Regex.Match, see RegEx Match multiple times in string - Similar case as above: splitting a string into paragraphs, by a double line break sequence - C# / Regex Pattern works in online testing, but not at runtime - You should remove regex delimiters, i.e. #"/\d+/" must actually look like #"\d+", see Simple and tested online regex containing regex delimiters does not work in C# code - If you unnecessarily used Regex.Escape to escape all characters in a regular expression (like Regex.Escape(#"\d+\.\d+")) you need to remove Regex.Escape, see Regular Expression working in regex tester, but not in c#
dartflutter - Use raw string literal, RegExp(r"\d"), or double backslashes (RegExp("\\d")) - https://stackoverflow.com/questions/59085824
javascript - Double escape backslashes in a RegExp("\\d"): Why do regex constructors need to be double escaped?
- (Negative) lookbehinds unsupported by most browsers: Regex works on browser but not in Node.js - Strings are immutable, assign the .replace result to a var - The .replace() method does change the string in place - Retrieve all matches with str.match(/pat/g) - Regex101 and Js regex search showing different results or, with RegExp#exec, RegEx to extract all matches from string using RegExp.exec- Replace all pattern matches in string: Why does javascript replace only first instance when using replace?
javascriptangular - Double the backslashes if you define a regex with a string literal, or just use a regex literal notation, see https://stackoverflow.com/questions/56097782
java - Word boundary not working? Make sure you use double backslashes, "\\b", see Regex \b word boundary not works - Getting invalid escape sequence exception? Same thing, double backslashes - Java doesn't work with regex \s, says: invalid escape sequence - No match found is bugging you? Run Matcher.find() / Matcher.matches() - Why does my regex work on RegexPlanet and regex101 but not in my code? - .matches() requires a full string match, use .find(): Java Regex pattern that matches in any online tester but doesn't in Eclipse - Access groups using matcher.group(x): Regex not working in Java while working otherwise - Inside a character class, both [ and ] must be escaped - Using square brackets inside character class in Java regex - You should not run matcher.matches() and matcher.find() consecutively, use only if (matcher.matches()) {...} to check if the pattern matches the whole string and then act accordingly, or use if (matcher.find()) to check if there is a single match or while (matcher.find()) to find multiple matches (or Matcher#results()). See Why does my regex work on RegexPlanet and regex101 but not in my code?
scala - Your regex attempts to match several lines, but you read the file line by line (e.g. use for (line <- fSource.getLines))? Read it into a single variable (see matching new line in Scala regex, when reading from file)
kotlin - You have Regex("/^\\d+$/")? Remove the outer slashes, they are regex delimiter chars that are not part of a pattern. See Find one or more word in string using Regex in Kotlin - You expect a partial string match, but .matchEntire requires a full string match? Use .find, see Regex doesn't match in Kotlin
mongodb - Do not enclose /.../ with single/double quotation marks, see mongodb regex doesn't work
c++ - regex_match requires a full string match, use regex_search to find a partial match - Regex not working as expected with C++ regex_match - regex_search finds the first match only. Use sregex_token_iterator or sregex_iterator to get all matches: see What does std::match_results::size return? - When you read a user-defined string using std::string input; std::cin >> input;, note that cin will only get to the first whitespace, to read the whole line properly, use std::getline(std::cin, input); - C++ Regex to match '+' quantifier - "\d" does not work, you need to use "\\d" or R"(\d)" (a raw string literal) - This regex doesn't work in c++ - Make sure the regex is tested against a literal text, not a string literal, see Regex_search c++
go - Double backslashes or use a raw string literal: Regular expression doesn't work in Go - Go regex does not support lookarounds, select the right option (Go) at regex101.com before testing! Regex expression negated set not working golang
groovy - Return all matches: Regex that works on regex101 does not work in Groovy
r - Double escape backslashes in the string literal: "'\w' is an unrecognized escape" in grep - Use perl=TRUE to PCRE engine ((g)sub/(g)regexpr): Why is this regex using lookbehinds invalid in R?
oracle - Greediness of all quantifiers is set by the first quantifier in the regex, see Regex101 vs Oracle Regex (then, you need to make all the quantifiers as greedy as the first one)] - \b does not work? Oracle regex does not support word boundaries at all, use workarounds as shown in Regex matching works on regex tester but not in oracle
firebase - Double escape backslashes, make sure ^ only appears at the start of the pattern and $ is located only at the end (if any), and note you cannot use more than 9 inline backreferences: Firebase Rules Regex Birthday
firebasegoogle-cloud-firestore - In Firestore security rules, the regular expression needs to be passed as a string, which also means it shouldn't be wrapped in / symbols, i.e. use allow create: if docId.matches("^\\d+$").... See https://stackoverflow.com/questions/63243300
google-data-studio - /pattern/g in REGEXP_REPLACE must contain no / regex delimiters and flags (like g) - see How to use Regex to replace square brackets from date field in Google Data Studio?
google-sheets - If you think REGEXEXTRACT does not return full matches, truncates the results, you should check if you have redundant capturing groups in your regex and remove them, or convert the capturing groups to non-capturing by add ?: after the opening (, see Extract url domain root in Google Sheet
sed - Why does my regular expression work in X but not in Y?
word-boundarypcrephp - [[:<:]] and [[:>:]] do not work in the regex tester, although they are valid constructs in PCRE, see https://stackoverflow.com/questions/48670105
snowflake-cloud-data-platform snowflake-sql - If you are writing a stored procedure, and \\d does not work, you need to double them again and use \\\\d, see REGEX conversion of VARCHAR value to DATE in Snowflake stored procedure using RLIKE not consistent.

Related

Why I got IllegalStateException here? [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Why replaceAll is not works? [duplicate]

For example, this regex
(.*)<FooBar>
will match:
abcde<FooBar>
But how do I get it to match across multiple lines?
abcde
fghij<FooBar>
Try this:
((.|\n)*)<FooBar>
It basically says "any character or a newline" repeated zero or more times.
It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.
The question is, can the . pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
A special note about lua-patterns: they are not considered regular expressions, but . matches any character there, the same as POSIX-based engines.
Another note on matlab and octave: the . matches any character by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match'); (tokens contain a abcde\n fghij item).
Also, in all of boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m (source).
As for oracle (it is POSIX based), use the n option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere . already matches line breaks, so there isn't a need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE, for base R with perl=TRUE or for stringr/stringi patterns, use the (?s) inline modifier) (demo) also treat . the same way.
However, most POSIX-based tools process input line by line. Hence, . does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed - There are multiple workarounds. The most precise, but not very safe, is sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/' (H;1h;$!d;x; slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file (removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file (with matching lines excluded) can be considered.
perl - perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str" (-0 slurps the whole file into memory, -p prints the file after applying the script given by -e). Note that using -000pe will slurp the file and activate 'paragraph mode' where Perl uses consecutive newlines (\n\n) as the record separator.
gnu-grep - grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file. Here, z enables file slurping, (?s) enables the DOTALL mode for the . pattern, (?i) enables case insensitive mode, \K omits the text matched so far, *? is a lazy quantifier, (?=<Foobar>) matches the location before <Foobar>.
pcregrep - pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file (M enables file slurping here). Note pcregrep is a good solution for macOS grep users.
See demos.
Non-POSIX-based engines:
php - Use the s modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m) (demo)
c# - Use RegexOptions.Singleline flag (demo): - var result = Regex.Match(s, #"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;- var result = Regex.Match(s, #"(?s)(.*)<Foobar>").Groups[1].Value;
powershell - Use the (?s) inline option: $s = "abcde`nfghij<FooBar>"; $s -match "(?s)(.*)<Foobar>"; $matches[1]
perl - Use the s modifier (or (?s) inline version at the start) (demo): /(.*)<FooBar>/s
python - Use the re.DOTALL (or re.S) flags or (?s) inline modifier (demo): m = re.search(r"(.*)<FooBar>", s, flags=re.S) (and then if m:, print(m.group(1)))
java - Use Pattern.DOTALL modifier (or inline (?s) flag) (demo): Pattern.compile("(.*)<FooBar>", Pattern.DOTALL)
kotlin - Use RegexOption.DOT_MATCHES_ALL : "(.*)<FooBar>".toRegex(RegexOption.DOT_MATCHES_ALL)
groovy - Use (?s) in-pattern modifier (demo): regex = /(?s)(.*)<FooBar>/
scala - Use (?s) modifier (demo): "(?s)(.*)<Foobar>".r.findAllIn("abcde\n fghij<Foobar>").matchData foreach { m => println(m.group(1)) }
javascript - Use [^] or workarounds [\d\D] / [\w\W] / [\s\S] (demo): s.match(/([\s\S]*)<FooBar>/)[1]
c++ (std::regex) Use [\s\S] or the JavaScript workarounds (demo): regex rex(R"(([\s\S]*)<FooBar>)");
vba vbscript - Use the same approach as in JavaScript, ([\s\S]*)<Foobar>. (NOTE: The MultiLine property of the RegExp object is sometimes erroneously thought to be the option to allow . match across line breaks, while, in fact, it only changes the ^ and $ behavior to match start/end of lines rather than strings, the same as in JavaScript regex)
behavior.)
ruby - Use the /m MULTILINE modifier (demo): s[/(.*)<Foobar>/m, 1]
rtrebase-r - Base R PCRE regexps - use (?s): regmatches(x, regexec("(?s)(.*)<FooBar>",x, perl=TRUE))[[1]][2] (demo)
ricustringrstringi - in stringr/stringi regex funtions that are powered with the ICU regex engine. Also use (?s): stringr::str_match(x, "(?s)(.*)<FooBar>")[,2] (demo)
go - Use the inline modifier (?s) at the start (demo): re: = regexp.MustCompile(`(?s)(.*)<FooBar>`)
swift - Use dotMatchesLineSeparators or (easier) pass the (?s) inline modifier to the pattern: let rx = "(?s)(.*)<Foobar>"
objective-c - The same as Swift. (?s) works the easiest, but here is how the option can be used: NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];
re2, google-apps-script - Use the (?s) modifier (demo): "(?s)(.*)<Foobar>" (in Google Spreadsheets, =REGEXEXTRACT(A2,"(?s)(.*)<Foobar>"))
NOTES ON (?s):
In most non-POSIX engines, the (?s) inline modifier (or embedded flag option) can be used to enforce . to match line breaks.
If placed at the start of the pattern, (?s) changes the bahavior of all . in the pattern. If the (?s) is placed somewhere after the beginning, only those .s will be affected that are located to the right of it unless this is a pattern passed to Python's re. In Python re, regardless of the (?s) location, the whole pattern . is affected. The (?s) effect is stopped using (?-s). A modified group can be used to only affect a specified range of a regex pattern (e.g., Delim1(?s:.*?)\nDelim2.* will make the first .*? match across newlines and the second .* will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any character, [\s\S] / [\d\D] / [\w\W] constructs can be used.
In POSIX, [\s\S] is not matching any character (as in JavaScript or any non-POSIX engine), because regex escape sequences are not supported inside bracket expressions. [\s\S] is parsed as bracket expressions that match a single character, \ or s or S.
If you're using Eclipse search, you can enable the "DOTALL" option to make '.' match any character including line delimiters: just add "(?s)" at the beginning of your search string. Example:
(?s).*<FooBar>
In many regex dialects, /[\S\s]*<Foobar>/ will do just what you want. Source
([\s\S]*)<FooBar>
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.
We can also use
(.*?\n)*?
to match everything including newline without being greedy.
This will make the new line optional
(.*?|\n)*?
In Ruby you can use the 'm' option (multiline):
/YOUR_REGEXP/m
See the Regexp documentation on ruby-doc.org for more information.
"." normally doesn't match line-breaks. Most regex engines allows you to add the S-flag (also called DOTALL and SINGLELINE) to make "." also match newlines.
If that fails, you could do something like [\S\s].
For Eclipse, the following expression worked:
Foo
jadajada Bar"
Regular expression:
Foo[\S\s]{1,10}.*Bar*
Note that (.|\n)* can be less efficient than (for example) [\s\S]* (if your language's regexes support such escapes) and than finding how to specify the modifier that makes . also match newlines. Or you can go with POSIXy alternatives like [[:space:][:^space:]]*.
Use:
/(.*)<FooBar>/s
The s causes dot (.) to match carriage returns.
Use RegexOptions.Singleline. It changes the meaning of . to include newlines.
Regex.Replace(content, searchText, replaceText, RegexOptions.Singleline);
In notepad++ you can use this
<table (.|\r\n)*</table>
It will match the entire table starting from
rows and columns
You can make it greedy, using the following, that way it will match the first, second and so forth tables and not all at once
<table (.|\r\n)*?</table>
In a Java-based regular expression, you can use [\s\S].
This works for me and is the simplest one:
(\X*)<FooBar>
Generally, . doesn't match newlines, so try ((.|\n)*)<foobar>.
In JavaScript you can use [^]* to search for zero to infinite characters, including line breaks.
$("#find_and_replace").click(function() {
var text = $("#textarea").val();
search_term = new RegExp("[^]*<Foobar>", "gi");;
replace_term = "Replacement term";
var new_text = text.replace(search_term, replace_term);
$("#textarea").val(new_text);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<button id="find_and_replace">Find and replace</button>
<br>
<textarea ID="textarea">abcde
fghij<Foobar></textarea>
Solution:
Use pattern modifier sU will get the desired matching in PHP.
Example:
preg_match('/(.*)/sU', $content, $match);
Sources:
Pattern Modifiers
In the context of use within languages, regular expressions act on strings, not lines. So you should be able to use the regex normally, assuming that the input string has multiple lines.
In this case, the given regex will match the entire string, since "<FooBar>" is present. Depending on the specifics of the regex implementation, the $1 value (obtained from the "(.*)") will either be "fghij" or "abcde\nfghij". As others have said, some implementations allow you to control whether the "." will match the newline, giving you the choice.
Line-based regular expression use is usually for command line things like egrep.
Try: .*\n*.*<FooBar> assuming you are also allowing blank newlines. As you are allowing any character including nothing before <FooBar>.
I had the same problem and solved it in probably not the best way but it works. I replaced all line breaks before I did my real match:
mystring = Regex.Replace(mystring, "\r\n", "")
I am manipulating HTML so line breaks don't really matter to me in this case.
I tried all of the suggestions above with no luck. I am using .NET 3.5 FYI.
I wanted to match a particular if block in Java:
...
...
if(isTrue){
doAction();
}
...
...
}
If I use the regExp
if \(isTrue(.|\n)*}
it included the closing brace for the method block, so I used
if \(!isTrue([^}.]|\n)*}
to exclude the closing brace from the wildcard match.
Often we have to modify a substring with a few keywords spread across lines preceding the substring. Consider an XML element:
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>81</PercentComplete>
</TASK>
Suppose we want to modify the 81, to some other value, say 40. First identify .UID.21..UID., then skip all characters including \n till .PercentCompleted.. The regular expression pattern and the replace specification are:
String hw = new String("<TASK>\n <UID>21</UID>\n <Name>Architectural design</Name>\n <PercentComplete>81</PercentComplete>\n</TASK>");
String pattern = new String ("(<UID>21</UID>)((.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
String replaceSpec = new String ("$1$2$440$6");
// Note that the group (<PercentComplete>) is $4 and the group ((.|\n)*?) is $2.
String iw = hw.replaceFirst(pattern, replaceSpec);
System.out.println(iw);
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>40</PercentComplete>
</TASK>
The subgroup (.|\n) is probably the missing group $3. If we make it non-capturing by (?:.|\n) then the $3 is (<PercentComplete>). So the pattern and replaceSpec can also be:
pattern = new String("(<UID>21</UID>)((?:.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
replaceSpec = new String("$1$2$340$5")
and the replacement works correctly as before.
Typically searching for three consecutive lines in PowerShell, it would look like:
$file = Get-Content file.txt -raw
$pattern = 'lineone\r\nlinetwo\r\nlinethree\r\n' # "Windows" text
$pattern = 'lineone\nlinetwo\nlinethree\n' # "Unix" text
$pattern = 'lineone\r?\nlinetwo\r?\nlinethree\r?\n' # Both
$file -match $pattern
# output
True
Bizarrely, this would be Unix text at the prompt, but Windows text in a file:
$pattern = 'lineone
linetwo
linethree
'
Here's a way to print out the line endings:
'lineone
linetwo
linethree
' -replace "`r",'\r' -replace "`n",'\n'
# Output
lineone\nlinetwo\nlinethree\n
Option 1
One way would be to use the s flag (just like the accepted answer):
/(.*)<FooBar>/s
Demo 1
Option 2
A second way would be to use the m (multiline) flag and any of the following patterns:
/([\s\S]*)<FooBar>/m
or
/([\d\D]*)<FooBar>/m
or
/([\w\W]*)<FooBar>/m
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

String to HTML paragraphs in Java with Regex [duplicate]

I had to match a number followed by itself 14 times. Then I've came to the following regular expression in the regexstor.net/tester:
(\d)\1{14}
Edit
When I paste it in my code, including the backslashes properly:
"(\\d)\\1{14}"
I've replaced the back-reference "\1" by the "$1" which is used to replace matches in Java.
Then I've realized that it doesn't work. When you need to back-reference a match in the REGEX, in Java, you have to use "\N", but when you want to replace it, the operator is "$N".
My question is: why?
$1 is not a back reference in Java's regexes, nor in any other flavor I can think of. You only use $1 when you are replacing something:
String input="A12.3 bla bla my input";
input = StringUtils.replacePattern(
input, "^([A-Z]\\d{2}\\.\\d).*$", "$1");
// ^^^^
There is some misinformation about what a back reference is, including the very place I got that snippet from: simple java regex with backreference does not work.
Java modeled its regex syntax after other existing flavors where the $ was already a meta character. It anchors to the end of the string (or line in multi-line mode).
Similarly, Java uses \1 for back references. Because regexes are strings, it must be escaped: \\1.
From a lexical/syntactic standpoint it is true that $1 could be used unambiguously (as a bonus it would prevent the need for the "evil escaped escape" when using back references).
To match a 1 that comes after the end of a line the regex would need to be $\n1:
this line
1
It just makes more sense to use a familiar syntax instead of changing the rules, most of which came from Perl.
The first version of Perl came out in 1987, which is much earlier than Java, which was released in beta in 1995.
I dug up the man pages for Perl 1, which say:
The bracketing construct (\ ...\ ) may also be used, in which case \<digit> matches the digit'th substring. (Outside of the pattern, always use $ instead of \ in front of the digit. The scope of $<digit> (and $\`, $& and $') extends to the end of the enclosing BLOCK or eval string, or to the next pattern match with subexpressions. The \<digit> notation sometimes works outside the current pattern, but should not be relied upon.) You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
I think the main Problem is not the backreference - which works perfectly fine with \1 in java.
Your Problem is more likely the "overall" escaping of a regex pattern in Java.
If you want to have the pattern
(\d)\1{14}
passed to the regex engine, you first need to escape it cause it's a java-string when you write it:
(\\d)\\1{14}
Voila, works like a charm: goo.gl/BNCx7B (add http://, SO does not allow Url-Shorteners, but tutorialspoint.com has no other option as it seems)
Offline-Example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String test = "555555555555555"; // 5 followed by 5 for 14 times.
String pattern = "(\\d)\\1{14}";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Matched!");
}else{
System.out.println("not matched :-(");
}
}
}

What is the pattern for empty string?

I need to validate input: valid variants are either number or empty string. What is the correspondent regular expression?
String pattern = "\d+|<what shoudl be here?>";
UPD: dont suggest "\d*" please, I'm just curious how to tell "empty string" in regexp.
In this particular case, ^\d*$ would work, but generally speaking, to match pattern or an empty string, you can use:
^$|pattern
Explanation
^ and $ are the beginning and end of the string anchors respectively.
| is used to denote alternates, e.g. this|that.
References
regular-expressions.info/Anchors and Alternation
Related questions
Regular expression which matches a pattern, or is an empty string
Note on multiline mode
In the so-called multiline mode (Pattern.MULTILINE/(?m) in Java), the ^ and $ match the beginning and end of the line instead. The anchors for the beginning and end of the string are now \A and \Z respectively.
If you're in multiline mode, then the empty string is matched by \A\Z instead. ^$ would match an empty line within the string.
Examples
Here are some examples to illustrate the above points:
String numbers = "012345";
System.out.println(numbers.replaceAll(".", "<$0>"));
// <0><1><2><3><4><5>
System.out.println(numbers.replaceAll("^.", "<$0>"));
// <0>12345
System.out.println(numbers.replaceAll(".$", "<$0>"));
// 01234<5>
numbers = "012\n345\n678";
System.out.println(numbers.replaceAll("^.", "<$0>"));
// <0>12
// 345
// 678
System.out.println(numbers.replaceAll("(?m)^.", "<$0>"));
// <0>12
// <3>45
// <6>78
System.out.println(numbers.replaceAll("(?m).\\Z", "<$0>"));
// 012
// 345
// 67<8>
Note on Java matches
In Java, matches attempts to match a pattern against the entire string.
This is true for String.matches, Pattern.matches and Matcher.matches.
This means that sometimes, anchors can be omitted for Java matches when they're otherwise necessary for other flavors and/or other Java regex methods.
Related questions
Regex doesn't work in String.matches()
/^\d*$/
Matches 0 or more digits with nothing before or after.
Explanation:
The '^' means start of line. '$' means end of line. '*' matches 0 or more occurences. So the pattern matches an entire line with 0 or more digits.
To explicitly match the empty string, use \A\Z.
You can also often see ^$ which works fine unless the option is set to allow the ^ and $ anchors to match not only at the start or end of the string but also at the start/end of each line. If your input can never contain newlines, then of course ^$ is perfectly OK.
Some regex flavors don't support \A and \Z anchors (especially JavaScript).
If you want to allow "empty" as in "nothing or only whitespace", then go for \A\s*\Z or ^\s*$.
Just as a funny solution, you can do:
\d+|\d{0}
A digit, zero times. Yes, it does work.
There shouldn't be anything wrong with just "\d+|"
One of the way to view at the set of regular language as the closure of the below things:
Special < EMPTY_STRING > is the regular language
Any symbol from alphaphet is the valid regular language
Any concatentation and union of two valid regexps is the regular language
Any union of two valid regular language is the regular language
Any transitive closure of the regexp is the regular language
Concreate regular language is concrete element of this closure.
I didn't find empty symbol in POSIX standard to express regular language idea from step (1).
But it is exist extra thing like question mark there which is by posix definition is the following:
(regexp|< EMPTY_STRING >)
So you can do in the following manner for bash, perl, and python:
echo 9023 | grep -E "(1|90)?23"
perl -e "print 'PASS' if (qq(23) =~ /(1|90)?23/)"
python -c "import re; print bool(re.match('^(1|90)?23$', '23'))"
To make any pattern that matches an entire string optional, i.e. allow a pattern match an empty string, use an optional group:
^(pattern)?$
^^ ^^^
See the regex demo
If the regex engine allows (as in Java), prefer a non-capturing group since its main purpose is to only group subpatterns, not keep the subvalues captured:
^(?:pattern)?$
The ^ will match the start of a string (or \A can be used in many flavors for this), $ will match the end of string (or \z can be used to match the very end in many flavors, and Java, too), and the (....)? will match 1 or 0 (due to the ? quantifier) sequences of the subpatterns inside parentheses.
A Java usage note: when used in matches(), the initial ^ and trailing $ can be omitted and you can use
String pattern = "(?:\d+)?";

Categories

Resources