Regular expression help needed for simple (I think) scenario - java

Never was good at these things. We are using Checkstyle (for Java) to enforce good coding practices. One new check we would like to add is to identify when a certain API is used (potentially) incorrectly through the use of a regex check.
Incorrect usage examples:
BigDecimal.valueOf(someDouble).setScale(3);
someBigDecimalObject.setScale(6);
Correct usage examples:
BigDecimal.valueOf(someDouble).setScale(3, RoundingMode.HALF_UP);
someBigDecimalObject.setScale(1, RoundingMode.HALF_DOWN);
So,the regular expression I'm looking for is when ".setScale(" appears in the code that "RoundingMode." appears somewhere after it. Or to add more clarity, the regular expression should be true when ".setScale(" appears but "RoundingMode." doesn't.
Thanks in advance

Off the top of my head:
(?x: \.setScale \s* \( (?: (?! \bRoundingMode\b ) [^)] ) * \) )
What you want is to see .setScale(...) where ... doesn't contain RoundingMode. To explain further (assuming the Regex is being processed by Java):
(?x: - introduce "whitespace" mode so that you can pad the regex with spaces for readability
\.setScale \s* \( - look for .setScale with arbitrary spaces then (
Now the fun really begins: At every position until we encounter the trailing ) check whether the word RoundingMode appears:
(?: - start a group containing...
(?! \bRoundingMode\b ) - ...a negative look-ahead assertion for the word RoundingMode, that is, the regex will fail to match if "RoundingMode" is detected; and then
[^)] - match a non ) character
) * - repeat the above as often as required until
\) - the closing ")" is detected.
) - Finally, close the (?x: construct.
Of course, this won't work if there are nested (...) expressions in the method call but you can't solve that with a Java Pattern, you'll need a dodgy Perl regex instead

You can use the regex:
\.setScale\(\d+,\s*RoundingMode
See it

Howabout:
\.setScale\(.+RoundingMode

Related

Modifying existing Java regex

I have the following regex that validates the allowed characters:
^[a-zA-Z0-9-?\/:;(){}\[\]|`~´.\,'+÷ !##$£%^"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\]*$
I need to modify it so that the string being validated:
may not begin with space or “/”
may not contain “//”
may not end with “/”
For the space at the beginning I have adapted it to
^[^\s][a-zA-Z0-9-?\\/:;(){}\\[\\]|`~´.\\,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\\\]*$
Not sure what to do about the other two requirements
For the second one I tried combining it with ^((?!//))*$ in various ways but to no success.
Note that ^((?!\/\/))*$ matches any empty string since the lookahead is a non-consuming pattern and here it always returns true.
[^\s] at the start of your pattern will match any chars other than whitespace chars, even those you did not specify in the character class.
You can use
^(?![\s/])(?!.*//)[a-zA-Z0-9?/:;(){}\[\]|`~´.,'+÷ !##$£%^\"&*_<>=àáâäçèéêëìíîïñòóôöùúûüýßÀÁÂÄÇÈÉÊËÌÍÎÏÒÓÔÖÙÚÛÜÑ\\-]*$(?<!/)
See the regex demo. Details:
^(?![\s/])(?!.*//) - at the start of string, two checks are peformed:
(?![\s/]) - no whitespace or / allowed (right at the start)
(?!.*//) - no // allowed anywhere after zero or more chars other than line break chars, as many as possible
(?<!/) is the check after the end of string is hit, and it fails the match if the last char in string is /.
Note that in Java regex declarations, you do not need to escape / since regex delimiter notation is not used, and / itself is not a special regex metacharacter.
It seems like the following regexp should be enough and more simple: (?!.*//)^[^ /].*[^/]$
So at the beginning you can use negative lookahead to prevent occurence of // anywhere in the text. Then any character but space and / is accepted at the beginning, then anything can be present (besides // which was excluded by negative lookahead) and anything but / is accepted at the end.
Since 95% of the time the special conditions on the space and forward slash
will not occur, it might be better to take those two characters out of your
big class and handle them separately if and when they occur.
The big class can also be condensed to speed things up a bit.
^(?>[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/LpCwt6/1
^
(?>
[a-zA-Z0-9\\!-.:-#\[\]-`{-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+
| (?:
/
(?! / | $ )
| [ ]
)
(?<! ^ . )
)*
$
And if you want to absorb all the class characters it can get very small.
^(?>[!-.0-~£´ÄÖäö÷À-ÂÇ-ÏÑ-ÔÙ-Üß-âç-ïñ-ôù-ý]+|(?:/(?!/|$)|[ ])(?<!^.))*$
https://regex101.com/r/EYdM5C/1

Regular expression for allowing only 1 of a set of characters

I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.
using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.
Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Regular Expression Wildcard Matching

I have a list of about 120 thousand english words (basically every word in the language).
I need a regular expression that would allow searching through these words using wildcards characters, a.k.a. * and ?.
A few examples:
if the user searches for m?st*, it would match for example master or mister or mistery.
if the user searches for *ind (any word ending in ind), it would match wind or bind or blind or grind.
Now, most users (especially the ones who are not familiar with regular expressions) know that ? is a replacement for exactly 1 character, while * is a replacement for 0, 1 or more characters. I absolutely want to build my search feature based on this.
My questions is: How do I convert what the user types (m?st* for example) to a regular expression ?
I searched the web (obviously including this website) and all I could find were tutorials that tried to teach me too much or questions that were somewhat similar, but not enough as to provide an answer to my own problem.
All I could figure out was that I have to replace ? with .. So m?st* becomes m.st*. However, I have no idea what to replace * with.
Any help would be greatly appreciated. Thank you.
PS: I'm totally new to regular expressions. I know how powerful they can be, but I also know they can be very hard to learn. So I just never took the time do to it...
Unless you want some funny behaviour, I would recommend you use \w instead of .
. matches whitespace and other non-word symbols, which you might not want it to do.
So I would replace ? with \w and replace * with \w*
Also if you want * to match at least one character, replace it with \w+ instead. This would mean that ben* would match bend and bending but not ben - it's up to you, just depends what your requirements are.
Take a look at this library: https://github.com/alenon/JWildcard
It wraps all not wildcard specific parts by regex quotes, so no special chars processing needed:
This wildcard:
"mywil?card*"
will be converted to this regex string:
"\Qmywil\E.\Qcard\E.*"
If you wish to convert wildcard to regex string use:
JWildcard.wildcardToRegex("mywil?card*");
If you wish to check the matching directly you can use this:
JWildcard.matches("mywild*", "mywildcard");
Default wildcard rules are "?" -> ".", "" -> ".", but you can change the default behaviour if you wish, by simply defining the new rules.
JWildcard.wildcardToRegex(wildcard, rules, strict);
You can use sources or download it directly using maven or gradle from Bintray JCenter: https://bintray.com/yevdo/jwildcard/jwildcard
Gradle way:
compile 'com.yevdo:jwildcard:1.4'
Maven way:
<dependency>
<groupId>com.yevdo</groupId>
<artifactId>jwildcard</artifactId>
<version>1.4</version>
</dependency>
Replace ? with . and * with .*.
Here is a way to transform wildcard into regex:
Prepend all special characters ([{\^-=$!|]}).+ with \ - so they are matched as characters and don't make user experience unexpected. Also you could enclose it within \Q (which starts the quote) and \E (which ends it). Also see paragraph about security.
Replace * wildcard with \S*
Replace ? wildcard with \S?
Optionally: prepend pattern with ^ - this will enforce exact match with the beginning.
Optionally: append $ to pattern - this will enforce exact match with the end.
\S - stand for non-space character, which happens zero or more times.
Consider using reluctant (non-greedy) quantifiers if you have characters to match after * or +. This can be done by adding ? after * or + like this: \S*? and \S*+?
Consider security: user will send you code to run (because regex is kind of a code too, and user string is used as the regex). You should avoid passing unescaped regex to any other parts of application and only use to filter data retrieved by other means. Because if you do user can affect speed of your code by supplying different regex withing wildcard string - this could be used in DoS attacks.
Example to show execution speeds of similar patterns:
seq 1 50000000 > ~/1
du -sh ~/1
563M
time grep -P '.*' ~/1 &>/dev/null
6.65s
time grep -P '.*.*.*.*.*.*.*.*' ~/1 &>/dev/null
12.55s
time grep -P '.*..*..*..*..*.*' ~/1 &>/dev/null
31.14s
time grep -P '\S*.\S*.\S*.\S*.\S*\S*' ~/1 &>/dev/null
31.27s
I'd suggest against using .* simply because it can match anything, and usually things are separated with spaces.
Replace all '?' characters with '\w'
Replace all '*' characters with '\w*'
The '*' operator repeats the previous item '.' (any character) 0 or more times.
This assumes that none of the words contain '.', '*', and '?'.
This is a good reference
http://www.regular-expressions.info/reference.html
Replace * with .* (the regex equivalent of "0 or more of any character").
. is an expression that matches any one character, as you've discovered. In your hours of searching, you undoubtedly also stumbled across *, which is a repetition operator that when used after an expression matches the preceding expression zero or more times in a row.
So the equivalent to your meaning of * is putting these two together: .*. This then means "any character zero or more times".
See the Regex Tutorial on repetition operators.
function matchWild(wild,name)
{
if (wild == '*') return true;
wild = wild.replace(/\./g,'\\.');
wild = wild.replace(/\?/g,'.');
wild = wild.replace(/\\/g,'\\\\');
wild = wild.replace(/\//g,'\\/');
wild = wild.replace(/\*/g,'(.+?)');
var re = new RegExp(wild,'i');
return re.test(name);
}
This is what I use:
String wildcardToRegex(String wildcardString) {
// The 12 is arbitrary, you may adjust it to fit your needs depending
// on how many special characters you expect in a single pattern.
StringBuilder sb = new StringBuilder(wildcardString.length() + 12);
sb.append('^');
for (int i = 0; i < wildcardString.length(); ++i) {
char c = wildcardString.charAt(i);
if (c == '*') {
sb.append(".*");
} else if (c == '?') {
sb.append('.');
} else if ("\\.[]{}()+-^$|".indexOf(c) >= 0) {
sb.append('\\');
sb.append(c);
} else {
sb.append(c);
}
}
sb.append('$');
return sb.toString();
}
Special character list from https://stackoverflow.com/a/26228852/1808989.

Why doesn't ? work as an optional repetition specifier in this pattern?

I am trying to match inputs like
<foo>
<bar>
#####<foo>
#####<bar>
I tried #{5}?<\w+>, but it does not match <foo> and <bar>.
What's wrong with this pattern, and how can it be fixed?
On ? for optional vs reluctant
The ? metacharacter in Java regex (and some other flavors) can have two very different meanings, depending on where it appears. Immediately following a repetition specifier, ? is a reluctant quantifier instead of "zero-or-one"/"optional" repetition specifier.
Thus, #{5}? does not mean "optionally match 5 #". It in fact says "match 5 # reluctantly". It may not make too much sense to try to match "exactly 5, but as few as possible", but this is in fact what this pattern means.
Grouping to the rescue!
One way to fix this problem is to group the optional pattern as (…)?. Something like this should work for this problem:
(#{5})?<\w+>
Now the ? does not immediately follow a repetition specifier (i.e. *, +, ?, or {…}); it follows a closing bracket used for grouping.
Alternatively, you can also use a non-capturing group (?:…)in this case:
(?:#{5})?<\w+>
This achieves the same grouping effect, but doesn't capture into \1.
References
regular-expressions.info
Question Mark for Optional - yes, but only with proper placement
Brackets for Grouping
Repetition
Flavor comparison
java.util.regex.Pattern: X{n}? : X, exactly n times
Related questions
regex{n,}? == regex{n} ? (absolutely NOT!)
Difference between .*? and .* for regex
Bonus material: What about ??
It's worth noting that you can use ?? to match an optional item reluctantly!
System.out.println("NOMZ".matches("NOMZ??"));
// "true"
System.out.println(
"NOM NOMZ NOMZZ".replaceAll("NOMZ??", "YUM")
); // "YUM YUMZ YUMZZ"
Note that Z?? is an optional Z, but it's matched reluctantly. "NOMZ" in its entirety still matches the pattern NOMZ??, but in replaceAll, NOMZ?? can match only "NOM" and doesn't have to take the optional Z even if it's there.
By contrast, NOMZ? will match the optional Z greedily: if it's there, it'll take it.
System.out.println(
"NOM NOMZ NOMZZ".replaceAll("NOMZ?", "YUM")
); // "YUM YUM YUMZ"
Related questions
method matches not work well
unlike other flavors, Java matches a pattern against the entire String
Place your # match in a subpattern:
(#{5})?<\w+>

Categories

Resources