Finding a simple pattern in a string unless escaped - java

I have some code that looks for a simple bold markup
private Pattern bold = Pattern.compile("\\*[^\\*]*\\*")
If someone uses: this my *bolded* text - my pattern would find "bolded"
I now need a way to use * not in the context of bolding. So I'd like to allow escaping.
E.g. this my \*non-bolded\* text - should not find any pattern.
Is there a simple way I can change my Regex to achieve this?

You need a negative lookbehind here:
(?<!\\)\*[^*]+(?<!\\)\*
In a Java string, this gives (backslash galore):
"(?<!\\\\)\\*[^*]+(?<!\\\\)\\*"
Note: the star (*) has no special meaning within a character class, therefore there is no need to escape it
Note 2: (?<!...) is a negative lookbehind; it is an anchor, which means it finds a position but consumes no text. Literally, it can be translated as: "find a position where there is no preceding text matching regex ...". Other anchors are:
^: find a position where there is no available input before (ie, can only match at the beginning of the input);
$: find a position where there is no available input after (ie, can only match at the end of the input);
(?=...): find a position where the following text matches regex ... (this is called a positive lookahead);
(?!...): find a position where the following text does not match regex ... (this is called a negative lookahead);
(?<=...): find a position where the preceding text matches regex ... (this is a positive lookbehind);
\<: find a position where the preceding input is either nothing or a character which is not a word character, and the following character is a word character (implementation dependent);
\>: find a position where the following input is either nothing or a character which is not a word character, and the preceding character is a word character (implementation dependent);
\b: either \< or \>.
Note 3: Javascript regexes do not support lookbehinds; neither do they support \< or \>. More information here.
Note 4: with some regex engines, it is possible to alter the meaning of ^ and $ to match positions at the beginning and end of each line instead; in Java, that is Pattern.MULTILINE; in Perl-like regex engines, that is /m.

This negative lookbehind based regex should work for you:
(?<!\\)\*[^*]+\*(?<!\\)
Live Demo: http://www.rubular.com/r/sobKUrkTjP
When translated to Java it will become:
(?<!\\\\)\\*[^*]+\\*(?<!\\\\)

I think the two answers until now are very interesting, but not completely correct. They don't work when a bolded text has escaped asterisk inside (I assume this is almost the main reason to escape asterisks).
For example:
My *bold \*text* here, another *bold*, more \* and *here\* and
\* end* more text
Should find three groups:
*bold \*text*
*bold*
*here\* and \* end*
With a little modification, we can do that, with this regular expression:
(?<!\\)\*([^*\\]|\\\*)+\*
can be tested here:
http://www.rubular.com/r/Jeml02HHYJ
Of course, in Java some more escaping is needed:
(?<!\\\\)\\*([^*\\\\]|\\\\\\*)+\\*

Related

Java Regex to validate group field pattern example - abc.def.gh1

I am just writing some piece of java code where I need to validate groupId (maven) passed by user.
For example - com.fb.test1.
I have written regex which says string should not start and end with '.' and can have alphanumeric characters delimited by '.'
[^\.][[a-zA-Z0-9]+\\.{0,1}]*[a-zA-Z0-9]$
But this regex not able to find out consecutive '.' For example - com..fb.test. I have added {0,1} followed by decimal to restrict it limitation to 1 but it didnt work.
Any leads would be highly appreciated.
The quantifier {0,1} and the dot should not be in the character class, because you are repeating the whole character class allowing for 0 or more dots, including { , } chars.
You can also exclude a dot to the left using a negative lookbehind instead of matching an actual character that is not a dot.
In Java you could write the pattern as
(?<!\\.)[a-zA-Z0-9]+(?:\\.[a-zA-Z0-9]+)+[a-zA-Z0-9]$
Note that the $ makes sure that that match is at the end of the string.
Regex demo

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Regex-How to prevent repeated special characters?

I don't have an experience on Regular Expressions. I need to a regular expression which doesn't allow to repeat of special characters (+-*/& etc.)
The string can contain digits, alphanumerics, and special characters.
This should be valid : abc,df
This should be invalid : abc-,df
i will be really appreciated if you can help me ! Thanks for advance.
Two solutions presented so far match a string that is not allowed.
But the tilte is How to prevent..., so I assume that the regex
should match the allowed string. It means that the regex should:
match the whole string if it does not contain 2
consecutive special characters,
not match otherwise.
You can achieve this putting together the following parts:
^ - start of string anchor,
(?!.*[...]{2}) - a negative lookahead for 2 consecutive special
characters (marked here as ...), in any place,
a regex matching the whole (non-empty) string,
$ - end of string anchor.
So the whole regex should be:
^(?!.*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2}).+$
Note that within a char class (between [ and ]) a backslash
escaping the following char should be placed before - (if in
the middle of the sequence), closing square bracket,
a backslash itself and / (regex terminator).
Or if you want to apply the regex to individual words (not the whole
string), then the regex should be:
\b(?!\S*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2})\S+
[\,\+\-\*\/\&]{2,} Add more characters in the square bracket if you want.
Demo https://regex101.com/r/CBrldL/2
Use the following regex to match the invalid string.
[^A-Za-z0-9]{2,}
[^\w!\s]{2,} This would be a shortest version to match any two consecutive special characters (ignoring space)
If you want to consider space, please use [^\w]{2,}

Stop regular expression from matching across lines

I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.
\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.
Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]

Using regex to match beginning and end of string [Java]

I have a list of files in a folder:
maze1.in.txt
maze2.in.txt
maze3.in.txt
I've used substring to remove the .txt extensions.
How do I use regex to match the front and the back of the file name?
I need it to match "maze" at the front and ".in" at the back, and the middle must be a digit (can be single or double digit).
I've tried the following
if (name.matches("name\\din")) {
//dosomething
}
It doesn't match anything. What is the correct regex expression to use?
I'm a little confused what you are asking for in particular
^(maze[0-9]*\.in)$
This will match maze(any number).in
^(maze[0-9]*\.in)\.txt$
this will match maze(any number).in.txt -- excludes the .txt NO NEED FOR USING SUB STRING!
Edit live on Debuggex
The think i would be wary about as of right now is the capture groups... I'm not particularly sure what you are doing with this regex. However, I believe explaining capture groups could benefit you.
A capture group for instance is denoted by () this is basically store them in the pattern array and is a way to parse stuff.
example maze1.in.txt
So if you want to capture the entire line minus .txt i would use this ^(maze[0-9]*\.in\.txt)$
However, if I wanted to capture things separately I would do this ^(maze)([0-9]*)(\.in)\.txt$ this will exclude .txt but include maze, the number, and .in IN separate indexes of the pattern array.
Your original solution doesn't work because string "name" is not in your text. It is "maze".
You can try this
name.matches("maze\\d{1,2}\\.in")
d{1,2} is used to match a digit(can be single or double digit).
You need regex anchors that tell the regex to
start at the beginning: ^
and signal the end of the string: $
^maze[\d]{0,2}\.in$
or in Java:
name.matches("^maze[\\d]{0,2}\\.in$");
Also, your regex wasn't matching strings with a dot (.) which would not accept your examples given. You need to add \. to the regex to accept dots because . is a special character.
It is always good to think of what you are trying to do in english, before you create regular expressions.
You want to match a word maze followed by a digit, followed by a literal period . followed by another word.
word `\w` matches a word character
digit `\d` matches a single digit
period `\.` matches a literal period
word `\w` matches a word character
putting it all together into a single string you get (keep in mind the double backslash for the Java escape and the pluses to repeat the previous match one or more times):
"\\w+\\d\\.\\w+"
The above is the generic case for any file name in the format xxx1.yyy, if you wanted to match maze and in specifically, you can just add those in as literal strings.
"maze\\d+\\.in"
example: http://ideone.com/rS7tw1
name.matches("^maze[0-9]+\\.in\\.txt$")

Categories

Resources