I'm trying to create some general code to ease the usage of regexes, and thinking how to implement the OR function.
The title is pretty accurate (ex1,ex2,ex3 are any regular expressions). Not considering grouping, what's the difference between:
"(ex1)|(ex2)|(ex3)"
and
"[(ex1)(ex2)(ex3)]"
These both should be an or relation between the named regexes, i just might be missing something. Any way one is more efficient than the other?
(ex1)|(ex2)|(ex3) matches ex1 (available in group 1), ex2 (available in group 2) or ex3 (available in group 3)
Debuggex Demo
[(ex1)(ex2)(ex3)] matches (, e, x, 1, 2, 3 or )
Debuggex Demo
(ex1)|(ex2)|(ex3)
Here you are capturing ex1, ex2 and ex3.
Here:
[(ex1)(ex2)(ex3)]
( and ) are quoted and treated as is since they're enclosed in [ and ] (character classes), it matches (, ), e, x, 1, 2 and 3.
Note that it's equivalent to (the order is not important):
[ex123)(]
Important notes on character sets:
The caret (^) and the hyphen (-) can be included as is. If you want to include hyphen, you should place it in the very beginning of the character class. If you want to match the caret as a part of the character set, you should not put it as the first character:
[^]x] matches anything that's not ] and x where []^x] matches ], ^ or x
[a-z] matches all letters from a to z where [-az] matches -, a and z
They're fundamentally different.
(ex1)|(ex2)|(ex3) defines a series of alternating capture groups for the literal text ex1, ex2, and ex3. That is, either ex1, if present, will be captured in the first capture group; or ex2, if present, will be captured in a second capture group; or ex3, if present, will be captured in a third group. (This would be a fairly odd expression, a more likely one would be (ex1|ex2|ex3), which matches and captures either ex1, ex2, or ex3.)
[(ex1)(ex2)(ex3)] defines a character class that will match any of the following characters (just one character): (ex1)23. There are no capture groups, the text within the [] is treated literally.
The Pattern class documentation goes into detail about how patterns work.
In the first regex: (ex1)|(ex2)|(ex3), you are going to match three groups denoted by the parenthesis (i.e. ex1, ex2, ex3), so you will get results that will match whatever ex1 regex matches, whatever ex2 regex matches and whatever ex3 regex matches.
Whereas in the second: [(ex1)(ex2)(ex3)] there will be no groups (as you are using [] brackets and parenthesis will be treated as characters. So you will get everything that matches (ex1)(ex2)(ex3) expression.
In the first case, you have 3 groups (1 to 3) each one with a sequence of characters, separated by OR
In the second case, you have a character class containing characters e, x, 1, 2, 3, (, ) and no group
The first case will match either ex1 or ex2 or ex3 and assign either to its relevant group. So, given input "ex1", it matches and will return group 1 equal to "ex1", group 2 and 3 null
Given the same input "ex1" in your second case, it will match all characters, one at the time, at each successive match, and each and every character e, x and 1 will be assigned to group 0, i.e. the whole match
first of all, in regex, [(abc)] means match character: a or b or c or ( or )
There is no "groupping" happening in character class. (between [...])
The other example from you is group match, different thing.
"(ex1)|(ex2)|(ex3)"
If ex1 presents, then it must be captured by group 1 and if ex2 present, it would be captured by group 2 and if ex3 presents, it would be captured by group 3.
"[(ex1)(ex2)(ex3)]"
This matches a single character from the given character class. It may be ( or e or x or 1 or 2 or 3 or )
Related
I have a string with 5 pieces of data delimited by underscores:
AAA_BBB_CCC_DDD_EEE
I want a different regex for each component.
The regex needs to return just the one component.
For example, the first would return just AAA, the second for BBB, etc.
I am able to parse out AAA with the following:
^([^_]*)?
I see that I can do a look-around like this to find:
(?<=[^_]*_).*
BBB_CCC_DDD_EEE
But the following can not find just BBB
(?<=[^_]*_)[^_]*(?=_)
Mixing lookbehind and lookahead
^([^_]+)? // 1st
(?<=_)[^_]+ // 2nd
(?<=_)[^_]+(?=_[^_]+_[^_]+$) // 3rd
(?<=_)[^_]+(?=_[^_]+$) // 4th
[^_]+$ // 5th
Just if the lengths of the strings beetween the "_" are known it can be like this
1st match
^([^_]+)?
2nd match
(?<=_)\K[^_]+
3rd match
(?<=_[A-Za-z]{3}_)\K[^_]+
4th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
5th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
each {3} is expressing the length of the string beetween "_"
If your string is always uses underscores, you might use 1 regex to capture your values in a capturing group by repeating the pattern of what is before (in this case NOT an underscore followed by an underscore) using a quantifier which you can change like {3}.
This way you can specify using the quantifier how many times you want to repeat the pattern before and then capture your match. For your example string AAA_BBB_CCC_DDD_EEE you could use {0}, {1},{2},{3} or {4}
^(?:[^_\n]+_){3}([0-9A-Za-z]+)(?:_[^_\n]+)*$
That would match:
^ Assert position at start of the line
(?:[^_\n]+_){3} In a non capturing group (?:, match NOT and underscore or a new line one or more times [^_\n]+ followed by an underscore and repeat that n times (In this example n is 3 times)
([0-9A-Za-z]+) Capture your characters in a group using for example a character class (or use [^_]+ to match not an underscore but that will also match any white space characters)
(?:_[^_\n]+)* Following after your captured values, repeat in a non capturing group matching an underscore, NOT and underscore or a new line one or more times and repeat that pattern zero or more times to get a full match
$ Assert position at the end of the line
How to make an or and and together in Regex.
We can do this in regex (Boo)|(l30o) and list all permutations which basically beats the purpose of using regex. Here or is being used.
I want to match B in any form, O in any form twice. Something like, [(B)|(l3)][0 o O]{2}. But, in this form, it matches (0O too.
O twice matching isn't a problem.
B when trying to match with multiple character match is a problem along with single character match.
Should match:
Boo
b0o
l300
I3oO
B00
etc.
All words which look like Boo, i.e., b - {B,b,l3,I3,i3} and o - {O, o, 0};
You could try (?:[bB]|[lIi]3)[0Oo]{2}:
(?:...) is a non-capturing group
[...] is a character class, i.e. any character inside it (except - depending on the position) will be assumed to be meant literally (i.e. [iIl] matches i, L or l, while [(B)|(l3)] wouldn't do what you think it does: it matches any of (, B, ), |, l or 3).
| means "or" and matches entire sequences
{...} is a numeric quantifier (i.e. {2} means exactly twice)
You could also use (?i) at the start of your expression to make it case-insensitive, i.e. the expression would then be (?i)(?:b|[li]3)[0o]{2}.
Can you try the following
(B|b|l3|I3|i3)[0oO]{2}
You can try it online at https://regex101.com/r/gLA6N2/3
(B|b|l3|I3|i3)(O|o|0)+
() is a group
| is an or
+ is a quantifier for {1,} which means 1 or more
I am trying to validate a text field that accepts number like 10.99, 1.99, 1, 10, 21.
\d{0,2}\.\d{1,2}
Above expression is only passing values such as 10.99, 11.99,1.99, but I want something that would satisfy my requirement.
Try this:
^\d{1,2}(\.\d{1,2})?$
^ - Match the start of string
\d{1,2} - Must contains at least 1 digit at most 2 digits
(\.\d{1,2}) - When decimal points occur must have a . with at least 1 and at most 2 digits
? - can have zero to 1 times
$ - Match the end of string
Assuming you don't want to allow edge cases like 00, and want at least 1 and at most 2 decimal places after the point mark:
^(?!00)\d\d?(\.\d\d?)?$
This precludes a required digit before the decimal point, ie ".12" would not match (you would have to enter "0.12", which is best practice).
If you're using String#matches(), you can drop the leading/trailing ^ and $, because that method must to match the entire string to return true.
First \d{0,2} does not seem to fit your requirement as in that case it will be valid for no number as well. It will give you the correct output but logically it does not mean to check no number in your string so you can change it to \d{1,2}
Now, in regex ? is for making things optional, you can use it with individual expression like below:
\d{1,2}\.?\d{0,2}
or you can use it on the combined expression like below
\d{1,2}(\.\d{1,2})?
You can also refer below list for further queries:
abc… Letters
123… Digits
\d Any Digit
\D Any Non-digit character
. Any Character
\. Period
[abc] Only a, b, or c
[^abc] Not a, b, nor c
[a-z] Characters a to z
[0-9] Numbers 0 to 9
\w Any Alphanumeric character
\W Any Non-alphanumeric character
{m} m Repetitions
{m,n} m to n Repetitions
* Zero or more repetitions
+ One or more repetitions
? Optional character
\s Any Whitespace
\S Any Non-whitespace character
^…$ Starts and ends
(…) Capture Group
(a(bc)) Capture Sub-group
(.*) Capture all
(abc|def) Matches abc or def
Useful link : https://regexone.com/
Can you try using this :
(\d{1,2}\.\d{1,2})|(\d{1,2})
Here is a Demo, you can check also simple program
You have two parts or two groups one to check the float numbers #.#, #.##, ##.##, ##.# and the second group to check the integer #, ##, so we can use the or |, float|integer
I think patterns of this type are best handled with alteration:
/^\s*([-+]?[0-9]*\.[0-9]+([eE][-+]?[0-9]+)?)$ #float
| # or
^(\d{1,2})$ # 2 digit int/mx
Demo
I have a bunch of characters like this: A B B C D
And I have a few spaces like this: _ _ _
Is there a way to use regular expression to match any string that can be formed by "dragging" the available characters into the empty spaces?
So in the example, these are some valid matches:
A B C
A B B
B C B
D A B
But these are invalid:
A A B // Only one 'A' is available in the set
B B B // Only two 'B's are available in the set
Sorry if it has already been asked before.
vks's solution would work properly, and here's it optimised with additions to fulfill the "_ _ _" rule:
^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}
Here is a regex demo.
Changes from original regex:
Capturing groups are removed since we're in Java - Java regex implementation dedicates time to write captured groups during matching).
The anchor ^ is moved in front for readability of the regex.
Regex explanation:
^ Asserts position at the start of the match.
(?! Negative lookahead - Asserts that our position does not match the following, without moving the pointer:
(?:[^A]*A){2} Two "A"s (literal character), with non-"A"s rolled over in an optimal way.
) Closes the group.
(?!(?:[^B]*B){3}) Same as the above group - Asserts that there are not three "B"s in the match.
(?!(?:[^C]*C){2}) Asserts that there are not two "C"s in the match.
(?!(?:[^D]*D){2}) Asserts that there are not two "D"s in the match.
(?: Non-capturing group: Matches the following without capturing.
[ABCD] Any character from the list "A", "B", "C", or "D".
(?:\s|$) A whitespace, or the end of string.
){3} Three times - Must match the sequence exactly three times to fulfill the "_ _ _" rule.
To use the regex:
boolean fulfillsRule(String str) {
Pattern tripleRule = Pattern.compile("^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}");
return tripleRule.matcher(str).find();
}
(?!(.*?A){2,})(?!(.*?B){3,})(?!((.*?C){2,}))(?!((.*?D){2,}))^[ABCD]*$
You can use something like this.See demo.
http://regex101.com/r/uH3fV3/1
Interesting problem, this is my idea:
(?m)^(?!.*([ACD]).*\1)(?!(?>.*?B){3})(?>[A-D] ){2}[A-D]$
Used (?m) MULTILINE modifier where ^ matches line start and $ line end.
Test at regexplanet (click on Java); regex101 (non Java)
If I understood it right, the available character-pot is A,B,B,C,D. A string should be valid, if it contains 0 or 1 of each [ACD] or 0-2 of B in your example. My pattern consists of 3 parts:
(?!.*([ACD]).*\1) Used at line-start ^ a negative lookahead to assure, that [ACD] occurs at most one time, by capturing [ACD] to \1 and checking, it does not occur twice anywhere.
(?!(?>.*?B){3}) Using a negative lookahead, to assure, B occurs at most 2x.
finally (?>[A-D] ){2}[A-D]$ determines the total usable character set, assures the formatting, where each letter must be prededed by space or start and checks the length.
This can be easily modified to other needs. Also see SO Regex FAQ
Directly from this java API (ctrl + f) + "Group name":
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.
I know how capturing groups work and how they work with backreference.
However I have not got the point of the API bit I above quoted. Is somebody able to put it down in other words?
Thanks in advance.
That quote says that:
If you have used a quantifier - +, *, ? or {m,n}, on your capture group, and your group is matched more than once, then only the last match will be associated with the capture group, and all the previous matches will be overridden.
For e.g.: If you match (a)+ against the string - "aaaaaa", your capture group 1 will refer to the last a.
Now consider the case, where you have a nested capture group as in the example shown in your quote:
`(a(b)?)+`
matching this regex with the string - "aba", you get the following 2 matches:
"ab" - Capture Group 1 = "ab" (due to outer parenthesis), Capture Group 2 = "b"(due to inner parenthesis)
"a" - Capture Group 1 = "a", Capture Group 2 = None. (This is because second capture group (b)? is optional. So, it successfully matches the last a.
So, finally your Capture group 1 will contain "a",which overrides earlier captured group "ab", and Capture group 2 will contain "b", which is not overridden.
Named captures or not is irrelevant in this case.
Consider this input text:
foo-bar-baz
and this regex:
[a-z]+(-[a-z]+)*
Now the question is what is captured by group 1?
As the regex progresses through the text, it first matches -bar which is then the contents of group 1; but then it goes on in the text and recognizes -baz which is now the new content of group 1.
Therefore, -bar is "lost": the regex engine has discarded it because further text in the input matched the capturing group. This is what is meant by this:
[t]he captured input associated with a group is always the subsequence that the group most recently matched [emphasis mine]