How to match such kind of strings using and in regex? - java

How to make an or and and together in Regex.
We can do this in regex (Boo)|(l30o) and list all permutations which basically beats the purpose of using regex. Here or is being used.
I want to match B in any form, O in any form twice. Something like, [(B)|(l3)][0 o O]{2}. But, in this form, it matches (0O too.
O twice matching isn't a problem.
B when trying to match with multiple character match is a problem along with single character match.
Should match:
Boo
b0o
l300
I3oO
B00
etc.
All words which look like Boo, i.e., b - {B,b,l3,I3,i3} and o - {O, o, 0};

You could try (?:[bB]|[lIi]3)[0Oo]{2}:
(?:...) is a non-capturing group
[...] is a character class, i.e. any character inside it (except - depending on the position) will be assumed to be meant literally (i.e. [iIl] matches i, L or l, while [(B)|(l3)] wouldn't do what you think it does: it matches any of (, B, ), |, l or 3).
| means "or" and matches entire sequences
{...} is a numeric quantifier (i.e. {2} means exactly twice)
You could also use (?i) at the start of your expression to make it case-insensitive, i.e. the expression would then be (?i)(?:b|[li]3)[0o]{2}.

Can you try the following
(B|b|l3|I3|i3)[0oO]{2}
You can try it online at https://regex101.com/r/gLA6N2/3

(B|b|l3|I3|i3)(O|o|0)+
() is a group
| is an or
+ is a quantifier for {1,} which means 1 or more

Related

String.replaceAll() with regex gets messed up

I'm trying to implement the Swedish "Robbers language" in Java. It's basically just replacing each consonant with itself, followed by a "o", followed by itself again. I thought I had it working with this code
str.replaceAll("[bcdfghjklmnpqrstvwxz]+", "$0o$0");
but it fails when there are two or more subsequent consonants, for example
String str = "horse";
It should produce hohororsose, but instead I get hohorsorse. I'm guessing the replacement somehow messes up the matching indexes in the original string. How can I make it work?
str.replaceAll("[bcdfghjklmnpqrstvwxz]", "$0o$0");
Remove the + quantifier as it will group consonants.
// when using a greedy quantifier
horse
h | o | rs | e
hoh | o | rsors | e
A plus sign matches one or more of the preceding character, class, or subpattern. For example a+ matches ab and aaab. But unlike a* and
a?, the pattern a+ does not match at the beginning of strings that
lack an "a" character.
https://autohotkey.com/docs/misc/RegEx-QuickRef.htm
+ means: Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+? means: Between one and unlimited times, as few times as possible, expanding as needed (lazy)
{1} means: Exactly 1 time (meaningless quantifier)
In your case you don't need a quantifier.
You can experiment with regular expressions online at https://regex101.com/

Regex ([mb|kb|gb|b|bytes]) does not match 'b' in 'kb' or 'gb' without a + after the braces

I am writing a regular expression that can capture a value and any of mb, kb, gb, bytes that comes after it
The Regex is:
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>[mb|kb|gb|b|bytes])
But when given an input "4096 mb", group sizetype matches only 'm' and not 'b'. adding a '+' quantifier after the braces gives the output of grop sizetype as 'mb'. The pattern was compiled with CASE_INSENSITIVE so that was not the issue.
This works
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>[mb|kb|gb|b|bytes]+)
Ideally shouldn't the first regex match 'mb' completely ?
You need to use capturing or non-capturing group instead of a character class.
[mb|kb|gb|b|bytes] matches only a single charcater from the given list, ie, it may match an m or b or | or k or b, etc. It won't consider mb as a single word and | operator inside the character class will looses it's special meaning and matches only a literal | symbol. It won't do an OR operation.
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>(?:mb|kb|gb|b|bytes)\b)
DEMO
Pattern p = Pattern.compile("(?<sizevalue>\\p{N}+)(?:\\s*)(?<sizetype>(?:mb|kb|gb|b|bytes)\\b)");

What's the difference between "(ex1)|(ex2)|(ex3)" and "[(ex1)(ex2)(ex3)]"

I'm trying to create some general code to ease the usage of regexes, and thinking how to implement the OR function.
The title is pretty accurate (ex1,ex2,ex3 are any regular expressions). Not considering grouping, what's the difference between:
"(ex1)|(ex2)|(ex3)"
and
"[(ex1)(ex2)(ex3)]"
These both should be an or relation between the named regexes, i just might be missing something. Any way one is more efficient than the other?
(ex1)|(ex2)|(ex3) matches ex1 (available in group 1), ex2 (available in group 2) or ex3 (available in group 3)
Debuggex Demo
[(ex1)(ex2)(ex3)] matches (, e, x, 1, 2, 3 or )
Debuggex Demo
(ex1)|(ex2)|(ex3)
Here you are capturing ex1, ex2 and ex3.
Here:
[(ex1)(ex2)(ex3)]
( and ) are quoted and treated as is since they're enclosed in [ and ] (character classes), it matches (, ), e, x, 1, 2 and 3.
Note that it's equivalent to (the order is not important):
[ex123)(]
Important notes on character sets:
The caret (^) and the hyphen (-) can be included as is. If you want to include hyphen, you should place it in the very beginning of the character class. If you want to match the caret as a part of the character set, you should not put it as the first character:
[^]x] matches anything that's not ] and x where []^x] matches ], ^ or x
[a-z] matches all letters from a to z where [-az] matches -, a and z
They're fundamentally different.
(ex1)|(ex2)|(ex3) defines a series of alternating capture groups for the literal text ex1, ex2, and ex3. That is, either ex1, if present, will be captured in the first capture group; or ex2, if present, will be captured in a second capture group; or ex3, if present, will be captured in a third group. (This would be a fairly odd expression, a more likely one would be (ex1|ex2|ex3), which matches and captures either ex1, ex2, or ex3.)
[(ex1)(ex2)(ex3)] defines a character class that will match any of the following characters (just one character): (ex1)23. There are no capture groups, the text within the [] is treated literally.
The Pattern class documentation goes into detail about how patterns work.
In the first regex: (ex1)|(ex2)|(ex3), you are going to match three groups denoted by the parenthesis (i.e. ex1, ex2, ex3), so you will get results that will match whatever ex1 regex matches, whatever ex2 regex matches and whatever ex3 regex matches.
Whereas in the second: [(ex1)(ex2)(ex3)] there will be no groups (as you are using [] brackets and parenthesis will be treated as characters. So you will get everything that matches (ex1)(ex2)(ex3) expression.
In the first case, you have 3 groups (1 to 3) each one with a sequence of characters, separated by OR
In the second case, you have a character class containing characters e, x, 1, 2, 3, (, ) and no group
The first case will match either ex1 or ex2 or ex3 and assign either to its relevant group. So, given input "ex1", it matches and will return group 1 equal to "ex1", group 2 and 3 null
Given the same input "ex1" in your second case, it will match all characters, one at the time, at each successive match, and each and every character e, x and 1 will be assigned to group 0, i.e. the whole match
first of all, in regex, [(abc)] means match character: a or b or c or ( or )
There is no "groupping" happening in character class. (between [...])
The other example from you is group match, different thing.
"(ex1)|(ex2)|(ex3)"
If ex1 presents, then it must be captured by group 1 and if ex2 present, it would be captured by group 2 and if ex3 presents, it would be captured by group 3.
"[(ex1)(ex2)(ex3)]"
This matches a single character from the given character class. It may be ( or e or x or 1 or 2 or 3 or )

RegExp to match string formed with a limited set of characters without reusing any character

I have a bunch of characters like this: A B B C D
And I have a few spaces like this: _ _ _
Is there a way to use regular expression to match any string that can be formed by "dragging" the available characters into the empty spaces?
So in the example, these are some valid matches:
A B C
A B B
B C B
D A B
But these are invalid:
A A B // Only one 'A' is available in the set
B B B // Only two 'B's are available in the set
Sorry if it has already been asked before.
vks's solution would work properly, and here's it optimised with additions to fulfill the "_ _ _" rule:
^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}
Here is a regex demo.
Changes from original regex:
Capturing groups are removed since we're in Java - Java regex implementation dedicates time to write captured groups during matching).
The anchor ^ is moved in front for readability of the regex.
Regex explanation:
^ Asserts position at the start of the match.
(?! Negative lookahead - Asserts that our position does not match the following, without moving the pointer:
(?:[^A]*A){2} Two "A"s (literal character), with non-"A"s rolled over in an optimal way.
) Closes the group.
(?!(?:[^B]*B){3}) Same as the above group - Asserts that there are not three "B"s in the match.
(?!(?:[^C]*C){2}) Asserts that there are not two "C"s in the match.
(?!(?:[^D]*D){2}) Asserts that there are not two "D"s in the match.
(?: Non-capturing group: Matches the following without capturing.
[ABCD] Any character from the list "A", "B", "C", or "D".
(?:\s|$) A whitespace, or the end of string.
){3} Three times - Must match the sequence exactly three times to fulfill the "_ _ _" rule.
To use the regex:
boolean fulfillsRule(String str) {
Pattern tripleRule = Pattern.compile("^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}");
return tripleRule.matcher(str).find();
}
(?!(.*?A){2,})(?!(.*?B){3,})(?!((.*?C){2,}))(?!((.*?D){2,}))^[ABCD]*$
You can use something like this.See demo.
http://regex101.com/r/uH3fV3/1
Interesting problem, this is my idea:
(?m)^(?!.*([ACD]).*\1)(?!(?>.*?B){3})(?>[A-D] ){2}[A-D]$
Used (?m) MULTILINE modifier where ^ matches line start and $ line end.
Test at regexplanet (click on Java); regex101 (non Java)
If I understood it right, the available character-pot is A,B,B,C,D. A string should be valid, if it contains 0 or 1 of each [ACD] or 0-2 of B in your example. My pattern consists of 3 parts:
(?!.*([ACD]).*\1) Used at line-start ^ a negative lookahead to assure, that [ACD] occurs at most one time, by capturing [ACD] to \1 and checking, it does not occur twice anywhere.
(?!(?>.*?B){3}) Using a negative lookahead, to assure, B occurs at most 2x.
finally (?>[A-D] ){2}[A-D]$ determines the total usable character set, assures the formatting, where each letter must be prededed by space or start and checks the length.
This can be easily modified to other needs. Also see SO Regex FAQ

Why does regex only match a string when others are present?

I have a white list of HTML end tags (br, b, i, div):-
String whitelist = "([^br|^b|^i|^div])";
String endTagPattern = "(<[ ]*/[ ]*)" + whitelist + "(>?).*?([^>]+>)";
...
html = html.replaceAll(endTagPattern, "[r]");
Which takes my test String and removes the end tags of those not in the white list, in this case replaced by [r] for clarity:-
1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong</strong>, 4. <div>div</div>, 5. <script lang='test'>script</script>
1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong[r], 4. <div>div</div>, 5. <script lang='test'>script[r]
If I add strong to this white list
String whitelist = "([^br|^b|^i|^div|^strong])";
Not only does it not match the strong end tag, it also stops matching that of the end script tag or any other for that matter.
My question is, why?
The reason for this is that you are using a character class. Inside a character class, the order of characters does not really matter except if you're dealing with character ranges.
So, [^br|^b|^i|^div|^strong] actually will match any character except those:
bridvstrong|^
[Note that | and ^ are there too].
You could have used [^bridvstrong|^] and it would behave the same way.
You might instead look into negative lookaheads.
String whitelist = "([^br|^b|^i|^div])";
Using [] creates a character class. I presume you wrote this so you could use ^ for "not", but a character class is inappropriate here. Inside square brackets, | does not mean "or"; it's just a literal pipe character. And writing div doesn't match the word div, it matches one of the three characters, d, i, or v. Negating that means "match anything except d, i, or v.
That whitelist is effectively equivalent to [^bdirv|\^] — it matches a single character that is not b, d, i, r, v, |, or ^.
String whitelist = "(?!br|b|i|div)";
If you want to exclude certain matches, what you want is negative lookahead. Leaving out the square brackets lets you use | the way you intended, as an "or" operator.

Categories

Resources