Why does regex only match a string when others are present?

Why does regex only match a string when others are present? - java

I have a white list of HTML end tags (br, b, i, div):-
String whitelist = "([^br|^b|^i|^div])";
String endTagPattern = "(<[ ]*/[ ]*)" + whitelist + "(>?).*?([^>]+>)";
...
html = html.replaceAll(endTagPattern, "[r]");
Which takes my test String and removes the end tags of those not in the white list, in this case replaced by [r] for clarity:-
1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong</strong>, 4. <div>div</div>, 5. <script lang='test'>script</script>
1. <b>bold</b>, 2. <i>italic</i>, 3. <strong>strong[r], 4. <div>div</div>, 5. <script lang='test'>script[r]
If I add strong to this white list
String whitelist = "([^br|^b|^i|^div|^strong])";
Not only does it not match the strong end tag, it also stops matching that of the end script tag or any other for that matter.
My question is, why?

The reason for this is that you are using a character class. Inside a character class, the order of characters does not really matter except if you're dealing with character ranges.
So, [^br|^b|^i|^div|^strong] actually will match any character except those:
bridvstrong|^
[Note that | and ^ are there too].
You could have used [^bridvstrong|^] and it would behave the same way.
You might instead look into negative lookaheads.

String whitelist = "([^br|^b|^i|^div])";
Using [] creates a character class. I presume you wrote this so you could use ^ for "not", but a character class is inappropriate here. Inside square brackets, | does not mean "or"; it's just a literal pipe character. And writing div doesn't match the word div, it matches one of the three characters, d, i, or v. Negating that means "match anything except d, i, or v.
That whitelist is effectively equivalent to [^bdirv|\^] — it matches a single character that is not b, d, i, r, v, |, or ^.
String whitelist = "(?!br|b|i|div)";
If you want to exclude certain matches, what you want is negative lookahead. Leaving out the square brackets lets you use | the way you intended, as an "or" operator.

Related

Regular expression for exact one character occurrence at any place of the string

I am writing a Java code that finds a way out from any maze and I need a class that checks generated mazes. Only ., #, S, X characters are allowed.
. and # - one and more occurrences at any place of the string are allowed
S and X - one and not more than one occurrence at any place of the string is allowed.
^[#//.]+$ - regex for the first condition,
But I cannot implement the second one.
The maze input looks like this:
.......S..#.
.....###....
..X.........
. - empty space, # - wall, S - start, X - exit

You can use negative lookahead groups, written like (?!...), to accomplish this, like so:
^(?!.*S.*S)(?!.*X.*X)[SX.#]+$
Demo
This accepts any set of characters from your set (S, X, ., #) from the start of the string using the ^ and [SX.#]+. But it rejects any string containing 2 Ss ((?!.*S.*S)) or 2 Xs ((?!.*X.*X)).
Note that this actually checks both of your conditions. You don't really need 2 regexes here. Based on your example maze, though, it looks like your input can span multiple lines. In that case, you need to add \n inside the final character class.

If the maze string consist of all the lines, you could use a single lookahead asserting only a single S, then match only one X (or the other way around)
^(?=[^S]*S[^S]*$)[.S#\r\n]*X[.S#\r\n]*$
In Java
String regex = "^(?=[^S]*S[^S]*$)[.S#\\r\\n]*X[.S#\\r\\n]*$";
Explanation
^ Start of string
(?= Positive lookahead, assert what is on the right is a single occurrence of S
[^S]*S Match 0+ times any char without S, match S
[^S]*$ Match 0+ times any char without S, end of string
) Close lookahead
[.S#\r\n]*X Match all accepted chars including a newline without X, then match X
[.S#\r\n]* Match all accepted chars including a newline without X
$ End of string
Regex demo

How to match such kind of strings using and in regex?

How to make an or and and together in Regex.
We can do this in regex (Boo)|(l30o) and list all permutations which basically beats the purpose of using regex. Here or is being used.
I want to match B in any form, O in any form twice. Something like, [(B)|(l3)][0 o O]{2}. But, in this form, it matches (0O too.
O twice matching isn't a problem.
B when trying to match with multiple character match is a problem along with single character match.
Should match:
Boo
b0o
l300
I3oO
B00
etc.
All words which look like Boo, i.e., b - {B,b,l3,I3,i3} and o - {O, o, 0};

You could try (?:[bB]|[lIi]3)[0Oo]{2}:
(?:...) is a non-capturing group
[...] is a character class, i.e. any character inside it (except - depending on the position) will be assumed to be meant literally (i.e. [iIl] matches i, L or l, while [(B)|(l3)] wouldn't do what you think it does: it matches any of (, B, ), |, l or 3).
| means "or" and matches entire sequences
{...} is a numeric quantifier (i.e. {2} means exactly twice)
You could also use (?i) at the start of your expression to make it case-insensitive, i.e. the expression would then be (?i)(?:b|[li]3)[0o]{2}.

Can you try the following
(B|b|l3|I3|i3)[0oO]{2}
You can try it online at https://regex101.com/r/gLA6N2/3

(B|b|l3|I3|i3)(O|o|0)+
() is a group
| is an or
+ is a quantifier for {1,} which means 1 or more

Validating name string with dashes and singlequotes

I am trying to validate a string with the following specification:
"Non-empty string that contains only letters, dashes, or single quotes"
I'm using String.matches("[a-zA-Z|-|']*") but it's not catching the - characters correctly. For example:
Test Result Should Be
==============================
shouldpass true true
fail3 false false
&fail false false
pass-pass false true
pass'again true true
-'-'-pass false true
So "pass-pass" and "-'-'-pass" are failing. What am I doing wrong with my regex?

You should use the following regex:
[a-zA-Z'-]+
You regex is allowing literal |, and you have a range specified, from | to |. The hyphen must be placed at the end or beginning of the character class, or escaped in the middle if you want to match a literal hyphen. The + quantificator at the end will ensure the string is non-empty.
Another alternative is to include all Unicode letters:
[\p{L}'-]+
Java string: "[\\p{L}'-]+".

Possible solution:
[a-zA-Z-']+
Problems with your regex:
If you don't want to accept empty strings, change * to + to accept one or more characters instead of zero or more.
Characters in character class are implicitly separated by OR operator. For instance:
regex [abc] is equivalent of this regex a|b|c.
So as you see regex engine doesn't need OR operator there, which means that | will be treated as simple pipe literal:
[a|b] represents a OR | OR b characters
You seem to know that - has special meaning in character class, which is to create range of characters like a-z. This means that |-| will be treated by regex engine as range of characters between | and | (which effectively is only one character: |) which looks like main problem of your regex.
To create - literal we either need to
escape it \-
place it where - wouldn't be able to be interpreted as range. To be more precise we need to place it somewhere where it will not have access to characters which could be use as left and right range indicators l-r like:
at start of character class [- ...] (no left range character)
at end of character class [... -] (no right range character)
right after other range like A-Z-x - Z was already used as character representing end of range A-Z so it can't reused in Z-x range.

This will work:
[a-zA-Z'-]+
Using the | is going to search for a range, you just want that specific character.
Tested Here

try {
if (subjectString.matches("(?i)([a-z'-]+)")) {
// String matched entirely
} else {
// Match attempt failed
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
EXPLANATION:
(?i)([a-z'-]+)
----------
Options: Case insensitive; Exact spacing; Dot doesn't match line breaks; ^$ don't match at line breaks; Default line breaks
Match the regex below and capture its match into backreference number 1 «([a-z'-]+)»
Match a single character present in the list below «[a-z'-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “a” and “z” (case insensitive) «a-z»
The literal character “'” «'»
The literal character “-” «-»

How to replace strings using java String.replaceAll() excluding some patterns?

I am using String.Replaceall to replace forward slash / followed or preceded by a space with a comma followed by space ", " EXCEPT some patterns (for example n/v, n/d should not be affected)
ALL the following inputs
"nausea/vomiting"
"nausea /vomiting"
"nausea/ vomiting"
"nausea / vomiting"
Should be outputted as
nausea, vomiting
HOWEVER ALL the following inputs
"user have n/v but not other/ complications"
"user have n/d but not other / complications"
Should be outputted as follows
"user have n/v but not other, complications"
"user have n/d but not other, complications"
I have tried
String source= "nausea/vomiting"
String regex= "([^n/v])(\\s*/\\s*)";
source.replaceAll(regex, ", ");
But it cuts the a before / and gives me nause , vomiting
Does any body know a solution?

Your first capturing group, ([^n/v]), captures any single character that is not the letter n, the letter v, or a slash (/). In this case, it's matching the a at the end of nausea and capturing it to be replaced.
You need to be a bit more clear about what you are and are not replacing here. Do you just want to make sure there's a comma instead when it doesn't end in "vomiting" or "d"? You can use non-capturing groups to indicate this:
(?=asdf) does not capture but when placed at the end ensures that right after the match the string will contain asdf; (?!asdf) ensures that it will not. Whichever you use, the question mark after the initial parenthesis ensures that any text it matches will not be returned or replaced when the match is found.
Also, do not forget that in Java source you must always double up any backslashes you put in string literals.

[^n/v] is a character class, and means anything except a n, / or a v.
You are probably looking for something like a negative lookbehind:
String regex= "(?<!\\bn)(\\s*/\\s*)";
This will match any of your slash and space combinations that are not preceded by just an n, and works for all your examples. You can read more on lookaround here.

Java String.split() splitting every character instead of given regular expression

I have a string that I want to split into an array:
SEQUENCE： 1A→2B→3C
I tried the following regular expression:
((.*\s)|([\x{2192}]*))
1. \x{2192} is the arrow mark
2. There is a space after the colon, I used that as a reference for matching the first part
and it works in testers(Patterns in OSX)
but it splits the string into this:
[, , 1, A, , 2, B, , 3, C]
How can I achieve the following?:
[1A,2B,3C]
This is the test code:
String str = "SEQUENCE： 1A→2B→3C"; //Note that there's an extra space after the colon
System.out.println(Arrays.toString(str.split("(.*\\s)|([\\x{2192}]*)")));

As noted in Richard Sitze's post, the main problem with the regex is that it should use + rather than *. Additionally, there are further improvements you can make to your regex:
Instead of \\x{2192}, use \u2192. And because it's a single character, you don't need to put it into a character class ([...]), you can just use \u2192+ directly.
Also, because | binds more loosely than .*\\s and \u2192+, you won't need the parentheses there either. So your final expression is simply ".*\\s|\u2192+".

The \u2192* will match 0 or more arrows - which is why you're splitting on every character (splitting on empty string). Try changing * to +.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does regex only match a string when others are present? - java

Related

Regular expression for exact one character occurrence at any place of the string

How to match such kind of strings using and in regex?

Validating name string with dashes and singlequotes

How to replace strings using java String.replaceAll() excluding some patterns?

Java String.split() splitting every character instead of given regular expression

Categories

Resources