Why does Java-regex matches underscore? [duplicate] - java

This question already has answers here:
Java RegEx meta character (.) and ordinary dot?
(9 answers)
Closed 2 years ago.
I was trying to match the URL pattern string.string. for any number of string. using ^([^\\W_]+.)([^\\W_]+.)$ as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$ stops working and matches the wrong pattern "string.str_ing.".
Do you know what is incorrect with the second version?

You need to escape your . character, else it will match any character including _.
^([^\\W_]+\.?)+$
this can be your generalised regex

With ^([^\\W_]+.)([^\\W_]+.)$ you match any two words with restricted set of characters. Although, you have not escaped the ., it still works as long as the first word is matched first string, then any literal (that's what unescaped . means) and then string again.
In the latter one the unescaped dot (.) is a part of the capturing group occurring at least once (since you use +), therefore it allows any character as a divisor. In other words string.str_ing. is understood as:
string as the 1st word
str as the 2nd word
ing as the 3rd word
... as long as the unescaped dot (.) allows any divisor (both . literally and _).
Escape the dot to make the Regex work as intented (demo):
^([^\\W_]+\.)+$

[^\W] seems a weird choice - it's matching 'not not-a-word-character'. I haven't thought it through, but that sounds like it's equivalent to \w, i.e., matching a word character.
Either way, with ^\W and \w, you're asking to match underscores - which is why it matches the string with the underscore. "Word characters" are uppercase alphabetics, lowercase alphabetics, digits, and underscore.
You probably want [a-z]+ or maybe [A-Za-z0-9]+

Related

Reg Ex strictly match word start with a pattern

I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b          A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.

Regex-How to prevent repeated special characters?

I don't have an experience on Regular Expressions. I need to a regular expression which doesn't allow to repeat of special characters (+-*/& etc.)
The string can contain digits, alphanumerics, and special characters.
This should be valid : abc,df
This should be invalid : abc-,df
i will be really appreciated if you can help me ! Thanks for advance.
Two solutions presented so far match a string that is not allowed.
But the tilte is How to prevent..., so I assume that the regex
should match the allowed string. It means that the regex should:
match the whole string if it does not contain 2
consecutive special characters,
not match otherwise.
You can achieve this putting together the following parts:
^ - start of string anchor,
(?!.*[...]{2}) - a negative lookahead for 2 consecutive special
characters (marked here as ...), in any place,
a regex matching the whole (non-empty) string,
$ - end of string anchor.
So the whole regex should be:
^(?!.*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2}).+$
Note that within a char class (between [ and ]) a backslash
escaping the following char should be placed before - (if in
the middle of the sequence), closing square bracket,
a backslash itself and / (regex terminator).
Or if you want to apply the regex to individual words (not the whole
string), then the regex should be:
\b(?!\S*[!##$%^&*()\-_+={}[\]|\\;:'",<.>\/?]{2})\S+
[\,\+\-\*\/\&]{2,} Add more characters in the square bracket if you want.
Demo https://regex101.com/r/CBrldL/2
Use the following regex to match the invalid string.
[^A-Za-z0-9]{2,}
[^\w!\s]{2,} This would be a shortest version to match any two consecutive special characters (ignoring space)
If you want to consider space, please use [^\w]{2,}

How to Match a String Against Multiple Regex Patterns in Java

I understand how to match a single String against multiple regex patterns using the pipe symbol as explained in some of the answers to this question: Match a string against multiple regex patterns
My question is that when I have the following String:
this_isAnExample of What nav-input a-autoid-9-announce thisIsAnExampleToo
And I use the following regex to extract text:
[A-Z][a-z]*|(?<=_)[A-Za-z-]*
I am expecting to get the following matches:
is
An
Example
What
Is
An
Example
Too
But I actually get is:
isAnExample
What
Is
An
Example
Too
Basically the engine is automatically linking the word An with Example bec it matches the underscore pattern but I want it to treat them as two words (non greedy?) bec according to the other pattern there is another match.
You probably ment the regex to be
[A-Z][a-z]*|(?<=_)[a-z-]*
The first part being lowercase word starting with uppercase letter, or the second: lowercase word preceded by underscore.
The part of your posted regex (?<=_)[A-Za-z-]* matches lower and upper case letters after underscore, i.e. does not stop matching when uppercase letter met, which should be in fact start of another word.
You can use this alternation regex to capture all the lower case text that is wither preceded by _ OR mixed case text:
((?<=_)[a-z][a-z-]*|[A-Z][a-z]*)
RegEx Demo

Java - Escape star in character class in regex [duplicate]

This question already has answers here:
Including a hyphen in a regex character bracket?
(6 answers)
Closed 3 years ago.
I would like to match a line with a regular expression. The line contains two numbers which could be divided by plus, minus or a star (multiplication). However, I am not sure how to escape the star.
line.matches("[0-9]*[+-*][0-9]*");
I tried also line.matches("[0-9]*[+-\\*][0-9]*"); but it does not work either.
Should I put the star into separate group? Why does the escaping \\* not work in this case?
* is not metacharacter in character class ([...]) so you don't need to escape it at all. What you need to escape is - because inside character class it is responsible for creating range of characters like [a-z].
So instead of "[+-*]" which represents all characters placed in Unicode Table between + and * use
"[+\\-*]"
or place - where it can't be used as range indicator
at start of character class [-+*]
at end of character class [+*-]
or right after other range if you have one [a-z-+*]
BTW if you would like to add \ literal to your operators you need to write it in regex as \\ (it is metacharacter used for escaping or to access standard character classes like \w so we also need to escape it). But since \ is also special in String literals (it can be used to represent characters via \n \r \t or \uXXXX), you also need to escape it there as well. So in regex \ needs to be represented as \\ which as string literal is written as "\\\\".
BTW 2: to represent digit instead of [0-9] you can use \d (written in string literal as "\\d" since \ is special there and requires escaping).
BTW 3: if you want to make sure that there will be at least two numbers in string (one on each side of operator) you need to use + instead of * at [0-9]* since + represents one or more occurrence of previous element, while * represents zero or more occurrences.
So your code can look like
line.matches("\\d+[-+*]\\d+");

Match first occurrence of semicolon in string, only if not preceded by '--'

I'm trying to write a regular expression for Java that matches if there is a semicolon that does not have two (or more) leading '-' characters.
I'm only able to get the opposite working: A semicolon that has at least two leading '-' characters.
([\-]{2,}.*?;.*)
But I need something like
([^([\-]{2,})])*?;.*
I'm somehow not able to express 'not at least two - characters'.
Here are some examples I need to evaluate with the expression:
; -- a : should match
-- a ; : should not match
-- ; : should not match
--; : should not match
-;- : should match
---; : should not match
-- semicolon ; : should not match
bla ; bla : should match
bla : should not match (; is mandatory)
-;--; : should match (the first occuring semicolon must not have two or more consecutive leading '-')
It seems that this regex matches what you want
String regex = "[^-]*(-[^-]+)*-?;.*";
DEMO
Explanation: matches will accept string that:
[^-]* can start with non dash characters
(-[^-]+)*-?; is a bit tricky because before we will match ; we need to make sure that each - do not have another - after it so:
(-[^-]+)* each - have at least one non - character after it
-? or - was placed right before ;
;.* if earlier conditions ware fulfilled we can accept ; and any .* characters after it.
More readable version, but probably little slower
((?!--)[^;])*;.*
Explanation:
To make sure that there is ; in string we can use .*;.* in matches.
But we need to add some conditions to characters before first ;.
So to make sure that matched ; will be first one we can write such regex as
[^;]*;.*
which means:
[^;]* zero or more non semicolon characters
; first semicolon
.* zero or more of any characters (actually . can't match line separators like \n or \r)
So now all we need to do is make sure that character matched by [^;] is not part of --. To do so we can use look-around mechanisms for instance:
(?!--)[^;] before matching [^;] (?!--) checks that next two characters are not --, in other words character matched by [^;] can't be first - in series of two --
[^;](?<!--) checks if after matching [^;] regex engine will not be able to find -- if it will backtrack two positions, in other words [^;] can't be last character in series of --.
How about just splitting the string along -- and if there are two or more sub strings, checking if the last one contains a semicolon?
How about using this regex in Java:
[^;]*;(?<!--[^;]{0,999};).*
Only caveat is that it works with up to 999 character length between -- and ;
Java Regex Demo
I think this is what you're looking for:
^(?:(?!--).)*;.*$
In other words, match from the start of the string (^), zero or more characters (.*) followed by a semicolon. But replacing the dot with (?:(?!--).) causes it to match any character unless it's the beginning of a two-hyphen sequence (--).
If performance is an issue, you can exclude the semicolon as well, so it never has to backtrack:
^(?:(?!--|;).)*;.*$
EDIT: I just noticed your comment that the regex should work with the matches() method, so I padded it out with .*. The anchors aren't really necessary, but they do no harm.
You need a negative lookahead!
This regex will match any string which does not contain your original match pattern:
(?!-{2,}.*?;.*).*?;.*
This Regex matches a string which contains a semicolon, but not one occuring after 2 or more dashes.
Example:

Categories

Resources