Stop regular expression from matching across lines - java

I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.

\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr

You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.

Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]

Related

Regular expression for allowing only 1 of a set of characters

I am trying to use some regex to validate some input inside of Java code. I have been successful in implementing "basic" regex, but this one seems to be out of my scope of knowledge. I am working through RegEgg tutorials to learn more.
Here are the conditions that need to be validated:
Field will always have 8 characters
Can be all spaces
Or
Valid characters: a-zA-Z0-9 -!& or a space
Cannot begin with a space
If one of the special characters is used, it can be the only one used
Legal: "B-123---" "AB&& &" "A!!!!!!!"
Illegal: "B-123!!!" "AB&& -" "A-&! "
Has to have at least one alphanumeric character (Can't be all special characters ie: "!!!!!!!!"
This was my regex before additional validations were added:
^(\s{8}|[A-Za-z\-\!\&][ A-Za-z0-9\-\!\&]{7})$"
Then the additional validations for now allowing multiple of the special characters, and I am a bit stuck. I have been successful in using a positive lookahead, but stuck when trying to use the positive lookbehind. (I think the data before the lookbehind was consumed), but I am speculating as I am a neophyte with this part of regex.
using the or construct (a|b) is a large part of this, and you've begun applying it, so that's a good start.
You've made the rule that it can't start with a digit; nothing in the spec says this. also, - inside [] has special meaning, so escape it, or make sure it is first or last, because then you don't have to. That gets us to:
^(\s{8}|[A-Za-z0-9-!& -]{8})$
next up is the rule that it has to be all the same special character if used at all. Given that there are only 3 special characters, could be easier to just explicitly list them all:
^(\s{8}|[A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8})$
Next up: Can't start with a space, and can't be all-special. Confirming the negative (that it ISNT all-special characters) gets complicated; lookahead seems like a better plan here. This:
^ is regexp-ese for: "Start of line". Note that this doesn't 'consume' a character. 1 is regexpese for 'only the exact character '1' will match here, nothinge else', but as it matches, it also 'consumes' that character, whereas ^ doesn't do that. 'start of line' is not a concept that can be consumed.
This notion of 'a match may fail, but if it succeeds, nothing is consumed' isn't limited to ^ and $; you can write your own:
(?=abc) will match if abc would match at this position, but does not consume it. Thus, the regexp ^(=abc)ab.d$ would match the input string abcd and nothing else. This is called positive lookahead. (it 'looks ahead' and matches if it sees the regular expression in the parens, failing if it does not).
(?!abc) is negative lookahead. It matches if it DOESNT see the thing in the parens. (?!abc)a.c will match the input adc but not the input abc.
(?<=abc) is positive lookbehind. It matches if the pattern you provide would match such that the match ends at the position you find yourself.
(?<!abc) is negative lookbehind.
Note that lookahead and lookbehind can be somewhat limited, in that they may not allow variable length patterns. But, fortunately, your requirements make it easy to limit ourselves to fixed size patterns here. Thus, we can introduce: (?![&!-]{8}) as a non-consuming unit in our regexp that will fail the match if we have all-8 special characters.
We can use this trick to fail on starting space too: (?! ) is all we need for that one.
Let's replace \s which is whitespace with just which is the space character (the problem description says 'space', not 'whitespace').
Putting it all together:
^( {8}|(?! )(?![&!-]{8})([A-Za-z0-9 -]{8}|[A-Za-z0-9 !]{8}|[A-Za-z0-9 &]{8}))$
Thats:
8 spaces, or...
not a space, and not all-8 special character, then,
any of the valid chars, any amount of spaces, and any amount of one of the 3 allowed special symbols, as long as we have precisely 8 of them...
.. OR the same thing as #3 but with the second of the three special symbols
.. OR with the third of the three.
Plug em in at regex101 along with your various examples of 'legal' and 'not legal' and you can play around with it some more.
NB: You can also use backreferences to attempt to solve the 'only one special character is allowed' part of this, but attempting to tackle the 'not all special characters' part seems quite unwieldy if you don't get to use (negative) lookahead.
Its a matter of asserting the right conditions at the start of the regex.
^(?=[ ]*$|(?![ ]))(?!.*([!&-]).*(?!\1)[!&-])[a-zA-Z0-9 !&-]{8}$
see -> https://regex101.com/r/tN5y4P/1
Some discussion:
^ # Begin of text
(?= # Assert, cannot start with a space
[ ]* $ # unless it's all spaces
| (?! [ ] )
)
(?! # Assert, not mixed special chars
.*
( [!&-] ) # (1)
.*
(?! \1 )
[!&-]
)
[a-zA-Z0-9 !&-]{8} # Consume 8 valid characters from within this class
$ # End of text

Reg Ex strictly match word start with a pattern

I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b          A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.

match whole sentence with regex

I'm trying to match sentences without capital letters with regex in Java:
"Hi this is a test" -> Shouldn't match
"hi thiS is a test" -> Shouldn't match
"hi this is a test" -> Should match
I've tried the following regex, but it also matches my second example ("hi, thiS is a test").
[a-z]+
It seems like it's only looking at the first word of the sentence.
Any help?
[a-z]+ will match if your string contains any lowercase letter.
If you want to make sure your string doesn't contain uppercase letters, you could use a negative character class: ^[^A-Z]+$
Be aware that this won't handle accentuated characters (like É) though.
To make this work, you can use Unicode properties: ^\P{Lu}+$
\P means is not in Unicode category, and Lu is the uppercase letter that has a lowercase variant category.
^[a-z ]+$
Try this.This will validate the right ones.
It's not matching because you haven't used a space in the match pattern, so your regex is only matching whole words with no spaces.
try something like ^[a-z ]+$ instead (notice the space is the square brackets) you can also use \s which is shorthand for 'whitespace characters' but this can also include things like line feeds and carriage returns so just be aware.
This pattern does the following:
^ matches the start of a string
[a-z ]+ matches any a-z character or a space, where 1 or more exists.
$ matches the end of the string.
I would actually advise against regex in this case, since you don't seem to employ extended characters.
Instead try to test as following:
myString.equals(myString.toLowerCase());

Finding a simple pattern in a string unless escaped

I have some code that looks for a simple bold markup
private Pattern bold = Pattern.compile("\\*[^\\*]*\\*")
If someone uses: this my *bolded* text - my pattern would find "bolded"
I now need a way to use * not in the context of bolding. So I'd like to allow escaping.
E.g. this my \*non-bolded\* text - should not find any pattern.
Is there a simple way I can change my Regex to achieve this?
You need a negative lookbehind here:
(?<!\\)\*[^*]+(?<!\\)\*
In a Java string, this gives (backslash galore):
"(?<!\\\\)\\*[^*]+(?<!\\\\)\\*"
Note: the star (*) has no special meaning within a character class, therefore there is no need to escape it
Note 2: (?<!...) is a negative lookbehind; it is an anchor, which means it finds a position but consumes no text. Literally, it can be translated as: "find a position where there is no preceding text matching regex ...". Other anchors are:
^: find a position where there is no available input before (ie, can only match at the beginning of the input);
$: find a position where there is no available input after (ie, can only match at the end of the input);
(?=...): find a position where the following text matches regex ... (this is called a positive lookahead);
(?!...): find a position where the following text does not match regex ... (this is called a negative lookahead);
(?<=...): find a position where the preceding text matches regex ... (this is a positive lookbehind);
\<: find a position where the preceding input is either nothing or a character which is not a word character, and the following character is a word character (implementation dependent);
\>: find a position where the following input is either nothing or a character which is not a word character, and the preceding character is a word character (implementation dependent);
\b: either \< or \>.
Note 3: Javascript regexes do not support lookbehinds; neither do they support \< or \>. More information here.
Note 4: with some regex engines, it is possible to alter the meaning of ^ and $ to match positions at the beginning and end of each line instead; in Java, that is Pattern.MULTILINE; in Perl-like regex engines, that is /m.
This negative lookbehind based regex should work for you:
(?<!\\)\*[^*]+\*(?<!\\)
Live Demo: http://www.rubular.com/r/sobKUrkTjP
When translated to Java it will become:
(?<!\\\\)\\*[^*]+\\*(?<!\\\\)
I think the two answers until now are very interesting, but not completely correct. They don't work when a bolded text has escaped asterisk inside (I assume this is almost the main reason to escape asterisks).
For example:
My *bold \*text* here, another *bold*, more \* and *here\* and
\* end* more text
Should find three groups:
*bold \*text*
*bold*
*here\* and \* end*
With a little modification, we can do that, with this regular expression:
(?<!\\)\*([^*\\]|\\\*)+\*
can be tested here:
http://www.rubular.com/r/Jeml02HHYJ
Of course, in Java some more escaping is needed:
(?<!\\\\)\\*([^*\\\\]|\\\\\\*)+\\*

why does \B works but not \b

Wanted to match a word that ends with # like
hi hello# world#
I tried to use boundary
\b\w+#\b
and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case
Surprisingly
\b\w+#\B
matches!
So why does \B works here and not \b!Also why doesn't \b work in this case!
NOTE:
Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!
Definition of word boundary \b
Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w.
A word boundary \b is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).
OR
Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).
(Note how similar this is to the expansion of XOR into conjunction and disjunction)
A non-word boundary \B is equivalent to:
(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))
Which means:
Right ahead and right behind, we cannot find any word character. Note that empty string is consider a non-word boundary under this definition.
OR
Right ahead and right behind, both sides are word characters. Note that this branch requires 2 characters, i.e. cannot occur at the beginning or the end of a non-empty string.
(Note how similar this is to the expansion of XNOR into conjunction and disjunction).
Definition of word character \w
Since the definition of \b and \B depends on definition of \w1, you need to consult the specific documentation to know exactly what \w matches.
1 Most of the regex flavors define \b based on \w. Well, except for Java [Point 9], where in default mode, \w is ASCII-only and \b is partially Unicode-aware.
In JavaScript, it would be [A-Za-z0-9_] in default mode.
In .NET, \w by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}], and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.
Answer to the question
With the definition above, answering the question becomes easy:
"hi hello# world#"
In hello#, after # is space (U+0020, in Zs category), which is not a word character, and # is not a word character itself (in Unicode, it is in Po category). Therefore, \B can match here. The branch (?<!\w)(?!\w) is used in this case.
In world#, after # is end of string. Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \B can match the empty string just after #. The branch (?<!\w)(?!\w) is also used in this case.
Addendum
Alan Moore gives quite a good summary in the comment:
I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.
The pound # symbol is not considered a "word boundary".
\b\w+#\b doesn't work because w+# is not considered a word, therefore it will not match world#.
\b\w+6\b on the other hand is, therefore it will match world6.
"Word Characters" are defined by: [A-Za-z0-9_].
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
— http://www.regular-expressions.info/wordboundaries.html
The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. Therefore \b will not match it and \B will match it.

Categories

Resources