why does \B works but not \b - java

Wanted to match a word that ends with # like
hi hello# world#
I tried to use boundary
\b\w+#\b
and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case
Surprisingly
\b\w+#\B
matches!
So why does \B works here and not \b!Also why doesn't \b work in this case!
NOTE:
Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!

Definition of word boundary \b
Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w.
A word boundary \b is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).
OR
Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).
(Note how similar this is to the expansion of XOR into conjunction and disjunction)
A non-word boundary \B is equivalent to:
(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))
Which means:
Right ahead and right behind, we cannot find any word character. Note that empty string is consider a non-word boundary under this definition.
OR
Right ahead and right behind, both sides are word characters. Note that this branch requires 2 characters, i.e. cannot occur at the beginning or the end of a non-empty string.
(Note how similar this is to the expansion of XNOR into conjunction and disjunction).
Definition of word character \w
Since the definition of \b and \B depends on definition of \w1, you need to consult the specific documentation to know exactly what \w matches.
1 Most of the regex flavors define \b based on \w. Well, except for Java [Point 9], where in default mode, \w is ASCII-only and \b is partially Unicode-aware.
In JavaScript, it would be [A-Za-z0-9_] in default mode.
In .NET, \w by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}], and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.
Answer to the question
With the definition above, answering the question becomes easy:
"hi hello# world#"
In hello#, after # is space (U+0020, in Zs category), which is not a word character, and # is not a word character itself (in Unicode, it is in Po category). Therefore, \B can match here. The branch (?<!\w)(?!\w) is used in this case.
In world#, after # is end of string. Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \B can match the empty string just after #. The branch (?<!\w)(?!\w) is also used in this case.
Addendum
Alan Moore gives quite a good summary in the comment:
I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.

The pound # symbol is not considered a "word boundary".
\b\w+#\b doesn't work because w+# is not considered a word, therefore it will not match world#.
\b\w+6\b on the other hand is, therefore it will match world6.
"Word Characters" are defined by: [A-Za-z0-9_].
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
— http://www.regular-expressions.info/wordboundaries.html

The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. Therefore \b will not match it and \B will match it.

Related

Reg Ex strictly match word start with a pattern

I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b          A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.

Use of \b Boundary Matcher In Java

I am reading Boundary Matcher from Oracle Documentation. I understand most of the part, but i am not able to grasp the \b Boundary Matcher. Here is the example from the documentation.
To check if a pattern begins and ends on a word boundary (as opposed
to a substring within a longer string), just use \b on either side;
for example, \bdog\b
Enter your regex: \bdog\b Enter input string to search: The dog plays
in the yard. I found the text "dog" starting at index 4 and ending at
index 7.
Enter your regex: \bdog\b Enter input string to search: The doggie
plays in the yard. No match found. To match the expression on a
non-word boundary, use \B instead:
Enter your regex: \bdog\B Enter input string to search: The dog
plays in the yard. No match found.
Enter your regex: \bdog\B Enter input string to search: The doggie
plays in the yard. I found the text "dog" starting at index 4 and
ending at index 7.
In short, i am not able to understand the working of \b. Can someone help me describing its usage and help me understand this example.
Thanks
\b is what you can call an "anchor": it will match a position in the input text.
More specifically, \b will match every position in the input text where:
there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
there is no following character and the preceding character is a word character;
the preceding character is a word character and the following character is not; or
the following character is a word character and the preceding character is not.
For instance, the regex dog\b in the text "my dog eats" will match the position immediately after the g of dog (which is a word character) and before the following space (which is not).
Note that like all anchors, the fact that it matches a position means that it does not consume any input text.
Other anchors are ^, $, lookarounds.
The docs don't seem to explain what exactly a word boundary is. Let me try:
\b matches a position between characters (so it doesn't match any text itself, it just asserts that a certain condition is met at the current position in the string). That condition is defined as:
There either is a character of the character set defined by \w (alphanumerics and underscore) before the current position or after the current position, but not both.
The inverse is true for \B - it matches iff \b doesn't match at the current position.
\b- matches the empty string at the beginning or end of a word.
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match is zero-length.
\B is opposite of \b
\B matches the empty string not at the beginning or end of a word.
For \b, if there is a 'word' char at one side of \b, there must be a not-'word' char at other side.
For \B, if there is a 'word' char at one side, there must be a 'word' char too at other side. If there is a not-'word' char at one side, there must be a not-'word' char too at other side.
The 'word' char are A-Za-z0-9 and _, others are not-word char for C locale.
Simply speaking, \b matches the position between a \w and \W (as in not \w) character,
and thus is the end or start of a Word. The end/start of String counts as \W here.
The most common \W characters you may find are:
Whitespace
Comma
Fullstop
Special Characters (§,$,%, [...])
Not Underscore
Anything not ASCII (Umlauts, Cyrillic, Arabic, [...])
\B is just the inverse match of \b
--> It matches the position, that \b does not match (eg. [\w][\w] OR [\W][\W])
You can experiment with java regular expressions here

Stop regular expression from matching across lines

I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.
\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.
Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]

Regex matching a space a digit and 8 characters

I want to match a string containing,
a space
any number of digit
a space
1-8 characters - (alphanumeric and special characters)
example,
01 Stack
This is what i tried,
\\s\\d+\\s[^.]{1, 8} - i tried here except for .,
Try this, to catch (and restrict to) the punctuation and alphanumerics: \s\d+\s[\p{Punct}\p{Alnum}]{1,8}; wrap it all in ^...$ if you want the begin/end line anchors.
If "any number of digits" means 1 or more digit, then the pattern above is fine. If it means "zero or more digits", then the \d+ needs to become \d*.
As an aside, the pattern [^.] will match anything that's not a period. It includes a bit too much, I think, and excludes a bit too much. So I'm opting for the more specific pattern [\p{Punct}\p{Alnum}].
See documentation here.
Try \\s\\d+\\s[^.]{1,8}? It looks like the only problem here is a superfluous space.
Also, \\S is for everything except whitespaces. [^ ] is for everything excpet space. . is for everything.
I don't understand the use of [^.]. The character . matches "any character". So you are asking it to match "any character except any character". Instead you should match non-space characters with \\S.

Help with regex

I'm constructing a regex which will accept at least 1 alpha numerical character and any number of spaces.
Right now I've got...[A-Za-z0-9]+[ \t\r\n]* which I understand to be at least 1 alphanumeric OR at least 1 space. How would I fix this?
EDIT: To answer the comments below I want it to accept strings which contain ATLEAST 1 alphanumeric AND any number of (including no) spaces. Right now it will accept JUST a whitespace.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
\s*\p{Alnum}[\p{Alnum}\s]*
Your regex, [A-Za-z0-9]+[ \t\r\n]*, requires the string to start with a letter or digit (or, more accurately, it doesn't start matching until it sees one). Adding \s* allows the match to start with whitespace, but you still won't match any alphanumerics after the first whitespace character that follows an alphanumeric (for example, it won't match the xyz in abc xyz. Changing the trailing \s* to [\p{Alnum}\s]* fixes that problem.
On a side note, \p{Alnum} is exactly equivalent to [A-Za-z0-9] in Java, which is not the case in all regex flavors. I used \p{Alnum}, not just because it's shorter, but because it gives more protection from typos like [A-z] (which is syntactically valid, but almost certainly not what the author really meant).
EDIT: Performance should be considered, too. I originally included a + after the first \p{Alnum}, but I realized that wasn't a good idea. If this were part of a longer regex, and the regex didn't match right away, it could end up wasting a lot of time trying to match the same groups of characters with \p{Alnum}+ or [\p{Alnum}\s]*. The leading \s* is okay, though, because \s doesn't match any of the characters that \p{Alnum} matches.
Any one or more word char zero or more whitespace
\w+\s*
Hey try this ([^\s]+\s*) [^\s] means catch everything that is not white space, while \s* means that an white space is optional (if you really want at least one white space put + instead of )
Edit: sory mine catch everithing not only alphanumeric (put ([a-zA-Z0-9]+\s) for alphanumeric)
This should do the trick:
\s*\p{Alnum}+\s*
\p{Alnum} is an alphanumeric character: [\p{Alpha}\p{Digit}]
* says "zero or more times"
+ says "at least one" (not "or" as you seem to believe, or is written |)
| means "or"
\s is a whitespace character: [ \t\n\x0B\f\r]
EDIT: To answer the comments below I want it to accept strings which contain AT LEAST 1 alphanumeric AND any number of (including no) spaces.
The pattern I suggested requires at least one alpha numeric character.
EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character
The pattern I suggested will not accept only white space characters only.

Categories

Resources