Use of \b Boundary Matcher In Java - java

I am reading Boundary Matcher from Oracle Documentation. I understand most of the part, but i am not able to grasp the \b Boundary Matcher. Here is the example from the documentation.
To check if a pattern begins and ends on a word boundary (as opposed
to a substring within a longer string), just use \b on either side;
for example, \bdog\b
Enter your regex: \bdog\b Enter input string to search: The dog plays
in the yard. I found the text "dog" starting at index 4 and ending at
index 7.
Enter your regex: \bdog\b Enter input string to search: The doggie
plays in the yard. No match found. To match the expression on a
non-word boundary, use \B instead:
Enter your regex: \bdog\B Enter input string to search: The dog
plays in the yard. No match found.
Enter your regex: \bdog\B Enter input string to search: The doggie
plays in the yard. I found the text "dog" starting at index 4 and
ending at index 7.
In short, i am not able to understand the working of \b. Can someone help me describing its usage and help me understand this example.
Thanks

\b is what you can call an "anchor": it will match a position in the input text.
More specifically, \b will match every position in the input text where:
there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
there is no following character and the preceding character is a word character;
the preceding character is a word character and the following character is not; or
the following character is a word character and the preceding character is not.
For instance, the regex dog\b in the text "my dog eats" will match the position immediately after the g of dog (which is a word character) and before the following space (which is not).
Note that like all anchors, the fact that it matches a position means that it does not consume any input text.
Other anchors are ^, $, lookarounds.

The docs don't seem to explain what exactly a word boundary is. Let me try:
\b matches a position between characters (so it doesn't match any text itself, it just asserts that a certain condition is met at the current position in the string). That condition is defined as:
There either is a character of the character set defined by \w (alphanumerics and underscore) before the current position or after the current position, but not both.
The inverse is true for \B - it matches iff \b doesn't match at the current position.

\b- matches the empty string at the beginning or end of a word.
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match is zero-length.
\B is opposite of \b
\B matches the empty string not at the beginning or end of a word.

For \b, if there is a 'word' char at one side of \b, there must be a not-'word' char at other side.
For \B, if there is a 'word' char at one side, there must be a 'word' char too at other side. If there is a not-'word' char at one side, there must be a not-'word' char too at other side.
The 'word' char are A-Za-z0-9 and _, others are not-word char for C locale.

Simply speaking, \b matches the position between a \w and \W (as in not \w) character,
and thus is the end or start of a Word. The end/start of String counts as \W here.
The most common \W characters you may find are:
Whitespace
Comma
Fullstop
Special Characters (§,$,%, [...])
Not Underscore
Anything not ASCII (Umlauts, Cyrillic, Arabic, [...])
\B is just the inverse match of \b
--> It matches the position, that \b does not match (eg. [\w][\w] OR [\W][\W])
You can experiment with java regular expressions here

Related

Reg Ex strictly match word start with a pattern

I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b          A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.

Regex that matches certain trailings of words with certain length?

How can I write a regex that matches the "e" trailings of every word, exept the "e" trailings of words with 2 or 3 letters?
Example:
abcdeeee: Full match for "eeee"
more: Full match for "e"
pie: No match
me: No match
Use a lookaround assertion:
e+(?<=\w{4})\b
This matches e characters if afterwards there have been 4 word characters before the final e. The \b makes sure it ends at a word boundary
Demo

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

Stop regular expression from matching across lines

I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.
\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.
Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]

why does \B works but not \b

Wanted to match a word that ends with # like
hi hello# world#
I tried to use boundary
\b\w+#\b
and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case
Surprisingly
\b\w+#\B
matches!
So why does \B works here and not \b!Also why doesn't \b work in this case!
NOTE:
Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!
Definition of word boundary \b
Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w.
A word boundary \b is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).
OR
Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).
(Note how similar this is to the expansion of XOR into conjunction and disjunction)
A non-word boundary \B is equivalent to:
(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))
Which means:
Right ahead and right behind, we cannot find any word character. Note that empty string is consider a non-word boundary under this definition.
OR
Right ahead and right behind, both sides are word characters. Note that this branch requires 2 characters, i.e. cannot occur at the beginning or the end of a non-empty string.
(Note how similar this is to the expansion of XNOR into conjunction and disjunction).
Definition of word character \w
Since the definition of \b and \B depends on definition of \w1, you need to consult the specific documentation to know exactly what \w matches.
1 Most of the regex flavors define \b based on \w. Well, except for Java [Point 9], where in default mode, \w is ASCII-only and \b is partially Unicode-aware.
In JavaScript, it would be [A-Za-z0-9_] in default mode.
In .NET, \w by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}], and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.
Answer to the question
With the definition above, answering the question becomes easy:
"hi hello# world#"
In hello#, after # is space (U+0020, in Zs category), which is not a word character, and # is not a word character itself (in Unicode, it is in Po category). Therefore, \B can match here. The branch (?<!\w)(?!\w) is used in this case.
In world#, after # is end of string. Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \B can match the empty string just after #. The branch (?<!\w)(?!\w) is also used in this case.
Addendum
Alan Moore gives quite a good summary in the comment:
I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.
The pound # symbol is not considered a "word boundary".
\b\w+#\b doesn't work because w+# is not considered a word, therefore it will not match world#.
\b\w+6\b on the other hand is, therefore it will match world6.
"Word Characters" are defined by: [A-Za-z0-9_].
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
— http://www.regular-expressions.info/wordboundaries.html
The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. Therefore \b will not match it and \B will match it.

Categories

Resources