positive lookbehind not behaving correctly - java

The code snippet for positive lookbehind is below
public class PositiveLookBehind {
public static void main(String[] args) {
String regex = "[a-z](?<=9)";
String input = "a9es m9x us9s w9es";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
System.out.println("===starting====");
while(matcher.find()) {
System.out.println("found:"+matcher.group()
+" start index:"+matcher.start()
+" end index is "+matcher.end());
}
System.out.println("===ending=====");
}
}
I was expecting that I should have 4 matches but to my surprise the output shows no match.
Can anyone point out my mistake?
As far as my understanding goes the regex here is alphabet preceded by digit 9 which is satisfied in 4 locations.

Problem
Notice that (?<=9) is placed after [a-z]. What it means?
Lets consider data like "a9c".
At start regex-engine places its "cursor" at start of the string which it iterates, here:
|a9c
^-regex cursor is here
Then regex-engine is trying to match each part of regex-pattern from left to right. So in case of [a-z](?<=9) it first will try to find match for [a-z] and after successfully finding that match for it, it will try to move to evaluation of (?<=9) part.
So match for [a-z] will happen here:
a9c
*<-- match for `[a-z]`
After that match regex will move cursor here:
a|9c
*^--- regex-engine cursor
^---- match for [a-z]
So now (?<=9) will be evaluated (notice position of cursor |). (?<=subregex) checks if immediately before cursor exist text which can be matched by subregex. But here since cursor is directly after a (?<=9) look-behind "sees"/includes that a as data which subexpression should test. But since a can't be matched by 9 evaluation fails.
Solution(s)
You probably wanted to check if 9 is placed before acceptable letter. To achieve that you can modify your regex in many ways:
with [a-z](?<=9.) you make look-behind test two previous characters
a9c|
^^
9. - `9` matches 9, `.` matches any character (one directly before cursor)
or simpler (?<=9)[a-z] to first look for 9 and then look for [a-z] which will let regex match 9c if cursor will be at 9|c.

Your cuurent pattern: [a-z](<=9) means: match lowercase letter and assure, that position right after the letter is preceeded by 9, which is contradiciton.
If you want to match letter preceeded by 9 use: (<=9)[a-z], which now means: assure what preceeds is 9, if so, match lowercase letter.

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"

Replace repeated letters in a word with exception

I would like to have a regex expression that (in java) will replace every repeated consonant into single letter, all repeated consonants but an initial "inn".
I explain myself better with some examples:
asso > aso
assso > aso
assocco > asoco
innasso > innaso
I found a way to replace all repeated letters with
Pattern.compile("([^aeiou])+\1").matcher(text).replaceAll("$1")
I found a way to recognize if a word does not start with "inn":
Pattern.compile("^(?!inn).+").matcher(text).matches()
but I don't know how to merge them, ie, degeminate all geminates consonants but the initial 'nn' if the word starts with 'inn'.
Anyone can help me? (I would like to solve this with a regex, in order to apply replaceAll)
Thank you
I'm not sure why you must do this all with a single regexp, but if you must... try using negative lookbehind:
Pattern.compile("((?<!^i(?=nn))[^aeiou])+\\1")
This gobbledygook broken down:
(?=X) means: Don't consume anything, just check if X occurs here. If not, it's not a match.
(?<!X) means 'negative lookbehind': It doesn't consume any characters, but it fails to match if X occurs at this exact spot. So, if at this exact spot we're on the first character in the text, and it is an 'i', then it's a failure no matter what.
(?<!^i(?=nn)) does not consume anything, but it fails for any position where the following holds: Immediately before the 'cursor' there is an i, and before that, the start of the string. After the 'cursor' there are 2 n's. If that all holds, fail. Otherwise do nothing (continue processing).
The rest is then just what you wrote already.
One option could be to capture a word that starts with inn using a negative lookbehind (?<!\S) in group 1, and capture matching [^aeiou] in group 2 and repeat the backreference to that group 1 or more times.
(?<!\S)(inn)|([^aeiou\r\n])\2+
Explanation
(?<!\S) Negative lookbehind, assert what is on the left is not a non whitespace char
(inn) Capture group 1, match inn
| Or
( Capture group 2
[^aeiou\r\n] Match any char except the listed
)\2+ Close group and repeat 1+ times what was captured in group 2
Regex demo | Java demo
In the replacement use the 2 capturing groups $1$2
For example
final String regex = "(?<!\\S)(inn)|([^aeiou\\r\\n])\\2+";
final String string = "asso\n"
+ "assso\n"
+ "assocco\n"
+ "innasso";
final String subst = "$1$2";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
Output
aso
aso
asoco
innaso

Masking part of the string with a regex

The idea is to mask a string like it's done with a credit cards. It can be done with this one line of code. And it works. However I can't find any straightforward explanations of the regex used in this case.
public class Solution {
public static void main(String[] args) {
String t1 = "518798673672531762319871";
System.out.println(t1.replaceAll(".(?=.{4})", "*"));
}
}
Output is: ********************9871
Explanation of regex:
.(?=.{4})
.: Match any character
(?=: Start of a lookahead condition
.{4}: that asserts presence of 4 characters
): End of the lookahead condition
In simple words it matches any character in input that has 4 characters on right hand side of the current position.
Replacement is "*" which means for each matched character in inout, replace by a single * character, thus replacing all the characters in credit card number except the last 4 characters when lookahead condition fails the match (since we won't have 4 characters ahead of current position).
Read more on look arounds in regex
?=.{4} is a positive lookahead. it matches the pattern inside the brackets (the next 4 digits after the current character) without including it in the main result (the . outside the brackets) that is matching all the other characters for replacement by *
Conceive that your regex goes through the input char by char. On the first digit (5) it asks "is there a single char followed by 4 other chars? yes, ok.. replace [the 5] with *"
It repeats this until the 9 (4th from end, at which point the "is there another 4 characters after this?" question becomes "no" and the replacing stops

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Given the regular expression \w*(\s+|$) and the input "foo" I would expect that a Java Matcher.find() to be true just once: \w* would consume foo, and the $ in (\s+|$) should consume the end of the string.
I can't understand why a second find() would also be true with an emtpy match.
Sample code:
public static void main(String[] args) {
Pattern p = Pattern.compile("\\w*(\\s+|$)");
Matcher m = p.matcher("foo");
while (m.find()) {
System.out.println("'" + m.group() + "'");
}
}
Expected (by me) output:
'foo'
Actual output:
'foo'
''
UPDATE
My regex example should have been just \w*$ in order to simplify the discussion which produces the exact same behavior.
So the thing seems to be how zero-length matches are handled.
I found the method Matcher.hitEnd() which tells you that the last match reached the end of the input, so that you know you don't need another Matcher.find()
while (!m.hitEnd() && m.find()) {
System.out.println("'" + m.group() + "'");
}
The !m.hitEnd() needs to be before the m.find() in order not to miss the last word.
The expresion \\w* matches zero or more characters, because you are using the Kleene operator.
One quick workaround is change the expresion to \\w+
Edit:
After read the documentation for Matcher, the find method "starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.". In this case, on the first call all the characters were matched, so the second call starts at empty.
Your regex can result in a zero-length match, because \w* can be zero-length, and $ is always zero-length.
For full description of zero-length matches, see "Zero-Length Regex Matches" on http://www.regular-expressions.info.
The most relevant part is in the section named "Advancing After a Zero-Length Regex Match":
If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.
Since your regex first matches the foo, it is left at the position after the last o, i.e. at the end of the input, so it is done with that round of searching, but that doesn't mean it is done with the overall search.
It just ends the matching for the first iteration of matching, and leaves the search position at the end of the input.
On the next iteration, it can make a zero-length match, so it will. Of course, after a zero-length match, is must advance, otherwise it'll just stay there forever, and advancing from the last position of the input stops the overall search, which is why there is no third iteration.
To fix the regex, so it doesn't do that, you can use the regex \w*\s+|\w+$, which will match:
Words followed by 1 or more spaces (spaces included in match)
"Nothing" followed by 1 or more spaces
A word at the end of the input
Because neither part of the | can be an empty match, what you experienced cannot happen. However, using \w* means that you will still find matches without any word in it, e.g.
He said: "It's done"
With that input, the regex will match:
"He "
" " the space after the :
"s " match after the '
Unless that's really what you want, you should just change regex to use + instead of *, i.e. \w+(\s+|$)
There are 2 matches, one for the foo and one for the foohere->.
If the match position changes and it has the
option to match nothing, it will match an extra time.
This only occurs once per match position.
This is to avoid an endless loop of infinite un-wisedom.
And, really has nothing to do with the EOS anchor other than it provides
the option to match nothing.
You'd get the same with \w* using foo, i.e. 2 matches.

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

Categories

Resources