How does java quantifier "?" work? - java

System.out.println(Pattern.matches("[amn]?", "a"));
This statement returns true.
But
System.out.println(Pattern.matches("[amn]?", "amn"));
System.out.println(Pattern.matches("[amn]?", "adef"));
These statements returns false.
Why ?
My understanding about regex quantifier "?" is this.
Regex: X?
Description: X occurs once or not at all
So the statement "[amn]?" "amn" should return true because a,m,n occurs once.
And similarly in "[amn]?" "adef" a occurs only once and m and n do not occur at all.
Where am I going wrong ?

The regex [amn]? matches any string which consists of either a , m or n and nothing else. Such as "a", which fulfills this condition.
amn and adef, however, start with one of these letters, but continue so that the "once or not at all" rule is not fulfilled.

The first returns true because a is one letter that is either a, m or n.
The others return false because there's not one letter, there's 3 and 4 letters.
Even though your letter group contains 3 letters, it will only check for the existence of one of them.

[amn] is a group consisting of the characters "a", "m" and "n".
[amn]? means "one of the characters from the group [amn] or no character at all".
Pattern.matches tries to match the entire pattern against the entire input string.
If you want the sequence of characters "amn" you could try (amn)?, which should mean "the sequence 'amn' or nothing".

Function matches() matches the entire string against the regular expression, which means it will return true only if the complete string can be matched by the expression, not any sub-sequence. Refer this documentation.
[amn]? means that either a or m or n can exists once or not at all. Only cases for which matches() will return true:
"a"
"m"
"n"
""
All other cases will be given as false.
If you want to find the regular expression in some string, then use find() as shown below.
Pattern p = Pattern.compile("[amn]?");
Matcher mat = p.matcher(""); //pass amn or adef
boolean matches = false;
while(mat.find()){
matches = true;
break;
}
System.out.println(matches);

? means match current regular expression not greedy

Related

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Given the regular expression \w*(\s+|$) and the input "foo" I would expect that a Java Matcher.find() to be true just once: \w* would consume foo, and the $ in (\s+|$) should consume the end of the string.
I can't understand why a second find() would also be true with an emtpy match.
Sample code:
public static void main(String[] args) {
Pattern p = Pattern.compile("\\w*(\\s+|$)");
Matcher m = p.matcher("foo");
while (m.find()) {
System.out.println("'" + m.group() + "'");
}
}
Expected (by me) output:
'foo'
Actual output:
'foo'
''
UPDATE
My regex example should have been just \w*$ in order to simplify the discussion which produces the exact same behavior.
So the thing seems to be how zero-length matches are handled.
I found the method Matcher.hitEnd() which tells you that the last match reached the end of the input, so that you know you don't need another Matcher.find()
while (!m.hitEnd() && m.find()) {
System.out.println("'" + m.group() + "'");
}
The !m.hitEnd() needs to be before the m.find() in order not to miss the last word.
The expresion \\w* matches zero or more characters, because you are using the Kleene operator.
One quick workaround is change the expresion to \\w+
Edit:
After read the documentation for Matcher, the find method "starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.". In this case, on the first call all the characters were matched, so the second call starts at empty.
Your regex can result in a zero-length match, because \w* can be zero-length, and $ is always zero-length.
For full description of zero-length matches, see "Zero-Length Regex Matches" on http://www.regular-expressions.info.
The most relevant part is in the section named "Advancing After a Zero-Length Regex Match":
If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.
Since your regex first matches the foo, it is left at the position after the last o, i.e. at the end of the input, so it is done with that round of searching, but that doesn't mean it is done with the overall search.
It just ends the matching for the first iteration of matching, and leaves the search position at the end of the input.
On the next iteration, it can make a zero-length match, so it will. Of course, after a zero-length match, is must advance, otherwise it'll just stay there forever, and advancing from the last position of the input stops the overall search, which is why there is no third iteration.
To fix the regex, so it doesn't do that, you can use the regex \w*\s+|\w+$, which will match:
Words followed by 1 or more spaces (spaces included in match)
"Nothing" followed by 1 or more spaces
A word at the end of the input
Because neither part of the | can be an empty match, what you experienced cannot happen. However, using \w* means that you will still find matches without any word in it, e.g.
He said: "It's done"
With that input, the regex will match:
"He "
" " the space after the :
"s " match after the '
Unless that's really what you want, you should just change regex to use + instead of *, i.e. \w+(\s+|$)
There are 2 matches, one for the foo and one for the foohere->.
If the match position changes and it has the
option to match nothing, it will match an extra time.
This only occurs once per match position.
This is to avoid an endless loop of infinite un-wisedom.
And, really has nothing to do with the EOS anchor other than it provides
the option to match nothing.
You'd get the same with \w* using foo, i.e. 2 matches.

Please justify the output in Regex Java program

I have came across one Java program in Regex .
Below is the program code :
import java.util.regex.*;
public class Regex_demo01 {
public static void main(String[] args) {
boolean b=true;
Pattern p=Pattern.compile("\\d*");
Matcher m=p.matcher("ab34ef");
while(b=m.find())
{
System.out.println(b);
System.out.println(">"+m.start()+"\t"+m.group()+"<");
}
}
}
Output :
true
>0 <
true
>1 <
true
>2 34<
true
>4 <
true
>5 <
true
>6 <
Doubt : As we all know that The find() method returns true if it gets a match and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.
My question is how come ">6 <" is present is the output when the string indexing is till index 5 ?
Anser is simple. x* matche any count of x even 0.
Replace * to + which matche to 1 or more element that is left to it.
My question is how come >6 < is present is the output when the string indexing is till index 5 ?
That behavior is due to your regex i.e. \\d* which matches 0 or more digits.
As you can see it is showing start position 0 as well when there is no digit at the start.
Similarly 6 is last index +1 because there is an empty match past the last character as well.
You should use \\d+ as your regex.
The star quantifier (*) is defined as "zero or more times". That said, your pattern matches zero digits most of the time.
What you actually want is probably the plus quantifier (+), which means "one or more times".
Source: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Why is there a match at index 6?
RegEx doesn't work on a char-basis, but rather inbetween single chars. When matching an empty string, it will look before and after every character. Duplicate findings are omitted, of course, so an empty string after the first char and before the second char will yield one match instead of two. By default the algorithm is greedy, which means it will match as many characters as possible.
Consider this example:
Input string is 1
RegEx is \\d*
In this case the RegEx engine starts before the first character and tries to match zero, one or more digits. Since it's greedy, it doesn't stop after the empty string it finds at the beginning. It finds a '1' with no digits following. This is the first match. Then it continues the search after the match. It finds an empty string and matches it too, since that equals zero digits.
For RegEx the string '1' looks rather like this:
"" + "1" + ""
The first two units (empty string and the "1") match the pattern, the third, empty string does, too.
In-depth article about this: http://www.regular-expressions.info/zerolength.html

How to negate a vowel condition using Regex in java

I'm trying to construct a Regex for a string which should have these following conditions:
It must contain at least one vowel.
It cannot contain three consecutive vowels or three consecutive consonants.
It cannot contain two consecutive occurrences of the same letter, except for 'ee' or 'oo'.
I'm not able to construct regex for 2nd and 3rd conditions.
e.g:
bower - accepted,
appple - not accepted,
miiixer - not accepted,
hedding - not accepted,
feeding - accepted
Thanks in advance!
Edited:
My code:
Pattern ptn = Pattern.compile("((.*[A-Za-z0-9]*)(.*[aeiou|AEIOU]+)(.*[##$%]).*)(.*[^a]{3}.*)");
Matcher mtch = ptn.matcher("zoggax");
if (mtch.find()) {
return true;
}
else
return false;
The following one should suit your needs:
(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*
In Java:
String regex = "(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*";
System.out.println("bower".matches(regex));
System.out.println("appple".matches(regex));
System.out.println("miiixer".matches(regex));
System.out.println("hedding".matches(regex));
System.out.println("feeding".matches(regex));
Prints:
true
false
false
false
true
Explanation:
(?=.*[aeiouy]): contains at least one vowel
(?!.*[aeiouy]{3}): does not contain 3 consecutive vowels
(?!.*[a-z&&[^aeiouy]]{3}): does not contain 3 consecutive consonants
[a-z&&[^aeiouy]]: any letter between a and z but none of aeiouy
(?!.*([a-z&&[^eo]])\1): does not contain 2 consecutive letters, except e and o
[a-z&&[^eo]]: any letter between a and z, but none of eo
See http://www.regular-expressions.info/charclassintersect.html.
This should work for English under the assumption that 'y' is a non-vowel;
^(?!.*[aeiou]{3})(?!.*[bcdfghjklmnpqrstvwxyz]{3})(?!.*([^eo])\1).*[aeiou]
Explanation:
^ fixes the match to the beginning of the string.
(?!.*[aeiou]{3}) checks that you can not find 3 consecutive vowels at any point after the current position in the string. (Since this is immidiately after the ^ this checks the entire string). It also does not advance the cursor.
Non vowels are tested similarily. This can be done in a prettier way if your regexp flavor supports set subtraction. But I think Java does not do this.
(?!.*([^eo])\1) checks that there are no occurence of a single character capture group, of characters other than e or o, which is followed by a copy of itself. Ie. no character other than e and o is repeated twice.
.*[aeiou] looks for a vowel at some point in the string.
This regexp also assumes that the case-insensitive flag is set. I think this is the default for java but I can be wrong about that.
It also is a regexp that will find a match in a string satisfying your criteria. It will not necesarily match the whole string. - If this is needed add .*$ to the end of the regexp.
If my hunch is correct that you meant to say "three consecutive occurrences of the same letter" (looking at your examples) then you can simply say "e and o may not occur thrice, everything else may not occur twice", like so:
^(?=.*[aeiouy].*)(?!.*([eo])\1\1.*)(?!.*([a-df-np-z])\2.*).*$
Debuggex Demo, Key is that a letter occuring thrice is also occuring twice.

simple java regular expression not working

I have this simple example of a regular expression. But it is not working. I don't know what I am doing wrong:
String name = "abc";
System.out.println(name.matches("[a-zA-Z]"));
it returns false, it should be true.
use :
name.matches("[a-zA-Z]+") // matches more than one character
or name.matches("\\w+") // matches more than one character
name.matches("[a-zA-Z]") // matches exactly one character.
Add + to your regex to match one or more alphabets,
String name = "abc"; System.out.println(name.matches("[a-zA-Z]+"));
Your regex [a-zA-Z] must match a single alphabet, not more than one.
[a-zA-Z] Match a lowercase alphabet from a-z or match an uppercase alphabet from A-Z.
The reason why this evaluates to false is, it tries to match the entrie string (see doc of String.matches()) to the Pattern [A-Za-z] wich only matches a single character. Either use
Pattern.compile("[A-Za-z]").matcher(str).find() to see if a substring matches (will return true in this case), or alter the RegEx to account for multiple Characters. The cleanest way of doing so is
Pattern.compile("^[A-Za-z]+$");
The ^ marks "start of string" and $ marks "end of string". + means "previous token at least once".
If you want to allow the empty String as well, use
Pattern.compile("^[A-Za-z]*$");
instead (* means "match the previous token 0 or more times")
Try with [a-zA-Z]+
[a-zA-Z] indicates:

Question mark in regular expression matching is not working

I am new to regular expressions. I thought that this would return matched succesfully, but it does not. Why is that so?
String myString = "SUB_HEADER5_LABEL";
if (myString.matches(Pattern.quote("SUB_HEADER?_LABEL")))
{
System.out.println("matched succesfully");
}
Pattern.qoute() will create a Pattern that only matches exactly the given String. you need
if (myString.matches("SUB_HEADER\\d_LABEL"))
If you expect the number to exceed 9, add the + quantifier like
if (myString.matches("SUB_HEADER\\d+_LABEL"))
if you want to match a digit where you have ? (wich means one or zero R in your case, because it's a quantifier). you need to replace it with [0-9] or \\d.

Categories

Resources