How to negate a vowel condition using Regex in java

How to negate a vowel condition using Regex in java - java

I'm trying to construct a Regex for a string which should have these following conditions:
It must contain at least one vowel.
It cannot contain three consecutive vowels or three consecutive consonants.
It cannot contain two consecutive occurrences of the same letter, except for 'ee' or 'oo'.
I'm not able to construct regex for 2nd and 3rd conditions.
e.g:
bower - accepted,
appple - not accepted,
miiixer - not accepted,
hedding - not accepted,
feeding - accepted
Thanks in advance!
Edited:
My code:
Pattern ptn = Pattern.compile("((.*[A-Za-z0-9]*)(.*[aeiou|AEIOU]+)(.*[##$%]).*)(.*[^a]{3}.*)");
Matcher mtch = ptn.matcher("zoggax");
if (mtch.find()) {
return true;
}
else
return false;

The following one should suit your needs:
(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*
In Java:
String regex = "(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*";
System.out.println("bower".matches(regex));
System.out.println("appple".matches(regex));
System.out.println("miiixer".matches(regex));
System.out.println("hedding".matches(regex));
System.out.println("feeding".matches(regex));
Prints:
true
false
false
false
true
Explanation:
(?=.*[aeiouy]): contains at least one vowel
(?!.*[aeiouy]{3}): does not contain 3 consecutive vowels
(?!.*[a-z&&[^aeiouy]]{3}): does not contain 3 consecutive consonants
[a-z&&[^aeiouy]]: any letter between a and z but none of aeiouy
(?!.*([a-z&&[^eo]])\1): does not contain 2 consecutive letters, except e and o
[a-z&&[^eo]]: any letter between a and z, but none of eo
See http://www.regular-expressions.info/charclassintersect.html.

This should work for English under the assumption that 'y' is a non-vowel;
^(?!.*[aeiou]{3})(?!.*[bcdfghjklmnpqrstvwxyz]{3})(?!.*([^eo])\1).*[aeiou]
Explanation:
^ fixes the match to the beginning of the string.
(?!.*[aeiou]{3}) checks that you can not find 3 consecutive vowels at any point after the current position in the string. (Since this is immidiately after the ^ this checks the entire string). It also does not advance the cursor.
Non vowels are tested similarily. This can be done in a prettier way if your regexp flavor supports set subtraction. But I think Java does not do this.
(?!.*([^eo])\1) checks that there are no occurence of a single character capture group, of characters other than e or o, which is followed by a copy of itself. Ie. no character other than e and o is repeated twice.
.*[aeiou] looks for a vowel at some point in the string.
This regexp also assumes that the case-insensitive flag is set. I think this is the default for java but I can be wrong about that.
It also is a regexp that will find a match in a string satisfying your criteria. It will not necesarily match the whole string. - If this is needed add .*$ to the end of the regexp.

If my hunch is correct that you meant to say "three consecutive occurrences of the same letter" (looking at your examples) then you can simply say "e and o may not occur thrice, everything else may not occur twice", like so:
^(?=.*[aeiouy].*)(?!.*([eo])\1\1.*)(?!.*([a-df-np-z])\2.*).*$
Debuggex Demo, Key is that a letter occuring thrice is also occuring twice.

Related

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Given the regular expression \w*(\s+|$) and the input "foo" I would expect that a Java Matcher.find() to be true just once: \w* would consume foo, and the $ in (\s+|$) should consume the end of the string.
I can't understand why a second find() would also be true with an emtpy match.
Sample code:
public static void main(String[] args) {
Pattern p = Pattern.compile("\\w*(\\s+|$)");
Matcher m = p.matcher("foo");
while (m.find()) {
System.out.println("'" + m.group() + "'");
}
}
Expected (by me) output:
'foo'
Actual output:
'foo'
''
UPDATE
My regex example should have been just \w*$ in order to simplify the discussion which produces the exact same behavior.
So the thing seems to be how zero-length matches are handled.
I found the method Matcher.hitEnd() which tells you that the last match reached the end of the input, so that you know you don't need another Matcher.find()
while (!m.hitEnd() && m.find()) {
System.out.println("'" + m.group() + "'");
}
The !m.hitEnd() needs to be before the m.find() in order not to miss the last word.

The expresion \\w* matches zero or more characters, because you are using the Kleene operator.
One quick workaround is change the expresion to \\w+
Edit:
After read the documentation for Matcher, the find method "starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.". In this case, on the first call all the characters were matched, so the second call starts at empty.

Your regex can result in a zero-length match, because \w* can be zero-length, and $ is always zero-length.
For full description of zero-length matches, see "Zero-Length Regex Matches" on http://www.regular-expressions.info.
The most relevant part is in the section named "Advancing After a Zero-Length Regex Match":
If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.
Since your regex first matches the foo, it is left at the position after the last o, i.e. at the end of the input, so it is done with that round of searching, but that doesn't mean it is done with the overall search.
It just ends the matching for the first iteration of matching, and leaves the search position at the end of the input.
On the next iteration, it can make a zero-length match, so it will. Of course, after a zero-length match, is must advance, otherwise it'll just stay there forever, and advancing from the last position of the input stops the overall search, which is why there is no third iteration.
To fix the regex, so it doesn't do that, you can use the regex \w*\s+|\w+$, which will match:
Words followed by 1 or more spaces (spaces included in match)
"Nothing" followed by 1 or more spaces
A word at the end of the input
Because neither part of the | can be an empty match, what you experienced cannot happen. However, using \w* means that you will still find matches without any word in it, e.g.
He said: "It's done"
With that input, the regex will match:
"He "
" " the space after the :
"s " match after the '
Unless that's really what you want, you should just change regex to use + instead of *, i.e. \w+(\s+|$)

There are 2 matches, one for the foo and one for the foohere->.
If the match position changes and it has the
option to match nothing, it will match an extra time.
This only occurs once per match position.
This is to avoid an endless loop of infinite un-wisedom.
And, really has nothing to do with the EOS anchor other than it provides
the option to match nothing.
You'd get the same with \w* using foo, i.e. 2 matches.

Please justify the output in Regex Java program

I have came across one Java program in Regex .
Below is the program code :
import java.util.regex.*;
public class Regex_demo01 {
public static void main(String[] args) {
boolean b=true;
Pattern p=Pattern.compile("\\d*");
Matcher m=p.matcher("ab34ef");
while(b=m.find())
{
System.out.println(b);
System.out.println(">"+m.start()+"\t"+m.group()+"<");
}
}
}
Output :
true
>0 <
true
>1 <
true
>2 34<
true
>4 <
true
>5 <
true
>6 <
Doubt : As we all know that The find() method returns true if it gets a match and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.
My question is how come ">6 <" is present is the output when the string indexing is till index 5 ?

Anser is simple. x* matche any count of x even 0.
Replace * to + which matche to 1 or more element that is left to it.

My question is how come >6 < is present is the output when the string indexing is till index 5 ?
That behavior is due to your regex i.e. \\d* which matches 0 or more digits.
As you can see it is showing start position 0 as well when there is no digit at the start.
Similarly 6 is last index +1 because there is an empty match past the last character as well.
You should use \\d+ as your regex.

The star quantifier (*) is defined as "zero or more times". That said, your pattern matches zero digits most of the time.
What you actually want is probably the plus quantifier (+), which means "one or more times".
Source: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Why is there a match at index 6?
RegEx doesn't work on a char-basis, but rather inbetween single chars. When matching an empty string, it will look before and after every character. Duplicate findings are omitted, of course, so an empty string after the first char and before the second char will yield one match instead of two. By default the algorithm is greedy, which means it will match as many characters as possible.
Consider this example:
Input string is 1
RegEx is \\d*
In this case the RegEx engine starts before the first character and tries to match zero, one or more digits. Since it's greedy, it doesn't stop after the empty string it finds at the beginning. It finds a '1' with no digits following. This is the first match. Then it continues the search after the match. It finds an empty string and matches it too, since that equals zero digits.
For RegEx the string '1' looks rather like this:
"" + "1" + ""
The first two units (empty string and the "1") match the pattern, the third, empty string does, too.
In-depth article about this: http://www.regular-expressions.info/zerolength.html

Java - Regex to get if word contains all letters of another one

Is it possible to create java regex that could define if one word (base) contains all letters from another word (sample frow which regex is created) exactly?
For example
Input: base = 'Subexpressions', sample1 = 'Nubs'
Output: True
Input: base = 'Subexpressions', sample2 = 'Expert'
Output: False
Explanation: base contains all letters from sample1 but doesn't contain 't' from sample2.

For a regex approach, here's how you could do it programmatically.
Make a list of letters: Subexpressions => subexprion
Build a series of lookaheads with the letters, and anchor it: ^(?=.*s)(?=.*u)etc(?=.*n)$
Run that regex against the string. It will act as an AND, because the lookaheads check one by one if a letter is present in the target string. If you have a match, bingo.
Of course, running strstr 10 times (once for each character) also works.
Note: Some engines may report strangely for a zero-width match, so for safety you can add a single dot after the lookaheads, ensuring that you have at least one char in the match:
^(?=.*s)(?=.*u)etc(?=.*n)$
You don't care about the char, just about whether there is a match at all.

This problem doesn't really need regex. Just use this simple approach:
Set a boolean variable found to true
Iterate sample variable character by character
Check presence of each character in your base string variable
Set found to false if a character is not found and bail out of loop

Matching only one occurrence of a character from a given set

I need to validate an input string such that validation returns true only if the string contains one of the special characters # # $ %, only one, and one time at the most. Letters and numbers can be anywhere and can be repeated any number of times, but at least one number or letter should be present
For example:
a# : true
#a : true
a#$: false
a#n01 : true
an01 : false
a : false
# : false
I tried
[0-9A-Za-z]*[##%$]{1}[0-9A-Za-z]*
I was hoping this would match one occurrence of any of the special characters. But, no. I need only one occurrence of any one in the set.
I also tried alternation but could not solve it.

Vivek, your regex was really close. Here is the one-line regex you are looking for.
^(?=.*?[0-9a-zA-Z])[0-9a-zA-Z]*[##$%][0-9a-zA-Z]*$
See demo
How does it work?
The ^ and $ anchors ensure that whatever we are matching is the whole string, avoiding partial matches with forbidden characters later.
The (?=.*?[0-9a-zA-Z]) lookahead ensures that we have at least one number or letter.
The [0-9a-zA-Z]*[##$%][0-9a-zA-Z]* matches zero or more letters or digits, followed by exactly one character that is either a #, #, $ or %, followed by zero or more letters or digits—ensuring that we have one special character but no more.
Implementation
I am sure you know how to implement this in Java, but to test if the string match, you could use something like this:
boolean foundMatch = subjectString.matches("^(?=[0-9a-zA-Z]*[##$%][0-9a-zA-Z]*$)[##$%0-9a-zA-Z]*");
What was wrong with my regex?
Actually, your regex was nearly there. Here is what was missing.
Because you didn't have the ^ and $ anchors, the regex was able to match a subset of the string, for instance a# in a##%%, which means that special characters could appear in the string, but outside of the match. Not what you want: we need to validate the whole string by anchoring it.
You needed something to ensure that at least one letter or digit was present. You could definitely have done it with an alternation, but in this case a lookahead is more compact.
Alternative with Alternation
Since you tried alternations, for the record, here is one way to do it:
^(?:[0-9a-zA-Z]+[##$%][0-9a-zA-Z]*|[0-9a-zA-Z]*[##$%][0-9a-zA-Z]+)$
See demo.
Let me know if you have any questions.

I hope this answer will be useful for you, if not, it might be for future readers. I am going to make two assumptions here up front: 1) You do not need regex per se, you are programming in Java. 2) You have access to Java 8.
This could be done the following way:
private boolean stringMatchesChars(final String str, final List<Character> characters) {
return (str.chars()
.filter(ch -> characters.contains((char)ch))
.count() == 1);
}
Here I am:
Using as input a String and a List<Character> of the ones that are allowed.
Obtaining an IntStream (consisting of chars) from the String.
Filtering every char to only remain in the stream if they are in the List<Character>.
Return true only if the count() == 1, that is of the characters in List<Character>, exactly one is present.
The code can be used as:
String str1 = "a";
String str2 = "a#";
String str3 = "a##a";
String str4 = "a##a";
List<Character> characters = Arrays.asList('#', '#', '$', '%');
System.out.println("stringMatchesChars(str1, characters) = " + stringMatchesChars(str1, characters));
System.out.println("stringMatchesChars(str2, characters) = " + stringMatchesChars(str2, characters));
System.out.println("stringMatchesChars(str3, characters) = " + stringMatchesChars(str3, characters));
System.out.println("stringMatchesChars(str4, characters) = " + stringMatchesChars(str4, characters));
Resulting in false, true, false, false.

loop through one group in regex (Java)

I have a regex containing something like ([A-Za-z]\s)+. It returns one group containing all the letters followed by a space. However, it keeps only the last element in the group for example if the text contains a b c d, I tried to print the group match but it returns only the letter (d). This is my program
while (m.find()) {
L = m.group(1);
System.out.println(L);
}
My question is why I only get letter (d) instead of all letters? Is it because They are all captured through one group? how can I correct that? How can I iterate through one group. For example, iterate through all matches detected as one group?

The problem with your regex is that it matches all sequences of one character followed by a space.
In your example, group() would return the whole a b c d string. However when capturing braces are inside repetition (like your +), only the last captured value can be retrieved, hence group(1) returns d.
To fix your issue, just remove the + from your regex. This will make the find() succeed several times, and each time you will get a different match. In that case you might even drop the parenthesis and simply use group().

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to negate a vowel condition using Regex in java - java

Related

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Please justify the output in Regex Java program

Java - Regex to get if word contains all letters of another one

Matching only one occurrence of a character from a given set

loop through one group in regex (Java)

Categories

Resources