loop through one group in regex (Java) - java

I have a regex containing something like ([A-Za-z]\s)+. It returns one group containing all the letters followed by a space. However, it keeps only the last element in the group for example if the text contains a b c d, I tried to print the group match but it returns only the letter (d). This is my program
while (m.find()) {
L = m.group(1);
System.out.println(L);
}
My question is why I only get letter (d) instead of all letters? Is it because They are all captured through one group? how can I correct that? How can I iterate through one group. For example, iterate through all matches detected as one group?

The problem with your regex is that it matches all sequences of one character followed by a space.
In your example, group() would return the whole a b c d string. However when capturing braces are inside repetition (like your +), only the last captured value can be retrieved, hence group(1) returns d.
To fix your issue, just remove the + from your regex. This will make the find() succeed several times, and each time you will get a different match. In that case you might even drop the parenthesis and simply use group().

Related

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Given the regular expression \w*(\s+|$) and the input "foo" I would expect that a Java Matcher.find() to be true just once: \w* would consume foo, and the $ in (\s+|$) should consume the end of the string.
I can't understand why a second find() would also be true with an emtpy match.
Sample code:
public static void main(String[] args) {
Pattern p = Pattern.compile("\\w*(\\s+|$)");
Matcher m = p.matcher("foo");
while (m.find()) {
System.out.println("'" + m.group() + "'");
}
}
Expected (by me) output:
'foo'
Actual output:
'foo'
''
UPDATE
My regex example should have been just \w*$ in order to simplify the discussion which produces the exact same behavior.
So the thing seems to be how zero-length matches are handled.
I found the method Matcher.hitEnd() which tells you that the last match reached the end of the input, so that you know you don't need another Matcher.find()
while (!m.hitEnd() && m.find()) {
System.out.println("'" + m.group() + "'");
}
The !m.hitEnd() needs to be before the m.find() in order not to miss the last word.
The expresion \\w* matches zero or more characters, because you are using the Kleene operator.
One quick workaround is change the expresion to \\w+
Edit:
After read the documentation for Matcher, the find method "starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.". In this case, on the first call all the characters were matched, so the second call starts at empty.
Your regex can result in a zero-length match, because \w* can be zero-length, and $ is always zero-length.
For full description of zero-length matches, see "Zero-Length Regex Matches" on http://www.regular-expressions.info.
The most relevant part is in the section named "Advancing After a Zero-Length Regex Match":
If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.
Since your regex first matches the foo, it is left at the position after the last o, i.e. at the end of the input, so it is done with that round of searching, but that doesn't mean it is done with the overall search.
It just ends the matching for the first iteration of matching, and leaves the search position at the end of the input.
On the next iteration, it can make a zero-length match, so it will. Of course, after a zero-length match, is must advance, otherwise it'll just stay there forever, and advancing from the last position of the input stops the overall search, which is why there is no third iteration.
To fix the regex, so it doesn't do that, you can use the regex \w*\s+|\w+$, which will match:
Words followed by 1 or more spaces (spaces included in match)
"Nothing" followed by 1 or more spaces
A word at the end of the input
Because neither part of the | can be an empty match, what you experienced cannot happen. However, using \w* means that you will still find matches without any word in it, e.g.
He said: "It's done"
With that input, the regex will match:
"He "
" " the space after the :
"s " match after the '
Unless that's really what you want, you should just change regex to use + instead of *, i.e. \w+(\s+|$)
There are 2 matches, one for the foo and one for the foohere->.
If the match position changes and it has the
option to match nothing, it will match an extra time.
This only occurs once per match position.
This is to avoid an endless loop of infinite un-wisedom.
And, really has nothing to do with the EOS anchor other than it provides
the option to match nothing.
You'd get the same with \w* using foo, i.e. 2 matches.

Continue scanning a string until it has found the first/last occurrence of a string

I have this line of text that I want to scan using regex.
axhaweacb
I want to get the text from "a" to "b". This is my current pattern:
pattern = "a.*?b";
The current output is: axhaweacb (it's taking everything in between a and b), but what I want to receive back is "acb".
Why you may ask? The logic/regex I am trying to apply is:
When you find the first occurrence of the "from" regex ("a"), start scanning. If you find another occurrence of the "from" letter without finding the "last" occurrence of a letter - in this case "b", remove the previous string - which is axh so that the string becomes: aweacb. If you find another occurrence of "from" - in this case a, without finding "to" - b. Remove the previous string so that it becomes acb. Then start scanning again. In this case we have found our pattern - a to b, without another "a" in our way.
I know that I can substring the string to begin with, and strip down everything until the last occurance of "a" - but I want to reuse this for different strings as well. And in that case, it will always substring everything until the last occurance of something - which results in removing a lot of data.
I hope I made my question/problem clear. If not, please tell me and I will do my best to clarify my problem.
Thank you.
The regex engine searches for a match from left to right. When it finds a with a.*?b, it is the first a in your string. Then, the first b found and matched is the last character in your axhaweacb string.
Lazy quantifier matches up to the closest right-most character matching the subsequent subpattern, not the shortest possible substring.
So, what you need is a way to exclude (=fail if found) all occurrences of the leading and trailing subpatterns in between them.
It can be done with the help of a tempered greedy token:
pattern = "a(?:(?!a|b).)*b";
^^^^^^^^^^^^^
Here is a demo
You can use this negative lookahead based regex:
a(?:(?![ab]).)*b
(?![ab]) is the negative regex to match anything but a and b`
(?:(?![ab]).)* matches 0 or more of any character that is not a and b, thus giving us shortest match betweenaandb`
RegEx Demo

Regex to replace repeated characters

Can someone give me a Java regex to replace the following.
If I have a word like this "Cooooool", I need to convert this to "Coool" with 3 o's. So that I can distinguish it with the normal word "cool".
Another ex: "happyyyyyy" should be "happyyy"
replaceAll("(.)\\1+","$1"))
I tried this but it removes all the repeating characters leaving only one.
Change your regex like below.
string.replaceAll("((.)\\2{2})\\2+","$1");
( start of the first caturing group.
(.) captures any character. For this case, you may use [a-z]
\\2 refers the second capturing group. \\2{2} which must be repeated exactly two times.
) End of first capturing group. So this would capture the first three repeating characters.
\\2+ repeats the second group one or more times.
DEMO
I think you might want something like this:
str.replaceAll("([a-zA-Z])\\1\\1+", "$1$1$1");
This will match where a character is repeated 3 or more times and will replace it with the same character, three times.
$1 only matches one character, because you're surrounding the character to match.
\\1\\1+ matches the character only, if it occurs at least three times in a row.
This call is also a lot more readable, than having a huge regex and only using one $1.

How to negate a vowel condition using Regex in java

I'm trying to construct a Regex for a string which should have these following conditions:
It must contain at least one vowel.
It cannot contain three consecutive vowels or three consecutive consonants.
It cannot contain two consecutive occurrences of the same letter, except for 'ee' or 'oo'.
I'm not able to construct regex for 2nd and 3rd conditions.
e.g:
bower - accepted,
appple - not accepted,
miiixer - not accepted,
hedding - not accepted,
feeding - accepted
Thanks in advance!
Edited:
My code:
Pattern ptn = Pattern.compile("((.*[A-Za-z0-9]*)(.*[aeiou|AEIOU]+)(.*[##$%]).*)(.*[^a]{3}.*)");
Matcher mtch = ptn.matcher("zoggax");
if (mtch.find()) {
return true;
}
else
return false;
The following one should suit your needs:
(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*
In Java:
String regex = "(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*";
System.out.println("bower".matches(regex));
System.out.println("appple".matches(regex));
System.out.println("miiixer".matches(regex));
System.out.println("hedding".matches(regex));
System.out.println("feeding".matches(regex));
Prints:
true
false
false
false
true
Explanation:
(?=.*[aeiouy]): contains at least one vowel
(?!.*[aeiouy]{3}): does not contain 3 consecutive vowels
(?!.*[a-z&&[^aeiouy]]{3}): does not contain 3 consecutive consonants
[a-z&&[^aeiouy]]: any letter between a and z but none of aeiouy
(?!.*([a-z&&[^eo]])\1): does not contain 2 consecutive letters, except e and o
[a-z&&[^eo]]: any letter between a and z, but none of eo
See http://www.regular-expressions.info/charclassintersect.html.
This should work for English under the assumption that 'y' is a non-vowel;
^(?!.*[aeiou]{3})(?!.*[bcdfghjklmnpqrstvwxyz]{3})(?!.*([^eo])\1).*[aeiou]
Explanation:
^ fixes the match to the beginning of the string.
(?!.*[aeiou]{3}) checks that you can not find 3 consecutive vowels at any point after the current position in the string. (Since this is immidiately after the ^ this checks the entire string). It also does not advance the cursor.
Non vowels are tested similarily. This can be done in a prettier way if your regexp flavor supports set subtraction. But I think Java does not do this.
(?!.*([^eo])\1) checks that there are no occurence of a single character capture group, of characters other than e or o, which is followed by a copy of itself. Ie. no character other than e and o is repeated twice.
.*[aeiou] looks for a vowel at some point in the string.
This regexp also assumes that the case-insensitive flag is set. I think this is the default for java but I can be wrong about that.
It also is a regexp that will find a match in a string satisfying your criteria. It will not necesarily match the whole string. - If this is needed add .*$ to the end of the regexp.
If my hunch is correct that you meant to say "three consecutive occurrences of the same letter" (looking at your examples) then you can simply say "e and o may not occur thrice, everything else may not occur twice", like so:
^(?=.*[aeiouy].*)(?!.*([eo])\1\1.*)(?!.*([a-df-np-z])\2.*).*$
Debuggex Demo, Key is that a letter occuring thrice is also occuring twice.

Will this regex always work according to requirements stated below?

Is this regex correct to break a sentence into 3 tokens:
Characters Before lowercased letters inside a parentheses
Lowercase letters inside a parentheses including parentheses
Characters after lowercased parentheses letters
System.out.println("This is (a) test".matches("^(.*)?\\([a-z]*\\)(.*)?$"));
The string may or may not have a parentheses lower cased letter and it may appear anywhere in the sentence. If you see a flaw in a use case I haven't considered, can you provide the correction in regex ?
For the e.g. above.
Group1 captures This is
Group2 captures (a)
Group3 captures test
EDIT:: How do I change the regex to achieve the following ?
If the string has (foo)(bar)(baz) how do I capture group1= empty group2=(foo) and group3=empty. And find the above pattern thrice because there are 3 parentheses.
Separate from examining the regex, whenever I write a regex, I write a series of unit tests to cover each case. I'd suggest you do the same. Create four tests (at least) using the regex and testing against the strings:
(a) This is test
This is (a) test
This is test (a)
This is a test
That should cover each of the cases you've described. That's much easier and faster than trying to hand analyze the regex for each case.
If you want to ensure that are characters inside your lower parentesis, you should use +, which stands by one or more times
[a-z]+
The way it is, This is (a) (b) test will yield
Group1 captures This is
Group2 captures (a)
Group3 captures (b) test
If Group2 is expected to be (b) you should use a greedy regexp in Group1
Suggested test cases:
empty - really empty, can't have a bullet point empty.
foo(bar)baz
(foo)(bar)(baz)
(foo)bar(baz)
foo(bar)(baz)bing
foo(bar)baz(bing)
foo(bar)
(foo)bar
Your regex has a little problem.
You say in your definition that you have 3 groups, when in fact your pattern contains 2.
Using literal parentheses doesn't count as a group, so you'd need to use something like this:
"^(.*)?(\\([a-z]*\\))(.*)?$"
Or if you don't really want the parentheses, just the letters, you can change the order:
"^(.*)?\\(([a-z]*)\\)(.*)?$"
Other than that, it seems to be OK, but have in mind that the lower case letters between parentheses are not mandatory in your pattern.
In python:
r=re.compile(r'([^()]*)(\([a-z)(]*\))([^()]*)')
r.match('abc(xx)dd').groups()
('abc', '(xx)', 'dd')`
r.match('abc(xx)(dd)dd').groups()
('abc', '(xx)(dd)', 'dd')
r.match('(abc)').groups()
('', '(abc)', '')
If you want the first and third group to contain all characters before and after the parantheses, you must make sure they exclude ( and ) (your .* will also match groups that contain parantheses, such as (foo)(bar) in your second example).
So I'd replace .* with this [^\\(\\)]*.
Also, if you want to match strings that contain many substrings of the second group (like in your second example), you should have * after the second group.
My result was this:
^([^\\(\\)]*)?(\\([a-z]*\\))*([^\\(\\)]*)?$
This will work for the first example and the second, but the second group will eventually store only the last one found - (bz).
If you want to be able to capture the second group 3 times like you said for your second example, you could try using while m.find() instead of if m.matches() (m is a Matcher object); and also change your regex a little to this:
([^\\(\\)]*)(\\([a-z]*\\))([^\\(\\)]*)
This will should the second group for every possible match in your string - (foo), (bar), (bz).
Edit:
For some reason that I can't really explain, for me it doesn't find (foo), only the other two. So I wrote a piece of code that tries to apply find() with a parameter, explicitly starting from some position, where the last found group ended:
String regex = "([^\\(\\)]*)(\\([a-z]*\\))([^\\(\\)]*)";
String text = "(foo)(bar)(bz)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
for (int reg = 0; reg < text.length(); reg+=(m.end()-m.start()))
if (m.find(reg))
for (int group = 1; group <=m.groupCount(); group++)
System.out.println("Group "+group+": "+m.group(group));
This works, and the output is:
Group 1:
Group 2: (foo)
Group 3:
Group 1:
Group 2: (bar)
Group 3:
Group 1:
Group 2: (bz)
Group 3:

Categories

Resources