Please justify the output in Regex Java program

Please justify the output in Regex Java program - java

I have came across one Java program in Regex .
Below is the program code :
import java.util.regex.*;
public class Regex_demo01 {
public static void main(String[] args) {
boolean b=true;
Pattern p=Pattern.compile("\\d*");
Matcher m=p.matcher("ab34ef");
while(b=m.find())
{
System.out.println(b);
System.out.println(">"+m.start()+"\t"+m.group()+"<");
}
}
}
Output :
true
>0 <
true
>1 <
true
>2 34<
true
>4 <
true
>5 <
true
>6 <
Doubt : As we all know that The find() method returns true if it gets a match and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.
My question is how come ">6 <" is present is the output when the string indexing is till index 5 ?

Anser is simple. x* matche any count of x even 0.
Replace * to + which matche to 1 or more element that is left to it.

My question is how come >6 < is present is the output when the string indexing is till index 5 ?
That behavior is due to your regex i.e. \\d* which matches 0 or more digits.
As you can see it is showing start position 0 as well when there is no digit at the start.
Similarly 6 is last index +1 because there is an empty match past the last character as well.
You should use \\d+ as your regex.

The star quantifier (*) is defined as "zero or more times". That said, your pattern matches zero digits most of the time.
What you actually want is probably the plus quantifier (+), which means "one or more times".
Source: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Why is there a match at index 6?
RegEx doesn't work on a char-basis, but rather inbetween single chars. When matching an empty string, it will look before and after every character. Duplicate findings are omitted, of course, so an empty string after the first char and before the second char will yield one match instead of two. By default the algorithm is greedy, which means it will match as many characters as possible.
Consider this example:
Input string is 1
RegEx is \\d*
In this case the RegEx engine starts before the first character and tries to match zero, one or more digits. Since it's greedy, it doesn't stop after the empty string it finds at the beginning. It finds a '1' with no digits following. This is the first match. Then it continues the search after the match. It finds an empty string and matches it too, since that equals zero digits.
For RegEx the string '1' looks rather like this:
"" + "1" + ""
The first two units (empty string and the "1") match the pattern, the third, empty string does, too.
In-depth article about this: http://www.regular-expressions.info/zerolength.html

Related

Masking part of the string with a regex

The idea is to mask a string like it's done with a credit cards. It can be done with this one line of code. And it works. However I can't find any straightforward explanations of the regex used in this case.
public class Solution {
public static void main(String[] args) {
String t1 = "518798673672531762319871";
System.out.println(t1.replaceAll(".(?=.{4})", "*"));
}
}
Output is: ********************9871

Explanation of regex:
.(?=.{4})
.: Match any character
(?=: Start of a lookahead condition
.{4}: that asserts presence of 4 characters
): End of the lookahead condition
In simple words it matches any character in input that has 4 characters on right hand side of the current position.
Replacement is "*" which means for each matched character in inout, replace by a single * character, thus replacing all the characters in credit card number except the last 4 characters when lookahead condition fails the match (since we won't have 4 characters ahead of current position).
Read more on look arounds in regex

?=.{4} is a positive lookahead. it matches the pattern inside the brackets (the next 4 digits after the current character) without including it in the main result (the . outside the brackets) that is matching all the other characters for replacement by *
Conceive that your regex goes through the input char by char. On the first digit (5) it asks "is there a single char followed by 4 other chars? yes, ok.. replace [the 5] with *"
It repeats this until the 9 (4th from end, at which point the "is there another 4 characters after this?" question becomes "no" and the replacing stops

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

Given the regular expression \w*(\s+|$) and the input "foo" I would expect that a Java Matcher.find() to be true just once: \w* would consume foo, and the $ in (\s+|$) should consume the end of the string.
I can't understand why a second find() would also be true with an emtpy match.
Sample code:
public static void main(String[] args) {
Pattern p = Pattern.compile("\\w*(\\s+|$)");
Matcher m = p.matcher("foo");
while (m.find()) {
System.out.println("'" + m.group() + "'");
}
}
Expected (by me) output:
'foo'
Actual output:
'foo'
''
UPDATE
My regex example should have been just \w*$ in order to simplify the discussion which produces the exact same behavior.
So the thing seems to be how zero-length matches are handled.
I found the method Matcher.hitEnd() which tells you that the last match reached the end of the input, so that you know you don't need another Matcher.find()
while (!m.hitEnd() && m.find()) {
System.out.println("'" + m.group() + "'");
}
The !m.hitEnd() needs to be before the m.find() in order not to miss the last word.

The expresion \\w* matches zero or more characters, because you are using the Kleene operator.
One quick workaround is change the expresion to \\w+
Edit:
After read the documentation for Matcher, the find method "starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.". In this case, on the first call all the characters were matched, so the second call starts at empty.

Your regex can result in a zero-length match, because \w* can be zero-length, and $ is always zero-length.
For full description of zero-length matches, see "Zero-Length Regex Matches" on http://www.regular-expressions.info.
The most relevant part is in the section named "Advancing After a Zero-Length Regex Match":
If a regex can find zero-length matches at any position in the string, then it will. The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.
Since your regex first matches the foo, it is left at the position after the last o, i.e. at the end of the input, so it is done with that round of searching, but that doesn't mean it is done with the overall search.
It just ends the matching for the first iteration of matching, and leaves the search position at the end of the input.
On the next iteration, it can make a zero-length match, so it will. Of course, after a zero-length match, is must advance, otherwise it'll just stay there forever, and advancing from the last position of the input stops the overall search, which is why there is no third iteration.
To fix the regex, so it doesn't do that, you can use the regex \w*\s+|\w+$, which will match:
Words followed by 1 or more spaces (spaces included in match)
"Nothing" followed by 1 or more spaces
A word at the end of the input
Because neither part of the | can be an empty match, what you experienced cannot happen. However, using \w* means that you will still find matches without any word in it, e.g.
He said: "It's done"
With that input, the regex will match:
"He "
" " the space after the :
"s " match after the '
Unless that's really what you want, you should just change regex to use + instead of *, i.e. \w+(\s+|$)

There are 2 matches, one for the foo and one for the foohere->.
If the match position changes and it has the
option to match nothing, it will match an extra time.
This only occurs once per match position.
This is to avoid an endless loop of infinite un-wisedom.
And, really has nothing to do with the EOS anchor other than it provides
the option to match nothing.
You'd get the same with \w* using foo, i.e. 2 matches.

How to negate a vowel condition using Regex in java

I'm trying to construct a Regex for a string which should have these following conditions:
It must contain at least one vowel.
It cannot contain three consecutive vowels or three consecutive consonants.
It cannot contain two consecutive occurrences of the same letter, except for 'ee' or 'oo'.
I'm not able to construct regex for 2nd and 3rd conditions.
e.g:
bower - accepted,
appple - not accepted,
miiixer - not accepted,
hedding - not accepted,
feeding - accepted
Thanks in advance!
Edited:
My code:
Pattern ptn = Pattern.compile("((.*[A-Za-z0-9]*)(.*[aeiou|AEIOU]+)(.*[##$%]).*)(.*[^a]{3}.*)");
Matcher mtch = ptn.matcher("zoggax");
if (mtch.find()) {
return true;
}
else
return false;

The following one should suit your needs:
(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*
In Java:
String regex = "(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*";
System.out.println("bower".matches(regex));
System.out.println("appple".matches(regex));
System.out.println("miiixer".matches(regex));
System.out.println("hedding".matches(regex));
System.out.println("feeding".matches(regex));
Prints:
true
false
false
false
true
Explanation:
(?=.*[aeiouy]): contains at least one vowel
(?!.*[aeiouy]{3}): does not contain 3 consecutive vowels
(?!.*[a-z&&[^aeiouy]]{3}): does not contain 3 consecutive consonants
[a-z&&[^aeiouy]]: any letter between a and z but none of aeiouy
(?!.*([a-z&&[^eo]])\1): does not contain 2 consecutive letters, except e and o
[a-z&&[^eo]]: any letter between a and z, but none of eo
See http://www.regular-expressions.info/charclassintersect.html.

This should work for English under the assumption that 'y' is a non-vowel;
^(?!.*[aeiou]{3})(?!.*[bcdfghjklmnpqrstvwxyz]{3})(?!.*([^eo])\1).*[aeiou]
Explanation:
^ fixes the match to the beginning of the string.
(?!.*[aeiou]{3}) checks that you can not find 3 consecutive vowels at any point after the current position in the string. (Since this is immidiately after the ^ this checks the entire string). It also does not advance the cursor.
Non vowels are tested similarily. This can be done in a prettier way if your regexp flavor supports set subtraction. But I think Java does not do this.
(?!.*([^eo])\1) checks that there are no occurence of a single character capture group, of characters other than e or o, which is followed by a copy of itself. Ie. no character other than e and o is repeated twice.
.*[aeiou] looks for a vowel at some point in the string.
This regexp also assumes that the case-insensitive flag is set. I think this is the default for java but I can be wrong about that.
It also is a regexp that will find a match in a string satisfying your criteria. It will not necesarily match the whole string. - If this is needed add .*$ to the end of the regexp.

If my hunch is correct that you meant to say "three consecutive occurrences of the same letter" (looking at your examples) then you can simply say "e and o may not occur thrice, everything else may not occur twice", like so:
^(?=.*[aeiouy].*)(?!.*([eo])\1\1.*)(?!.*([a-df-np-z])\2.*).*$
Debuggex Demo, Key is that a letter occuring thrice is also occuring twice.

Java Regex of String start with number and fixed length

I made a regular expression for checking the length of String , all characters are numbers and start with number e.g 123
Following is my expression
REGEX =^123\\d+{9}$";
But it was unable to check the length of String. It validates those strings only their length is 9 and start with 123.
But if I pass the String 1234567891 it also validates it. But how should I do it which thing is wrong on my side.

Like already answered here, the simplest way is just removing the +:
^123\\d{9}$
or
^123\\d{6}$
Depending on what you need exactly.
You can also use another, a bit more complicated and generic approach, a negative lookahead:
(?!.{10,})^123\\d+$
Explanation:
This: (?!.{10,}) is a negative look-ahead (?= would be a positive look-ahead), it means that if the expression after the look-ahead matches this pattern, then the overall string doesn't match. Roughly it means: The criteria for this regular expression is only met if the pattern in the negative look-ahead doesn't match.
In this case, the string matches only if .{10} doesn't match, which means 10 or more characters, so it only matches if the pattern in front matches up to 9 characters.
A positive look-ahead does the opposite, only matching if the criteria in the look-ahead also matches.
Just putting this here for curiosity sake, it's more complex than what you need for this.

Try using this one:
^123\\d{6}$
I changed it to 6 because 1, 2, and 3 should probably still count as digits.
Also, I removed the +. With it, it would match 1 or more \ds (therefore an infinite amount of digits).

Based on your comment below Doorknobs's answer you can do this:
int length = 9;
String prefix = "123"; // or whatever
String regex = "^" + prefix + "\\d{ " + (length - prefix.length()) + "}$";
if (input.matches(regex)) {
// good
} else {
// bad
}

How to convert "string" to "string*"

I need to convert a string like
"string"
to
"*s*t*r*i*n*g*"
What's the regex pattern? Language is Java.

You want to match an empty string, and replace with "*". So, something like this works:
System.out.println("string".replaceAll("", "*"));
// "*s*t*r*i*n*g*"
Or better yet, since the empty string can be matched literally without regex, you can just do:
System.out.println("string".replace("", "*"));
// "*s*t*r*i*n*g*"
Why this works
It's because any instance of a string startsWith(""), and endsWith(""), and contains(""). Between any two characters in any string, there's an empty string. In fact, there are infinite number of empty strings at these locations.
(And yes, this is true for the empty string itself. That is an "empty" string contains itself!).
The regex engine and String.replace automatically advances the index when looking for the next match in these kinds of cases to prevent an infinite loop.
A "real" regex solution
There's no need for this, but it's shown here for educational purpose: something like this also works:
System.out.println("string".replaceAll(".?", "*$0"));
// "*s*t*r*i*n*g*"
This works by matching "any" character with ., and replacing it with * and that character, by backreferencing to group 0.
To add the asterisk for the last character, we allow . to be matched optionally with .?. This works because ? is greedy and will always take a character if possible, i.e. anywhere but the last character.
If the string may contain newline characters, then use Pattern.DOTALL/(?s) mode.
References
regular-expressions.info/Dot Matches (Almost) Any Character and Grouping and Backreferences

I think "" is the regex you want.
System.out.println("string".replaceAll("", "*"));
This prints *s*t*r*i*n*g*.

If this is all you're doing, I wouldn't use a regex:
public static String glitzItUp(String text) {
return insertPeriodically(text, "*", 1);
}
Putting char into a java string for each N characters
public static String insertPeriodically(
String text, String insert, int period)
{
StringBuilder builder = new StringBuilder(
text.length() + insert.length() * (text.length()/period)+1);
int index = 0;
while (index <= text.length())
{
builder.append(insert);
builder.append(text.substring(index,
Math.min(index + period, text.length())));
index += period;
}
return builder.toString();
}
Another benefit (besides simplicity) is that it's about ten times faster than a regex.
IDEOne | Working example

Just to be a jerk, I'm going to say use J:
I've spent a school year learning Java, and self-taught myself a bit of J over the course of the summer, and if you're going to be doing this for yourself, it's probably most productive to use J simply because this whole inserting an asterisk thing is easily done with one simple verb definition using one loop.
asterisked =: 3 : 0
i =. 0
running_String =. '*'
while. i < #y do.
NB. #y returns tally, or number of items in y: right operand to the verb
running_String =. running_String, (i{y) , '*'
i =. >: i
end.
]running_String
)
This is why I would use J: I know how to do this, and have only studied the language for a couple months loosely. This isn't as succinct as the whole .replaceAll() method, but you can do it yourself quite easily and edit it to your specifications later. Feel free to delete this/ troll this/ get inflamed at my suggestion of J, I really don't care: I'm not advertising it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Please justify the output in Regex Java program - java

Anser is simple. x* matche any count of x even 0. Replace * to + which matche to 1 or more element that is left to it.

Related

Masking part of the string with a regex

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

How to negate a vowel condition using Regex in java

Java Regex of String start with number and fixed length

How to convert "string" to "string*"

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Please justify the output in Regex Java program - java

Anser is simple. x* matche any count of x even 0. Replace * to + which matche to 1 or more element that is left to it.

Related

Masking part of the string with a regex

Why does the regex \w*(\s+|$) finds 2 matches for "foo" (Java)?

How to negate a vowel condition using Regex in java

Java Regex of String start with number and fixed length

How to convert "string" to "*s*t*r*i*n*g*"

Categories

Resources

How to convert "string" to "string*"