Java regex behaving wierd [duplicate] - java

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I have the below test case,
#Test
public void test_check_pattern_match_caseInSensitive_for_pre_sampling_filename() {
// given
String pattern = "Sample*.*Selection*.*Preliminary";
// when
// then
assertThat(Util.checkPatternMatchCaseInSensitive(pattern, "Sample selectiossn preliminary"), is(false));
assertThat(Util.checkPatternMatchCaseInSensitive(pattern, "sample selection preliminary"), is(true));
}
The Util method is:
public static boolean checkPatternMatchCaseInSensitive(String pattern, String value) {
Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = p.matcher(value);
if (matcher.find())
return true;
return false;
}
Can someone please help, why the regex Sample*.*Selection*.*Preliminary matches the fileName = Sample selectiossn preliminary ?
This test case should pass, but it fails because of the first assert. :S

The * in regex means 0 or more of the previous character, while . means any single character.
What your expression is looking for is:
Exactly Sampl
0 or more e
0 or more of any char
Exactly Selectio
0 or more n
0 or more of any char
And so on
The problem would fall under points 5 and 6:
No n was found under point 5, and ssn would match point 6

Selection* in regexp matches to "selectio".
.* matches to "ssn "
Preliminary matches to "preliminary"
Regexp n* mean zero or more n character.
Regexp . mean any character.
Regexp .* mean zero or more any character.

*.*
You have "Selection*.*", which means "Selectio", then any number (including zero) of letter "n", then any number (including zero) of any character.
The match assumes zero matches of "n" matching "", and four matches of any character matching "ssn ".

Related

Why, I am getting always false regex pattern match, How to test and debug it and at which point condition is become false?

I have below String I am trying to write regex pattern in java
**String value = "ABC6072103325000100120190429R070001";**
please consider space bar for the part of the string
ABC6 0721 033250001001 20190429 R 07 0001
1st part - CNV6
Max length -> length always 4, Alphanumeric A-Z0-9
2nd Part - 0721
length always 4, Only 0-9 Digits allowed
3rd part - 033250001001
length always 12, only digits allowed
4th Part - 20190507,
format always YYYYMMDD, only digits allowed max length 8
5th Part - R
It is constant always R coming on this occurrence
6th part - 07
only 2 digits allowed
7th part - 0001
allowed 1-4 only digits
According to my knowledge, I have written below regex but in my every attempt it becomes false.
String s = "[A-Z0-9]{4}[0-9]{16}[1-9][0-9]{3}[0(1-9)|1(0-2)][0(1-9)|1(0-
9)|z(0-9)|3(0-1)](R0)(1-9)0(0-9){1,3}";
Below is My program
package regextest;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static String regex ="[A-Z0-9]{4}[0-9]{16}[1-9][0-9]{3}[0(1-9)|1(0-2)] \r\n[0(1-9)|1(0-9)|2(0-9)|3(0-1)](R0)(1-9)0(0-9){1,3}";
public static void main(String[] args) {
String stringToMatch = "ABC6072103325000100120190429R070001";
boolean isValid = isValidRegex(stringToMatch);
System.out.println("isValid : " + isValid);
}
public static boolean isValidRegex(String stringToMatch) {
boolean isValid =false;
// Create a Pattern object
Pattern r = Pattern.compile(regex);
// Now create a matcher object.
Matcher m = r.matcher(stringToMatch);
if (m.find( )) {
System.out.println("Matched");
isValid = true;
}else {
System.out.println("NO MATCH");
isValid = false;
}
return isValid;
}
}
output - NO MATCH
About your pattern:
In these parts [0(1-9)|1(0-2)] and [0(1-9)|1(0-9)|z(0-9)|3(0-1)]( I think you are aiming to use the | as an OR, but that does not work in a character class.
The first part is for example equal to [)(0-9] due to 0 and the ranges 1-9 and 0-2. Therefore the second part will also not be suited to match a date like format.
To match the number of digits without the more specific date like pattern,you could use:
[A-Z\d]{4}\d{4}\d{12}\d{8}R\d{2}\d{4}
In Java
String regex = "[A-Z\\d]{4}\\d{4}\\d{12}\\d{8}R\\d{2}\\d{4}";
You could also use [0-9] instead of \\d
Regex demo
Note
To match a 'date like' pattern YYYYMMDD to narrow down the possible accepted digits, you might use the following regex but that will not validate a date itself.
^\d{4}(?:1[012]|0[1-9])(?:3[01]|[12][0-9]|0[1-9])$
Regex demo
How to test and debug a regex? Personnally I always use one of those websites who exist for this purpose. By example:
https://regexr.com/
https://regex101.com/
https://www.regextester.com/
Most of them can show you what's wrong on your regex or even explain what they understand about it.
For your actual situation this regex should work fine
[A-Z0-9]{4}[0-9]{16}[1-9][0-9]{3}[0(1-9)|1(0-2)][1-9][0-9]{2}R[0-9]{2}[0-9]{1,4}
Your regex started to not work at the end part of the date.
[0(1-9)|1(0-9)|z(0-9)|3(0-1)]
This part never match with the month and day part of the date so your regex never found an occurrence.

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

i want to have regular expression to check input character as a-z and 0-9 but i do not want to allow input as just numeric value at all ( must be have at least one alphabetic character)
for example :
413123123123131
not allowed but if have just only one alphabetic character in any place of phrase it's ok
i trying to define correct Regex for that and at final i raised to
[0-9]*[a-z].*
but in now i confused how to defined {x,y} length of phrase i want to have {9,31} but after last * i can not to have length block too i trying to define group but unlucky and not worked
tested at https://www.debuggex.com/
how can i to add it ??
What you seek is
String regex = "(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*";
Use it with String#matches() / Pattern#matches() method to require a full string match:
if (s.matches(regex)) {
return true;
}
Details
^ - implicit in matches() - matches the start of string
(?=.{9,31}$) - a positive lookahead that requires 9 to 31 any chars other than line break chars from the start to end of the string
\\p{Alnum}* - 0 or more alphanumeric chars
\\p{Alpha} - an ASCII letter
\\p{Alnum}* - 0 or more alphanumeric chars
Java demo:
String lines[] = {"413123123123131", "4131231231231a"};
Pattern p = Pattern.compile("(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*");
for(String line : lines)
{
Matcher m = p.matcher(line);
if(m.matches()) {
System.out.println(line + ": MATCH");
} else {
System.out.println(line + ": NO MATCH");
}
}
Output:
413123123123131: NO MATCH
4131231231231a: MATCH
This might be what you are looking for.
[0-9a-zA-Z]*[a-zA-Z][0-9a-zA-Z]*
To help explain it, think of the middle term as your one required character and the outer terms as any number of alpha numeric characters.
Edit: to restrict the length of the string as a whole you may have to check that manually after matching. ie.
if (str.length > 9 && str.length < 31)
Wiktor does provide a solution that involves more regex, please look at his for a better regex pattern
Try this Regex:
^(?:(?=[a-z])[a-z0-9]{9,31}|(?=\d.*[a-z])[a-z0-9]{9,31})$
OR a bit shorter form:
^(?:(?=[a-z])|(?=\d.*[a-z]))[a-z0-9]{9,31}$
Demo
Explanation(for the 1st regex):
^ - position before the start of the string
(?=[a-z])[a-z0-9]{9,31} means If the string starts with a letter, then match Letters and digits. minimum 9 and maximum 31
| - OR
(?=\d.*[a-z])[a-z0-9]{9,31} means If the string starts with a digit followed by a letter somewhere in the string, then match letters and digits. Minimum 9 and Maximum 31. This also ensures that If the string starts with a digit and if there is no letter anywhere in the string, there won't be any match
$ - position after the last literal of the string
OUTPUT:
413123123123131 NO MATCH(no alphabets)
kjkhsjkf989089054835werewrew65 MATCH
kdfgfd4374985794379857984379857weorjijuiower NO MATCH(length more than 31)
9087erkjfg9080980984590p465467 MATCH
4131231231231a MATCH
kjdfg34 NO MATCH(Length less than 9)
Here's the regex:
[a-zA-Z\d]*[a-zA-Z][a-zA-Z\d]*
The trick here is to have something that is not optional. The leading and trailing [a-zA-Z\d] has a * quantifier, so they are optional. But the [a-zA-Z] in the middle there is not optional. The string must have a character that matches [a-zA-Z] in order to be matched.
However, you need to check the length of the string with length afterwards and not with regex. I can't think of any way how you can do this in regex.
Actually, I think you can do this regexless pretty easily:
private static boolean matches(String input) {
for (int i = 0 ; i < input.length() ; i++) {
if (Character.isLetter(input.charAt(i))) {
return input.length() >= 9 && input.length() <= 31;
}
}
return false;
}

Parsing array syntax using regex

I think what I am asking is either very trivial or already asked, but I have had a hard time finding answers.
We need to capture the inner number characters between brackets within a given string.
so given the string
StringWithMultiArrayAccess[0][9][4][45][1]
and the regex
^\w*?(\[(\d+)\])+?
I would expect 6 capture groups and access to the inner data.
However, I end up only capturing the last "1" character in capture group 2.
If it is important heres my java junit test:
#Test
public void ensureThatJsonHandlerCanHandleNestedArrays(){
String stringWithArr = "StringWithMultiArray[0][0][4][45][1]";
Pattern pattern = Pattern.compile("^\\w*?(\\[(\\d+)\\])+?");
Matcher matcher = pattern.matcher(stringWithArr);
matcher.find();
assertTrue(matcher.matches()); //passes
System.out.println(matcher.group(2)); //prints 1 (matched from last array symbols)
assertEquals("0", matcher.group(2)); //expected but its 1 not zero
assertEquals("45", matcher.group(5)); //only 2 capture groups exist, the whole string and the 1 from the last array brackets
}
In order to capture each number, you need to change your regex so it (a) captures a single number and (b) is not anchored to--and therefore limited by--any other part of the string ("^\w*?" anchors it to the start of the string). Then you can loop through them:
Matcher mtchr = Pattern.compile("\\[(\\d+)\\]").matcher(arrayAsStr);
while(mtchr.find()) {
System.out.print(mtchr.group(1) + " ");
}
Output:
0 9 4 45 1

Regular expression "\\d?" giving incorrect output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
Sample code
Pattern p = Pattern.compile("\\d?");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while (m.find())
{
System.out.print(m.start());// + m.group());
}
Answer: 012456
But string total length is 6. So How m.start will give 6 in the output, as index starts
from 0.
\d? matches zero or one character, so it starts beyond the last character of the string as well, as a zero-width match.
Note that your output is not in fact attained by \d?, but by \d*. You should change either one or the other to make the question self-consistent.
\d? matches zero or one digit, which matches every digit, but also matches every character boundary.
Try matching at least one digit:
Pattern p = Pattern.compile("\\d+");

Java RegEx - for an Integer not containing a "."

I need to be able to return signed and unsigned integer constants with no
intervening symbols, possibly preceded by + or -. The only allowed digits are 3, 4, and 5.
I can't figure out a way to say that the expression must not contain a period before or after the integer.
This is what I have so far, but if I pass say "34.5 - 43" the string returned will be: "34 5 43".
All that needs to be returned is "43".
public String getInts(String toBeScanned){
String INT = "";
Pattern p = Pattern.compile("\\b[+-]?[3-5]+\\b");
Matcher m = p.matcher(toBeScanned);
if (m.matches() == true){
INT = toBeScanned;
}
else{
m = p.matcher(" " + toBeScanned);
while (m.find()){
INT = INT + m.group() + " ";
}
}
return INT;
}
Any thoughts or pushes in the right direction are appreciated. Is there a way to say it that the first and last character can be [\b and not .]
This is frustrating the heck out of me. Help!
You don't want a word boundary \b here. I think the best is to create your own assertion, try this
(?<![.\d])[+-]?[3-5]+(?![.\d])
See it here on Regexr
(?<![.\d]) is a negative lookbehind assertion, it says before the pattern is no dot and no digit allowed.
(?![.\d]) is a negative lookahead assertion, it says after the pattern is no dot and no digit allowed.
Improvement
to avoid that it matches stuff like "hf34" we can make it more strict
(?<![.\w])[+-]?[3-5]+(?![.\w])
See it on Regexr
The word boundary \b
\b matches on a change from a word character to a non word character. A word character is a letter or a digit or a _. That means you will also get problems with your \b before the [+-], because there is no \b between a space/start of the string and a [+-].
"\b[+-]?[3-5]+[.][3-5]+\b"
This pattern says that in order to match, there must be at least one number before, and one number after the decimal point.
Is there a way to say it that the first and last character can be [\b and not .]
[^\.\b]
matches \b but not '.'
Is that what you are looking for?
[^\.\b][+-]?[3-5]+[^\.\b]
Will match '43' but not '34.5'

Categories

Resources