Regex with ? for a set of words

Regex with ? for a set of words - java

I want to have a regex for NAME;NAME;NAME and also for NAME;NAME;NAME;NAME where the fourth occurrence of NAME is optional.
I have one regex as (.+);(.+);(.+) which matched the first pattern but not the second. I tried playing with ? but its not working out with (.+);(.+);(.+)(;(.+))?
Basically, I want to achieve the fourth (.+) as zero or one occurence.

Using .+ matches 1+ times any character including ;
If you want to match 3 or 4 groups separated by a ; and not including it, you could use a negated character class [^;]+ with an optional group at the end of the pattern.
^([^;]+);([^;]+);([^;]+)(?:;([^;]+))?$
^ Start of string
([^;]+);([^;]+);([^;]+) Capture group 1, 2 and 3 matching any char except ;
(?: Non capture group
;([^;]+) Match ; and capture any char except ; in group 4
)? Close group and make it optional
$ End of string
Regex demo
If the parts in between can not contain ; you could also use split and count the number of the parts.
String arr[] = { "NAME;NAME;", "NAME;NAME;NAME", "NAME;NAME;NAME;NAME", "NAME;NAME;NAME;NAME;NAME" };
for (String s : arr) {
String [] parts = s.split(";");
if (parts.length == 3 || parts.length == 4) {
System.out.println(s);
}
}
Output
NAME;NAME;NAME
NAME;NAME;NAME;NAME

You can use the regex, (.+);\1;\1(?:;\1)?
Demo:
import java.util.stream.Stream;
public class Main {
public static void main(String args[]) {
// Test
Stream.of(
"NAME;NAME;NAME",
"NAME;NAME;NAME;NAME",
"NAME;NAME;NAME;",
"NAME;NAME;NAMES",
"NAME;NAME;NAME;NAME;NAME"
).forEach(s -> System.out.println(s + " => " + s.matches("(.+);\\1;\\1(?:;\\1)?")));
}
}
Output:
NAME;NAME;NAME => true
NAME;NAME;NAME;NAME => true
NAME;NAME;NAME; => false
NAME;NAME;NAMES => false
NAME;NAME;NAME;NAME;NAME => false
Explanation of the regex:
\1 matches the same text as most recently matched by the 1st capturing group.
?: makes (?:;\1) a non-capturing group.
? makes the previous token optional

With your shown samples, please try following.
1st solution:
^(?:([^;]*);){2,3}\1$
Online demo for 1st solution
Explanation: Adding detailed explanation for above.
^(?: ##Matching value from starting of the value here.
([^;]*); ##Creating 1st capturing group which has everything till ; in it, followed by ;.
){2,3} ##Looking for 2 to 3 occurrences of it.
\1$ ##Again matching 1st capturing group value at the end here.
2nd solution:
^([^;]*)(;)(?:\1\2){1,2}\1$
Online demo for 2nd solution
Explanation: Adding detailed explanation for above.
^([^;]*) ##checking from starting of value, a capturing group till value of ; is coming here.
(;) ##Creating 2nd capturing group which has ; in it.
(?: ##Creating a non-capturing group here.
\1\2 ##Matching 1st and 2nd capturing group here.
){1,2} ##Closing non-capturing group here, with occurrences of 1 to 2.
\1$ ##Matching 1st capturing group value here at the end of value.

You could use lazy quantifier +?. Example:
private static final Pattern pattern = Pattern.compile("((\\w+);?)+?");
public void extractGroups(String input) {
var matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
}
Input "FIRST;SECOND;THIRD;FOURTH" gives
FIRST
SECOND
THIRD
FOURTH
Input "FIRST;SECOND;THIRD" gives
FIRST
SECOND
THIRD
Lazy quantifier is used to match the shortest possible String. And if you call it repeatedly in while loop, you'll get all matches.
Also you should better use \\w for mathing words, cause . also includes the ; symbol;

Related

Regex to match a list of exact strings with some variable characters

I'm looking for a way to match a list of parameters that include some predefined characters and some variable characters using Java's String#matches method. For instance:
Possible Parameter 1: abc;[variable lowercase letters with maybe an underscore]
Possible Parameter 2: cde;[variable lowercase letters with maybe an underscore]
Possible Parameter 3: g;4
Example 1: abc;erga_sd,cde;dfgef,g;4
Example 2: g;4,abc;dsfaweg
Example 3: cde;df_ger
Each of the parameters would be comma-separated but they can come in any order and include 1, 2, and/or 3 (no duplicates)
This is the regex I have so far that partially works:
(abc;[a-z_,]+){0,1}|(cde;[a-z,]+){0,1}|(g;4,){0,1}
The problem is that it also finds something like this valid: abc;dsfg,dfvser where the beginning of the string after the comma does not start with a valid abc; or cde; or g;4

As you said:
The problem is that it also finds something like this valid:
abc;dsfg,dfvser where the beginning of the string after the comma does
not start with a valid abc; or cde; or g;4
Therefore the valid entries will always have the patterns after the comma. What you can do is, you can split the each inputs with the delimiter "," and apply the valid regex pattern to the split elements and then combine the matching results of the split elements to get the matching result of the whole input line.
Your regex should be:
(abc;[a-z_]+)|(cde;[a-z_]+)|(g;4)
You'll get any of these three patterns just like you have mentioned in your post earlier, in a valid element which you've gotten by doing a split on the input line.
Here's the code:
String regex = "(abc;[a-z_]+)|(cde;[a-z_]+)|(g;4)";
boolean finalResult = true;
for (String input: inputList.split(",")) {
finalResult = finalResult && Pattern.matches(regex,input);
}
System.out.println(finalResult);

If you want to use matches, then the whole string has to match.
^(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$
Explanation
^ Start of string
(?: Non capture group
(?:abc|cde);[a-z_]+ match either abc; or cde; and 1+ chars a-z or _
| Or
g;4 Match literally
) Close non capture group
(?: Non capture group
,(?:(?:abc|cde);[a-z_]+|g;4) Match a comma, and repeat the first pattern
)* Close non capture group and optionally repeat
$ End of string
See a regex demo and a Java demo
Example code
String[] strings = {
"abc;erga_sd,cde;dfgef,g;4",
"g;4,abc;dsfaweg",
"cde;df_ger",
"g;4",
"abc;dsfg,dfvser"
};
String regex = "^(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$";
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.matches()) {
System.out.printf("Match for %s%n", s);
} else {
System.out.printf("No match for %s%n", s);
}
}
Output
Match for abc;erga_sd,cde;dfgef,g;4
Match for g;4,abc;dsfaweg
Match for cde;df_ger
Match for g;4
No match for abc;dsfg,dfvser
If there should not be any duplicate abc; cde or g;4 you can rule that out using a negative lookahead with a backreference to match the same twice at the start of the pattern.
^(?!.*(abc;|cde;|g;4).*\1)(?:(?:abc|cde);[a-z_]+|g;4)(?:,(?:(?:abc|cde);[a-z_]+|g;4))*$
Regex demo

Find all matches of ambigous regular expression [duplicate]

In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?

Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}

Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:

The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.

Regex to match a digit not followed by a dot(".")

I have a string
string 1(excluding the quotes) -> "my car number is #8746253 which is actually cool"
conditions - The number 8746253, could be of any length and
- the number can also be immediately followed by an end-of-line.
I want to group-out 8746253 which should not be followed by a dot "."
I have tried,
.*#(\d+)[^.].*
This will get me the number for sure, but this will match even if there is a dot, because [.^] will match the last digit of the number(for example, 3 in the below case)
string 2(excluding the quotes) -> "earth is #8746253.Kms away, which is very far"
I want to match only the string 1 type and not the string 2 types.

To match any number of digits after # that are not followed with a dot, use
(?<=#)\d++(?!\.)
The ++ is a possessive quantifier that will make the regex engine only check the lookahead (?!\.) only after the last matched digit, and won't backtrack if there is a dot after that. So, the whole match will get failed if there is a dit after the last digit in a digit chunk.
See the regex demo
To match the whole line and put the digits into capture group #1:
.*#(\d++)(?!\.).*
See this regex demo. Or a version without a lookahead:
^.*#(\d++)(?:[^.\r\n].*)?$
See another demo. In this last version, the digit chunk can only be followed with an optional sequence of a char that is not a ., CR and LF followed with any 0+ chars other than line break chars ((?:[^.\r\n].*)?) and then the end of string ($).

This works like you have described
public class MyRegex{
public static void main(String[] args) {
Pattern patern = Pattern.compile("#(\\d++)[^\\.]");
Matcher matcher1 = patern.matcher("my car number is #8746253 which is actually cool");
if(matcher1.find()){
System.out.println(matcher1.group(1));
}
Matcher matcher2 = patern.matcher("earth is #8746253.Kms away, which is very far");
if(matcher2.find()){
System.out.println(matcher1.group(1));
}else{
System.out.println("No match found");
}
}
}
Outputs:
> 8746253
> No match found

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:
(((Hello! --> ((( and Hello!
or
########No? --> ######## and No?
At the beginning I have n-times the same special character, but I want to match the longest possible sequence.
What I have at the moment is this regex:
([^a-zA-Z0-9])\\1+([a-zA-Z].*)
This one would return for the first example
( (only 1 time) and Hello!
and for the second
# and No!
How do I tell regEx I want the maximal long repetition of the matching character?
I am using RegEx as part of a Java program in case this matters.

I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):
String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
if (str.matches("(?s)(\\W)\\1+\\w.*")) {
System.out.println(Arrays.toString(str.split(ptrn)));
}else { System.out.println(Arrays.asList(str)); }
}
See IDEONE demo
Result:
[(((, Hello!]
[########, No?]
[$%^&^Hello!]
Also, your original regex can be modified to fit the requirement like this:
String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
Pattern p = Pattern.compile(ptrn);
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(Arrays.asList(m.group(1), m.group(3)));
}
else {
System.out.println(Arrays.asList(str));
}
}
See another IDEONE demo
That regex matches:
(?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
(\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

Parsing array syntax using regex

I think what I am asking is either very trivial or already asked, but I have had a hard time finding answers.
We need to capture the inner number characters between brackets within a given string.
so given the string
StringWithMultiArrayAccess[0][9][4][45][1]
and the regex
^\w*?(\[(\d+)\])+?
I would expect 6 capture groups and access to the inner data.
However, I end up only capturing the last "1" character in capture group 2.
If it is important heres my java junit test:
#Test
public void ensureThatJsonHandlerCanHandleNestedArrays(){
String stringWithArr = "StringWithMultiArray[0][0][4][45][1]";
Pattern pattern = Pattern.compile("^\\w*?(\\[(\\d+)\\])+?");
Matcher matcher = pattern.matcher(stringWithArr);
matcher.find();
assertTrue(matcher.matches()); //passes
System.out.println(matcher.group(2)); //prints 1 (matched from last array symbols)
assertEquals("0", matcher.group(2)); //expected but its 1 not zero
assertEquals("45", matcher.group(5)); //only 2 capture groups exist, the whole string and the 1 from the last array brackets
}

In order to capture each number, you need to change your regex so it (a) captures a single number and (b) is not anchored to--and therefore limited by--any other part of the string ("^\w*?" anchors it to the start of the string). Then you can loop through them:
Matcher mtchr = Pattern.compile("\\[(\\d+)\\]").matcher(arrayAsStr);
while(mtchr.find()) {
System.out.print(mtchr.group(1) + " ");
}
Output:
0 9 4 45 1

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex with ? for a set of words - java

Related

Regex to match a list of exact strings with some variable characters

Find all matches of ambigous regular expression [duplicate]

Regex to match a digit not followed by a dot(".")

RegEx: Matching n-char long sequence of repeating character

Parsing array syntax using regex

Categories

Resources