Find all matches of ambigous regular expression [duplicate]

Find all matches of ambigous regular expression [duplicate] - java

In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?

Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}

Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:

The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?

Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine

Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();

Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo

A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

java regex add trailing slash

I am trying to redirect the urls to add trailing slash
/news -> /news/
/news?param1=value1 -> /news/?param1=value
/news#anchor?param1=value1 -> /news/#anchor?param1=value1
I need to do it through a regex that identifies only the path and add /. When there are no parameters there is no problem.
^(/[a-z0–9/_\-]*[^/])$ -> $1/
But when there are parameters I am not able to create the regular expression that separates the path from the parameters.
Any ideas?, thanks

Might be just need to extend the end of string past the parameters.
To cover both with and without parameters might be:
^(/[a-z0–9/_-]*(?<!/))([^/]*)$ -> $1/$2
see https://regex101.com/r/Iwl23o/2

You shouldn't match the end of the string with $ and there is no need for [^/] at the end either.
^(/[a-z0–9/_\-]*)
const regex = new RegExp("^(/[a-z0–9/_\-]*)");
console.log("/news".replace(regex, "$1/"));
console.log("/news?param1=value1".replace(regex, "$1/"));
console.log("/news#anchor?param1=value1".replace(regex, "$1/"));

You can use a very simple regex like this:
^([/\w]+)
With this replacement string: $1/
Working demo

The pattern you tried matches only /news because the anchor $ asserts the end of the string.
If you omit the anchor, it would also match the ? and # as you use [^/] which matches any char except a forward slash.
You could repeat 1 or more times matching a forward slash followed by 1 or more times any char listed in the character class to prevent matching ///
In the replacement use the full match and add a a forward slash.
^(?:/[a-z0-9_-]+)+
Regex demo | Java demo
String regex = "^(?:/[a-z0-9_-]+)+";
String string = "/news\n"
+ "/news?param1=value1\n"
+ "/news#anchor?param1=value1";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceAll("$0/");
System.out.println(result);
Output
/news/
/news/?param1=value1
/news/#anchor?param1=value1
Note that in your regex, the hyphen in this part 0–9 is
https://www.compart.com/en/unicode/U+2013 instead of https://www.compart.com/en/unicode/U+002D

You can do it as follows:
public class Main {
public static void main(final String[] args) {
String[] arr = { "/news", "/news?param1=value1", "/news#anchor?param1=value1" };
for (String s : arr) {
System.out.println(s.replaceFirst("([^\\/\\p{Punct}]+)", "$1/"));
}
}
}
Output:
/news/
/news/?param1=value1
/news/#anchor?param1=value1
Explanation of the regex:
(: Start of capturing group#1
[: Start of character classes
^: None of
\/: A / character
\p{Punct}: A punctuation character.
]: End of character classes
+: One or more times
): End of capturing group#1

Pattern Matcher Java - Finding All Substrings in a String [duplicate]

In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?

Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}

Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:

The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:
(((Hello! --> ((( and Hello!
or
########No? --> ######## and No?
At the beginning I have n-times the same special character, but I want to match the longest possible sequence.
What I have at the moment is this regex:
([^a-zA-Z0-9])\\1+([a-zA-Z].*)
This one would return for the first example
( (only 1 time) and Hello!
and for the second
# and No!
How do I tell regEx I want the maximal long repetition of the matching character?
I am using RegEx as part of a Java program in case this matters.

I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):
String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
if (str.matches("(?s)(\\W)\\1+\\w.*")) {
System.out.println(Arrays.toString(str.split(ptrn)));
}else { System.out.println(Arrays.asList(str)); }
}
See IDEONE demo
Result:
[(((, Hello!]
[########, No?]
[$%^&^Hello!]
Also, your original regex can be modified to fit the requirement like this:
String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
Pattern p = Pattern.compile(ptrn);
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(Arrays.asList(m.group(1), m.group(3)));
}
else {
System.out.println(Arrays.asList(str));
}
}
See another IDEONE demo
That regex matches:
(?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
(\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}

homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex

The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html

Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find all matches of ambigous regular expression [duplicate] - java

Make the matcher attempt to start its next scan from the latter \d+. Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5"); if (m.find()) { do { allMatches.add(m.group()); } while (m.find(m.start(1))); }

Related

How to parse string using regex

java regex add trailing slash

Pattern Matcher Java - Finding All Substrings in a String [duplicate]

RegEx: Matching n-char long sequence of repeating character

Author and time matching regex

Categories

Resources