Capture group multiple times - java

Lately I have being playing around with regex in Java, and I find myself into a problem which (theoretically) is easy to solve, but I was wandering if there is any easier way to do it (Yes, yes I am lazy), the problem is capture a group multiple times, this is:
public static void main(String[] args) {
Pattern p = Pattern.compile("A (IvI(.*?)IvI)*? A");
Matcher m = p.matcher("A IvI asd IvI IvI qwe IvI A"); //ANY NUMBER of IvI x IvI
//Matcher m = p.matcher("A A");
int loi = 0; //last Occurrence Index
String storage;
while (loi >= 0 && m.find(loi)) {
System.out.println(m.group(1));
if ((storage = m.group(2)) != null) {
System.out.println(storage);
}
//System.out.println(m.group(1));
loi = m.end(1);
}
m.find();
System.out.println("2 opt");
Pattern p2 = Pattern.compile("IvI(.*?)IvI");
Matcher m2 = p2.matcher(m.group(1)); //m.group(1) = "IvI asd IvI IvI qwe IvI"
loi = 0;
while (loi >= 0 && m2.find(loi)) {
if ((storage = m2.group(1)) != null) {
System.out.println(storage);
}
loi = m2.end(0);
}
}
Using ONLY Pattern p is there any way to get what is inside IvI's? (in the test string would be "asd" and "qwe") considering that there could be any number of IvI's sections, something alike of what I am trying to do in the first while which is, finding the first occurrence of the group, then moving the index and search for the next group and so on and so on...
Using the code I wrote in that while it returns asd IvI IvI qwe as the group 2, not just asd and then qwe, in part I suppose it could be because of the (.*?) part, is is not supposed to be greedy but still it goes up to the qwe consuming two of the IvI's, I mention this because otherwise I may be able to use the end index of those with the matcher.find(anInt) method, but it does not work either; I don't think it is anything wrong with the regex, since the next code works without consuming the IvI.
public static void main(String[] args) {
Pattern p = Pattern.compile("(.*?)IvI");
Matcher m = p.matcher("bla bla blaIvI");
m.find();
System.out.println(m.group(1));
}
This prints: bla bla bla
THERE IS A SOLUTION I KNOW (but I am lazy remember)
(Also on the first code, bellow "2 opt" message)
The solution is dividing it into sub-groups and use another regex where you process only those sub-groups one at a time...
BTW: I did my homework
In this page it mentions
Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z])+, when you inspect the match, Group 1 will be D. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched.
But I am still hoping you to give me good news...

No, unfortunately, as your citation already mentions, the java.util.regex regular expression implementation does not support retrieving any previous values of a repeated capturing group after a single match. The only way to get those, as your code illustrates, is by find()ing multiple matches of the repeated part of your regular expression.
I've also been looking at other implementations of regular expressions in Java, for example:
http://www.brics.dk/automaton/
but I could not find any that supported it (only the Microsoft .NET engine) . If I understood correctly, implementations of regular expressions based on state machines cannot easily implement this feature. java.util.regex does not use state machines, though.
If anyone knows of a Java regular expression library that supports this behaviour, please share it, because it would be a powerful feature.
p.s. it took me quite a while to understand your question. The title is good, but the body confused me about whether I understood you correctly.

Related

Word not preceded by a regular expression

There are plenty of these questions but they all focus on having a couple of characters.
In a text file i have TXX and txx and i need to find those. But I also have Base64 encoded pictures.
Meaning I have
"picture":"/9j/4AAQSkTXX . . .
Basically TXX, txx can appear randomly in Base64-encoded pictures.
I used the following regular expression:
(?<!"picture":")(?:(\w|\/|\+)+)(TXX|txx)
I also realized it should probably be changed into:
(?<!"picture":")(?:(\d|\w|\/|\+|\=)+)(TXX|txx)
But it says I'm doing a catastrophic backtracking, and even without the (?:) (non-capturing group) it still doesn't work. Basically it just doesn't take the "picture":" and the first char and takes everything else.
Since I cannot put a regular expression inside the negative look-behind with a quantifier like
(?<!"picture":".+)TXX|txx
How should I form that regular expression so that these pass
"something-txx": "somerandomstring"
value not picture: "some other stringtxxsome string"
But this doesn't
"picture":"txxl5l71JGwnxMXAmJGOt8ZPwN24JNgtZpYHPBQLTViqVatk4ZoZhY+husj7Pgv3ag4NmpJ4CBlXudzydA5c+5QecmgaPz9vLrSbzRa+tNns0GjUfD+NSa5ZHo9KRf2nCWLl7360x2Kx8zA6dquNqubjoElpVRo2Dq0GOmZ8HMycktxxH08veKg84OPlCZvdDqvNxkPhOB0sn5wly+vdgx1Di82KzMxMlAoJQZkSJdGjZ0+UrlCJi/Xysc5GCPETtxxgUAgEAieNoQQLygg/P8K8VLaFCVVez+/SfMmPo74sNyxGz+/0YI8QKBQCAQCP4DPG6MeLrZcQvihFar46L6govdPE69movlMhIPh0NYaRJTtu2e+FQWyPkqDSsLqker0fKJVR0Oe5ap1RqoWD+pfuo7hefhbVJcfA8VlK42ycudJlIlMd1iMrnakePok5BPDyoUSvnhBMsEs9XMQ+PYrDQRqwd0Oj2vh/eVleXj5OMF7BSqhq2YjEa2TQ83nNDrPeHp5YWQEmXg4+vPPeLzIoR4gUAgEAcvvgETxtCiBcI/ifY2Y2aA57eWu7lJBAIBAKBQCB4eP62EC/JYWmoPBnFeieRnGKnk7e3yWTiYjN5fZPYLId5kcV67sHtcLBt+vZG4VzIu93lVe8SqUmsdzpsrDz7jse2tZrs+O/kxc7z5oGE/PtB+XOWs7tCtpB4z9NIkGf9YU3JeSmb0yV422np5AI8eaTXX"
Sample input is on :
http://pastebin.com/5XJVNqGS
(I know pastebin is bad since the expiration but i'm having problem pasting that amount of text as the page stucks)
And the results should be:
Result1: "some-txx": value
Result2: hereisTXX: "1235"
Result3: "GROUPDATA" : "{DATA1: sample, TXX-value:12312 ,DATA2: sample2}"
I believe you can use a rather useful Java "to-some-extent" variable-width look-behind:
(?<!"picture":"[^"]{0,10000})(?i:txx)
You can adjust the 10000 value in case you have longer Base64-encoded strings.
Tested on RegexPlanet
In case you have very large images, use a reverse-string trick with a reversed regex (look-aheads can be of undefined variable size):
String rx = "(?i)\"[^\"]*\"\\s*:\\s*\"[^\"]*xxt[^\"]*\"(?![^\"]*\":\"erutcip\")";
Sample Java program on Ideone:
import java.util.regex.*;
class HelloWorld{
public static void main(String []args){
String str = "THE_HUIGE_STRING_THAT_CAUSED_Body is limited to 30000 characters;you entered 53501_ISSUE";
str = new StringBuilder(str).reverse().toString();
String rx = "\"?[^\"]*\"?\\s*\"?[^\"\\n\\r]*(?:xxt|XXT)[^\"\\n\\r]*(?![^\"]*\":\"erutcip\")";
Pattern ptrn = Pattern.compile(rx);
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println(new StringBuilder(m.group(0)).reverse().toString());
}
m = ptrn.matcher(new StringBuilder("\"something-txx\": \"somerandomstring\"").reverse().toString());
while (m.find()) {
System.out.println(new StringBuilder(m.group(0)).reverse().toString());
}
}
}

Java Regex is including new line in match

I'm trying to match a regular expression to textbook definitions that I get from a website.
The definition always has the word with a new line followed by the definition. For example:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
Here's my snippet
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
All help is greatly appreciated. Thanks in advance!
Try using the Pattern.MULTILINE option
Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.
Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.
With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
So changing mtch.group() to mtch.group(1) should do the trick:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}
A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string
(?s)[Your Expression]
Basically (?s) also tells dot to match all characters, including line breaks
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Just replace:
String result = mtch.group();
By:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group (e.g. (\\w+)) .
Try the next:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\\w+)\\r?\\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1")
); // prints "Zither"
}

Java regex: Repeating capturing groups

An item is a comma delimited list of one or more strings of numbers or characters e.g.
"12"
"abc"
"12,abc,3"
I'm trying to match a bracketed list of zero or more items in Java e.g.
""
"(12)"
"(abc,12)"
"(abc,12),(30,asdf)"
"(qqq,pp),(abc,12),(30,asdf,2),"
which should return the following matching groups respectively for the last example
qqq,pp
abc,12
30,asdf,2
I've come up with the following (incorrect)pattern
\((.+?)\)(?:,\((.+?)\))*
which matches only the following for the last example
qqq,pp
30,asdf,2
Tips? Thanks
That's right. You can't have a "variable" number of capturing groups in a Java regular expression. Your Pattern has two groups:
\((.+?)\)(?:,\((.+?)\))*
|___| |___|
group 1 group 2
Each group will contain the content of the last match for that group. I.e., abc,12 will get overridden by 30,asdf,2.
Related question:
Regular expression with variable number of groups?
The solution is to use one expression (something like \((.+?)\)) and use matcher.find to iterate over the matches.
You can use regular expression like ([^,]+) in loop or just str.split(",") to get all elements at once. This version: str.split("\\s*,\\s*") even allows spaces.
(^|\s+)(\S*)(($|\s+)\2)+ with ignore case option /i
She left LEft leFT now
example here - https://regex101.com/r/FEmXui/2
Match 1
Full match 3-23 ` left LEft leFT LEFT`
Group 1. 3-4 ` `
Group 2. 4-8 `left`
Group 3. 18-23 ` LEFT`
Group 4. 18-19 ` `
Using an ANTLR grammar can solve this problem. This is really beyond the reasonable capabilities of RegExp, although I believe some newer versions of Microsoft's implementation in .Net support this behavior. See this other SO question. If you're stuck with everything but .Net your best option is going to be a parser-generator (you don't have to use ANTLR, that's just my personal preference). Going through the ANTLR4 GitHub page can help get someone started on matching on more complex expressions with things like repeating match groups. Another option that doesn't require a whole lot of new learning is to tokenize the string input that you're wanting to match on and pull out the pieces that you want, but this can prove to be extremely messy and create nightmarish chunks of parsing code that are better-suited to a generated parser.
This may be the solution :
package com.drl.fw.sch;
import java.util.regex.Pattern;
public class AngularJSMatcher extends SimpleStringMatcher {
Matcher delegate;
public AngularJSMatcher(String lookFor){
super(lookFor);
// ng-repeat
int ind = lookFor.indexOf('-');
if(ind >= 0 ){
StringBuilder sb = new StringBuilder();
boolean first = true;
for (String s : lookFor.split("-")){
if(first){
sb.append(s);
first = false;
}else{
if(s.length() >1){
sb.append(s.substring(0,1).toUpperCase());
sb.append(s.substring(1));
}else{
sb.append(s.toUpperCase());
}
}
}
delegate = new SimpleStringMatcher(sb.toString());
}else {
String words[] = lookFor.split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])");
if(words.length > 1 ){
StringBuilder sb = new StringBuilder();
for (int i=0;i < words.length;i++) {
sb.append(words[i].toLowerCase());
if(i < words.length-1) sb.append("-");
}
delegate = new SimpleStringMatcher(sb.toString());
}
}
}
#Override
public boolean match(String in) {
if(super.match(in)) return true;
if(delegate != null && delegate.match(in)) return true;
return false;
}
public static void main(String[] args){
String lookfor="ngRepeatStart";
Matcher matcher = new AngularJSMatcher(lookfor);
System.out.println(matcher.match( "<header ng-repeat-start=\"item in items\">"));
System.out.println(matcher.match( "var ngRepeatStart=\"item in items\">"));
}
}

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

How to create article spinner regex in Java?

Say for example I want to take this phrase:
{{Hello|What's Up|Howdy} {world|planet} |
{Goodbye|Later}
{people|citizens|inhabitants}}
and randomly make it into one of the following:
Hello world
Goodbye people
What's Up word
What's Up planet
Later citizens
etc.
The basic idea is that enclosed within every pair of braces will be an unlimited number of choices separated by "|". The program needs to go through and randomly choose one choice for each set of braces. Keep in mind that braces can be nested endlessly within each other. I found a thread about this and tried to convert it to Java, but it did not work. Here is the python code that supposedly worked:
import re
from random import randint
def select(m):
choices = m.group(1).split('|')
return choices[randint(0, len(choices)-1)]
def spinner(s):
r = re.compile('{([^{}]*)}')
while True:
s, n = r.subn(select, s)
if n == 0: break
return s.strip()
Here is my attempt to convert that Python code to Java.
public String generateSpun(String text){
String spun = new String(text);
Pattern reg = Pattern.compile("{([^{}]*)}");
Matcher matcher = reg.matcher(spun);
while (matcher.find()){
spun = matcher.replaceFirst(select(matcher.group()));
}
return spun;
}
private String select(String m){
String[] choices = m.split("|");
Random random = new Random();
int index = random.nextInt(choices.length - 1);
return choices[index];
}
Unfortunately, when I try to test this by calling
generateAd("{{Hello|What's Up|Howdy} {world|planet} | {Goodbye|Later} {people|citizens|inhabitants}}");
In the main of my program, it gives me an error in the line in generateSpun where Pattern reg is declared, giving me a PatternSyntaxException.
java.util.regex.PatternSyntaxException: Illegal repetition
{([^{}]*)}
Can someone try to create a Java method that will do what I am trying to do?
Here are some of the problems with your current code:
You should reuse your compiled Pattern, instead of Pattern.compile every time
You should reuse your Random, instead of new Random every time
Be aware that String.split is regex-based, so you must split("\\|")
Be aware that curly braces in Java regex must be escaped to match literally, so Pattern.compile("\\{([^{}]*)\\}");
You should query group(1), not group() which defaults to group 0
You're using replaceFirst wrong, look up Matcher.appendReplacement/Tail instead
Random.nextInt(int n) has exclusive upper bound (like many such methods in Java)
The algorithm itself actually does not handle arbitrarily nested braces properly
Note that escaping is done by preceding with \, and as a Java string literal it needs to be doubled (i.e. "\\" contains a single character, the backslash).
Attachment
Source code and output with above fix but no major change to algorithm
To fix the regex, add backslashes before the outer { and }. These are meta-characters in Java regexes. However, I don't think that will result in a working program. You are modifying the variable spun after it has been bound to the regex, and I do not think the returned Matcher will reflect the updated value.
I also don't think the python code will work for nested choices. Have you actually tried the python code? You say it "supposedly works", but it would be wise to verify that before you spend a lot of time porting it to Java.
Well , I just created one in PHP & Python , demo here http://spin.developerscrib.com , its at a very early stage so might not work to expectation , the source code is on github : https://github.com/razzbee/razzy-spinner
Use this, will work... I did, and working great
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
and here
private String select(String m){
String[] choices = m.split("|");
Random random = new Random();
int index = random.nextInt(choices.length - 1);
return choices[index];
}
m.split("|") use m.split("\\|")
Other wise it splits each an every character
and use Pattern.compile("\\{([^{}]*)\\}");

Categories

Resources