Need help in figuring out the right regex pattern - java

I need to extract substrings from a string:
Given string: "< If( ( h == v ) ): { [ < j = (i - f) ;>, < k = (g + t) ;> ] }>"
I need two substrings: "j = (i - f)" and "k = (g + t)".
For this I tried user pattern regex. Here's my code:
Pattern pattern = Pattern.compile("[<*;>]");
Matcher matcher = pattern.matcher(out.get(i).toString());
while (matcher.find())
{
B2.add(matcher.group());
}
out.get(i).toString() is my input string. B2 is an ArrayList which will contain the two extracted substrings.
But, after running the above code, the output I am getting is : [<, <, ;, >, <, ;, >, >].
My pattern is not working! Your help is very much appreciated.
Thanks in advance!

You can use the expression <([^<]+);>.
This will match things between < and ;>
Pattern pattern = Pattern.compile("<([^<]+);>");
Matcher matcher = pattern.matcher(out.get(i).toString());
while (matcher.find())
{
B2.add(matcher.group(1));
}
You can see the results on regexplanet: http://fiddle.re/5rty6

your [ and ] are causing you problems. those symbols mean: "match one among the symbols inside of these" If you remove those, you'll get better results. You'll also have to escape your pointy brackets when you do that.
The next step will be to capture the groups. you normally use ( and ) for that.
You'll also have to worry about nasty artifacts such as that < at the beginning of the string which will mess with your regex. in order to deal with that, you'll need to exclude those from your regex.
You might end up with
"\<([^<>]*?)\>"
as your regex. Be sure to check the specific java documentation and to escape your \ for a final result of
"\\<([^<>]*?)\\>"
If you're wanting to next other < and > inside your pointy brackets, regex has a lot of trouble with that kind of thing, and maybe you should try a different method
Here's a sample regex

Related

Regex to match/group repeating characters in a string

I need a regular expression that will match groups of characters in a string. Here's an example string:
qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT
It should match
(match group) "result"
(1) "q"
(2) "wwwwwwwww"
(3) "eeeee"
(4) "rr"
(5) "t"
(6) "yyyyy"
(7) "qqqq"
(8) "w"
(9) "EE"
(10) "r"
(11) "TTT"
after doing some research, this is the best I could come up with
/(.)(\1*)/g
The problem I'm having is that the only way to use the \1 back-reference is to capture the character first. If I could reference the result of a non capturing group I could solve this problem but after researching I don't think it's possible.
How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).
Looks like you need to use a Matcher in a loop:
Pattern p = Pattern.compile("((.)\\2*)");
Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT");
while (m.find()) {
System.out.println(m.group(1));
}
Outputs:
q
wwwwwwwww
eeeee
rr
t
yyyyy
qqqq
w
EE
r
TTT
Assuming what #cruncher said as a premise is true: "we want to catch repeating letter groups without knowing beforehand which letter should be repeating" then:
/((a*?+)|(b*?+)|(c*?+)|(d*?+)|(e*?+)|(f*?+)|(g*?+)|(h*?+))/
The above RegEx should allow the capture of repeating letter groups without hardcoding a particular order in which they would occur.
The ?+ is a reluctant possesive quantifier which helps us not waste RAM space by not saving previously valid backtracking cases if the current case is valid.
Since you did tag java, I'll give an alternative non-regex solution(I believe in requirements being the end product, not the method by which you get there).
String repeat = "";
char c = '';
for(int i = 0 ; i < s.length() ; i++) {
if(s.charAt(i) == c) {
repeat += c;
} else {
if(!repeat.isEmpty())
doSomething(repeat); //add to an array if you want
c = s.charAt(i);
repeat = "" + c;
}
}
if(!repeat.isEmpty())
doSomething(repeat);

How to retain all occurrences of X when using the greedy quantifier X* in a java regexp?

I have a regular expression that I use to find matches of a list of coma-separated words between <> inside a string, like "Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd" in the example
I want to use capturing groups to retain each word between the braces:
Here is my expression: < (\w+) (?: ,(\w+) )* > (spaces are added for readability but not a part of the pattern)
Parenthesis are for creating capturing groups, (?: ) is for creating a non capturing group, because I don't want to retain the coma.
Here is my test code:
#Test
public void test() {
String patternString = "<(\\w+)(?:,(\\w+))*>";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher("Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd");
while(matcher.find()) {
System.out.println("== Match ==");
MatchResult matchResult = matcher.toMatchResult();
for(int i = 0; i < matchResult.groupCount(); i++) {
System.out.println(" " + matchResult.group(i + 1));
}
}
}
This is the output produced:
== Match ==
a1
null
== Match ==
b1
b2
== Match ==
c1
c3
And here is what I wanted:
== Match ==
a1
== Match ==
b1
b2
== Match ==
c1
c2
c3
From this I understand that there is exactly as many groups as the number of capturing groups in my expression, but this is not what I want, because I need all the substring that were recognized as the \w+
Is there any chance to get what I want with a single RegExp, or should I finish the job with split(","), trim(), etc...
As far as I know .NET has the only regex engine out there, that can return multiple captures for a single capturing group. So what you are asking for is not possible in Java (at least not the way you asked for).
In your case this problem can however be solved to a certain extent. If you can be sure that there will never be an unmatched closing >, you can make the stuff you want to capture the full match, and require the correct position through a lookahead:
"\\w+(?=(?:,\\w+)*>)"
This can never match "words" outside of <...>, because they cannot get past the opening < to match the closing >. Of course that makes it hard to distinguish between elements from different sets of <...>.
Alternatively (and I suppose that is even better, because it's safer, and more readable), go for a two-step algorithm. First match
"<([\\w,]*)>"
Then split every result's first capture at ,.

Java recursive(?) repeated(?) deep(?) pattern matching

I'm trying to get ALL the substrings in the input string that match the given pattern.
For example,
Given string: aaxxbbaxb
Pattern: a[a-z]{0,3}b
(What I actually want to express is: all the patterns that starts with a and ends with b, but can have up to 2 alphabets in between them)
Exact results that I want (with their indexes):
aaxxb: index 0~4
axxb: index 1~4
axxbb: index 1~5
axb: index 6~8
But when I run it through the Pattern and Matcher classes using Pattern.compile() and Matcher.find(), it only gives me:
aaxxb : index 0~4
axb : index 6~8
This is the piece of code I used.
Pattern pattern = Pattern.compile("a[a-z]{0,3}b", Pattern.CASE_INSENSITIVE);
Matcher match = pattern.matcher("aaxxbbaxb");
while (match.find()) {
System.out.println(match.group());
}
How can I retrieve every single piece of string that matches the pattern?
Of course, it doesn't have to use Pattern and Matcher classes, as long as it's efficient :)
(see: All overlapping substrings matching a java regex )
Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.
public static void allMatches(String text, String regex)
{
for (int i = 0; i < text.length(); ++i) {
for (int j = i + 1; j <= text.length(); ++j) {
String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);
if (m.find())
{
System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
}
}
}
}
you are in effect searching for the strings ab, a_b, and a__b in an input string, where
_ denotes a non-whitespace character whose value you do not care about.
That's three search targets. The most efficient way I can think of to do this would be to use a search algorithm like the Knuth-Morris-Pratt algorithm, with a few modifications. In effect your pseudocode would be something like:
for i in 0 to sourcestring.length
check sourcestring[i] - is it a? if so, check sourcestring[i+x]
// where x is the index of the search string - 1
if matches then save i to output list
else i = i + searchstring.length
obviously if you have a position match you must then check the inner characters of the substring to make sure they are alphabetical.
run the algorithm 3 times, one for each search term. It will doubtless be much faster than trying to do the search using pattern matching.
edit - sorry, didn't read the question properly. If you have to use regex then the above will not work for you.
One thing you could do is:
Create all possible Substrings that are 4 characters or longer (good
luck with that if your String is large)
Create a new Matcher for each of these substrings
do a match() instead of a find()
calculate the absolute offset from the substring's relative offset and the matcher info

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

Regex to find commas that aren't inside "( and )"

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:
,a,b,c,d,"("x","y",z)",e,f,g,
Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.
I tried a lot of combinations but regular expressions is still a little foggy for me.
I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.
So, want to do something like this:
String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g
Thanks!
You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:
String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";
String[] parts = text.split(";(?![^<>]*>)");
System.out.println(java.util.Arrays.toString(parts));
// _ _ _ _ _______ _ _ _ _________ _ _ _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]
Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.
On the pattern
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.
The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.
Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.
This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.
You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).
References
regular-expressions.info/Character class, Repetition, Lookarounds, Possessive
Try this one:
(?![^(]*\)),
It worked for me in my testing, grabbed all commas not inside parenthesis.
Edit: Gopi pointed out the need to escape the slashes in Java:
(?![^(]*\\)),
Edit: Alan Moore pointed out some unnecessary complexity. Fixed.
If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.
List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) == 0) {
String[] atoms = chunks[i].split(",");
for (int j = 0; j < atoms.length; j++)
result.add(atoms[j]);
}
else
result.add(chunks[i]);
}
Well,
After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!
But I still looking for one that can found the commas even if there's no "" in the inside terms.
Thankz for the help guyz.
This should do what you want:
(".*")|([a-z])
I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm
the groups $1 and $2 contain the right values, so they match and you should get what you want.
A littlte be trickier this will get if you have other complexer values than a-z in between the commas.
If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for.
Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly.
By inversing the problem itself, the problem gets often simpler.
I had the same issue. I choose Adam Schmideg answer and improve it.
I had to deal with these 3 string for example :
France (Grenoble, Lyon), Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
The idea was to have :
France (Grenoble, Lyon)
or Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
I choose not to use regex because I was 100% of what I was doing and that it would work in any case.
String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) != 0) {
chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
}
}
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
buffer.append(chunks[i]);
}
String s = buffer.toString();
String[] output = s.split(",");

Categories

Resources