Parsing a value using regular expression in java - java

I am trying to read a line and parse a value using regular expression in java. The line that contains the value looks something like this,
...... TESTYY912345 .......
...... TESTXX967890 ........
Basically, it contains 4 letters, then any two ASCII values followed by numeric 9 then (any) digits. And, i want to get the value, 912345 and 967890.
This is what I have so far in regular expression,
... TEST[\x00-\xff]{2}[9]{1} ...
But, this skips the 9 and parse 12345 and 67890. (I want to include 9 as well).
Thanks for your help.

You are pretty close. Capture the entire group (9\\d*) after matching TEST\\p{ASCII}{2}. This way, you'll capture the 9 and the following digits:
String s = "...... TESTYY912345 ......";
Pattern p = Pattern.compile("TEST\\p{ASCII}{2}(9\\d+)");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1)); // 912345
}

See my comment for a working expression, "TEST.{2}(9\\d*)".
final Pattern pattern = Pattern.compile("TEST.{2}(9\\d*)");
for (final String str : Arrays.asList("...... TESTYY912345 .......",
"...... TESTXX967890 ........")) {
final Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
final int value = Integer.valueOf(matcher.group(1));
System.out.println(value);
}
}
See the result on ideone:
912345
967890
This will match any two characters (except a line terminator) for what is XX and YY in your example, and will take any digits after the 9.

Related

Regex capturing groups within logical OR

I have a set of strings I need to parse and extract values from. They look like:
/apple/1212d3fe
/cat/23224a2f4
/auto/445478eefd
/somethingelse/1234fded
It should match only apple, cat and auto. The output I expect is:
1212, d3fe
23224, a2f4
445478, eefd
null
I need to come up with a regex capturing groups to do the same. I am able to extract the second part but not the first one. The closest I came up with is:
String r2 = "^/(apple/[0-9]{4}|cat/[0-9]{5}|auto/[0-9]{6})([a-f0-9]{4})$";
System.out.println(r2);
Pattern pattern2 = Pattern.compile(r2);
Matcher matcher2 = pattern2.matcher("/apple/2323efff");
if (matcher2.find()) {
System.out.println(matcher2.group(1));
System.out.println(matcher2.group(2));
}
UPDATED QUESTION:
I have a set of strings I need to parse and extract values from. They look like:
/apple/1212d3fe
/cat/23e24a2f4
/auto/df5478eefd
/somethingelse/1234fded
It should match only apple, cat and auto. The output I expect is the everything after the 2nd '/' split as follows: 4 characters if 'apple', 5 characters if 'cat' and 6 characters if 'auto' like:
1212, d3fe
23e24, a2f4
df5478, eefd
null
I need to come up with a regex capturing groups to do the same. I am able to extract the second part but not the first one. The closest I came up with is:
String r2 = "^/(apple/[0-9]{4}|cat/[0-9]{5}|auto/[0-9]{6})([a-f0-9]{4})$";
System.out.println(r2);
Pattern pattern2 = Pattern.compile(r2);
Matcher matcher2 = pattern2.matcher("/apple/2323efff");
if (matcher2.find()) {
System.out.println(matcher2.group(1));
System.out.println(matcher2.group(2));
}
I can do it without the regex OR(|) but it breaks when I include it. Any help with the right regex?
Updated Answer:
As per your updated question you can use this regex based on lookbehind assertions:
/((?<=apple/).{4}|(?<=cat/).{5}|(?<=auto/).{6})(.+)$
RegEx Demo
This regex uses 2 capture groups after matching /
In 1st group we have 3 lookbehind conditions with alternations.
(?<=apple/).{4} makes sure that we match 4 characters that have apple/ on left hand side. Likewise we match 5 and 6 character strings that have cat/ and /auto/.
In 2nd capture group we match remaining characters before end of line.
You could use the regex \/[apple|auto|cat]+\/(\d*)(.*), See here
If you want the last group to have exactly 4 digits you can use this regex:
/(apple|cat|auto)/([0-9a-f]+)([0-9a-f]{4})
Here is a working example:
List<String> strings = Arrays.asList("/apple/1212d3fe", "/cat/23224a2f4", "/auto/445478eefd");
Pattern pattern = Pattern.compile("/(apple|cat|auto)/([0-9a-f]+)([0-9a-f]{4})");
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
}
If you want for digits after apple, 5 after cat and 6 after auto you can split your algorithm in 2 parts:
List<String> strings = Arrays.asList("/apple/1212d3fe", "/cat/23224a2f4", "/auto/445478eefd", "/some/445478eefd");
Pattern firstPattern = Pattern.compile("/(apple|cat|auto)/([0-9a-f]+)");
for (String string : strings) {
Matcher firstMatcher = firstPattern.matcher(string);
if (firstMatcher.find()) {
String first = firstMatcher.group(1);
System.out.println(first);
int length = getLength(first);
Pattern secondPattern = Pattern.compile("([0-9a-f]{" + length + "})([0-9a-f]{4})");
Matcher secondMatcher = secondPattern.matcher(string);
if (secondMatcher.find()) {
System.out.println(secondMatcher.group(1));
System.out.println(secondMatcher.group(2));
}
}
}
private static int getLength(String key) {
switch (key) {
case "apple":
return 4;
case "cat":
return 5;
case "auto":
return 6;
}
throw new IllegalArgumentException("key not allowed");
}

Repeating capture group in a regular expression

What would be the best way to parse the following string in Java using a single regex?
String:
someprefix foo=someval baz=anotherval baz=somethingelse
I need to extract someprefix, someval, anotherval and somethingelse. The string always contains a prefix value (someprefix in the example) and can have from 0 to 4 key-value pairs (foo=someval baz=anotherval baz=somethingelse in the example)
You can use this regex for capturing your intended text,
(?<==|^)\w+
Which captures a word that is preceded by either an = character or is at ^ start of string.
Sample java code for same,
Pattern p = Pattern.compile("(?<==|^)\\w+");
String s = "someprefix foo=someval baz=anotherval baz=somethingelse";
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Prints,
someprefix
someval
anotherval
somethingelse
Live Demo

Java regex doesn't find numbers

I'm trying to parse some text, but for some strange reason, Java regex doesn't work. For example, I've tried:
Pattern p = Pattern.compile("[A-Z][0-9]*,[0-9]*");
Matcher m = p.matcher("H3,4");
and it simply gives No match found exception, when I try to get the numbers m.group(1) and m.group(2). Am I missing something about how Java regex works?
Yes.
You must actually call matches() or find() on the matcher first.
Your regex must actually contain capturing groups
Example:
Pattern p = Pattern.compile("[A-Z](\\d*),(\\d*)");
matcher m = p.matcher("H3,4");
if (m.matches()) {
// use m.group(1), m.group(2) here
}
You also need the parenthesis to specify what is part of each group. I changed the leading part to be anything that's not a digit, 0 or more times. What's in each group is 1 or more digits. So, not * but + instead.
Pattern p = Pattern.compile("[^0-9]*([0-9]+),([0-9]+)");
Matcher m = p.matcher("H3,4");
if (m.matches())
{
String g1 = m.group(1);
String g2 = m.group(2);
}

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

How do I parse recurring pattern with regex

I want to use regex to find unknown number of arguments in a string. I think that if I explain it would be hard so let's just see the example:
The regex: #ISNULL\('(.*?)','(.*?)','(.*?)'\)
The String: #ISNULL('1','2','3')
The result:
Group[0] "#ISNULL('1','2','3')" at 0 - 20
Group[1] "1" at 9 - 10
Group[2] "2" at 13 - 14
Group[3] "3" at 17 - 18
That's working great.
The problem begins when I need to find unknown number of arguments (2 and more).
What changes do I need to do to the regex in order to find all the arguments that will occur in the string?
So, if I parse this string "#ISNULL('1','2','3','4','5','6')" I'll find all the arguments.
If you don't know the number of potential matches in a repeated construct, you need a regex engine that supports captures in addition to capturing groups. Only .NET and Perl 6 offer this currently.
In C#:
string pattern = #"#ISNULL\(('([^']*)',?)+\)";
string input = #"#ISNULL('1','2','3','4','5','6')";
Match match = Regex.Match(input, pattern);
if (match.Success) {
Console.WriteLine("Matched text: {0}", match.Value);
for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
Console.WriteLine(" Group {0}: {1}", ctr, match.Groups[ctr].Value);
int captureCtr = 0;
foreach (Capture capture in match.Groups[ctr].Captures) {
Console.WriteLine(" Capture {0}: {1}",
captureCtr, capture.Value);
captureCtr++;
}
}
}
In other regex flavors, you have to do it in two steps. E.g., in Java (code snippets courtesy of RegexBuddy):
First, find the part of the string you need:
Pattern regex = Pattern.compile("#ISNULL\\(('([^']*)',?)+\\)");
// or, using non-capturing groups:
// Pattern regex = Pattern.compile("#ISNULL\\((?:'(?:[^']*)',?)+\\)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Then use another regex to find and iterate over your matches:
List<String> matchList = new ArrayList<String>();
try {
Pattern regex = Pattern.compile("'([^']*)'");
Matcher regexMatcher = regex.matcher(ResultString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
This answer is somewhat speculative as i have no clue what regex engine you are using.
If the parameters are always numbers and always enclosed in single quotes, then why don't you try using the digit class like this:
'(\d)+?'
This is just the \d class and the extraneous #ISNULL stuff removed, as i assume you are only interested in the parameters themselves. You may not need the + and of course i don't know whether the engine you are using supports the lazy ? operator, just give it a go.

Categories

Resources