Regex matches with multiple patterns - java

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.

Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com

Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]

The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Related

Regex - Ignore part of the string

I am working on Pentaho which uses Java regex package : java.util.regex.
I want to extract a lot of information from the lines of a text file from both start and end of the string :
^StartofString Controls\(param1="(D[0-9]{0,})",param2="(G[0-9]{0,})",param3="([^"]{0,})",param4="([^"]{0,})"\):(?:.*)param5="([^"]{0,})",.*
There is a long part of the string I want to ignore and try to do so with (?:.*)
The positive lookahead seems working when I test the Regex on the step but does not work when I execute the transformation.
I test the string on 'Regex Evaluation' step, check with 'Filter rows' the boolean of previous step and extract groups within a Javascript step :
var pattern = Packages.java.util.regex.Pattern.compile(patternStr);
var matcher = pattern.matcher(content.toString());
var matchFound = matcher.find();
with patterStr being the same regex than the one in the 'Regex Evaluation' step but with escaping characters : \
I have read many questions about ignoring parts of strings in regex and still can't find the answer.
Any help is welcome.
I can provide more informations if needed.
A non capturing group doesn't mean that its content won't be captured, it means that it won't be captured in a group (although you're still grouping tokens in your regex, which can be useful to apply a modifier to them at once).
For example, these regex will all match the exact same abc string :
abc
a(?:b)c
a(b)c
However in the third case, you've defined a capturing group which will enable you to access to b independently. The first two cases are equals in all respects.
The non-capturing group becomes useful when you want to apply a modifier to a group of tokens without having an extra group you can reference later. The following regexs will all match the same strings :
(ab)*(c)\2
(?:ab)*(c)\1
We want to apply * to the ab tokens. Either we do it with a capturing group (first example) and a group is created that we can reference, or we use a non-capturing group. The backreference at the end of the regex is supposed to match the c ; in the first example it's the second group since ab is the first one, while in the second c is the first group that can be referenced.
Now that I've explained what non-capturing groups do, let's solve your problem : you want to remove something from the middle of your string, where you know what's at the beginning and what's at the end.
Let's assume the string you want to match is the following :
Aremove-thisB
And that you want the result AB.
There are multiple strategies to do so, the easiest in your case probably is to match both the beginning and end of the string in their own capturing group and create your output from there :
var pattern = Packages.java.util.regex.Pattern.compile("(A).*(B)");
var matcher = pattern.matcher(content.toString());
var matchFound = matcher.find();
if (matchFound) { return matcher.group(1) + matcher.group(2); }

Java REGEX: matching comments and NOT matching specific character

so I'm new to Java and having some trouble with regex. I'm trying to find winged comments (/* */) and end of line comments( // ) in a string so I can split along them and put the pieces in an array.
This is the regex I'm currently have:
stringofstuff.split("[!//.*?\n!]");
and it works, but my problem is that it's also matching the character "." so when the string contains a number like 90.55, my array looks like [90, 55] which is NOT what I want. I've tried adding ^\\. to the end of the regex after the closing square bracket:
stringofstuff.split("[!//.*?\n!]^\\.");
and it succeeds in not matching . but it no longer recognizes either type of comment! I have no clue where I'm going wrong, any suggestions?
You can use pattern and matcher of regex package to do so.
For example to find digits:
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher(string);
if(m.find())
{
System.out.println(m.start()+" "+m.end()+" "+m.group);
}
Similarly you can make different combinations of strings you want to separate out and they will be stored in m.group().
For different combinations and more information on regex package you can see here:
http://www.regular-expressions.info/java.html

Using backreference to refer to a pattern rather than actual match

I am trying to write a regex which would match a (not necessarily repeating) sequence of text blocks, e.g.:
foo,bar,foo,bar
My initial thought was to use backreferences, something like
(foo|bar)(,\1)*
But it turns out that this regex only matches foo,foo or bar,bar but not foo,bar or bar,foo (and so on).
Is there any other way to refer to a part of a pattern?
In the real world, foo and bar are 50+ character long regexes and I simply want to avoid copy pasting them to define a sequence.
With a decent regex flavor you could use (foo|bar)(?:,(?-1))* or the like.
But Java does not support subpattern calls.
So you end up having a choice of doing String replace/format like in ajx's answer, or you could condition the comma if you know when it should be present and when not. For example:
(?:(?:foo|bar)(?:,(?!$|\s)|))+
Perhaps you could build your regex bit by bit in Java, as in:
String subRegex = "foo|bar";
String fullRegex = String.format("(%1$s)(,(%1$s))*", subRegex);
The second line could be factored out into a function. The function would take a subexpression and return a full regex that would match a comma-separated list of subexpressions.
The point of the back reference is to match the actual text that matches, not the pattern, so I'm not sure you could use that.
Can you use quantifiers like:
String s= "foo,bar,foo,bar";
String externalPattern = "(foo|bar)"; // comes from somewhere else
Pattern p = Pattern.compile(externalPattern+","+externalPattern+"*");
Matcher m = p.matcher(s);
boolean b = m.find();
which would match 2 or more instances of foo or bar (followed by commas)

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)
You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}
The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.
Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

How to appendReplacement on a Matcher group instead of the whole pattern?

I am using a while(matcher.find()) to loop through all of the matches of a Pattern. For each instance or match of that pattern it finds, I want to replace matcher.group(3) with some new text. This text will be different for each one so I am using matcher.appendReplacement() to rebuild the original string with the new changes as it goes through. However, appendReplacement() replaces the entire Pattern instead of just the group.
How can I do this but only modify the third group of the match rather than the entire Pattern?
Here is some example code:
Pattern pattern = Pattern.compile("THE (REGEX) (EXPRESSION) (WITH MULTIPLE) GROUPS");
Matcher matcher = pattern.matcher("THE TEXT TO SEARCH AND MODIFY");
StringBuffer buffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(buffer, processTheGroup(matcher.group(3));
}
but I would like to do something like this (obviously this doesn't work).
...
while(matcher.find()){
matcher.group(3).appendReplacement(buffer, processTheGroup(matcher.group(3));
}
Something like that, where it only replaces a certain group, not the whole Pattern.
EDIT: changed the regex example to show that not all of the pattern is grouped.
I see this already has an accepted answer, but it is not fully correct. The correct answer appears to be something like this:
.appendReplacement("$1" + process(m.group(2)) + "$3");
This also illustrates that "$" is a special character in .appendReplacement. Therefore you must take care in your "process()" function to replace all "$" with "\$". Matcher.quoteReplacement(replacementString) will do this for you (thanks #Med)
The previous accepted answer will fail if either groups 1 or 3 happen to contain a "$". You'll end up with "java.lang.IllegalArgumentException: Illegal group reference"
Let's say your entire pattern matches "(prefix)(infix)(suffix)", capturing the 3 parts into groups 1, 2 and 3 respectively. Now let's say you want to replace only group 2 (the infix), leaving the prefix and suffix intact the way they were.
Then what you do is you append what group(1) matched (unaltered), the new replacement for group(2), and what group(3) matched (unaltered), so something like this:
matcher.appendReplacement(
buffer,
matcher.group(1) + processTheGroup(matcher.group(2)) + matcher.group(3)
);
This will still match and replace the entire pattern, but since groups 1 and 3 are left untouched, effectively only the infix is replaced.
You should be able to adapt the same basic technique for your particular scenario.

Categories

Resources