Java regex "\A" boundary match - java

I am looking for some help with this regex. I have strings of varying length, and want to match only the beginning. The strings have newlines in them so it seems \A is the way to go.
I want regex that will match all the following cases:
OPTIONAL: [any whitespace/newlines/etc]
OPTIONAL: <?.*?>
OPTIONAL: [any whitespace/newlines/etc]
MANDAORY: <lemon>
OPTIONAL: anything afterwards.
Since the strings can get huge, the final Optional matching is making this be extremely slow.
My initial solution was:
"(^\\s*<?.*?>\\s*<lemon>)[\\s\\S]*|(^\\s*<lemon>.*)[\\s\\S]*"
This is extremely convoluted and matches the entire string instead of just the start
My current best try is:
"\\A(?:\\s*<?.*?>)?\\s*<lemon>"
However, this does not work if there is anything after mon>, then the match fails.
Has anyone got any ideas as to why? Examples on \A are sparse and I can't get it to work.

What you're missing is the notion of grouping. I've taken your regex and put it into ( brackets
Pattern p = Pattern.compile("(\\A(?:\\s*<?.*?>)?\\s*<lemon>).*");
Matcher m = p.matcher(" <?.*?> <lemon> hi ");
if (m.find()) {
System.out.println(m.group(1));
}
group 0 will the whole expression
group 1 will what you need.
This tutorial might explain how groups work
I am simply looking for a way to get a binary answer similar to String.matches(), which upon finding a match stop going through the string
Take this: \\A(?:\\s*<?.*?>)?\\s*<lemon>(.*?) with no grouping

Related

A regex to match the smallest nested part first

I am quite new to RegEx. I have not much experience. I already searched the internet and tried many things on regex101.com. Nothing seems to work.
This is the pattern:
\\((.*?)\\)
I use it in combination with Java 's replaceAll to add a ?: to each (...) provided in a string (the user input).
The user input is used as regular expression as well. But currently I am treating it as a normal String.
Imagine this user input: (Welcome, (StackOverflow|World)|Hello, Dad)
What I want as the result is: (?:Welcome, (?:StackOverflow|World)|Hello, Dad)
But I only get the first ?: : (?:Welcome, (StackOverflow|World)|Hello, Dad)
I think, I understand the problem. I guess, RegEx scans from right to left and is trying to get the smallest match (see .*? ). It searches for ( till the next ) . And this is (Welcome, (StackOverflow|World) .
What could I do to match these nested matches first? I cannot let the user modify their input. I have to find a better regex pattern to match from the smallest possible match to the greatest possible match, and not from the left to the right.
I suggest searching for any unescaped ( (so as not to add ?: after literal () that is not followed with ? (to avoid matching lookarounds/non-capturing groups/etc,):
(?<!\\)((?:\\{2})*)\((?!\?)
and replace with $1(?:. See the regex demo.
Java declaration:
String pat = "(?<!\\\\)((?:\\\\{2})*)\\((?!\\?)";
Details:
(?<!\\) - no backslash immediately to the left of the current location
((?:\\{2})*) - Group 1: zero or more even number of backslashes
\(- a literal (...
(?!\?) - that is not immediately followed with a literal ?.

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Regex nth match in equal string

Is it possible to use only a regex (no additional code!) for matching the nth match? For example:
"CAR" - "TRAIN" - "BOAT" - "BICYCLE"
Now I only want to match the BOAT, regex for matching would be "[A-Z]+" however this also matches the first, second and fourth.
Does anyone have a pure regex solution for this? I need this because I can't change the code that uses the regex, but I can provide a regex.
Best regards,
Robin
I think this lookbehind should do it:
(?<=^("[A-Z]+"[\s-]+){2})"[A-Z]+"
It matches a word that comes two words after the start of the string
If I understood you right and you're putting multiple strings into one regexp, string by string, then no, this is not possible.
The regular expression itself does not have memory which lasts longer than matching time,
so if you're matching one thing and after that another thing, there's no information of the first.
(?!(\"[A-Z]\"\s-\s){2})(\"[A-Z]\") - where the {2} means which index you want.
The only problem with it is that it returns every match after the specified index as well. You could perform the match and return the first result.
Tested it using regexpal with your example.

Need regex to separate comma separated values (Interface list from a router query output)

I have an input like this
RX Only : Gi1/0/15,Gi1/0/20,Gi1/0/17
I want to capture 1/0/15, 1/0/20, 1/0/17 from this. But this input changes. Sometimes there are only 2 comma separated values, sometimes 1 sometimes more than 3.
The regex I came up with only captures the first group. If I use the non-greedy operator, then it captures last. What regex should I use to capture all these groups separately.
The language used would be Java.
it's often easier to just write the regex for the substrings you are interested in, then repeatedly use Matcher.find(), as opposed to trying to write a regex that matches the entire string and pulling what you want from a complex arrangement of groups.
assuming what you are looking for are triples of three numbers separated by "/", then,
Pattern p = Pattern.compile("\\d+/\\d+/\\d+");
Matcher m = p.matcher(inputString);
while (m.find()) {
// your triple is in group 0
System.out.println(m.group(0));
}
Give a man a fish ... or
http://gskinner.com/RegExr/
Do you really have to use regex here? If data formats are quite similar you can just use indexOf function combined with substring. You will have to find the : character and start finding comas starting from the next character. Then you check the position of \n and use the smaller index in order to retrieve a substring.

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)
You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}
The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.
Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Categories

Resources