Possible Regular Expression Question - java

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)

You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}

The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.

Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Related

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Java regex "\A" boundary match

I am looking for some help with this regex. I have strings of varying length, and want to match only the beginning. The strings have newlines in them so it seems \A is the way to go.
I want regex that will match all the following cases:
OPTIONAL: [any whitespace/newlines/etc]
OPTIONAL: <?.*?>
OPTIONAL: [any whitespace/newlines/etc]
MANDAORY: <lemon>
OPTIONAL: anything afterwards.
Since the strings can get huge, the final Optional matching is making this be extremely slow.
My initial solution was:
"(^\\s*<?.*?>\\s*<lemon>)[\\s\\S]*|(^\\s*<lemon>.*)[\\s\\S]*"
This is extremely convoluted and matches the entire string instead of just the start
My current best try is:
"\\A(?:\\s*<?.*?>)?\\s*<lemon>"
However, this does not work if there is anything after mon>, then the match fails.
Has anyone got any ideas as to why? Examples on \A are sparse and I can't get it to work.
What you're missing is the notion of grouping. I've taken your regex and put it into ( brackets
Pattern p = Pattern.compile("(\\A(?:\\s*<?.*?>)?\\s*<lemon>).*");
Matcher m = p.matcher(" <?.*?> <lemon> hi ");
if (m.find()) {
System.out.println(m.group(1));
}
group 0 will the whole expression
group 1 will what you need.
This tutorial might explain how groups work
I am simply looking for a way to get a binary answer similar to String.matches(), which upon finding a match stop going through the string
Take this: \\A(?:\\s*<?.*?>)?\\s*<lemon>(.*?) with no grouping

Regex: Match a string between two tags in a string

I am new to Regexp. I am struck in writing regexp for below scenario. Can some one please help me in solving this?
If i have a String like the following:
<Tag1 attr="test"/>
<Tag2>
<Tag4 attr="test"/>
<Tag5 attr="test"/>
</Tag2>
<Tag3 attr="test"/>
Whats the regex to match 'test' between the <Tag2> and </Tag2> tags?
Output should match 'test' in both Tag4 and Tag5...
Any help would be highly appreciated..
Why are you using a regex for this? I am not familiar with the Java libraries, but I would imagine there is a library that would allow you to do XQueries using XPaths. That would be the simpler approach.
Here is a website that shows examples
Here is a SO question on XPath in Java
XPath is really more appropriate for this. This looks like duplicate post. Original
Perl has a couple of good xpath parsers on CPAN. But here's a good page on multiline regex parsing if you absolutely must use it.
All said before is totally true - however if you still want to practice some regex heres an alternative:
Doing it in one match is not possible since one of the inner groups will always be discarded (see this) , so you'll have to extract the inner passage first.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTagParse {
static String html = "<Tag1 attr=\"test\"/><Tag2> <Tag4 attr=\"test_one\"/> <Tag5 attr=\"test_two\"/></Tag2><Tag3 attr=\"test\"/>";
public static void main(String[] args) {
Matcher mat1 = Pattern.compile("Tag2>(.*)</Tag2").matcher(html);
mat1.find();
Matcher mat2 = Pattern.compile("<[^<>]*attr=\"([^\"]+)\"[^<>]>").matcher(mat1.group(1));
while(mat2.find()){
System.out.println(mat2.group(1));
}
}
}
anyways, you'd be much better off using XPath :)
I'm not in practice with java, but I can offer some guidance to the regular expression, I hope. If you know what the specific attribute and value is that you're looking for, you can use something like the following:
Pattern pattern = Pattern.compile("<tag[45].*attr\s*=\s*[\"']test['\"][^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("<Tag1 attr='test'/><Tag2><Tag4 attr='test'/><Tag5 attr='test'/></Tag2><Tag3 attr='test'/>");
matcher.matches();
the regex is made up of the following components:
match the literal string:
followed by either a 4 or a 5 (the [45] designation)
followed by any number of characters preceding the literal string: attr
followed by any number of spaces
followed by the literal character: =
followed by any number of spaces
followed by either the ' or " character
followed by the string literal: test
followed by either the ' or " character
followed by any character that is not >
followed by >
the point in adding some of these extra bits is simply to highlight that you may need/want to consider accounting for different coding styles, etc. note: I took the easy away out by setting the pattern as case-insensitive, but you can omit that and change your expression to check for the appropriate case (for example, if your attribute value is case-sensitive, you can change the 'tag' literal to be [tT][aA][gG] in order to allow matching the tag to be case-insensitive.
I'm apparently too slow to type, since jvataman has already answered your question, but perhaps there is some value in my writeup, so I'll post anyway.

What's wrong with this regex?

I am trying the following code on Java:
String test = "http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf";
String regex = "[http://]{0,1}([a-zA-Z]*.)*\\.google\\.com/[-a-zA-Z/_.?&=]*";
System.out.println(test.matches(regex));
It does work for several minutes (after that I killed the VM) with no result.
Can anyone help me?
BTW: What will you recommend me to do to speed up weblink-testng regexes in future?
[http://] is a character class, meaning any one of those characters from the set.
Just leave those particular square brackets off if it must start with http://. If it's optional, you can use (http://)?.
One obvious problem is that you're looking for the sequence ([a-zA-Z]+.)*\\.google - this will do a lot of backtracking due to that naked . which means "any character" rather than the literal period that you wanted.
But even if you replace it with what you meant, ([a-zA-Z]+\\.)*\\.google, you still have a problem - this will then require two . characters immediately before google. You should instead try:
String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
That returns immediately for me with a true match.
Keep in mind that this currently requires the / at the end of google.com. If that's a problem, it's a minor fix, but I've left it there since you had it in your original regex.
You are trying to match the scheme as a character class using square brackets. That means only zero or one of the characters from that set. You want a subpattern, with parentheses. You can also change {0,1} to just say ?.
Also, you should remove the period just before google\\.com because you're already looking for a period in the subdomain subpattern of your regex. As cherouvim points out, you forgot to escape that period as well.
String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
In the ([a-zA-Z]*.) part you either need to escape the . (because right now it means "all characters") or remove it.
There are two problems with the regular expression.
The first is easy, as was mentioned by others. You need to match "http://" as a subpattern, not as a character class. Change the brackets to parentheses.
The second problem causes the very poor performance. It's causing the regex to backtrack repeatedly, trying to match the pattern.
What you're trying to do is match zero or more subdomains, which are groups of letters followed by a dot. Since you want to match the dot explicitly, escape the dot. Also remove the dot in front of "google" so you can match "http://google.com/etc" (ie, no leading dot in front of google).
So your expression becomes:
String regex = "(http://){0,1}([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";
Running this regex on your example takes just a fraction of a second.
Assuming you fix the ([a-zA-Z]*\\.) you need to change * to + so the part becomes ([a-zA-Z]+\\.). Otherwise you'll be accepting http://...google.com and this is not valid.
By grouping part before google.com I assume you are looking for part of URL host name. I think that rexep is powerful tool, but you can simply use URL Java class. There is getHost() method. Then you can check if host name ends with google.com and split it or use some simplier regexp with only host name.
URL url = new URL("http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf");
String host = url.getHost();
if (host.endsWith("google.com"))
{
String [] parts = host.split("\\.");
for (String s: parts)
System.out.println(s);
}

Linkify text with regular expressions in Java

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.
What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.
This solution works when all I have is plain text:
String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(comment);
comment = matcher.replaceAll("$0"); // group 0 is the whole expression
But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.
So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.
Any ideas?
You are close. You can use a "negative lookbehind" like so:
(?<!href=")http:// etc
All results preceded by href will be ignored.
If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:
(?!</a>)
Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string
http://example.com/
This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.
You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.
This works for me (note the three extra +'s):
String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";
If you really want to do it with regex, than:
String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
e.g. check that the URL is not following a =" or />
Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.
If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.

Categories

Resources