Regex - Optimize look behind

Regex - Optimize look behind - java

I have a regex that finds urls in a source code
(?<!\b(XmlNamespace)\([^\n]{0,1000})"(http|ftp|socket):\/\/(?!www\.google-analytics\.com(\/collect)?)(\:(\d+)?)?("|\/))[\w\d]
But it VERY slow. The main problem is look behind.
I use Java (java.util.regex.Pattern)
Can someone help me please?
UPD:
When I changed {0,1000} to {0,100}, processing time has changed to 50 seconds. But it's not a solution. I believe that this look behind works first, but the main part only second. So the question: How to make "(http|ftp|socket):// work first and look behind only after that?

The regex was tested at RegexPlanet.
NOTE: The (?!www\\.google-analytics\\.com(/collect)?) lookahead does not make much sense since you are consuming : + digits* after //, so your regex might be invalid in general.
I dwell on how the pattern can be enhanced.
The point is that the lookbehind in your pattern is triggered at alocation before every character. If you place it after the initial pattern repeating this pattern after the subpattern that is already in your lookbehind, it will be triggered only after matching this initial subpattern.
Fortunately, Java regex is wise enough to see that the lookbehind width is still constrained width with the alternation.
So,
"(http|ftp|socket)://(?<!\bXmlNamespace\(.{0,1000}"(http|ftp|socket)://)(?!www\‌.google-analytics\.com(/collect)?)(:\d*)?["/]\w
Should work better. Note I removed unnecessary escape symbols (/ is not a special character in Java regex), put the ("|/) alternation into a character class [...]. Also, [\w\d] is the same as \w (it already matches \d).
The regex was tested at RegexPlanet.
Java test:
String value1 = "\"http://:2123\"123";
String pattern1 = "\"(http|ftp|socket)://(?<!\\bXmlNamespace\\(.{0,1000}\"(http|ftp|socket)://)(?!www\\.google-analytics\\.com(/collect)?)(:\\d*)?[\"/]\\w";
Pattern ptrn = Pattern.compile(pattern1);
Matcher matcher = ptrn.matcher(value1);
if (matcher.find())
System.out.println("true");
else
System.out.println("false");

Related

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?

Perl RegEx and PCRE (Perl-Compatible RegEx) amongst others have the shorthand \K to discard all matches to the left of it except for capturing groups, but Java doesn't support it, so what's Java's equivalent to it ?

There is no direct equivalent. However, you can always re-write such patterns using capturing groups.
If you have a closer look at \K operator and its limitations, you will see you can replace this pattern with capturing groups.
See rexegg.com \K reference:
In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind.
The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.
However, all this means that the pattern before \K is still a consuming pattern, i.e. the regex engine adds up the matched text to the match value and advances its index while matching the pattern, and \K only drops the matched text from the match keeping the index where it is. This means that \K is no better than capturing groups.
So, a value\s*=\s*\K\d+ PCRE/Onigmo pattern would translate into this Java code:
String s = "Min value = 5000 km";
Matcher m = Pattern.compile("value\\s*=\\s*(\\d+)").matcher(s);
if(m.find()) {
System.out.println(m.group(1));
}
There is an alternative, but that can only be used with smaller, simpler
patterns. A constrained width lookbehind:
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
So, this will also work:
m = Pattern.compile("(?<=value\\s{0,10}=\\s{0,10})\\d+").matcher(s);
if(m.find()) {
System.out.println(m.group());
}
See the Java demo.

Regex, lookbehind/lookahead with ".*"

This word has to be taken with the space behind it
word like this has to be taken too
If the word is like \gloss{word}, \(anything here)sezione{word}, \gloss{anything word anything), \(anything here)sezione{anything word anything}, it must not be taken.
If the word inside is like \(anything but gloss or sezione){word} and \{anything but gloss or sezione){strings word strings} it has to be taken.
Obviously aword, worda and aworda has not to be taken.
(the bold word has been taken, word has not)
I have problems in not catching the word that is inside "{.... word .....}"
My guess was (?<!(sezione\{)|(gloss\{))(\b)( ?)word(\b)(?!.*\{}) so far, and I would have added a ".*" on the lookbehind and lookahead ( (?<!(sezione\{)|(gloss\{).*)[...] ) but like this it stops working.
If this matter, I plan to use Java's regex engine
Thanks in advance
edit: the major problem is
\(anything here)sezione{anything word anything}
If I can NOT get this one, this should solve the whole problem

Let's set few hard facts about your use-case:
Java (and most of) regex engines don't support variable length lookbehind
Java regex engine doesn't support \K pattern that allows you to reset the search
In absence of that you will need to use a workaround which works in 3 steps:
Make sure input is matching expected lookbehind pattern
If it does then remove matched String by lookbehind pattern
In the replaced String match and extract your search pattern
Consider following code:
String str = "(anything here)sezione{anything word anything}";
// look behind pattern
String lookbehind = "^.*?(?:sezione|gloss|word)\\{";
// make sure input is matching lookbehind pattern first
if (str.matches(lookbehind + ".*$")) {
// actual search pattern
Pattern p = Pattern.compile("[^}]*?\\b(word)\\b");
// search in replaced String
Matcher m = p.matcher(str.replaceFirst(lookbehind, ""));
if (m.find())
System.out.println(m.group(1));
//> word
}
PS: You may need to improve code by checking for indexes in the input String for the starting point of search pattern.

Pattern.compile("(.*?):")

I'm trying to understand the following code:
Pattern.compile("(.*?):")
I already did some research about what it could mean,
but I don't quite get it:
According to the java docs the * would mean 0 or more times,
while ? means once or not at all.
Also, what does the ':' mean?
Thanks

This is called a reluctant quantifier. An asterisk and a question mark *? together mean "zero or more times, without matching more characters from the input than is needed". This is what prevents the dot . expression from matching the subsequent colon : in the input.
A better expression to match the same sequence is [^:]*:, because it lets you avoid backtracking. Here is a link to an article explaining why.

The ? after greedy operators such as + or * will make the operator non greedy. Without the ?, that regex will keep matching all the characters it finds, including the :.
As it is, the regex will match any string which happens before the semi colon (:). In this case, the semicolon is not a special character. What ever comes before the semicolon, will be thrown into a group, which can be accessed later through a Matcher object.
This code snippet will hopefully make things more clear:
String str = "Hello: This is a Test:";
Pattern p1 = Pattern.compile("(.*?):");
Pattern p2 = Pattern.compile("(.*):");
Matcher m1 = p1.matcher(str);
if (m1.find())
{
System.out.println(m1.group(1));
}
Matcher m2 = p2.matcher(str);
if (m2.find())
{
System.out.println(m2.group(1));
}
Yields:
Hello
Hello: This is a Test

This regular expression means anthing ending with : or it could be understood as anthing till first :.
Here ':' means nothing. but it complies for pattern anystring: will match to this pattern

I think the '?' is redundant and will be applied on '.*'.
':' has no special meaning whatsoever in regexps and will be matched to the characters in the string.
EDIT: dasblinkenlight is be right, if greedy the regexp will try to match as much as they can, and he is right in his suggestion as well.
I found a link which lists greedy vs reluctant: What is the difference between `Greedy` and `Reluctant` regular expression quantifiers?

JAVA equivalent to Javascript REGEX

I'm totally beginner in java.
In javascript i have this regex:
/[^0-9.,\-\ ]/gi
How can i do the same in java?

Have a look at this: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Theres quite a lot you can do in Java with Regex

If you want to match repeatedly against that regex, you would do:
Pattern p = Pattern.compile("(?i)[^0-9.,-\ ]");
Matcher m = p.matcher(targetString);
Then use the matcher methods in a loop to get the match you want. The "i" is a case insensitivity flag (which you actually don't need as there are no characters specified), but I'm not sure what the equivalent of the "g" flag is.. I think it's simply to attempt to apply the pattern repeatedly to the target string rather than to try and match the whole string, which is what the above code does.
Also, the pattern above will only match one character at a time, you may in fact want [^0-9.,-\ ]*, which will match against 0 or more characters, greedily. I would read the docs on the Pattern class if I were you.

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)

You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}

The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.

Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex - Optimize look behind - java

Related

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?

Regex, lookbehind/lookahead with ".*"

Pattern.compile("(.*?):")

JAVA equivalent to Javascript REGEX

Possible Regular Expression Question

Categories

Resources