Java regex matcher.find fails occasionally

Java regex matcher.find fails occasionally - java

I have regexp which parses all names of used freemarker macros in template (for example from <#macroName /> I need only macroName). Templates are usually quite large (round 30 thousand characters).
Java code with regex looks like:
Pattern pattern = Pattern.compile(".*?<#(.*?)[ /].*?",
Pattern.DOTALL | Pattern.UNIX_LINES);
Matcher matcher = pattern.matcher(inputText);
while(matcher.find()){
//... some code
}
But sometimes happens that I get this exception:
java.util.regex.Pattern$Curly.match1(Pattern.java:3814)
java.util.regex.Pattern$Curly.match(Pattern.java:3763)
java.util.regex.Pattern$Start.match(Pattern.java:3072)
java.util.regex.Matcher.search(Matcher.java:1116)
java.util.regex.Matcher.find(Matcher.java:552)
...
Does anybody know why it happens or could anybody make me sure if the regexp I'm using is optimized well?
thank you

For <#macro macroName /> your regex looks a little bit convoluted. Either there are things (special cases) that <#macro macroName /> don't describe, or the regex is trying too hard. Try:
<#macro\s+(\S+)\s+/>
You should have now the macro's name in group #1.

You can get rid of the leading .*? because you don't need to consume the text before/between the matches. The regex engine will take care of scanning for the next match, and it will do it a lot more efficiently than what you're doing. Just give it the pattern for the tag itself and get out of its way.
You can get rid of the trailing .*? because it never does anything. Think about it: it's trying to match zero or more of any characters, reluctantly. That means the first thing it tries to do is match nothing. That attempt will succeed (it's always possible to match nothing), so it never tries to consume more characters.
You probably want something like this ():
<#(\w+)[\s/]
...or in Java-speak:
Pattern p= Pattern.compile("<#(\\w+)[ /]");
You don't need DOTALL (no dots) or any other modifiers.

Related

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?

Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.

The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101

Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Replace string part with regex pattern

I would like to replace the following string.
img/s/430x250/
The problem is there are variations, like:
img/s/265x200/
or:
img/s/110x73/
So I would like to replace this part in whole, but the numbers are changeable, so how could I make a pattern that replaces it from a string?

Is your goal to match all three of those cases?
If so, this should work: img\/s\/\d+x\d+\/
It searches for img/s/[1 or more digits]x[1 or more digits]/

This regular expression will match your examples
img\/s\/\d+?x\d+?\/
the / matches /
the \d matches digits 0-9 and the + means 1 or more. The ? makes it lazy instead of greedy.
the img and s just match that literally
check out https://regex101.com/ to try out regular expressions. It's much easier than testing them by debugging code. Once you find an expression that works, you can move on to make sure your specific code will perform the same.

JAVA equivalent to Javascript REGEX

I'm totally beginner in java.
In javascript i have this regex:
/[^0-9.,\-\ ]/gi
How can i do the same in java?

Have a look at this: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Theres quite a lot you can do in Java with Regex

If you want to match repeatedly against that regex, you would do:
Pattern p = Pattern.compile("(?i)[^0-9.,-\ ]");
Matcher m = p.matcher(targetString);
Then use the matcher methods in a loop to get the match you want. The "i" is a case insensitivity flag (which you actually don't need as there are no characters specified), but I'm not sure what the equivalent of the "g" flag is.. I think it's simply to attempt to apply the pattern repeatedly to the target string rather than to try and match the whole string, which is what the above code does.
Also, the pattern above will only match one character at a time, you may in fact want [^0-9.,-\ ]*, which will match against 0 or more characters, greedily. I would read the docs on the Pattern class if I were you.

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.

XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.

As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").

I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.

the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

Linkify text with regular expressions in Java

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.
What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.
This solution works when all I have is plain text:
String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(comment);
comment = matcher.replaceAll("$0"); // group 0 is the whole expression
But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.
So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.
Any ideas?

You are close. You can use a "negative lookbehind" like so:
(?<!href=")http:// etc
All results preceded by href will be ignored.

If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:
(?!</a>)
Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string
http://example.com/
This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.
You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.
This works for me (note the three extra +'s):
String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";

If you really want to do it with regex, than:
String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
e.g. check that the URL is not following a =" or />

Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.

If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex matcher.find fails occasionally - java

For <#macro macroName /> your regex looks a little bit convoluted. Either there are things (special cases) that <#macro macroName /> don't describe, or the regex is trying too hard. Try: <#macro\s+(\S+)\s+/> You should have now the macro's name in group #1.

Related

Why is this regex not matching URLs?

Replace string part with regex pattern

JAVA equivalent to Javascript REGEX

How do I make this regex more general, sometimes it works and sometimes it doesn't

Linkify text with regular expressions in Java

Categories

Resources