Java Regex Lookahead Conditional - java

I have a regex which works, but unfortunately not in Java because Java does not support this type of inline modifier.
I have already read about this topic e. g. here:
Java support for conditional lookahead
Java Regex Pattern compilation error
My regex:
(?(?=\d{1,2}[.]\d{1,2}[.]\d{2,4})somerandomtextwhichisnotinthetext|^((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$)
I also tried a lookbehind but the pattern it should be matched has a variable length an this is unfortunately not supported...
The regex should me matches all of this pattern (a full match is needed --> matcher.group(0) ):
123.342,22
123,233.22
232,11
232.2
232.2 €
but not this:
06.01.99
And it needs to be implemented in Java.
But still I have no solution...
Thanks for your help!!!

The point here is that you need to use the first part as a negative lookahead to add an exception to the other pattern:
^(?!\d{1,2}[.]\d{1,2}[.]\d{2,4}$)((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
So, rather than requiring an exception pattern and then failing to match a fake string, it makes sense to simply use a valid match pattern and add an exception at the start.
I also see ($|€|EUR)?, you probably wanted to match a dollar symbol here. If I am right, replace it with ([$€]|EUR)?. Also, ($|EUR)? might also need replacing with ([$€]|EUR)?.
Also, consider using non-capturing groups rather than capturing ones, since you say you are only interested in full match values.

Related

Regex with at least one of two options but in sequence

I am trying to write a regex (for use in a Java Pattern) that will match strings that possibly have a letter that is possibly followed by a space then number, but must have at least one of them. For example, the following strings should be matched:
"a 5"
"b 9"
" 8"
However, it should not match an empty string ("").
Furthermore, I would like to make each of the components part a named capture group.
The following works, but allows the empty string.
"(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
To ensure that there is at least one of them, you can use lookahead (?=\\p{Alpha}| \\p{Digit}) at the beginning:
"(?=\\p{Alpha}| \\p{Digit})(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
In general, to avoid empty strings you can use (?=.).
You can use a negative lookahead to avoid empty input and keep your regex as:
^(?!$)(?<let>\p{L})?(?:\h+(?<num>\p{N}))?$
RegEx Demo
(?!$) is negative lookahead to fail the match for empty strings.
You can solve problem with:
([a-z]? \d)|([a-z] \d?)
You can see this code that covers your test cases in demo here. You can see this code in demo here. This is very basic regular expression knowledge, you should definitely learn more about regular expressions, there are bunch of good tutorials on web (e.g this one).
You can use | for or, then simply repeat "any pattern" to match everything like this.
((?<let>[A-z])|(?<num>\d)\s*)+
That lets you match any number of named patterns in any order.

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Java replaceAll to javascript regex

I want to move some user input test from Java to javascript. The code suppose to remove wildcard characters out of user input string, at any position. I'm attempting to convert the following Java notation to javascript, but keep getting error
"Invalid regular expression: /(?<!\")~[\\d\\.]*|\\?|\\*/: Invalid group".
I have almost no experience with regex expressions. Any help will be much appreciated:
JAVA:
str = str.replaceAll("(?<!\")~[\\d\\.]*|\\?|\\*","");
My failing javascript version:
input = input.replace( /(?<!\")~[\\d\\.]*|\\?|\\*/g, '');
The problem, as anubhava points out, is that JavaScript doesn't support lookbehind assertions. Sad but true. The lookbehind assertion in your original regex is (?<!\"). Specifically, it's looking only for strings that don't start with a double quotation mark.
However, all is not lost. There are some tricks you can use to achieve the same result as a lookbehind. In this case, the lookbehind is there only to prevent the character prior to the tilde from being replaced as well. We can accomplish this in JavaScript by matching the character anyway, but then including it in the replacement:
input = input.replace( /([^"])~[\d.]*|\?|\*/g, '$1' );
Note that for the alternations \? and \*, there will be no groups, so $1 will evaluate to the empty string, so it doesn't hurt to include it in the replacement.
NOTE: this is not 100% equivalent to the original regular expression. In particular, lookaround assertions (like the lookbehind above) also prevent the input stream from being consumed, which can sometimes be very helpful when matching things that are right next to each other. However, in this case, I can't think of a way that that would be a problem. To make a completely equivalent regex would be more difficult, but I believe this meets the need of the original regex.

Java Regex Negative Lookahead

After looking through the answers that are already on StackOverflow regarding this issue, I settled with the most accurate one I could find:
Java regex: Negative lookahead
I went over to gskinner and tested it. I put /foo/(?!.*\\bbar\\b).+ in the pattern input box and the following in the regex match text area:
/foo/abc123doremi
/foo/abc123doremi/bar/def456fasola
Gskinner recognised both of these as matches though so clearly either Gskinner is wrong or the regex pattern above isn't correct. Any thoughts?
You are looking for \bbar\b while your text contains /bar/.
What you meant is probably \bbar\b (i.e. /foo/(?!.*\bbar\b).+)
Note that "duplicate the \" is only required inside of Java String literals. That makes writing regexs in Java a bit of a pain.

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.
As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").
I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.
the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

Categories

Resources