Java Regex Negative Lookahead

Java Regex Negative Lookahead - java

After looking through the answers that are already on StackOverflow regarding this issue, I settled with the most accurate one I could find:
Java regex: Negative lookahead
I went over to gskinner and tested it. I put /foo/(?!.*\\bbar\\b).+ in the pattern input box and the following in the regex match text area:
/foo/abc123doremi
/foo/abc123doremi/bar/def456fasola
Gskinner recognised both of these as matches though so clearly either Gskinner is wrong or the regex pattern above isn't correct. Any thoughts?

You are looking for \bbar\b while your text contains /bar/.
What you meant is probably \bbar\b (i.e. /foo/(?!.*\bbar\b).+)
Note that "duplicate the \" is only required inside of Java String literals. That makes writing regexs in Java a bit of a pain.

Related

Java Regex Lookahead Conditional

I have a regex which works, but unfortunately not in Java because Java does not support this type of inline modifier.
I have already read about this topic e. g. here:
Java support for conditional lookahead
Java Regex Pattern compilation error
My regex:
(?(?=\d{1,2}[.]\d{1,2}[.]\d{2,4})somerandomtextwhichisnotinthetext|^((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$)
I also tried a lookbehind but the pattern it should be matched has a variable length an this is unfortunately not supported...
The regex should me matches all of this pattern (a full match is needed --> matcher.group(0) ):
123.342,22
123,233.22
232,11
232.2
232.2 €
but not this:
06.01.99
And it needs to be implemented in Java.
But still I have no solution...
Thanks for your help!!!

The point here is that you need to use the first part as a negative lookahead to add an exception to the other pattern:
^(?!\d{1,2}[.]\d{1,2}[.]\d{2,4}$)((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
So, rather than requiring an exception pattern and then failing to match a fake string, it makes sense to simply use a valid match pattern and add an exception at the start.
I also see ($|€|EUR)?, you probably wanted to match a dollar symbol here. If I am right, replace it with ([$€]|EUR)?. Also, ($|EUR)? might also need replacing with ([$€]|EUR)?.
Also, consider using non-capturing groups rather than capturing ones, since you say you are only interested in full match values.

Java replaceAll to javascript regex

I want to move some user input test from Java to javascript. The code suppose to remove wildcard characters out of user input string, at any position. I'm attempting to convert the following Java notation to javascript, but keep getting error
"Invalid regular expression: /(?<!\")~[\\d\\.]*|\\?|\\*/: Invalid group".
I have almost no experience with regex expressions. Any help will be much appreciated:
JAVA:
str = str.replaceAll("(?<!\")~[\\d\\.]*|\\?|\\*","");
My failing javascript version:
input = input.replace( /(?<!\")~[\\d\\.]*|\\?|\\*/g, '');

The problem, as anubhava points out, is that JavaScript doesn't support lookbehind assertions. Sad but true. The lookbehind assertion in your original regex is (?<!\"). Specifically, it's looking only for strings that don't start with a double quotation mark.
However, all is not lost. There are some tricks you can use to achieve the same result as a lookbehind. In this case, the lookbehind is there only to prevent the character prior to the tilde from being replaced as well. We can accomplish this in JavaScript by matching the character anyway, but then including it in the replacement:
input = input.replace( /([^"])~[\d.]*|\?|\*/g, '$1' );
Note that for the alternations \? and \*, there will be no groups, so $1 will evaluate to the empty string, so it doesn't hurt to include it in the replacement.
NOTE: this is not 100% equivalent to the original regular expression. In particular, lookaround assertions (like the lookbehind above) also prevent the input stream from being consumed, which can sometimes be very helpful when matching things that are right next to each other. However, in this case, I can't think of a way that that would be a problem. To make a completely equivalent regex would be more difficult, but I believe this meets the need of the original regex.

Java - Regex to Split Tokens With Minimum Size and Delimiters

I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.
I want to split a String in Java, and I have 4 constraints:
The delimiters are [.?!] (end of the sentence)
Decimal numbers shouldn't be tokenized
The delimiters shouldn't be removed.
The minimum size of each token should be 5
For example, for input:
"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."
The output will be:
[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]
Up to now I got the answer for three first constraints by this regex:
text.split("(?<=[.!?])(?<!\\d)(?!\\d)");
And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.
For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.
Finally, introducing a good tutorial of regex is appreciated.
UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.
So, Be careful! The accepted answer will not cover all possible use cases!

Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).
Now let's modify it to the needs of this question:
We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).
Some links:
Online tester, jump to JAVA
Explain tool (Not JAVA based)
THE regex tutorial
Java specific regex tutorial
SO regex chatroom
Some advanced nice regex-fu on SO
How does this regex find triangular numbers?
How can we match a^n b^n?
How does this Java regex detect palindromes?
How to determine if a number is a prime with regex?
"vertical" regex matching in an ASCII "image"
Can the for loop be eliminated from this piece of PHP code? ^-- See regex solution, although not sure if applicable in JAVA

What about the next regular expression?
(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)
e.g.
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");
public static void main(String[] args) {
String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[Hello World!, This answer worth $1.45 in U.S., dollar., Thank you.]"
}

How do I make this regex more general, sometimes it works and sometimes it doesn't

I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.

XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.

As others have pointed out, the greedy .* (dot-star) that matches the "name" attribute needs to be made non-greedy (.*?) or even better, replaced with a negated character class ([^"]*) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.
Pattern p = Pattern.compile(
"<!--\\s*<editable\\s+name=\"([^\"]*)\">\\s*-->" +
"(.*?)" +
"<!--\\s*</editable>\\s*-->",
Pattern.DOTALL);
I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course, \s matches them as well.
I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (\s*), and the quotation marks wouldn't need to be escaped ("[^"]*").

I would replace that .* with [\w-]* for example if name is an identifier of some sort.
Or [^\"]* so it doesn't capture the end double quote.
Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting \s* in the beginning and at the end, plus \s+ before the attribute-like thing.
<!--\s*<editable\s+name=(\"[^\"]*\")?>\s*-->(.*)<!--\s*</editable>\s*-->
Or when you are filtering on XML based search:
"\\s*<editable\\s+name=(\"[^\"]*\")?>\\s*"
"\\s*</editable>\\s*"
Edit 3: Fixed the escapes twice. Thanks Alan M.

the * multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.
You can disable this by using *?, so try:
(\".*?\")

regex for that excludes matches within quotes

I'm working on this pretty big re-factoring project and I'm using intellij's find/replace with regexp to help me out.
This is the regexp I'm using:
\b(?<!\.)Units(?![_\w(.])\b
I find that most matches that are not useful for my purpose are the matches that occur with strings within quotes, for example: "units"
I'd like to find a way to have the above expression not match when it finds a matching string that's between quotes...
Thx in advance, this place rocks!

Assuming the quotes are always paired on a given line, you could create matches before and after for an even number of quotes, and make sure the whole line is matched:
^([^"]*("[^"]*")*[^"]*)*\b(?<!\.)Units(?![_\w(.])\b([^"]*("[^"]*")*[^"]*)*$
this works because the fragment
([^"]*("[^"]*")*[^"]*)*
will only match paired quotes. By adding the begin and end line anchors, it forces the quotes on the left and right side of your regex to be an even count.
This won't handle embedded escaped quotes properly, and multiline quoted strings will be trouble.

Intellij uses Java regexes, doesn't it? Try this:
(?m)(?<![\w.])Units(?![\w(.])(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
The first part is your regex after a little cosmetic surgery:
(?<![\w.])Units(?![\w(.])
The \b at the beginning and end were effectively the same as a negative lookbehind and a negative lookahead (respectively) for \w, so I folded them into your existing lookarounds. The new lookahead matches the rest of the line if it contains even number (including zero) of unescaped quotation marks:
(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
That handles pathological cases like the one Welbog pointed out, and unlike Michael's regex it will find multiple occurrences of the text the same line. But it doesn't take comments into account. Is Intellij's find/replace feature intelligent enough to disregard text in comments? Come to think of it, doesn't it have some kind of refactoring support built in?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex Negative Lookahead - java

You are looking for \bbar\b while your text contains /bar/. What you meant is probably \bbar\b (i.e. /foo/(?!.*\bbar\b).+) Note that "duplicate the \" is only required inside of Java String literals. That makes writing regexs in Java a bit of a pain.

Related

Java Regex Lookahead Conditional

Java replaceAll to javascript regex

Java - Regex to Split Tokens With Minimum Size and Delimiters

How do I make this regex more general, sometimes it works and sometimes it doesn't

regex for that excludes matches within quotes

Categories

Resources