Java regex basic usage problem - java

The following code works:
String str= "test with foo hoo";
Pattern pattern = Pattern.compile("foo");
Matcher matcher = pattern.matcher(str);
if(matcher.find()) { ... }
But this example does not:
if(Pattern.matches("foo", str)) { ... }
And neither this version:
if(str.matches("foo")) { ... }
In the real code, str is a chunk of text with multiple lines if that is treated differently by the matcher, also in the real code, replace will be used to replace a string of text.
Anyway, it is strange that it works in the first version but not the other two versions.
Edit
Ok, I realise that the behaviour is the same in the first example if if(matcher.matches()) { ... } is used instead of matcher.find. I still cannot make it work for multiline input but I stick to the Pattern.compile/Pattern.matcher solution anyway.

Your last couple of examples fail because matches adds an implicit start and end anchor to your regular expression. In other words, it must be an exact match of the entire string, not a partial match.
You can work around this by using .*foo.* instead. Using Matcher.find is more flexible solution though, so I'd recommend sticking with that.

In Java, String.matches delegates to Pattern.matches which in turn delegates to Matcher.matches, which checks if a regex matches the entire string.
From the java.util.regex.Matcher API:
Once created, a matcher can be used to perform three different kinds of match operations:
The matches method attempts to match the entire input sequence against the pattern.
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
The find method scans the input sequence looking for the next subsequence that matches the pattern.
To find if a substring matches pattern, you can:
Matcher.find() the pattern within the string
Check if the entire string matches .*pattern.*
Related questions
On matches() matching whole string:
Why is my regex is not matching?
Java Regex Match Error
Java RegEx Pattern not matching (works in .NET)
On hitEnd() for partial matching:
How can I perform a partial match with java.util.regex.*?
Can java.util.regex.Pattern do partial matches?
On multiline vs singleline/Pattern.DOTALL mode:
string.matches(".*") returns false

Related

Why does my regex work on RegexPlanet and regex101 but not in my code?

Given the string #100=SAMPLE('Test','Test', I want to extract 100 and Test. I created the regular expression ^#(\d+)=SAMPLE\('([\w-]+)'.* for this purpose.
I tested the regex on RegexPlanet and regex101. Both tools give me the expected results, but when I try to use it in my code I don't get matches. I used the following snippet for testing the regex:
final String line = "#100=SAMPLE('Test','Test',";
final Pattern pattern = Pattern.compile("^#(\\d+)=SAMPLE\\('([\\w-]+)'.*");
final Matcher matcher = pattern.matcher(line);
System.out.println(matcher.matches());
System.out.println(matcher.find());
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
The output is
true
false
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at Test.main(Test.java:15)
I used Java 8 for compiling and running the program. Why does the regex work with the online tools but not in my program?
A Matcher object allows you to query it several times, so that you can find the expression, get the groups, find the expression again, get the groups, and so on.
This means that it keeps state after each call - both for the groups that resulted from a successful match, and the position where to continue searching.
When you run two matching/finding methods consecutively, what you have is:
matches() - Matches at the beginning of the string, sets the groups.
find() - tries to find the next occurrence of the pattern after the previously matched/found occurrence, sets the groups.
But of course, in your case, the text doesn't contain two occurrences of the pattern, only one. So although matches() was successful and set proper groups, the find() then fails to find another match, and the groups are invalid (the groups are not accessible after a failed match/find).
And that's why you get the error message.
Now, if you're just playing around with this, to see the difference between matches and find, then there is nothing wrong with having both of them in the program. But you need to use reset() between them, which will cause find() not to try to continue from where matches() stopped (which will always fail if matches() succeeded). Instead, it will start scanning from the start, as if you had a fresh Matcher. And it will succeed and give you groups.
But as other answers here hinted, if you're not just trying to compare the results of matches and find, but just wanted to match your pattern and get the results, then you should choose only one of them.
matches() will try to match the entire string. For this reason, if it succeeds, running find() after it will never succeed - because it starts searching at the end of the string. If you use matches(), you don't need anchors like ^ and $ at the beginning and the end of your pattern.
find() will try to match anywhere in the string. It will start scanning from the left, but doesn't require that the actual match start there. It is also possible to use it more than once.
lookingAt() will try to match at the beginning of the string, but will not necessarily match the complete string. It's like having an ^ anchor at the beginning of your pattern.
So you choose which one of these is appropriate for you, and use it, and then you can use the groups. Always test that the match succeeded before attempting to use the groups!
As the #RealSkeptic mentioned, you should remove the call to matcher.find() in your code, which was advancing the matcher before you had a chance to find all the groups and output them to the console. The rest of your code remains as is:
final String line = "#100=SAMPLE('Test','Test',";
final Pattern pattern = Pattern.compile("^#(\\d+)=SAMPLE\\('([\\w-]+)'.*");
final Matcher matcher = pattern.matcher(line);
System.out.println(matcher.matches());
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
Output:
true
100
Test
Check out this one, if I understood correctly, this is what you were trying to achieve:
public class test {
public static void main(String[] args) {
final String line = "#100=SAMPLE('Test','Test',";
final Pattern pattern = Pattern.compile("^#(\\d+)=SAMPLE\\('([\\w-]+)'.*");
final Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
}
}

Password validation, one regex pattern at a time

For the password field, I have a TextWatcher, and onTextChanged, i run the each of four regex patterns against the text, one at a time. My regex patterns are:
".{3,5}"
"(?=.*[A-Z])"
"(?=.*[a-z])"
"(?=.*\\d)"
I wrote this test code and do not understand why this would fail:
Pattern pat = Pattern.compile("(?=.*[A-Z])");
Matcher mat = pat.matcher("aB");
if(mat.matches()){
System.out.println("MATCHES!");
}
else{
System.out.println("DOES NOT MATCH");
}
I expected a match here, but its failed.
Likewise other regex patterns also fail.
With look-around (?=condition) we can check many conditions on entire string, because it is zero-width (it will reset position of cursor in regex engine to place where it was right before test performed by look-ahead).
So since matches() checks if entire string matches regex, and look-around reset cursor it means that cursor wasn't able to pass entire string to accept this regex.
If you want to use matches() you can use regex like this
(?=.*[A-Z])(?=.*[a-z])(?=.*\\d).{3,5}
.{3,5} part will allow regex engine iterate over 3-5 characters, so if string is shorter, or longer it will not be accepted (because regex wasn't able to match entire string).
Alternative to this solution is to use find() instead of matches(). Also in that case you shouldn't use look-around. Simple [A-Z], [a-z], \\d with find() should be fine. We use look-around mechanisms only if we want regex to be able to iterate over data more than once.

JAVA equivalent to Javascript REGEX

I'm totally beginner in java.
In javascript i have this regex:
/[^0-9.,\-\ ]/gi
How can i do the same in java?
Have a look at this: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Theres quite a lot you can do in Java with Regex
If you want to match repeatedly against that regex, you would do:
Pattern p = Pattern.compile("(?i)[^0-9.,-\ ]");
Matcher m = p.matcher(targetString);
Then use the matcher methods in a loop to get the match you want. The "i" is a case insensitivity flag (which you actually don't need as there are no characters specified), but I'm not sure what the equivalent of the "g" flag is.. I think it's simply to attempt to apply the pattern repeatedly to the target string rather than to try and match the whole string, which is what the above code does.
Also, the pattern above will only match one character at a time, you may in fact want [^0-9.,-\ ]*, which will match against 0 or more characters, greedily. I would read the docs on the Pattern class if I were you.

Using backreference to refer to a pattern rather than actual match

I am trying to write a regex which would match a (not necessarily repeating) sequence of text blocks, e.g.:
foo,bar,foo,bar
My initial thought was to use backreferences, something like
(foo|bar)(,\1)*
But it turns out that this regex only matches foo,foo or bar,bar but not foo,bar or bar,foo (and so on).
Is there any other way to refer to a part of a pattern?
In the real world, foo and bar are 50+ character long regexes and I simply want to avoid copy pasting them to define a sequence.
With a decent regex flavor you could use (foo|bar)(?:,(?-1))* or the like.
But Java does not support subpattern calls.
So you end up having a choice of doing String replace/format like in ajx's answer, or you could condition the comma if you know when it should be present and when not. For example:
(?:(?:foo|bar)(?:,(?!$|\s)|))+
Perhaps you could build your regex bit by bit in Java, as in:
String subRegex = "foo|bar";
String fullRegex = String.format("(%1$s)(,(%1$s))*", subRegex);
The second line could be factored out into a function. The function would take a subexpression and return a full regex that would match a comma-separated list of subexpressions.
The point of the back reference is to match the actual text that matches, not the pattern, so I'm not sure you could use that.
Can you use quantifiers like:
String s= "foo,bar,foo,bar";
String externalPattern = "(foo|bar)"; // comes from somewhere else
Pattern p = Pattern.compile(externalPattern+","+externalPattern+"*");
Matcher m = p.matcher(s);
boolean b = m.find();
which would match 2 or more instances of foo or bar (followed by commas)

Java regular expression to match patterns and extract them

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseLinks {
public static void main(String[] args) {
String message = "This is a link- #www.google.com# and this is another #google.com#";
Pattern p = Pattern.compile("#.*#");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
String result = matcher.group();
System.out.println(result);
}
}
}
This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?
#[^#]+#
Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.
The reason why your's does not work is the greediness of the star (from regular-expressions.info):
[The star] repeats the previous item
zero or more times. Greedy, so as many
items as possible will be matched
before trying permutations with less
matches of the preceding item, up to
the point where the preceding item is
not matched at all.
Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.
If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:
#[^#]*#
Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to
match a "#"
match as many characters as possible such that you can still ...
... match a "#"
What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

Categories

Resources