Regex check to see if a String contains non digit fails - java

Why does this fail?
String n = "h107";
if (n.matches("\\D+")) {
System.out.println("non digit in it");
}
I had a night sleep over it, and I still not get it.
I got a solution now:
if (n.matches(".*\\D+.*")) {
But in my (maybe lack of knowledge) the first one should also match. Cause if it has to match a complete String, then what's the point of a '^' character for a line beginning.

That is the recurring problem of .matches(): it is misnamed. It does NOT do regex matching. And the problem is that even other languages have fallen prey to that misnaming (python is one example).
The problem is that it will try to match your whole input.
Use a Pattern, a Matcher and .find() instead (.find() does real regex matching, ie find text that matches anywhere in the input):
private static final Pattern NONDIGIT = Pattern.compile("\\D");
// in code
if (NONDIGIT.matcher(n).find())
// there is a non digit
You should in fact use a Pattern; String's .matches() will recompile a pattern each time. With a Pattern it is only compiled once.

String.matches returns true if the entire string matches the pattern. Simply change your regular expression to \d+ which returns true if entire string consists of digits:
String n = "h107";
if (!n.matches("\\d+")) {
System.out.println("non digit in it");
}

Related

Java String.Matches(); [duplicate]

if("test%$#*)$(%".matches("[^a-zA-Z\\.]"))
System.exit(0);
if("te/st.txt".matches("[^a-zA-Z\\.]"))
System.exit(0);
The program isn't exiting even though the regexes should be returning true. What's wrong with the code?
matches returns true only if regex matches entire string.
In your case your regex represents only one character that is not a-z, A-Z or ..
I suspect that you want to check if string contains one of these special characters which you described in regex. In that case surround your regex with .* to let regex match entire string. Oh, and you don't have to escape . inside character class [.].
if ("test%$#*)$(%".matches(".*[^a-zA-Z.].*")) {
//string contains character that is not in rage a-z, A-Z, or '.'
BUT if you care about performance you can use Matcher#find() method which
can return true the moment it will find substring containing match for regex. This way application will not need to check rest of the text, which saves us more time the longer remaining text is.
Will not force us to constantly build Pattern object each time String#matches(regex) is called, because we can create Pattern once and reuse it with different data.
Demo:
Pattern p = Pattern.compile("[^a-zA-Z\\.]");
Matcher m = p.matcher("test%$#*)$(%");
if(m.find())
System.exit(0);
//OR with Matcher inlined since we don't really need that variable
if (p.matcher("test%$#*)$(%").find())
System.exit(0);
x.matches(y) is equivalent to
Pattern.compile(y).matcher(x).matches()
and requires the whole string x to match the regex y. If you just want to know if there is some substring of x that matches y then you need to use find() instead of matches():
if(Pattern.compile("[^a-zA-Z.]").matcher("test%$#*)$(%").find())
System.exit(0);
Alternatively you could reverse the sense of the test:
if(!"test%$#*)$(%".matches("[a-zA-Z.]*"))
by providing a pattern that matches the strings that are allowed rather than the characters that aren't, and then seeing whether the test string fails to match this pattern.
You obtain always false because the matches() method returns true only when the pattern matches the full string.

Why is String.matches returning false in Java?

if("test%$#*)$(%".matches("[^a-zA-Z\\.]"))
System.exit(0);
if("te/st.txt".matches("[^a-zA-Z\\.]"))
System.exit(0);
The program isn't exiting even though the regexes should be returning true. What's wrong with the code?
matches returns true only if regex matches entire string.
In your case your regex represents only one character that is not a-z, A-Z or ..
I suspect that you want to check if string contains one of these special characters which you described in regex. In that case surround your regex with .* to let regex match entire string. Oh, and you don't have to escape . inside character class [.].
if ("test%$#*)$(%".matches(".*[^a-zA-Z.].*")) {
//string contains character that is not in rage a-z, A-Z, or '.'
BUT if you care about performance you can use Matcher#find() method which
can return true the moment it will find substring containing match for regex. This way application will not need to check rest of the text, which saves us more time the longer remaining text is.
Will not force us to constantly build Pattern object each time String#matches(regex) is called, because we can create Pattern once and reuse it with different data.
Demo:
Pattern p = Pattern.compile("[^a-zA-Z\\.]");
Matcher m = p.matcher("test%$#*)$(%");
if(m.find())
System.exit(0);
//OR with Matcher inlined since we don't really need that variable
if (p.matcher("test%$#*)$(%").find())
System.exit(0);
x.matches(y) is equivalent to
Pattern.compile(y).matcher(x).matches()
and requires the whole string x to match the regex y. If you just want to know if there is some substring of x that matches y then you need to use find() instead of matches():
if(Pattern.compile("[^a-zA-Z.]").matcher("test%$#*)$(%").find())
System.exit(0);
Alternatively you could reverse the sense of the test:
if(!"test%$#*)$(%".matches("[a-zA-Z.]*"))
by providing a pattern that matches the strings that are allowed rather than the characters that aren't, and then seeing whether the test string fails to match this pattern.
You obtain always false because the matches() method returns true only when the pattern matches the full string.

Regex in Java: Capturing string up to a certain group of words

I'm trying to capture a string up until a certain word that is within some group of words.
I only want to capture the string up until the FIRST instance of one of these words, as they may appear many times in the string.
For example:
Group of words: (was, in, for)
String = "Once upon a time there was a fox in a hole";
would return "Once upon a time there"
Thank you
What you need is called a Lookahead. The exact regex for your situation is:
/^.+?(?=(?:was)|(?:in)|(?:for))/
Anyway, the ^ matches the beginning of the string, .+? is a lazy match(it will match the shortest possible string), (?= ... ) means "followed by" and (?: ... ) is a noncapturing group - which may or may not be necessary for you.
For bonus points, you should probably be using word boundaries to make sure you're matching the whole word, instead of a substring ("The fox wasn't" would return "The fox "), and a leading space in the lookahead to kill the trailing space in the match:
/^.+?(?=\s*\b(?:was)|(?:in)|(?:for)\b)/
Where \s* matches any amount of white space (including none at all) and \b matches the beginning or end of a word. It's a Zero-Width assertion, meaning it doesn't match an actual character.
Or, in Java:
Pattern p = Pattern.compile("^.+?(?=\\s*\\b(?:was)|(?:in)|(?:for)\\b)");
I think that will work. I haven't used it, but according to the documentation, that exact string should work. Just had to escape all the backslashes.
Edit
Here I am, more than a year later, and I just realized the regex above does not do what I thought it did at the time. Alternation has the highest precedence, rather than the lowest, so this pattern is more correctly:
/^.+?(?=\s*\b(?:was|in|for)\b)/
Compare this new regex to my old one. Additionally, future travelers, you may wish to capture the whole string if no such breaker word exists. Try THIS on for size:
/^(?:(?!\s*\b(?:was|in|for)\b).)*/
This one uses a NEGATIVE lookahead (which asserts a match that fails the pattern). It's possibly slower, but it still does the job. See it in action here.
You can use this code to capture the string before a terminating word:
Pattern p = Pattern.compile("^(.*?)((\\b(was|in|for)\\b)|$)");
String s = "Once upon a time there was a fox in a hole";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
This code produces the following output (link):
Once upon a time there
Here is how this expression works: the (\\b(was|in|for)\\b) means "any of the words listed in the inner parentheses, when they appear on word boundaries". The expression just outside allows for $ in order to capture something even if none of the terminating words appear in the source.
A very simple way to handle this is to just split the string with a regex and keep the first thing returned:
String str = "Once upon a time there was a fox in a hole";
String match = str.split("(was|in|for)")[0];
// match = "Once upon a time there "
In this example, match will either contain the first part of the string before the first matched word or, in the case of a string where the word wasn't found it will contain the entire string.
String s = "Once upon a time there was a fox in the hole";
String[] taboo = {"was", "in", "for"} ;
for (int i = 0; i < taboo.length; i++){
if (s.indexOf(taboo[i]) > -1 ){
s=s.substring(0, s.indexOf(taboo[i])) ;
}
}
out.print(s);
works on my computer..

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Java regular expression to match patterns and extract them

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseLinks {
public static void main(String[] args) {
String message = "This is a link- #www.google.com# and this is another #google.com#";
Pattern p = Pattern.compile("#.*#");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
String result = matcher.group();
System.out.println(result);
}
}
}
This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?
#[^#]+#
Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.
The reason why your's does not work is the greediness of the star (from regular-expressions.info):
[The star] repeats the previous item
zero or more times. Greedy, so as many
items as possible will be matched
before trying permutations with less
matches of the preceding item, up to
the point where the preceding item is
not matched at all.
Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.
If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:
#[^#]*#
Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to
match a "#"
match as many characters as possible such that you can still ...
... match a "#"
What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

Categories

Resources