Cannot match my regular expression - java

I am trying to match a string that looks like "WIFLYMODULE-xxxx" where the x can be any digit. For example, I want to be able to find the following...
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632
I am currently using
final Pattern q = Pattern.compile("[WIFLYMODULE]-[0-9]{3}");
but I am not picking up the string that I want. So my question is, why is my regular expression not working? Am i going about it in the wrong way?

You should use (..) instead of [...]. [..] is used for Character class
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
(WIFLYMODULE)-[0-9]{4}
Here is demo
Note: But in this case it's not needed at all. (...) is used for capturing group to access it by Matcher.group(index)
Important Note: Use \b as word boundary to match the correct word.
\\bWIFLYMODULE-[0-9]{4}\\b
Sample code:
String str = "WIFLYMODULE-3253 WIFLYMODULE-1585 WIFLYMODULE-1632";
Pattern p = Pattern.compile("\\bWIFLYMODULE-[0-9]{4}\\b");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group());
}
output:
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632

The regex should be:
"WIFLYMODULE-[0-9]{4}"
The square brackets means: one of the characters listed inside. Also you were matching three numbers instead of four. So your were matching strings like (where xxx is a number of three digits):
W-xxx, I-xxx, F-xxx, L-xxx, Y-xxx, M-xxx, O-xxx, D-xxx, U-xxx, L-xxx, E-xxx

You had it match on 3 digits instead of 4. And putting WIFLYMODULE inside [] makes it match on only one of those characters.
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");

[...] means that one character out of the ones in the bracket must match and not the string within it.
You, however, want to match WIFLYMODULE, thus, you have to use Pattern.compile("WIFLYMODULE-[0-9]{3}"); or Pattern.compile("(WIFLYMODULE)-[0-9]{3}");
{n} means that the character (or group) must match n-times. In your example you need 4 instead of 3: Pattern.compile("WIFLYMODULE-[0-9]{4}");

This way will work:
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");
The pattern breaks down to:
WIFLYMODULE- The literal string WIFLYMODULE-
[0-9]{4} Exactly four digits
What you had was:
[WIFLYMODULE] Any one of the characters in WIFLYMODULE
- The literal string -
[0-9]{3} Exactly three digits

Related

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Why does this regex capture the excluded character?

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.
Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)
Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..
How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.
I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

Regular expression for a complex string

I am new to regexp. I need to validate a string, but when I use my current attempt it is always returning false.
Rules:
A text matcher like "polygon(( ))"
number matcher like X Y, where x and y can be any double numbers
as many X Y pairs, separated by comma.
eg:
PolyGoN((
-74.0075783459999 40.710775696,
-74.007375926 40.710655064,
-74.0074640719999 40.7108592490001,
-74.0075783459999 40.710775696))
Here is the code that I used:
String inputString = "POLYGON((-74.0075783459999 40.710775696, -74.007375926 40.710655064, -74.0072836009999 40.710720973, -74.0075783459999 40.710775696))";
String regexp = "polygon[\\((][(\\-?\\d+(\\.\\d+)?)\\s*(\\-?\\d+(\\.\\d+)?)]*[\\))]";
Pattern pattern = Pattern.compile(regexp, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(inputString);
boolean result = matcher.matches();
[\\((] is incorrect way of specifying that you need ( twice. No matter how many times you repeat a character but inside a character class [] it counts only once. Since, it's the same character repeating you don't even need a character class there but just the character \\( with a quantifier that tells how many times it should repeat {2}. So, you need \\({2} at the start and \\){2} at the end.
Another problem with your use of [] is that you used them to denote a group of double pairs that repeats (using *). You always use () for grouping a part of your match. [] denotes a character class only. I wonder why you got that wrong because you grouped your doubles and their pairs correctly.
Next, you've forgotten to match all the commas , separating the double pairs. I've included that as (,\\s*)? in my regex. The hyphen - (or the negative sign here) doesn't need to be escaped since it's not inside a character class [] and so the regex parser knows you're not using it to specify a character range.
The corrected regex is (indented for clarity)
polygon\({2}\s*(
(-?\d+(\.\d+)?)\s*(-?\d+(\.\d+)?)(,\s*)
)*(-?\d+(\.\d+)?)\s*(-?\d+(\.\d+)?)
\s*\){2}
m|Polygon\(\(((\s*-?\d+\.\d+\s*){2},)*(\s*-?\d+\.\d+\s*){2}\)\)|i

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

How to match repeated patterns?

I would like to match:
some.name.separated.by.dots
But I don't have any idea how.
I can match a single part like this
\w+\.
How can I say "repeat that"
Try the following:
\w+(?:\.\w+)+
The + after (?: ... ) tell it to match what is inside the parenthesis one or more times.
Note that \w only matches ASCII characters, so a word like café wouldn't be matches by \w+, let alone words/text containing Unicode.
EDIT
The difference between [...] and (?:...) is that [...] always matches a single character. It is called a "character set" or "character class". So, [abc] does not match the string "abc", but matches one of the characters a, b or c.
The fact that \w+[\.\w+]* also matches your string is because [\.\w+] matches a . or a character from \w, which is then repeated zero or more time by the * after it. But, \w+[\.\w+]* will therefor also match strings like aaaaa or aaa............
The (?:...) is, as I already mentioned, simply used to group characters (and possible repeat those groups).
More info on character sets: http://www.regular-expressions.info/charclass.html
More info on groups: http://www.regular-expressions.info/brackets.html
EDIT II
Here's an example in Java (seeing you post mostly Java answers):
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String text = "some.text.here only but not Some other " +
"there some.name.separated.by.dots and.we are done!";
Pattern p = Pattern.compile("\\w+(?:\\.\\w+)+");
Matcher m = p.matcher(text);
while(m.find()) {
System.out.println(m.group());
}
}
}
which will produce:
some.text.here
some.name.separated.by.dots
and.we
Note that m.group(0) and m.group() are equivalent: meaning "the entire match".
This will also work:
(\w+(\.|$))+
You can use ? to match 0 or 1 of the preceeding parts, * to match 0 to any amount of the preceeding parts, and + to match at least one of the preceeding parts.
So (\w\.)? will match w. and a blank, (\w\.)* will match r.2.5.3.1.s.r.g.s. and a blank, and (\w\.)+ will match any of the above but not a blank.
If you want to match something like your example, you'll need to do (\w+\.)+, which means 'match at least one non whitespace, then a period, and match at least one of these'.
(\w+\.)+
Apparently, the body has to be at least 30 characters. I hope this is enough.

Categories

Resources