Why does this regex capture the excluded character? - java

I have a regex like this:
(?:(\\s| |\\A|^))(?:#)[A-Za-z0-9]{2,}
What I am trying to do is find a pattern that starts with an # and has two or more characters after, however it can't start in the middle of a word.
I'm new to regex but was under the impression ?: matches but then excludes the character however my regex seems to match but include the characters. Ideally I'd like for "#test" to return "test" and "test#test" to not match at all.
Can anyone tell me what I've done wrong?
Thanks.

Your understanding is incorrect. The difference between (...) and (?:...) is only that the former also creates a numbered match group which can be referred to with a backreference from within the regex, or as a captured match group from code following the match.
You could change the code to use lookbehinds, but the simple and straightforward fix is to put ([A-Za-z0-9]{2,}) inside regular parentheses, like I have done here, and retrieve the first matched group. (The # doesn't need any parentheses around it in this scenario, but the ones you have are harmless.)

Try this : You could use word boundary to specify your condition.
public static void main(String[] args) {
String s1 = "#test";
String s2 = "test#test";
String pattern = "\\b#\\w{2,}\\b";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s1);
m.find();
System.out.println(m.group());
}
o/p :
#test
throws `IllegalStateException` in the second case (s2)..

How about:
\W#[\S]{2}[\S]*
The strings caught by this regular expression needs to be trimmed and remove the first character.

I guess you better need the following one:
(?<=(?<!\w)#)\w{2,}
Debuggex Demo
Don't forget to escape the backslashes in Java since in a string literal:
(?<=(?<!\\w)#)\\w{2,}

Related

Java regex, replace certain characters except

I have this string "u2x4m5x7" and I want replace all the characters but a number followed by an x with "".
The output should be:
"2x5x"
Just the number followed by the x.
But I am getting this:
"2x45x7"
I'm doing this:
String string = "u2x4m5x7";
String s = string.replaceAll("[^0-9+x]","");
Please help!!!
Here is a one-liner using String#replaceAll with two replacements:
System.out.println(string.replaceAll("\\d+(?!x)", "").replaceAll("[^x\\d]", ""));
Here is another working solution. We can iterate the input string using a formal pattern matcher with the pattern \d+x. This is the whitelist approach, of trying to match the variable combinations we want to keep.
String input = "u2x4m5x7";
Pattern pattern = Pattern.compile("\\d+x");
Matcher m = pattern.matcher(input);
StringBuilder b = new StringBuilder();
while(m.find()) {
b.append(m.group(0));
}
System.out.println(b)
This prints:
2x5x
It looks like this would be much simpler by searching to get the match rather than replacing all non matches, but here is a possible solution, though it may be missing a few cases:
\d(?!x)|[^0-9x]|(?<!\d)x
https://regex101.com/r/v6udph/1
Basically it will:
\d(?!x) -- remove any digit not followed by an x
[^0-9x] -- remove all non-x/digit characters
(?<!\d)x -- remove all x's not preceded by a digit
But then again, grabbing from \dx would be much simpler
Capture what you need to $1 OR any character and replace with captured $1 (empty if |. matched).
String s = string.replaceAll("(\\d+x)|.", "$1");
See this demo at regex101 or a Java demo at tio.run

Cannot match my regular expression

I am trying to match a string that looks like "WIFLYMODULE-xxxx" where the x can be any digit. For example, I want to be able to find the following...
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632
I am currently using
final Pattern q = Pattern.compile("[WIFLYMODULE]-[0-9]{3}");
but I am not picking up the string that I want. So my question is, why is my regular expression not working? Am i going about it in the wrong way?
You should use (..) instead of [...]. [..] is used for Character class
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
(WIFLYMODULE)-[0-9]{4}
Here is demo
Note: But in this case it's not needed at all. (...) is used for capturing group to access it by Matcher.group(index)
Important Note: Use \b as word boundary to match the correct word.
\\bWIFLYMODULE-[0-9]{4}\\b
Sample code:
String str = "WIFLYMODULE-3253 WIFLYMODULE-1585 WIFLYMODULE-1632";
Pattern p = Pattern.compile("\\bWIFLYMODULE-[0-9]{4}\\b");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group());
}
output:
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632
The regex should be:
"WIFLYMODULE-[0-9]{4}"
The square brackets means: one of the characters listed inside. Also you were matching three numbers instead of four. So your were matching strings like (where xxx is a number of three digits):
W-xxx, I-xxx, F-xxx, L-xxx, Y-xxx, M-xxx, O-xxx, D-xxx, U-xxx, L-xxx, E-xxx
You had it match on 3 digits instead of 4. And putting WIFLYMODULE inside [] makes it match on only one of those characters.
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");
[...] means that one character out of the ones in the bracket must match and not the string within it.
You, however, want to match WIFLYMODULE, thus, you have to use Pattern.compile("WIFLYMODULE-[0-9]{3}"); or Pattern.compile("(WIFLYMODULE)-[0-9]{3}");
{n} means that the character (or group) must match n-times. In your example you need 4 instead of 3: Pattern.compile("WIFLYMODULE-[0-9]{4}");
This way will work:
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");
The pattern breaks down to:
WIFLYMODULE- The literal string WIFLYMODULE-
[0-9]{4} Exactly four digits
What you had was:
[WIFLYMODULE] Any one of the characters in WIFLYMODULE
- The literal string -
[0-9]{3} Exactly three digits

How to make regex matching fail if checked string still has leftover characters?

I'm trying to check a string with a regular expression, and this check should only pass if the string contains only *h, *d, *w and/or *m where * can be any number.
So far I've got this:
Pattern p = Pattern.compile("([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)");
Matcher m = p.matcher(strToCheck);
if(m.find()){
//matching succesful code
}
And it works to detect if there are any of the number-letter combinations present in the checked string, but it also works if the input is, for instance, "12x5d", because it has "5d" in it. I don't know if this is a code problem or a regex problem. Is there a way to achieve what I want?
EDIT:
Thank you for your answers so far, but as requested, I'll try to clarify a bit. A string like "1w 2d 3h" or "1w 1w" is valid and should pass, but something like "1w X 2d 3h", "1wX 2d" or "w d h" should fail.
use m.matches() or add ^ and $ to the beginning and end of the regex resp.
edit but if you wan sequences of these delimited by whitespace (as mentioned in the comments) you can use
Pattern.compile("\\b\\d[hdwm]\\b");
Matcher m = p.matcher(strToCheck);
while(m.find()){
//matching succesful code
}
Firstly, I think you should use matches() instead of find(). The former matches the entire string against the regex, whereas the latter searches within the string.
Secondly, you can simplify the regex like so: "[0-9][hdwm]".
Finally, if the number can contain multiple digits, use the + operator: "[0-9]+[hdwm]"
try this:
Pattern p = Pattern.compile("[0-9][hdwm]");
Matcher m = p.matcher(strToCheck);
if(m.matches()){
//matching succesful code
}
If you want to only accept things like 5d as a complete word, rather than just part of one, you can use the \b "word border" markers in regex:
Pattern p = Pattern.compile("\\b([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)\\b");
This will let you match a string like "Dimension: 5h" while rejecting a string like "Dimension: 12wx5h".
(If, on the other hand, you only want to match if the entire string is just 5d or the like, then use matches() as others have suggested.)
You can write it like this "^\\d+[hdwm]$". Which should only match on the desired strings.

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Java regular expression to match patterns and extract them

I tried writing a program in Java using regex to match a pattern and extract it. Given a string like "This is a link- #www.google.com# and this is another #google.com#" I should be able to get #www.google.com# and #google.com# strings extracted. Here is what I tried-
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseLinks {
public static void main(String[] args) {
String message = "This is a link- #www.google.com# and this is another #google.com#";
Pattern p = Pattern.compile("#.*#");
Matcher matcher = p.matcher(message);
while(matcher.find()) {
String result = matcher.group();
System.out.println(result);
}
}
}
This results in output- #www.google.com# and this is another #google.com#. But what I wanted is only the strings #www.google.com# and #google.com# extracted. Can I please know the regex for this?
#[^#]+#
Though thinking about it, a hash sign is a bad choice for delimiting URLs, for rather obvious reasons.
The reason why your's does not work is the greediness of the star (from regular-expressions.info):
[The star] repeats the previous item
zero or more times. Greedy, so as many
items as possible will be matched
before trying permutations with less
matches of the preceding item, up to
the point where the preceding item is
not matched at all.
Assuming Java regex supports it, use the non-greedy pattern .*? instead of the greedy .* so that it will end the capture as soon as possible instead of as late as possible.
If the language doesn't support it, then you can approximate it by simply checking for anything that's not an ending delimiter, like so:
#[^#]*#
Regular expressions are "greedy" by default, that is, they will match as much text as possible. In your example, the pattern "#.*#" translates to
match a "#"
match as many characters as possible such that you can still ...
... match a "#"
What you want is a "non-greedy" or "reluctant" pattern such as "*?". Try "#.*?#" in your case.

Categories

Resources