I'm wondering about the behavior of using the matcher in java.
I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.
My code:
String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("\nRegex : " matcher.group());
}
I get hit with "star war" which is right as it is in my pattern.
But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.
The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.
So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.
Just check if the string contains the strings you check against, the simplest approach is with a loop:
String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
if(str.contains(s))
System.out.println(s);
}
See the Java demo
By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.
A better way to build your regex would be like this:
String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";
Breaking down:
[Ss]: it will match either S or s in the first position
\s: representation of space
{0,1}: the previous character (or set) will be matched from 0 to 1 times
An alternative is:
String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
?: the previous character (or set) will be matched once or not at all
For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit 1: fixed typo (\s -> \\s). Thanks, #eugene.
You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:
Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
+ "star wars|pirates of the caribbean)$");
will print
Regex : star wars
But I agree with #NAMS: Don't build your regex like this.
Related
I have this string "u2x4m5x7" and I want replace all the characters but a number followed by an x with "".
The output should be:
"2x5x"
Just the number followed by the x.
But I am getting this:
"2x45x7"
I'm doing this:
String string = "u2x4m5x7";
String s = string.replaceAll("[^0-9+x]","");
Please help!!!
Here is a one-liner using String#replaceAll with two replacements:
System.out.println(string.replaceAll("\\d+(?!x)", "").replaceAll("[^x\\d]", ""));
Here is another working solution. We can iterate the input string using a formal pattern matcher with the pattern \d+x. This is the whitelist approach, of trying to match the variable combinations we want to keep.
String input = "u2x4m5x7";
Pattern pattern = Pattern.compile("\\d+x");
Matcher m = pattern.matcher(input);
StringBuilder b = new StringBuilder();
while(m.find()) {
b.append(m.group(0));
}
System.out.println(b)
This prints:
2x5x
It looks like this would be much simpler by searching to get the match rather than replacing all non matches, but here is a possible solution, though it may be missing a few cases:
\d(?!x)|[^0-9x]|(?<!\d)x
https://regex101.com/r/v6udph/1
Basically it will:
\d(?!x) -- remove any digit not followed by an x
[^0-9x] -- remove all non-x/digit characters
(?<!\d)x -- remove all x's not preceded by a digit
But then again, grabbing from \dx would be much simpler
Capture what you need to $1 OR any character and replace with captured $1 (empty if |. matched).
String s = string.replaceAll("(\\d+x)|.", "$1");
See this demo at regex101 or a Java demo at tio.run
I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?
You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).
String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
}
System.out.println(res); // => [HH12, HH]
See the Java demo
An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:
String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
System.out.println(res);
See another Java demo
Details:
(\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
(?:HH12|HH) - either HH12 or HH.
In the string HH12HH, the regex (HH|HH12)+ will work this way:
HH12HH
^ - both option work, continue
HH12HH
^ - First condition is entierly satisfied, mark it as match
HH12HH
^ - No Match
HH12HH
^ - No Match
As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.
In this case, you have three options:
Put the longuest pattern first /(HH12|HH)/Ag. See demoThe one I prefer.
Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
Put a $ at the end like so /(HH|HH12)$/Ag
The problem you are having is entirely related to the way the regex engine decides what to match.
As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.
Your regex works a lot like this code:
if(bool1){
// This is where `HH` matches
} else if (bool1 && bool2){
// This is where `HH12` would match, but this code will never execute
}
The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.
Then, you can just match with an alteration:
HH12|HH
It should be pretty obvious what matches, since you can get the results of each match.
(You could also put each word in its own capture group, but that's a bit harder to work with.)
I have the following string:
CLASSIC STF
CLASSIC
am using regexp to match the strings.
Pattern p = Pattern.compile("^CLASSIC(\\s*)$", Pattern.CASE_INSENSITIVE);
CLASSIC STF is also being displayed.
am using m.find()
How is it possible that only CLASSIC is displayed not CLASSIC STF
Thanks for helping.
If you use Matcher.find() the expression CLASSIC(\s*) will match CLASSIC STF.
Matcher.matches() will return false, however, since it requires the expression to match the entire input.
To make Matcher.find() do the same, change the expression to ^CLASSIC(\s*)$, as said by reto.
By default ^ and $ match against the beginning and end of the entire input string respectively, ignoring any newlines. I would expect that your expression would not match on the string you mention. Indeed:
String pattern = "^CLASSIC(\\s*)$";
String input = "CLASSIC STF%nCLASSIC";
Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(String.format(input));
while (m.find()) {
System.out.println(m.group());
}
prints no results.
If you want ^ and $ to match the beginning and end of all lines in the string you should enable "multiline mode". Do so by replacing line 3 above with Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE + Pattern.MULTILINE);. When I do so I get one result, namely: "CLASSIC".
You also asked why "CLASSIC STF" is not matched. Let's break down your pattern to see why. The pattern says: match anything that...
starts at the beginning of a line ~ ^
begins with a C, followed by an L, A, S, S, I and C ~ CLASSIC
after which 0 or more whitespace characters follow ~ (\s*)
after which we see a line ending ~ $
After matching the space in "CLASSIC STF" (step 3) we are looking at a character "S". This doesn't match a line ending (step 4), so we cannot match the regex.
Note that the parentheses in your regex are not necessary. You can leave them out.
The Javadoc of the Pattern class is very elaborate. It could be helpful to read it.
EDIT:
If you want to check if a string/line contains the word "CLASSIC" using a regex, then I'd recommend to use the regex \bCLASSIC\b. If you want to see if a string starts with the word "CLASSIC", then I'd use ^CLASSIC\b.
I wonder if this would help:
practice = c("CLASSIC STF", "CLASSIC")
grep("^CLASSIC[[:space:]STF]?", practice)
Using the following value of a text node...
MatcH one MatcHer two MarcH three
How can java matcher.find() be used to create the following output?
<wrap>MatcH</wrap> one MatcHer two <wrap>MarcH</wrap> three
Assuming a java regex that captures all words starting with capital 'M' and ending with a capital 'H'
\bM\w*H\b
Basically, I want to surround anything that matches this regex with wrap tags
String text = "MatcH one MatcHer two MarcH three";
Pattern pattern = Pattern.compile(\\bM\w*H\b\);
Matcher matcher = pattern.matcher(text);
// replace each time the regex is found
while (matcher.find()) {
text = text.replaceAll(matcher.group(), "<wrap>" +
+ matcher.group() + "</wrap>");
}
ReplaceFirst/ReplaceAll is not working for me because it results in the following...
<wrap>MatcH</wrap> one <wrap>MatcH</wrap>er two <wrap>MarcH</wrap> three
Thanks in advance...
Your regex is problematic since your do replaceAll, so it will match MatcH, then MatcH and MatcHer will get replaced in that iteration of the loop. Note that the \\b doesn't appear in the output of group, so nothing prevents it from replacing MatcHer.
You can put a System.out.println inside the loop to print the output of group and the output of replaceAll to see what happens and why it does what it does.
Simplifying your code to just the below will work: (that's probably "hard-coding match numbers" but I don't really see a problem with that as it stands and I don't see a simpler solution)
String text = "MatcH one MatcHer two MarcH three";
text = text.replaceAll("\\b(M\\w*H)\\b", "<wrap>$1</wrap>");
The above is how regex is supposed to work. If you see that problems may arise in future using something similar to the above, regex may not be the way to go.
I'm trying to capture a string up until a certain word that is within some group of words.
I only want to capture the string up until the FIRST instance of one of these words, as they may appear many times in the string.
For example:
Group of words: (was, in, for)
String = "Once upon a time there was a fox in a hole";
would return "Once upon a time there"
Thank you
What you need is called a Lookahead. The exact regex for your situation is:
/^.+?(?=(?:was)|(?:in)|(?:for))/
Anyway, the ^ matches the beginning of the string, .+? is a lazy match(it will match the shortest possible string), (?= ... ) means "followed by" and (?: ... ) is a noncapturing group - which may or may not be necessary for you.
For bonus points, you should probably be using word boundaries to make sure you're matching the whole word, instead of a substring ("The fox wasn't" would return "The fox "), and a leading space in the lookahead to kill the trailing space in the match:
/^.+?(?=\s*\b(?:was)|(?:in)|(?:for)\b)/
Where \s* matches any amount of white space (including none at all) and \b matches the beginning or end of a word. It's a Zero-Width assertion, meaning it doesn't match an actual character.
Or, in Java:
Pattern p = Pattern.compile("^.+?(?=\\s*\\b(?:was)|(?:in)|(?:for)\\b)");
I think that will work. I haven't used it, but according to the documentation, that exact string should work. Just had to escape all the backslashes.
Edit
Here I am, more than a year later, and I just realized the regex above does not do what I thought it did at the time. Alternation has the highest precedence, rather than the lowest, so this pattern is more correctly:
/^.+?(?=\s*\b(?:was|in|for)\b)/
Compare this new regex to my old one. Additionally, future travelers, you may wish to capture the whole string if no such breaker word exists. Try THIS on for size:
/^(?:(?!\s*\b(?:was|in|for)\b).)*/
This one uses a NEGATIVE lookahead (which asserts a match that fails the pattern). It's possibly slower, but it still does the job. See it in action here.
You can use this code to capture the string before a terminating word:
Pattern p = Pattern.compile("^(.*?)((\\b(was|in|for)\\b)|$)");
String s = "Once upon a time there was a fox in a hole";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
This code produces the following output (link):
Once upon a time there
Here is how this expression works: the (\\b(was|in|for)\\b) means "any of the words listed in the inner parentheses, when they appear on word boundaries". The expression just outside allows for $ in order to capture something even if none of the terminating words appear in the source.
A very simple way to handle this is to just split the string with a regex and keep the first thing returned:
String str = "Once upon a time there was a fox in a hole";
String match = str.split("(was|in|for)")[0];
// match = "Once upon a time there "
In this example, match will either contain the first part of the string before the first matched word or, in the case of a string where the word wasn't found it will contain the entire string.
String s = "Once upon a time there was a fox in the hole";
String[] taboo = {"was", "in", "for"} ;
for (int i = 0; i < taboo.length; i++){
if (s.indexOf(taboo[i]) > -1 ){
s=s.substring(0, s.indexOf(taboo[i])) ;
}
}
out.print(s);
works on my computer..