regex break string into words in dictionary

regex break string into words in dictionary - java

I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?

You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).
String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
}
System.out.println(res); // => [HH12, HH]
See the Java demo
An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:
String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
System.out.println(res);
See another Java demo
Details:
(\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
(?:HH12|HH) - either HH12 or HH.

In the string HH12HH, the regex (HH|HH12)+ will work this way:
HH12HH
^ - both option work, continue
HH12HH
^ - First condition is entierly satisfied, mark it as match
HH12HH
^ - No Match
HH12HH
^ - No Match
As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.
In this case, you have three options:
Put the longuest pattern first /(HH12|HH)/Ag. See demoThe one I prefer.
Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
Put a $ at the end like so /(HH|HH12)$/Ag

The problem you are having is entirely related to the way the regex engine decides what to match.
As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.
Your regex works a lot like this code:
if(bool1){
// This is where `HH` matches
} else if (bool1 && bool2){
// This is where `HH12` would match, but this code will never execute
}
The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.
Then, you can just match with an alteration:
HH12|HH
It should be pretty obvious what matches, since you can get the results of each match.
(You could also put each word in its own capture group, but that's a bit harder to work with.)

Related

Java regex, replace certain characters except

I have this string "u2x4m5x7" and I want replace all the characters but a number followed by an x with "".
The output should be:
"2x5x"
Just the number followed by the x.
But I am getting this:
"2x45x7"
I'm doing this:
String string = "u2x4m5x7";
String s = string.replaceAll("[^0-9+x]","");
Please help!!!

Here is a one-liner using String#replaceAll with two replacements:
System.out.println(string.replaceAll("\\d+(?!x)", "").replaceAll("[^x\\d]", ""));
Here is another working solution. We can iterate the input string using a formal pattern matcher with the pattern \d+x. This is the whitelist approach, of trying to match the variable combinations we want to keep.
String input = "u2x4m5x7";
Pattern pattern = Pattern.compile("\\d+x");
Matcher m = pattern.matcher(input);
StringBuilder b = new StringBuilder();
while(m.find()) {
b.append(m.group(0));
}
System.out.println(b)
This prints:
2x5x

It looks like this would be much simpler by searching to get the match rather than replacing all non matches, but here is a possible solution, though it may be missing a few cases:
\d(?!x)|[^0-9x]|(?<!\d)x
https://regex101.com/r/v6udph/1
Basically it will:
\d(?!x) -- remove any digit not followed by an x
[^0-9x] -- remove all non-x/digit characters
(?<!\d)x -- remove all x's not preceded by a digit
But then again, grabbing from \dx would be much simpler

Capture what you need to $1 OR any character and replace with captured $1 (empty if |. matched).
String s = string.replaceAll("(\\d+x)|.", "$1");
See this demo at regex101 or a Java demo at tio.run

Java find value in a string using regex

I'm wondering about the behavior of using the matcher in java.
I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.
My code:
String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("\nRegex : " matcher.group());
}
I get hit with "star war" which is right as it is in my pattern.
But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.

The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.
So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.
Just check if the string contains the strings you check against, the simplest approach is with a loop:
String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
if(str.contains(s))
System.out.println(s);
}
See the Java demo
By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.

A better way to build your regex would be like this:
String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";
Breaking down:
[Ss]: it will match either S or s in the first position
\s: representation of space
{0,1}: the previous character (or set) will be matched from 0 to 1 times
An alternative is:
String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
?: the previous character (or set) will be matched once or not at all
For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit 1: fixed typo (\s -> \\s). Thanks, #eugene.

You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:
Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
+ "star wars|pirates of the caribbean)$");
will print
Regex : star wars
But I agree with #NAMS: Don't build your regex like this.

How do I change the following pattern to find all possible matches?

I have a Java pattern here
String patternString = "(#)(.+?)([\\s,#.])";
I basically want to find all words beginning with a '#' in a given text string. The pattern matches all words except the last one if it is followed by an end line. I am using a hash map to store the values.
int x = 0;
HashMap<Integer, String> values = new HashMap<>();
while(matcher.find()) {
values.put(x++, matcher.group(2));
}
I have tried putting a '$' symbol in the third to match the group but it doesn't seem to work. How do I tweak the pattern to match all words beginning with a '#' that includes the last word too?

Unless I misunderstood your requirements, it can be much simpler. I'd suggest using the following pattern:
(#)([^\s]+)
It matches a # followed by as many non-white space characters as possible. You'll have to change you code to use group 1 instead of group 2, as my pattern doesn't have 3 groups.
Depending on your exact requirements you can also use \w instead of [^\s] to match any word character (equivalent to [a-zA-Z0-9_]).

Matching Exact string with regexp

I have the following string:
CLASSIC STF
CLASSIC
am using regexp to match the strings.
Pattern p = Pattern.compile("^CLASSIC(\\s*)$", Pattern.CASE_INSENSITIVE);
CLASSIC STF is also being displayed.
am using m.find()
How is it possible that only CLASSIC is displayed not CLASSIC STF
Thanks for helping.

If you use Matcher.find() the expression CLASSIC(\s*) will match CLASSIC STF.
Matcher.matches() will return false, however, since it requires the expression to match the entire input.
To make Matcher.find() do the same, change the expression to ^CLASSIC(\s*)$, as said by reto.

By default ^ and $ match against the beginning and end of the entire input string respectively, ignoring any newlines. I would expect that your expression would not match on the string you mention. Indeed:
String pattern = "^CLASSIC(\\s*)$";
String input = "CLASSIC STF%nCLASSIC";
Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(String.format(input));
while (m.find()) {
System.out.println(m.group());
}
prints no results.
If you want ^ and $ to match the beginning and end of all lines in the string you should enable "multiline mode". Do so by replacing line 3 above with Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE + Pattern.MULTILINE);. When I do so I get one result, namely: "CLASSIC".
You also asked why "CLASSIC STF" is not matched. Let's break down your pattern to see why. The pattern says: match anything that...
starts at the beginning of a line ~ ^
begins with a C, followed by an L, A, S, S, I and C ~ CLASSIC
after which 0 or more whitespace characters follow ~ (\s*)
after which we see a line ending ~ $
After matching the space in "CLASSIC STF" (step 3) we are looking at a character "S". This doesn't match a line ending (step 4), so we cannot match the regex.
Note that the parentheses in your regex are not necessary. You can leave them out.
The Javadoc of the Pattern class is very elaborate. It could be helpful to read it.
EDIT:
If you want to check if a string/line contains the word "CLASSIC" using a regex, then I'd recommend to use the regex \bCLASSIC\b. If you want to see if a string starts with the word "CLASSIC", then I'd use ^CLASSIC\b.

I wonder if this would help:
practice = c("CLASSIC STF", "CLASSIC")
grep("^CLASSIC[[:space:]STF]?", practice)

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}

I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.

The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.

Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex break string into words in dictionary - java

Related

Java regex, replace certain characters except

Java find value in a string using regex

How do I change the following pattern to find all possible matches?

Matching Exact string with regexp

How can I make a Java regex all or nothing?

Categories

Resources