I'm writing a Java application where a user can reduce a List of strings based on a filter that the user supplies.
So for example, the user could enter a filter such as:
ABC*xyz
This means that user is looking for strings that start with ABC and have xyz that follow (That would be the same as doing a search for ABC*xyz*)
Another example of a filter the user could enter is:
*DEF*mno*rst
This means that the string can start with anything, but it must then follow with DEF, followed by mno, followed by rst.
How would I write the Java code to be able generate the regular expression that I need to figure out if my strings match the filter the user has specified?
If converting your syntax to regex, which is the "easy" way to do this (avoiding writing a lexer yourself), you must remember to escape your string appropriately.
So if going down this route, you should probably aim to quote the bits that aren't wildcards in your syntax and join with regex .* (or .+ if you want your * to mean "at least one character). This will avoid incorrect results when using *, ., (, ) and all the other regex special characters.
Try something like:
public Pattern createPatternFromSearch(String query) {
StringBuilder sb = new StringBuilder();
for (String part : query.split("\\*")) {
if (part.length() > 0) {
sb.append(Pattern.quote(part));
}
sb.append(".*");
}
return Pattern.compile(sb.toString());
}
// ...
// then you can use it like....
Matcher matcher = createPatternFromQuery("*DEF*mno*rst").matcher(str);
if (matcher.matches()) {
// process the matching result
}
Note that by using Matcher#matches() (not find) and leaving the trailing .*, it will cater for your syntax that is anchored at the start only.
Replace * with .* and you have your regular expression.
String str = "*DEF*mno*rst";
String regex = str.replaceAll("*", ".*");
Related
I have a problem with creating regex of match that will get from string example: NotificationGroup_n+En where n are numbers from 1-4 and when let's say i match desired number from range i will replace or remove it with that specific number.
String BEFORE process: NotificationGroup_4+E3
String AFTER process: NotificationGroup_E3
I removed n (number from 1-4) and leave _E with number
My question is how to write regex in string.replace function to match number and than the plus sign and leave out only the string with _En
def String string = "Notification_Group_4+E3";
println(removeChar(string));
}
public static def removeChar(String string) {
if ((string.contains("1+"))||(string.contains("2+")||(string.contains("3+"))||(string.contains("4+")))) {
def stringReplaced = string.replace('4+', "");
return stringReplaced;
}
}
in groovy:
def result = "Notification_Group_4+E3".replaceFirst(/_\d\+(.*)/, '_$1')
println result
output:
~> Â groovy solution.groovy
Notification_Group_E3
~>
Try it online!
A visualization of the regex look like this:
Regex explanation:
we use groovy slashy strings /.../ to define the regex. This makes escaping simpler
we first match on underscore _
Then we match on a single digit (0-9) using the predefined character class \d as described in the javadoc for the java Pattern class.
We then match for one + character. We have to escape this with a backslash \ since + without escaping in regular expressions means "one or more" (see greedy quantifiers in the javadocs) . We don't want one or more, we want just a single + character.
We then create a regex capturing group as described in the logical operators part of the java Pattern regex using the parens expression (.*). We do this so that we are not locked into the input string ending with E3. This way the input string can end in an arbitrary string and the pattern will still work. This essentially says "capture a group and include any character (that is the . in regex) any number of times (that is the * in regex)" which translates to "just capture the rest of the line, whatever it is".
Finally we replace with _$1, i.e. just underscore followed by whatever the capturing group captured. The $1 is a "back reference" to the "first captured group" as documented in, for example, the java Matcher javadocs.
try this regex (\d.*?\+) here demo
in java :
String string = "Notification_Group_4+E3";
System.out.print(string.replaceAll("\\d.*?\\+", ""));
output :
Notification_Group_E3
The simple one-liner:
String res = 'Notification_Group_4+E3'.replaceAll( /_\d+\+/, '_' )
assert 'Notification_Group_E3' == res
I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?
You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).
String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
}
System.out.println(res); // => [HH12, HH]
See the Java demo
An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:
String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
System.out.println(res);
See another Java demo
Details:
(\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
(?:HH12|HH) - either HH12 or HH.
In the string HH12HH, the regex (HH|HH12)+ will work this way:
HH12HH
^ - both option work, continue
HH12HH
^ - First condition is entierly satisfied, mark it as match
HH12HH
^ - No Match
HH12HH
^ - No Match
As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.
In this case, you have three options:
Put the longuest pattern first /(HH12|HH)/Ag. See demoThe one I prefer.
Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
Put a $ at the end like so /(HH|HH12)$/Ag
The problem you are having is entirely related to the way the regex engine decides what to match.
As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.
Your regex works a lot like this code:
if(bool1){
// This is where `HH` matches
} else if (bool1 && bool2){
// This is where `HH12` would match, but this code will never execute
}
The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.
Then, you can just match with an alteration:
HH12|HH
It should be pretty obvious what matches, since you can get the results of each match.
(You could also put each word in its own capture group, but that's a bit harder to work with.)
I am trying to extract a special sequence out of a String using the following Regular Expression:
[(].*[)]
My Pattern should only match if the String contains () with text between them.
Somehow, i I create a new Pattern using Pattern#compile(myString) and then match the String using Matcher matcher = myPattern.matcher(); it doesn't find anything, even though I tried it on regexr.com and it worked there.
My Pattern is a static final Pattern object in another class (I directly used Pattern#compile(myString).
Example String to match:
save (xxx,yyy)
The likely problem here is your quantifier.
Since you're using greedy * with a combination of . for any character, your match will not delimit correctly as . will also match closing ).
Try using reluctant [(].*?[)].
See quantifiers in docs.
You can also escape parenthesis instead of using custom character classes, like so: \\( and \\), but that has nothing to do with your issue.
Also note (thanks esprittn)
The * quantifier will match 0+ characters, so if you want to restrict your matches to non-empty parenthesis, use .+? instead - that'll guarantee at least one character inside your parenthesis.
Hope the below code helps : its extracts the data between '(' & ')' including them .
String pattern = "\\(.*\\)";
String line = "save(xx,yy)";
Pattern TokenPattern = Pattern.compile(pattern);
Matcher m = TokenPattern.matcher(line);
while (m.find()) {
int start = m.start(0);
int end = m.end(0);
System.out.println(line.substring(start, end));
}
to remove the brackets change 'start' to 'start+1' and 'end' to 'end-1' to change the bounding indexes of the sub-string being taken.
I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1
I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")