I would like to enforce that 4 different characters will be in a string.
Valid examples:
"1q2w3e4r5t"
"abcd"
Invalid examples:
"good"
"1ab1"
Ideas for a pattern?
You should consider using a non-regex solution. I only write this answer to show a simpler regex solution for this problem.
Initial solution
Here is a simpler regex solution, which asserts that there are at least 4 distinct characters in the string:
(.).*?((?!\1).).*?((?!\1|\2).).*?((?!\1|\2|\3).).*
Demo on regex101 (PCRE and Java has the same behavior for this regex)
.*?((?!\1).), .*?((?!\1|\2).), ... searches for the next character which has not appeared before, which is implemented by the checking the character is not the same as whatever captured in previous capturing groups.
Logically, the laziness/greediness of the quantifier doesn't matter here. The lazy quantifier .*? is used to make the search start from the closest character which has not appeared before, rather than from the furthest character. It should slightly improve the performance in matching case, since less backtracking is done.
Used with String.matches(), which asserts that the whole string matches the regex:
input.matches("(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*")
Improved solution
If you are concerned about performance:
(.)(?>.*?((?!\1).))(?>.*?((?!\1|\2).))(?>.*?((?!\1|\2|\3).)).*
Demo on regex101
With String.matches():
input.matches("(.)(?>.*?((?!\\1).))(?>.*?((?!\\1|\\2).))(?>.*?((?!\\1|\\2|\\3).)).*")
The (?>pattern) construct prevents backtracking into the group once you exit from the pattern inside. This is used to "lock" the capturing groups to the first appearance of each of the distinct character, since the result is the same even if you pick a different character later in the string.
This regex behaves the same as a normal program which loops from left-to-right, checks the current character against a set of distinct characters and adds it to the set if the current character is not in the set.
Due to this reason, the lazy quantifier .*? becomes significant, since it searches for the closest character which has not appeared so far.
You can use a regular expression to validate this, with negative look-aheads checking that the captured alphanumeric character is not the same 4 times.
I'd say it is very ugly, but working:
String rx = "^(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*?$"
See demo
IDEONE Demo
String re = "^(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*?$";
// Good
System.out.println("1q2w3e4r5t".matches(re));
System.out.println("goody".matches(re));
System.out.println("gggoooggoofr".matches(re));
// Bad
System.out.println("good".matches(re));
System.out.println("1ab1".matches(re));
Output:
true
true
true
false
false
You can count the number of distinct chars like this:
String s = "abcdefaa";
long numDistinctChars = s.chars().distinct().count()
Or if not on Java 8 (I couldn't come up with something better):
Set<Character> set = new HashSet<>();
char[] charArray = s.toCharArray();
for (char c : charArray) {
set.add(Character.valueOf(c));
}
int numDistinctChars = set.size();
Related
I am looking for a way to match an optional ABC in the following strings.
Both strings should be matched either way, if ABC is there or not:
precedingstringwithundefinedlenghtABCsubsequentstringwithundefinedlength
precedingstringwithundefinedlenghtsubsequentstringwithundefinedlength
I've tried
.*(ABC).*
which doesn't work for an optional ABC but making ABC non greedy doesn't work either as the .* will take all the pride:
.*(ABC)?.*
This is NOT a duplicate to e.g. Regex Match all characters between two strings as I am looking for a certain string inbetween two random string, kind of the other way around.
You can use
.*(ABC).*|.*
This works like this:
.*(ABC).* pattern is searched for first, since it is the leftmost part of an alternation (see "Remember That The Regex Engine Is Eager"), it looks for any zero or more chars other than line break chars as many as possible, then captures ABC into Group 1 and then matches the rest of the line with the right-hand .*
| - or
.* - is searched for if the first alternation part does not match.
Another solution without the need to use alternation:
^(?:.*(ABC))?.*
See this regex demo. Details:
^ - start of string
(?:.*(ABC))? - an optional non-capturing group that matches zero or more chars other than line break chars as many as possible and then captures into Group 1 an ABC char sequence
.* - zero or more chars other than line break chars as many as possible.
I’ve come up with an answer myself:
Using the OR operator seems to work:
(?:(?:.*(ABC))|.*).*
If there’s a better way, feel free to answer and I will accept it.
You could use this regex: .*(ABC){0,1}.*. It means any, optional{min,max}, any. It is easier to read. I can' t say if your solution or mine is faster due to the processing speed.
Options:
{value} = n-times
{min,} = min to infinity
{min,max} = min to max
.+([ABC])?.+ should do the job
I want to replace only numeric section of a string. Most of the cases it's either full URL or part of URL, but it can be just a normal string as well.
/users/12345 becomes /users/XXXXX
/users/234567/summary becomes /users/XXXXXX/summary
/api/v1/summary/5678 becomes /api/v1/summary/XXXX
http://example.com/api/v1/summary/5678/single becomes http://example.com/api/v1/summary/XXXX/single
Notice that I am not replacing 1 from /api/v1
So far, I have only following which seem to work in most of the cases:
input.replaceAll("/[\\d]+$", "/XXXXX").replaceAll("/[\\d]+/", "/XXXXX/");
But this has 2 problems:
The replacement size doesn't match with the original string length.
The replacement character is hardcoded.
Is there a better way to do this?
In Java you can use:
str = str.replaceAll("(/|(?!^)\\G)\\d(?=\\d*(?:/|$))", "$1X");
RegEx Demo
RegEx Details:
\G asserts position at the end of the previous match or the start of the string for the first match.
(/|(?!^)\\G): Match / or end of the previous match (but not at start) in capture group #1
\\d: Match a digit
(?=\\d*(?:/|$)): Ensure that digits are followed by a / or end.
Replacement: $1X: replace it with capture group #1 followed by X
Not a Java guy here but the idea should be transferrable. Just capture a /, digits and / optionally, count the length of the second group and but it back again.
So
(/)(\d+)(/?)
becomes
$1XYZ$3
See a demo on regex101.com and this answer for a lambda equivalent to e.g. Python or PHP.
First of all you need something like this :
String new_s1 = s3.replaceAll("(\\/)(\\d)+(\\/)?", "$1XXXXX$3");
I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.
I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.
My current regex:
(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$
Valid wildcard examples:
*test.com
*.test.com
*test
test.*
test*
*test*
I compile the pattern with:
private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);
I look for matches with:
// Wildcards
while (wildcardPatternMatch.find()) {
String wildcard = wildcardPatternMatch.group();
myProperty.add(new property(wildcard, providerId));
System.out.println(wildcard);
}
Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?
The pattern you need is
(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$
See the regex demo
Main points:
The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.
The rest is OK.
Details
(?im) - case insensitive and multiline modifiers
^ - start of a line
(?=\*|.+\*$) - the string should either start or end with *
(?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
[a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
(?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
$ - end of a line.
I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes
Your regex contains a repeated pattern:
(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+
Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times
Edit: Another good reference
So I have a bonus task assigned and it asks to write a program which returns true if in a given string at least one character is repeated.
I am relatively new to regular expressions but to my knowledge this should work:
String input = "wool";
return input.matches(".*(.)/1+.*");
This should return true, because the '.*' at the beginning and the end express that there could be prefices or suffices. And the '(.)/1+' is a repeating pattern of any character.
As I said I'm relatively new to the regex stuff but I'm very interested in learning and understanding it.
Almost perfect, just / looks the wrong way around (should be \).
Also, you don't need .* for prefixes and suffixes - regexp will find a match anywhere in the string, so (.)\1 suffices. This is not an error, just an optimisation (although in other cases it might, and does, make a difference).
One more issue is that backslashes are special characters in Java strings, so when you write a regexp in Java, you need to double up on backslashes. This gives you:
return input.matches(".*(.)\\1.*");
EDIT: I forgot, you don't need + because if something repeats 3 times, it also repeats 2 times, so you will find it just by searching for a two-character repetition. Again, not an error, just not needed here.
And Kita has a good point that your task is not well-defined, as it does not say whether you are looking for the repeating characters next to each other or anywhere in the string. My solution is for the adjacent characters; if you need the repetition anywhere, use his.
EDIT2 after comments: Forgot the semantics of .matches. You guys are quite correct, edited appropriately.
If the task "in a given string at least one character is repeated" includes the following pattern:
abcbd (b is repeated)
then the regex pattern would be:
(.).*\1
This pattern assumes that other characters could be in between the repeating characters. Otherwise
(.)\1
will do.
Note that the task is to capture "at least one character is repeated", which means identifying a single occurrence is enough for the task, so \1 does not have to have + quantifier.
The code:
return input.matches("(.).*\\1");
or
return input.matches("(.)\\1");
Alternative solution would be adding the elements to hashset. Then checking the length of the string and the hashset.
I know that the ? is a greedy quantifier and ?? is the reluctant one for it.
When I use it as follows it gives me an empty output always? Is it because of it always operates from left to right (first looking at the zero occurrence then the matched occurrence) or another one?
Pattern pattern = Pattern.compile("a??");
Matcher matcher = pattern.matcher("aba");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[]0
1[]1
2[]2
3[]3
Your regex could be explained as follows: "try to match zero characters, and if that fails try to match one 'a' character".
Trying to match zero characters will always succeed, so there is really no purpose for a regex that only contains a single reluctant element.
I'm not sure about the Java implementation but regular-expressions.info states this for ?? :
Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.
Thus you get 4 matches (3 character positions + the empty string at the ent) and the optional a is excluded from each of those.