java pattern regular expression matching

java pattern regular expression matching - java

I'm really bad with pattern matching. I'm trying to take in a password and just check that it meets this criteria:
contains at least 1 lowercase letter
contains at least 1 uppercase letter
contains at least 1 number
contains at least one of these special chars: ##$%
has a minimum length of 8 characters
has a maximum length of 10 characters
This is what I have:
Pattern pattern = Pattern.compile("((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%]).{8,10})");
Matcher matcher = pattern.matcher(in);
if(!matcher.find())
{
return false;
}
else
{
return true;
}
I would also like to do something like this:
int MIN = 8,
MAX = 10;
"((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%]).{MIN,MAX})"
but I get some weird message about malformed expression.
Something isn't right. My program crashes with this. I don't know what's wrong. Any ideas?
private boolean isValidPassword(String in)
{
/* PASSWORD MUST:
* contains at least 1 lowercase letter
* contains at least 1 uppercase letter
* contains at least 1 number
* contains at least one of these special chars: ##$%
* has a minimum length of 8 characters
* has a maximum length of 10 characters
*/
Pattern hasLowercase = Pattern.compile(".*[a-z].*");
Pattern hasUppercase = Pattern.compile(".*[A-Z].*");
Pattern hasNumber = Pattern.compile(".*[0-9].*");
Pattern hasSpecial = Pattern.compile(".*(#|#|$|%).*");
Matcher matcher = hasLowercase.matcher(in);
if (!matcher.matches()) //a-z
{
return false;
}
matcher = hasUppercase.matcher(in);
if (!matcher.matches()) //A-Z
{
return false;
}
matcher = hasNumber.matcher(in);
if (!matcher.matches()) //0-9
{
return false;
}
matcher = hasSpecial.matcher(in);
if (!matcher.matches()) //##$%
{
return false;
}
if(in.length() < MIN_LENGTH || in.length() > MAX_LENGTH) //length must be min-to-max.
{
return false;
}
return true;
}

If you really want to do this with regular expressions, it would be much easier to test the input against multiple simple expressions rather than one single and excessively complex expression.
Test your input against the following regexes.
If one of them fails, then the input is invalid.
.*[a-z].*
.*[A-Z].*
.*[0-9].*
.*(#|#|$|%).*
Additionnally, check the length of the input, with basic string methods.

I am not sure how to help you with crashing without more information, but I do have a suggestion.
Instead of trying to create one giant regex expression, I would recommend making one expression for each rule, then test them all on the string individually. This allows you to easily edit the individual rules if you decide you want to change/add/remove rules. This also makes them easier to understand.
There is also the option of not using regex, which would make your rules pretty easy using the string contains method with these character classes
As for the malformed expression, you should concat the MIN and MAX like this:
"((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%]).{" + MIN + "," + MAX + "})" which will insert the values of MAX and MIN into the string.

I think that your expression might be off, but I found one that meets what you are looking for.
"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!##\$%\^&\*])(?=.{8,10})"
You can modify the min and max length by using
"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!##\$%\^&\*])(?=.{" + MIN + "," + MAX + "})"
I have included this RegEx in Regexr so you can see how it works.
http://regexr.com/3gnbd
Also, for future reference when testing Regular Expressions, regexr.com is very helpful for seeing the different components.
You also should use if/then statements to return true or false, because you could just return the tested condition instead. return matcher.find() eliminates the need for an if statement.

Related

Regular Expression to restrict some special characters

I am trying to write regular expression to restrict some characters. The character to restrict is based on the requirement from various users.
I am trying to use this regex - [(char1|char2|char3|...)$]
Note: Each char will be from requirement.
If the user entered string matches any of the character i ll return true. Now,
what I want to know is weather this expression will work for all the conditions?
For example - requirement1 = .:, requirement2 = .:&%
I will concatinate | in between each char and then i will generate regular expression in java. This is working for my requirement1 but not for requirement2.
my sample java code
String requirement = ":>&%";
String regExp1 = null;
for (int i = 0; i < requirement.length(); i++) {
regExp1 = "[(" + requirement.charAt(i);
if (i - 1 != requirement.length()) {
regExp1.concat("|");
}
}
if (regExp1 != null) {
regExp1.concat(")]$");
}
Pattern p = Pattern.compile(regExp);
Matcher m = p.matcher(arg);
if (m.find())
return true;
else
return false;
How can I generate standard regular expression?

If you want "one of these characters" the brackets are good enough. No need for parenthesis and pipes.
Something like this : [.:,] and [.:&%] may work. If want them one or more times you have to had + at the end of your regex (ie: [.:&%]+).
As said in the comments, beware of special chars (like the dot, which means any chars in regex).

Replacing Strings with a number in it without a for loop

So I currently have this code;
for (int i = 1; i <= this.max; i++) {
in = in.replace("{place" + i + "}", this.getUser(i)); // Get the place of a user.
}
Which works well, but I would like to just keep it simple (using Pattern matching)
so I used this code to check if it matches;
System.out.println(StringUtil.matches("{place5}", "\\{place\\d\\}"));
StringUtil's matches;
public static boolean matches(String string, String regex) {
if (string == null || regex == null) return false;
Pattern compiledPattern = Pattern.compile(regex);
return compiledPattern.matcher(string).matches();
}
Which returns true, then comes the next part I need help with, replacing the {place5} so I can parse the number. I could replace "{place" and "}", but what if there were multiple of those in a string ("{place5} {username}"), then I can't do that anymore, as far as I'm aware, if you know if there is a simple way to do that then please let me know, if not I can just stick with the for-loop.

then comes the next part I need help with, replacing the {place5} so I can parse the number
In order to obtain the number after {place, you can use
s = s.replaceAll(".*\\{place(\\d+)}.*", "$1");
The regex matches arbitrary number of characters before the string we are searching for, then {place, then we match and capture 1 or more digits with (\d+), and then we match the rest of the string with .*. Note that if the string has newline symbols, you should append (?s) at the beginning of the pattern. $1 in the replacement pattern "restores" the value we need.

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.

Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";

The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O

I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");

You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true

Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

codingbat wordEnds using regex

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.

I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}

Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java pattern regular expression matching - java

Related

Regular Expression to restrict some special characters

Replacing Strings with a number in it without a for loop

Efficient Regular Expression for big data, if a String contains a word

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

codingbat wordEnds using regex

Categories

Resources