How to validate phone number(US format) in Java? - java

I just want to know where am i wrong here:
import java.io.*;
class Tokens{
public static void main(String[] args)
{
//String[] result = "this is a test".split("");
String[] result = "4543 6546 6556".split("");
boolean flag= true;
String num[] = {"0","1","2","3","4","5","6","7","8","9"};
String specialChars[] = {"-","#","#","*"," "};
for (int x=1; x<result.length; x++)
{
for (int y=0; y<num.length; y++)
{
if ((result[x].equals(num[y])))
{
flag = false;
continue;
}
else
{
flag = true;
}
if (flag == true)
break;
}
if (flag == false)
break;
}
System.out.println(flag);
}
}

If this is not homework, is there a reason you're avoiding regular expressions?
Here are some useful ones: http://regexlib.com/DisplayPatterns.aspx?cattabindex=6&categoryId=7
More generally, your code doesn't seem to validate that you have a phone number, it seems to merely validate that your strings consists only of digits. You're also not allowing any special characters right now.

Asides from the regex suggestion (which is a good one), it would seem to make more sense to deal with arrays of characters rather than single-char Strings.
In particular, the split("") call (shudder) could/should be replaced by toCharArray(). This lets you iterate over each individual character, which more clearly indicates your intent, is less prone to bugs as you know you're treating each character at once, and is more efficient*. Likewise your valid character sets should also be characters.
Your logic is pretty strangely expressed; you're not even referencing the specialChars set at all, and the looping logic once you've found a match seems odd. I think this is your bug; the matching seems to be the wrong way round in that if the character matches the first valid char, you set flag to false and continue round the current loop; so it will definitely not match the next valid char and hence you break out of the loop with a true flag. Always.
I would have thought something like this would be more intuitive:
private static final Set<Character> VALID_CHARS = ...;
public boolean isValidPhoneNumber(String number)
{
for (char c : number,toCharArray())
{
if (!VALID_CHARS.contains(c))
{
return false;
}
}
// All characters were valid
return true;
}
This doesn't take sequences into account (e.g. the strings "--------** " and "1" would be valid because all individual characters are valid) but then neither does your original code. A regex is better because it lets you specify the pattern, I supply the above snippet as an example of a clearer way of iterating through the characters.
*Yes, premature optimization is the root of all evil, but when better, cleaner code also happens to be faster that's an extra win for free.

Maybe this is overkill, but with a grammar similar to:
<phone_numer> := <area_code><space>*<local_code><space>*<number> |
<area_code><space>*"-"<space>*<local_code><space>*"-"<space>*<number>
<area_code> := <digit><digit><digit> |
"("<digit><digit><digit>")"
<local_code> := <digit><digit><digit>
<number> := <digit><digit><digit><digit>
you can write a recursive descent parser. See this page for an example.

You can checkout the Pattern class in Java, really easy to work with regular expression using this class:
https://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html.

Related

Java: How to implement wildcard matching?

I'm researching on how to find k values in the BST that are closest to the target, and came across the following implementation with the rules:
'?' Matches any single character.
'*' Matches any sequence of characters (including the empty sequence).
The matching should cover the entire input string (not partial).
The function prototype should be:
bool isMatch(const char *s, const char *p)
Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "*") → true
isMatch("aa", "a*") → true
isMatch("ab", "?*") → true
isMatch("aab", "cab") → false
Code:
import java.util.*;
public class WildcardMatching {
boolean isMatch(String s, String p) {
int i=0, j=0;
int ii=-1, jj=-1;
while(i<s.length()) {
if(j<p.length() && p.charAt(j)=='*') {
ii=i;
jj=j;
j++;
} else if(j<p.length() &&
(s.charAt(i) == p.charAt(j) ||
p.charAt(j) == '?')) {
i++;
j++;
} else {
if(jj==-1) return false;
j=jj;
i=ii+1;
}
}
while(j<p.length() && p.charAt(j)=='*') j++;
return j==p.length();
}
public static void main(String args[]) {
String s = "aab";
String p = "a*";
WildcardMatching wcm = new WildcardMatching();
System.out.println(wcm.isMatch(s, p));
}
}
And my question is, what's the reason for having two additional indexes, ii and jj, and why do they get initialized with -1? What's the purpose of each? Wouldn't traversing it with i and j be enough?
And what's the purpose of ii=i; and jj=j; in the first if case, and i=ii+1; and j=jj; in the third if case?
Lastly, in what case would you encounter while(j<p.length() && p.charAt(j)=='*') j++;?
Examples would be extremely helpful in understanding.
Thank you in advance and will accept answer/up vote.
It looks like ii and jj are used to handle the wildcard "*", which matches to any sequence. Their initialization to -1 acts as a flag: it tells us if we've hit an unmatched sequence and are not currently evaluating a "*". We can walk through your examples one at a time.
Notice that i is related to the parameter s (the original string) and j is related to the parameter p (the pattern).
isMatch("aa","a"):
this returns false because the j<p.length() statement will fail before we leave the while loop, since the length of p ("a") is only 1 whereas the length of s ("aa") is 2, so we'll jump to the else block. This is where the -1 initialization comes in: since we never saw any wildcards in p, jj is still -1, indicating that there's no way the strings can match, so we return false.
isMatch("aa","aa"):
s and p are exactly the same, so the program repeatedly evaluates the else-if block with no problems and finally breaks out of the while loop once i equals 2 (the length of "aa"). The second while loop never runs, since j is not less than p.length() - in fact, since the else-if increments i and j together, they are both equal to 2, and 2 is not less than the length of "aa". We return j == p.length(), which evaluates to 2 == 2, and get true.
isMatch("aaa","aa"): this one fails for the same reason as the first. Namely, the strings are not the same length and we never hit a wildcard character.
isMatch("aa","*"): this is where it gets interesting. First we'll enter the if block, since we've seen a "*" in p. We set ii and jj to 0 and increment j only. On the second iteration, j<p.length() fails, so we jump to the else block. jj is not -1 anymore (it's 0), so we reset j to 0 and set i to 0+1. This basically allows us to keep evaluating the wildcard, since j just gets reset to jj, which holds the position of the wildcard, and ii tells us where to start from in our original string. This test case also explains the second while loop. In some cases our pattern may be much shorter than the original string, so we need to make sure it's matched up with wildcards. For example, isMatch("aaaaaa","a**") should return true, but the final return statement is checking to see if j == p.length(), asking if we checked the entire pattern. Normally we would stop at the first wildcard, since it matches anything, so we need to finally run through the rest of the pattern and make sure it only contains wildcards.
From here you can figure out the logic behind the other test cases. I hope this helped!
Lets look at this a bit out of order.
First, this is a parallel iteration of the string (s) and the wildcard pattern (p), using variable i to index s and variable j to index p.
The while loop will stop iterating when end of s is reached. When that happens, hopefully end of p has been reached too, in while case it'll return true (j==p.length()).
If however p ends with a *, that is also valid (e.g. isMatch("ab", "ab*")), and that's what the while(j<p.length() && p.charAt(j)=='*') j++; loop ensures, i.e. any * in the pattern at this point is skipped, and if that reaches end of p, then it returns true. If end of p is not reached, it returns false.
That was the answer to your last question. Now lets look at the loop. The else if will iterate both i and j as long as there is a match, e.g. 'a' == 'a' or 'a' == '?'.
When a * wildcard is found (first if), it saves the current positions in ii and jj, in case backtracking becomes necessary, then skips the wildcard character.
This basically starts by assuming the wildcard matches the empty string (e.g. isMatch("ab", "a*b")). When it continues iterating, the else if will match the rest and method ends up returning true.
Now, if a mismatch is found (the else block), it will try to backtrack. Of course, if it doesn't have a saved wildcard (jj==-1), it can't backtrack, so it just returns false. That's why jj is initialized to -1, so it can detect if a wildcard was saved. ii could be initialized to anything, but is initialized to -1 for consistency.
If a wildcard position was saved in ii and jj, it will restore those values, then forward i by one, i.e. assuming that if the next character is matched against the wildcard, the rest of the matching will succeed and return true.
That's the logic. Now, it could be optimized a tiny bit, because that backtracking is sub-optimal. It currently resets j back to the *, and i back to the next character. When it loops around, it will enter the if and save the save value again in jj and save the i value in ii, and then increment j. Since that is a given (unless end of s is reached), the backtracking could just do that too, saving an iteration loop, i.e.
} else {
if(jj==-1) return false;
i=++ii;
j=jj+1;
}
The code looks buggy to me. (See below)
The ostensible purpose of ii and jj is to implement a form of backtracking.
For example, when you try to match "abcde" against the pattern "a*e", the algorithm will first match the "a" in the pattern against the "a" in the the input string. Then it will eagerly match the "*" against the rest of the string ... and find that it has made a mistake. At that point, it needs to backtrack and try an alternative
The ii and jj are to record the point to backtrack to, and the uses those variables are either recording a new backtrack point, or backtracking.
Or at least, that was probably the author's intent at some point.
The while(j<p.length() && p.charAt(j)=='*') j++; seems to be dealing with an edge-case
However, I don't think this code is correct.
It certainly won't cope with backtracking in the case where there are multiple "*" wildcards in the pattern. That requires a recursive solution.
The part:
if(j<p.length() && p.charAt(j)=='*') {
ii=i;
jj=j;
j++;
doesn't make much sense. I'd have thought it should increment i not j. It might "mesh" with the behavior of the else part, but even if it does this is a convoluted way of coding this.
Advice:
Don't use this code as an example. Even if it works (in a limited sense) it is not a good way to do this task, or an example of clarity or good style.
I would handle this by translating the wildcard pattern into a regex and then using Pattern / Matcher to do the matching.
For example: Wildcard matching in Java
I know you are asking about BST, but to be honest there is also a way of doing that with regex (not for competitive programming, but stable and fast enough be used in a production environment):
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class WildCardMatcher{
public static void main(String []args){
// Test
String urlPattern = "http://*.my-webdomain.???",
urlToMatch = "http://webmail.my-webdomain.com";
WildCardMatcher wildCardMatcher = new WildCardMatcher(urlPattern);
System.out.printf("\"%s\".matches(\"%s\") -> %s%n", urlToMatch, wildCardMatcher, wildCardMatcher.matches(urlToMatch));
}
private final Pattern p;
public WildCardMatcher(final String urlPattern){
Pattern charsToEscape = Pattern.compile("([^*?]+)([*?]*)");
// here we need to escape all the strings that are not "?" or "*", and replace any "?" and "*" with ".?" and ".*"
Matcher m = charsToEscape.matcher(urlPattern);
StringBuffer sb = new StringBuffer();
String replacement, g1, g2;
while(m.find()){
g1 = m.group(1);
g2 = m.group(2);
// We first have to escape pattern (original string can contain charachters that are invalid for regex), then escaping the '\' charachters that have a special meaning for replacement strings
replacement = (g1 == null ? "" : Matcher.quoteReplacement(Pattern.quote(g1))) +
(g2 == null ? "" : g2.replaceAll("([*?])", ".$1")); // simply replacing "*" and "?"" with ".*" and ".?"
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
p = Pattern.compile(sb.toString());
}
#Override
public String toString(){
return p.toString();
}
public boolean matches(final String urlToMatch){
return p.matcher(urlToMatch).matches();
}
}
There is still a list of optimizations that you can implement (lowecase / uppercase distinction, setting a max-length to the string being checked to prevent attackers to make you check against a 4-GigaByte-String, ...).

How to properly use java Pattern object to match string patterns

I wrote a code that does several string operations including checking whether a given string matches with a certain regular expression. It ran just fine with 70,000 input but it started to give me out of memory error when I iteratively ran it for five-fold cross validation. It just might be the case that I have to assign more memory, but I have a feeling that I might have written an inefficient code, so wanted to double check if I didn't make any obvious mistake.
static Pattern numberPattern = Pattern.compile("^[a-zA-Z]*([0-9]+).*");
public static boolean someMethod(String line) {
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++) {
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
if(numberPattern.matcher(tokens[i]).find()) return true;
}
return false;
}
and I have also many lines like below:
token.matches("[a-z]+[A-Z][a-z]+");
Which way is more memory efficient? Do they look efficient enough? Any advice is appreciated!
Edited:
Sorry, I had a wrong code, which I intended to modify before posting this question but I forgot at the last minute. But the problem was I had many similar looking operations all over, aside from the fact that the example code did not make sense, I wanted to know if regexp comparison part was efficient.
Thanks for all of your comments, I'll look through and modify the code following the advice!
Well, first at all, try a second look at your code... it will always return a "true" value ! You are not reading the 'match' variable, just putting values....
At second, String is immutable, so, each time you're splitting, you're creating another instances... why don't you try so create a pattern that makes the matches you want ignoring the commas and semicolons? I'm not sure, but I think it will take you less memory...
Yes, this code is inefficient indeed because you can return immediately once you've found that match = true; (no point to continue looping).
Further, are you sure you need to break the line into tokens ? why not check the regex only once ?
And last, if all comparisons checks failed, you should return false (last line).
Instead of altering the text and splitting it you can put it all in the regex.
// the \\b means it must be the start of the String or a word
static Pattern numberPattern = Pattern.compile("\\b[a-zA-Z,;]*[0-9,;]*[0-9]");
// return true if the string contains
// a number which might have letters in front
public static boolean someMethod(String line) {
return numberPattern.matcher(line).find());
}
Aside from what #alfasin has mentioned in his answer, you should avoid duplicating code; Rewrite the following:
{
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
}
Into:
tokens[i] = tokens[i].replaceAll(",|;", "");
And please just compute this before it was .split(), such that the operation doesn't have to be repeated within the loop:
String[] tokens = line.replaceAll(",|;", "").split(" ");
^^^^^^^^^^^^^^^^^^^^^^
Edit: After staring at your code for a bit I think I have a better solution, using regex ;)
public static boolean someMethod(String line) {
return Pattern.compile("\\b[a-zA-Z]*\\d")
.matcher(line.replaceAll(",|;", "")).find();
}
Online Regex DemoOnline Code Demo
\b is a Word Boundary.
It asserts position at the Boundary of a word (Start of line + after spacing)
Code Demo STDOUT:
foo does not match
bar does not match
bar1 does match
foo baz bar bar1 lolz does match
password_01 does not match

Check special arrangement of specific signs in a string in Java

I need to check a string whether it includes a specific arrangements of letters and numbers.
Valid arrangements are for example:
X
X-Y
A-H-K-L-J-Y
A-H-J-Y
123
12?
12*
12-17
Invalid are for example:
-X-Y
-XY
*12
?12
I have written this method in java to solve this problem (but i don´t have some experiences with regular expressions):
public boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(Pattern.quote(searchPattern),
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.find();
}
return patternFounded;
}
How can i implemented this requirement with regular expressions?
By the way: It is a good solution to check a string, whether it includes numeric content by using the method isNumeric from the java class StringUtils?
//EDIT
The link, which was edited by the admins includes not specific arrangements of characters but only an appearance of characters with regular expressions in general !
After a good while trying to help, answering to constantly changing questions, just found out that the same was asked yesterday, and that the OP doesn't accept answers to his questions...all I have left to say is good night sir, good luck
n-th answer follows:
First pattern: [a-z](-[a-z])* : a letter, possibly followed by more letters, separated by -.
Second pattern: \d+(-\d+)*[?*]* : a number, possibly followed by more numbers, separated by -, and possibly ending with ? or *.
So join them together: ^([a-z](-[a-z])*)|(\d+(-\d+)*[?*]*)$. ^ and $ mark the beginning and the end of the string.
Few more comments on the code: you don't need to use Pattern.quote, and you should use matches() instead of find(), because find() returns true if any part of the string matches the pattern, and you want the whole string:
public static boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(searchPattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.matches();
}
return patternFounded;
}
Called like this: checkPatternMatching(s, "^([a-z](-[a-z])*)|(\\d+(-\\d+)*[?*]*)$")
About the second question, this is the current implementation of StringUtils.isNumeric:
public static boolean isNumeric(final CharSequence cs) {
if (isEmpty(cs)) {
return false;
}
final int sz = cs.length();
for (int i = 0; i < sz; i++) {
if (Character.isDigit(cs.charAt(i)) == false) {
return false;
}
}
return true;
}
So no, there is nothing wrong about it, that is as simple as it gets. But you need to include an external JAR in your program, which I find unnecessary if you just want to use such a simple method.
I believe that you should first remove the Pattern.quote() method because that would turn the inputting patterns into string literals; and those are not really useful in your context.
To match the valid arrangements with letters, something like this should work:
^[a-z](?:-[a-z])*$
For the numbers (if I understood the rules correctly):
^\\d+(?:[?*]|-\\d+)*$
And if you want to combine them:
^(?:[a-z](?:-[a-z])*|\\d+(?:[?*]|-\\d+)*)$
I'm not familiar with Java itself, nor the isNumeric method, sorry.
As per your comment, if you want to accept *12 or 1?2 or 12*456, you can use:
^\\*?\\d+(?:[?*]\\d*|-\\d+)*$
Then add it to the previous regex like so:
^(?:[a-z](?:-[a-z])*|\\*?\\d+(?:[?*]\\d*|-\\d+)*)$

Fastest way to decide if string contains only allowed values?

I'm trying to determine what is the fastest way to implement this code:
Pattern ID_REGEX = Pattern.compile( "[A-Za-z0-9_\\-.]+" );
boolean match = ID_REGEX.matcher( id ).matches();
if ( !match ) throw new IllegalArgumentException("Disallowed character in ID");
Given that the ID_REGEX is constant, I would assume something like a BitSet or an array of permitted values is the fastest way to implement this, maybe even just a huge if statement.
Note that the search is for A-Za-z, not Character.isLetter.
Additional kudos for an OSS implementation
My fast but unclear solution
// encapsulate this into a class and do once; perhaps use a static initializer
boolean[] allowed = new boolean[256]; // default false
allowed[32] = true;
allowed['a'] = true;
// fill all allowed characters
allowed['Z'] = true;
// the check
for (int n=0,len=str.length(); n<len; n++) {
char ch = str.charAt(n);
if (ch>255 || !allowed[ch]) {
return false;
}
}
return true;
Some additional casts may be needed, but I hope the idea is clear.
I suppose you could iterate through string, get code of character, compare with ASCII table codes. This way at each character you have 3 between comparisons(for a-z, A-Z, 0-9) and 6 usual integer comparisons for other chars. I think it will be fastest. You could also try multithreading approach, in which you do basically the same, but divide problem into several parts before starting and make checks concurrently.
Edit(after #krosenvold comment): multithreading approach is suitable only for very big strings, because creating threads has its own price.

regex is very slow, how to check if a string is only with word chars fast?

I have a function to check if a string(most of the string is only with one CJK char) is only with word chars, and it will be invoked many many times, so the cost is unacceptable, but I don't know how to optimize it, any suggestions?
/*\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
For more details see Unicode TR-18, and bear in mind that the set of characters
in each class can vary between Unicode releases.*/
private static final Pattern sOnlyWordChars = Pattern.compile("\\w+");
private boolean isOnlyWordChars(String s) {
return sOnlyWordChars.matcher(s).matches();
}
when s is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true.
private boolean isOnlyWordChars(String s) {
char[] chars = s.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;
}
A better implementation
public static boolean isAlpha(String str) {
if (str == null) {
return false;
}
int sz = str.length();
for (int i = 0; i < sz; i++) {
if (Character.isLetter(str.charAt(i)) == false) {
return false;
}
}
return true;
}
Or if you are using Apache Commons, StringUtils.isAlpha(). the second implemenation of the answer is actually from the source code if isAlpha.
UPDATE
HI Sorry for the late reply. I wasn't pretty sure about the speed although I read in several places that loop is faster than regex. To be sure I run the following codes in ideoone and here is the result
for 5000000 iteration
with your codes: 4.99 seconds (runtime error after that so for big data it is not working)
with my first code 2.71 seconds
with my second code 2.52 seconds
for 500000 iteration
with your codes: 1.07 seconds
with my first code 0.36 seconds
with my second code 0.33 seconds
Here is the sample code I used.
N.B. There might be small mistakes. You can play with it to test in different scenario.
according to the comment of Jan, I think those are minor things like using private or public. yest condition checking is a nice point.
I think that the chief problem is your pattern.
I was working through an iterative solution, when I noticed that it failed on one of my test strings Supercalifragilisticexpalidociou5. This reason for this: \w+ only cares if there is one or more word characters. It doesn't care if you're not looking at a word character beyond what it's already matched.
To rectify this, use a lookaround:
(?!\W+)(\w+)
The \W+ condition will lock the regex if one or more characters are found to be a non-word character (such as &*()!#!#$).
The only thing i see is to change your pattern to:
^\\w++$
but i am not a java expert
explanations:
I have added anchors (ie ^ $) that increases the performances of the pattern (the regex engine fails at the first non word character until it encounters the end). I have added a possessive quantifier (ie ++), then the regex engine doesn't matter of backtrack positions and is more fast.
more informations here.
If you want to do this using regexes, then the most efficient way do it is to change the logic to a negation; i.e. "every character is a letter" becomes "no character is a non-letter".
private static final Pattern pat = Pattern.compile("\\W");
private boolean isOnlyWordChars(String s) {
return !pat.matcher(s).find();
}
This will test each character at most once ... with no backtracking.

Categories

Resources