How to properly use java Pattern object to match string patterns

How to properly use java Pattern object to match string patterns - java

I wrote a code that does several string operations including checking whether a given string matches with a certain regular expression. It ran just fine with 70,000 input but it started to give me out of memory error when I iteratively ran it for five-fold cross validation. It just might be the case that I have to assign more memory, but I have a feeling that I might have written an inefficient code, so wanted to double check if I didn't make any obvious mistake.
static Pattern numberPattern = Pattern.compile("^[a-zA-Z]*([0-9]+).*");
public static boolean someMethod(String line) {
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++) {
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
if(numberPattern.matcher(tokens[i]).find()) return true;
}
return false;
}
and I have also many lines like below:
token.matches("[a-z]+[A-Z][a-z]+");
Which way is more memory efficient? Do they look efficient enough? Any advice is appreciated!
Edited:
Sorry, I had a wrong code, which I intended to modify before posting this question but I forgot at the last minute. But the problem was I had many similar looking operations all over, aside from the fact that the example code did not make sense, I wanted to know if regexp comparison part was efficient.
Thanks for all of your comments, I'll look through and modify the code following the advice!

Well, first at all, try a second look at your code... it will always return a "true" value ! You are not reading the 'match' variable, just putting values....
At second, String is immutable, so, each time you're splitting, you're creating another instances... why don't you try so create a pattern that makes the matches you want ignoring the commas and semicolons? I'm not sure, but I think it will take you less memory...

Yes, this code is inefficient indeed because you can return immediately once you've found that match = true; (no point to continue looping).
Further, are you sure you need to break the line into tokens ? why not check the regex only once ?
And last, if all comparisons checks failed, you should return false (last line).

Instead of altering the text and splitting it you can put it all in the regex.
// the \\b means it must be the start of the String or a word
static Pattern numberPattern = Pattern.compile("\\b[a-zA-Z,;]*[0-9,;]*[0-9]");
// return true if the string contains
// a number which might have letters in front
public static boolean someMethod(String line) {
return numberPattern.matcher(line).find());
}

Aside from what #alfasin has mentioned in his answer, you should avoid duplicating code; Rewrite the following:
{
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
}
Into:
tokens[i] = tokens[i].replaceAll(",|;", "");
And please just compute this before it was .split(), such that the operation doesn't have to be repeated within the loop:
String[] tokens = line.replaceAll(",|;", "").split(" ");
^^^^^^^^^^^^^^^^^^^^^^
Edit: After staring at your code for a bit I think I have a better solution, using regex ;)
public static boolean someMethod(String line) {
return Pattern.compile("\\b[a-zA-Z]*\\d")
.matcher(line.replaceAll(",|;", "")).find();
}
Online Regex DemoOnline Code Demo
\b is a Word Boundary.
It asserts position at the Boundary of a word (Start of line + after spacing)
Code Demo STDOUT:
foo does not match
bar does not match
bar1 does match
foo baz bar bar1 lolz does match
password_01 does not match

Related

Check special arrangement of specific signs in a string in Java

I need to check a string whether it includes a specific arrangements of letters and numbers.
Valid arrangements are for example:
X
X-Y
A-H-K-L-J-Y
A-H-J-Y
123
12?
12*
12-17
Invalid are for example:
-X-Y
-XY
*12
?12
I have written this method in java to solve this problem (but i don´t have some experiences with regular expressions):
public boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(Pattern.quote(searchPattern),
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.find();
}
return patternFounded;
}
How can i implemented this requirement with regular expressions?
By the way: It is a good solution to check a string, whether it includes numeric content by using the method isNumeric from the java class StringUtils?
//EDIT
The link, which was edited by the admins includes not specific arrangements of characters but only an appearance of characters with regular expressions in general !

After a good while trying to help, answering to constantly changing questions, just found out that the same was asked yesterday, and that the OP doesn't accept answers to his questions...all I have left to say is good night sir, good luck
n-th answer follows:
First pattern: [a-z](-[a-z])* : a letter, possibly followed by more letters, separated by -.
Second pattern: \d+(-\d+)*[?*]* : a number, possibly followed by more numbers, separated by -, and possibly ending with ? or *.
So join them together: ^([a-z](-[a-z])*)|(\d+(-\d+)*[?*]*)$. ^ and $ mark the beginning and the end of the string.
Few more comments on the code: you don't need to use Pattern.quote, and you should use matches() instead of find(), because find() returns true if any part of the string matches the pattern, and you want the whole string:
public static boolean checkPatternMatching(String sourceToScan, String searchPattern) {
boolean patternFounded;
if (sourceToScan == null) {
patternFounded = false;
} else {
Pattern pattern = Pattern.compile(searchPattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sourceToScan);
patternFounded = matcher.matches();
}
return patternFounded;
}
Called like this: checkPatternMatching(s, "^([a-z](-[a-z])*)|(\\d+(-\\d+)*[?*]*)$")
About the second question, this is the current implementation of StringUtils.isNumeric:
public static boolean isNumeric(final CharSequence cs) {
if (isEmpty(cs)) {
return false;
}
final int sz = cs.length();
for (int i = 0; i < sz; i++) {
if (Character.isDigit(cs.charAt(i)) == false) {
return false;
}
}
return true;
}
So no, there is nothing wrong about it, that is as simple as it gets. But you need to include an external JAR in your program, which I find unnecessary if you just want to use such a simple method.

I believe that you should first remove the Pattern.quote() method because that would turn the inputting patterns into string literals; and those are not really useful in your context.
To match the valid arrangements with letters, something like this should work:
^[a-z](?:-[a-z])*$
For the numbers (if I understood the rules correctly):
^\\d+(?:[?*]|-\\d+)*$
And if you want to combine them:
^(?:[a-z](?:-[a-z])*|\\d+(?:[?*]|-\\d+)*)$
I'm not familiar with Java itself, nor the isNumeric method, sorry.
As per your comment, if you want to accept *12 or 1?2 or 12*456, you can use:
^\\*?\\d+(?:[?*]\\d*|-\\d+)*$
Then add it to the previous regex like so:
^(?:[a-z](?:-[a-z])*|\\*?\\d+(?:[?*]\\d*|-\\d+)*)$

regex is very slow， how to check if a string is only with word chars fast?

I have a function to check if a string(most of the string is only with one CJK char) is only with word chars, and it will be invoked many many times, so the cost is unacceptable, but I don't know how to optimize it, any suggestions?
/*\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
For more details see Unicode TR-18, and bear in mind that the set of characters
in each class can vary between Unicode releases.*/
private static final Pattern sOnlyWordChars = Pattern.compile("\\w+");
private boolean isOnlyWordChars(String s) {
return sOnlyWordChars.matcher(s).matches();
}
when s is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true.

private boolean isOnlyWordChars(String s) {
char[] chars = s.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;
}
A better implementation
public static boolean isAlpha(String str) {
if (str == null) {
return false;
}
int sz = str.length();
for (int i = 0; i < sz; i++) {
if (Character.isLetter(str.charAt(i)) == false) {
return false;
}
}
return true;
}
Or if you are using Apache Commons, StringUtils.isAlpha(). the second implemenation of the answer is actually from the source code if isAlpha.
UPDATE
HI Sorry for the late reply. I wasn't pretty sure about the speed although I read in several places that loop is faster than regex. To be sure I run the following codes in ideoone and here is the result
for 5000000 iteration
with your codes: 4.99 seconds (runtime error after that so for big data it is not working)
with my first code 2.71 seconds
with my second code 2.52 seconds
for 500000 iteration
with your codes: 1.07 seconds
with my first code 0.36 seconds
with my second code 0.33 seconds
Here is the sample code I used.
N.B. There might be small mistakes. You can play with it to test in different scenario.
according to the comment of Jan, I think those are minor things like using private or public. yest condition checking is a nice point.

I think that the chief problem is your pattern.
I was working through an iterative solution, when I noticed that it failed on one of my test strings Supercalifragilisticexpalidociou5. This reason for this: \w+ only cares if there is one or more word characters. It doesn't care if you're not looking at a word character beyond what it's already matched.
To rectify this, use a lookaround:
(?!\W+)(\w+)
The \W+ condition will lock the regex if one or more characters are found to be a non-word character (such as &*()!#!#$).

The only thing i see is to change your pattern to:
^\\w++$
but i am not a java expert
explanations:
I have added anchors (ie ^ $) that increases the performances of the pattern (the regex engine fails at the first non word character until it encounters the end). I have added a possessive quantifier (ie ++), then the regex engine doesn't matter of backtrack positions and is more fast.
more informations here.

If you want to do this using regexes, then the most efficient way do it is to change the logic to a negation; i.e. "every character is a letter" becomes "no character is a non-letter".
private static final Pattern pat = Pattern.compile("\\W");
private boolean isOnlyWordChars(String s) {
return !pat.matcher(s).find();
}
This will test each character at most once ... with no backtracking.

Regex not finding string

I am having issues with this code:
For some reason, it always fails to match the code.
for (int i = 0; i < pluginList.size(); i++) {
System.out.println("");
String findMe = "plugins".concat(FILE_SEPARATOR).concat(pluginList.get(i));
Pattern pattern = Pattern.compile("("+name.getPath()+")(.*)");
Matcher matcher = pattern.matcher(findMe);
// Check if the current plugin matches the string.
if (matcher.find()) {
return !pluginListMode;
}
}

All you really need is
return ("plugins"+FILE_SEPARATOR+pluginName).indexOf(name.getPath()) != -1;
But your code also makes no sense due to the fact that there's no way for that for-loop to enter a second iteration -- it returns unconditionally. So more probably you need something like this:
for (String pluginName : pluginList)
if (("plugins"+FILE_SEPARATOR+pluginName).indexOf(name.getPath()) != -1)
return false;
return true;

Right now we can only guess since we don't know what name.getPath() might return.
I suspect it fails because that string might contain characters that have special meaning inside regexes. Try it again with
Pattern pattern = Pattern.compile("("+Pattern.quote(name.getPath())+")(.*)");
and see what happens then.
Also the (.*) part (and even the parentheses around your name.getPath() result) don't appear to matter at all since you're not doing anything with the result of the match itself. At which point the question is why you're using a regex in the first place.

String splitting

I have a string in what is the best way to put the things in between $ inside a list in java?
String temp = $abc$and$xyz$;
how can i get all the variables within $ sign as a list in java
[abc, xyz]
i can do using stringtokenizer but want to avoid using it if possible.
thx

Maybe you could think about calling String.split(String regex) ...

The pattern is simple enough that String.split should work here, but in the more general case, one alternative for StringTokenizer is the much more powerful java.util.Scanner.
String text = "$abc$and$xyz$";
Scanner sc = new Scanner(text);
while (sc.findInLine("\\$([^$]*)\\$") != null) {
System.out.println(sc.match().group(1));
} // abc, xyz
The pattern to find is:
\$([^$]*)\$
\_____/ i.e. literal $, a sequence of anything but $ (captured in group 1)
1 and another literal $
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
(…) is used for grouping. (pattern) is a capturing group and creates a backreference.
The backslash preceding the $ (outside of character class definition) is used to escape the $, which has a special meaning as the end of line anchor. That backslash is doubled in a String literal: "\\" is a String of length one containing a backslash).
This is not a typical usage of Scanner (usually the delimiter pattern is set, and tokens are extracted using next), but it does show how'd you use findInLine to find an arbitrary pattern (ignoring delimiters), and then using match() to access the MatchResult, from which you can get individual group captures.
You can also use this Pattern in a Matcher find() loop directly.
Matcher m = Pattern.compile("\\$([^$]*)\\$").matcher(text);
while (m.find()) {
System.out.println(m.group(1));
} // abc, xyz
Related questions
Validating input using java.util.Scanner
Scanner vs. StringTokenizer vs. String.Split

Just try this one:temp.split("\\$");

I would go for a regex myself, like Riduidel said.
This special case is, however, simple enough that you can just treat the String as a character sequence, and iterate over it char by char, and detect the $ sign. And so grab the strings yourself.
On a side node, I would try to go for different demarkation characters, to make it more readable to humans. Use $ as start-of-sequence and something else as end-of-sequence for instance. Or something like I think the Bash shell uses: ${some_value}. As said, the computer doesn't care but you debugging your string just might :)
As for an appropriate regex, something like (\\$.*\\$)* or so should do. Though I'm no expert on regexes (see http://www.regular-expressions.info for nice info on regexes).

Basically I'd ditto Khotyn as the easiest solution. I see you post on his answer that you don't want zero-length tokens at beginning and end.
That brings up the question: What happens if the string does not begin and end with $'s? Is that an error, or are they optional?
If it's an error, then just start with:
if (!text.startsWith("$") || !text.endsWith("$"))
return "Missing $'s"; // or whatever you do on error
If that passes, fall into the split.
If the $'s are optional, I'd just strip them out before splitting. i.e.:
if (text.startsWith("$"))
text=text.substring(1);
if (text.endsWith("$"))
text=text.substring(0,text.length()-1);
Then do the split.
Sure, you could make more sophisticated regex's or use StringTokenizer or no doubt come up with dozens of other complicated solutions. But why bother? When there's a simple solution, use it.
PS There's also the question of what result you want to see if there are two $'s in a row, e.g. "$foo$$bar$". Should that give ["foo","bar"], or ["foo","","bar"] ? Khotyn's split will give the second result, with zero-length strings. If you want the first result, you should split("\$+").

If you want a simple split function then use Apache Commons Lang which has StringUtils.split. The java one uses a regex which can be overkill/confusing.

You can do it in simple manner writing your own code.
Just use the following code and it will do the job for you
import java.util.ArrayList;
import java.util.List;
public class MyStringTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
List <String> result = getTokenizedStringsList("$abc$efg$hij$");
for(String token : result)
{
System.out.println(token);
}
}
private static List<String> getTokenizedStringsList(String string) {
List <String> tokenList = new ArrayList <String> ();
char [] in = string.toCharArray();
StringBuilder myBuilder = null;
int stringLength = in.length;
int start = -1;
int end = -1;
{
for(int i=0; i<stringLength;)
{
myBuilder = new StringBuilder();
while(i<stringLength && in[i] != '$')
i++;
i++;
while((i)<stringLength && in[i] != '$')
{
myBuilder.append(in[i]);
i++;
}
tokenList.add(myBuilder.toString());
}
}
return tokenList;
}
}

You can use
String temp = $abc$and$xyz$;
String array[]=temp.split(Pattern.quote("$"));
List<String> list=new ArrayList<String>();
for(int i=0;i<array.length;i++){
list.add(array[i]);
}
Now the list has what you want.

How to validate phone number(US format) in Java?

I just want to know where am i wrong here:
import java.io.*;
class Tokens{
public static void main(String[] args)
{
//String[] result = "this is a test".split("");
String[] result = "4543 6546 6556".split("");
boolean flag= true;
String num[] = {"0","1","2","3","4","5","6","7","8","9"};
String specialChars[] = {"-","#","#","*"," "};
for (int x=1; x<result.length; x++)
{
for (int y=0; y<num.length; y++)
{
if ((result[x].equals(num[y])))
{
flag = false;
continue;
}
else
{
flag = true;
}
if (flag == true)
break;
}
if (flag == false)
break;
}
System.out.println(flag);
}
}

If this is not homework, is there a reason you're avoiding regular expressions?
Here are some useful ones: http://regexlib.com/DisplayPatterns.aspx?cattabindex=6&categoryId=7
More generally, your code doesn't seem to validate that you have a phone number, it seems to merely validate that your strings consists only of digits. You're also not allowing any special characters right now.

Asides from the regex suggestion (which is a good one), it would seem to make more sense to deal with arrays of characters rather than single-char Strings.
In particular, the split("") call (shudder) could/should be replaced by toCharArray(). This lets you iterate over each individual character, which more clearly indicates your intent, is less prone to bugs as you know you're treating each character at once, and is more efficient*. Likewise your valid character sets should also be characters.
Your logic is pretty strangely expressed; you're not even referencing the specialChars set at all, and the looping logic once you've found a match seems odd. I think this is your bug; the matching seems to be the wrong way round in that if the character matches the first valid char, you set flag to false and continue round the current loop; so it will definitely not match the next valid char and hence you break out of the loop with a true flag. Always.
I would have thought something like this would be more intuitive:
private static final Set<Character> VALID_CHARS = ...;
public boolean isValidPhoneNumber(String number)
{
for (char c : number,toCharArray())
{
if (!VALID_CHARS.contains(c))
{
return false;
}
}
// All characters were valid
return true;
}
This doesn't take sequences into account (e.g. the strings "--------** " and "1" would be valid because all individual characters are valid) but then neither does your original code. A regex is better because it lets you specify the pattern, I supply the above snippet as an example of a clearer way of iterating through the characters.
*Yes, premature optimization is the root of all evil, but when better, cleaner code also happens to be faster that's an extra win for free.

Maybe this is overkill, but with a grammar similar to:
<phone_numer> := <area_code><space>*<local_code><space>*<number> |
<area_code><space>*"-"<space>*<local_code><space>*"-"<space>*<number>
<area_code> := <digit><digit><digit> |
"("<digit><digit><digit>")"
<local_code> := <digit><digit><digit>
<number> := <digit><digit><digit><digit>
you can write a recursive descent parser. See this page for an example.

You can checkout the Pattern class in Java, really easy to work with regular expression using this class:
https://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to properly use java Pattern object to match string patterns - java

Related

Check special arrangement of specific signs in a string in Java

regex is very slow， how to check if a string is only with word chars fast?

Regex not finding string

String splitting

How to validate phone number(US format) in Java?

Categories

Resources