java regular expressions: performance and alternative

java regular expressions: performance and alternative - java

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear until a search term has been entered by the user. The problem is I have noticed each time my application runs the following line:
if (stringValue.matches (rexExPattern))
{
// do something so simple
}
it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.
The pattern is something like:
"A*B*C*D*E*F*"
where A~F are just examples here, but the pattern is some thing like the above. Please note* that the pattern actually changes per search. For example "A*B*C*" may change to W*D*G*A*".
I wonder if there is a better substitution for the above pattern or, more generally, an alternative for java regular expressions.

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.
So you should compile your regular expression only once and reuse it:
Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
Matcher matcher = pattern.matcher(value);
if (matcher.matches()) {
// your code here
}
}

Consider the following (quick and dirty) test:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
// time that tick() was called
static long tickTime;
// called at start of operation, for timing
static void tick () {
tickTime = System.nanoTime();
}
// called at end of operation, prints message and time since tick().
static void tock (String action) {
long mstime = (System.nanoTime() - tickTime) / 1000000;
System.out.println(action + ": " + mstime + "ms");
}
// generate random strings of form AAAABBBCCCCC; a random
// number of characters each randomly repeated.
static List<String> generateData (int itemCount) {
Random random = new Random();
List<String> items = new ArrayList<String>();
long mean = 0;
for (int n = 0; n < itemCount; ++ n) {
StringBuilder s = new StringBuilder();
int characters = random.nextInt(7) + 1;
for (int k = 0; k < characters; ++ k) {
char c = (char)(random.nextInt('Z' - 'A') + 'A');
int rep = random.nextInt(95) + 5;
for (int j = 0; j < rep; ++ j)
s.append(c);
mean += rep;
}
items.add(s.toString());
}
mean /= itemCount;
System.out.println("generated data, average length: " + mean);
return items;
}
// match all strings in items to regexStr, do not precompile.
static void regexTestUncompiled (List<String> items, String regexStr) {
tick();
int matched = 0, unmatched = 0;
for (String item:items) {
if (item.matches(regexStr))
++ matched;
else
++ unmatched;
}
tock("uncompiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// match all strings in items to regexStr, precompile.
static void regexTestCompiled (List<String> items, String regexStr) {
tick();
Matcher matcher = Pattern.compile(regexStr).matcher("");
int matched = 0, unmatched = 0;
for (String item:items) {
if (matcher.reset(item).matches())
++ matched;
else
++ unmatched;
}
tock("compiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// test all strings in items against regexStr.
static void regexTest (List<String> items, String regexStr) {
regexTestUncompiled(items, regexStr);
regexTestCompiled(items, regexStr);
}
// generate data and run some basic tests
public static void main (String[] args) {
List<String> items = generateData(1000000);
regexTest(items, "A*");
regexTest(items, "A*B*C*");
regexTest(items, "E*C*W*F*");
}
}
Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.
Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:
Regex Uncompiled Precompiled
A* 0.564 sec 0.126 sec
A*B*C* 1.768 sec 0.238 sec
E*C*W*F* 0.795 sec 0.275 sec
Actual output:
generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms
Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.
After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.
For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

Related

Checking for consecutively repeated characters in java

I am quite new to java. I am wondering if it is possible to check for a certain number of consecutively repeated characters (the 'certain number' being determined by the user) in a string or an index in a string array. So far I have tried
int multiple_characters = 0;
String array1 [] = {"abc","aabc","xyyyxy"};
for (int index = 0; index < array1.length;i++){
for (int i = 0;i<array1[index].length;i++){
if (array1[index].charAt(i) == array1[index].charAt(i+1)){
multiple_characters++;
}
}
}
But with this I get a StringIndexOutOfBounds error. I tried fixing this by putting in an extra if statement to make sure i was not equal to the array1[index].length, but this still threw up the same error. Other than the manual and cop-out method of:
if ((array1[index].charAt(i) == array1[index].charAt(i+1) && (array1[index].charAt(i) == array1[index].charAt(i+2))
and repeating however many times, (which would not be great for quick changes to my code), I can't seem to find a solution.

For the inner for loop (the one with the i variable), you're then calling string.charAt(i+1) where ii loops from 0 to the length of that string.
No wonder you get an index array out of bounds exception, you're asking for the character AFTER the last.
I advise that you try to understand the exception, and if you can't, debug your code (step through it, one line at a time, and if you don't know how to use a debugger, add println statements, checking what the code does what with you think it does. There where your code acts differently from your expectation? That's where the bug is).
This plan of 'oh, it does not work, I'll just chuck it out entirely and find another way to do it' is suboptimal :) – go back to the first snippet, and just fix this.

You are getting StringIndexOutOfBoundsException because you are trying to access string.charAt(i + 1) where i goes up to the highest index (i.e. string.length() - 1) of string.
You can do it as follows:
class Main {
public static void main(String[] args) {
int multiple_characters = 0;
int i;
String array1[] = { "abc", "aabc", "xyyyxy" };
for (int index = 0; index < array1.length; index++) {
System.out.println("String: " + array1[index]);
for (i = 0; i < array1[index].length() - 1; i++) {
multiple_characters = 1;
while (array1[index].charAt(i) == array1[index].charAt(i + 1) && i < array1[index].length() - 1) {
multiple_characters++;
i++;
}
System.out.println(array1[index].charAt(i) + " has been repeated consecutively " + multiple_characters
+ " time(s)");
}
if (multiple_characters == 1) {
System.out.println(array1[index].charAt(i) + " has been repeated consecutively 1 time(s)");
}
System.out.println("------------");
}
}
}
Output:
String: abc
a has been repeated consecutively 1 time(s)
b has been repeated consecutively 1 time(s)
c has been repeated consecutively 1 time(s)
------------
String: aabc
a has been repeated consecutively 2 time(s)
b has been repeated consecutively 1 time(s)
c has been repeated consecutively 1 time(s)
------------
String: xyyyxy
x has been repeated consecutively 1 time(s)
y has been repeated consecutively 3 time(s)
x has been repeated consecutively 1 time(s)
y has been repeated consecutively 1 time(s)
------------

If I was to look for repeated characters, I would go the regular expression route. For example to look for repeated a characters (repeated twice in this example), you could have:
import java.util.regex.Pattern;
public class Temp {
public static void main(final String[] args) {
String array1 [] = {"abc","aabc","xyyyxy"};
for (String item : array1){
if (Pattern.compile("[a]{2}").matcher(item).find()) {
System.out.println(item + " matches");
}
}
}
}
In this extract, the reg exp is "[a]{2}" which looks for any sequence of a characters repeated twice.
Of course more complicated regular expressions are required for more complex matches, good resources to explain this may be found here:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
Another point is that for efficiencies sake, it is often practise to move the:
Pattern.compile(*Pattern*)
outside of the method call, e.g. to a final static field
This stack overflow:
RegEx No more than 2 identical consecutive characters and a-Z and 0-9
gives quite a detailed description of the regular expression issues involved with this problem.

Regexp for Word similarity "n letter difference"

Assume i have a word like this; mert . I want to to search for all 1 letter difference combinations for that word. aert, ert, meat,mmert, merst,merts etc. are all applicable. So my regular expression is like
[a-z]{0,2}ert OR m[a-z]{0,2}rt OR me[a-z]{0,2}t OR mer[a-z]{0,2}
So for n letter difference, i just replace 2 with n-1 and you can'T get all combinations.
My question is this; Is there any shorter way of writing this regexp?
Thanks

Please check this solution, I have test this code below. It seems to work.
/**
* Then function will return list of the words matched with nth_difference
*
* #param pattern search pattern
* #param data input data
* #param nth_difference difference
* #return
*/
static List<String> getNthDifferenceWords(String pattern, String[] data, int nth_difference) {
Map<Character, Integer> frequencyTable = new HashMap<>();
List<String> matchedWords = new ArrayList<>();
//Code complexity : O(n)
for (int i = 0; i < pattern.length(); ++i) {
frequencyTable.put(pattern.charAt(i), 1);
}
//Code complexity : O(m) where m is size of entire input;
for (String input : data) {
int matchCounter = 0;
for (int j=0; j<input.length(); ++j){
if(frequencyTable.containsKey(input.charAt(j))){
++matchCounter;
}
}
//System.out.println("matched=" + matchCounter);
if(input.length() <= pattern.length() && (matchCounter == pattern.length() - nth_difference)){
matchedWords.add(input);
}
if((input.length() - pattern.length() == 1) && (matchCounter >= input.length() - nth_difference)){
matchedWords.add(input);
}
}
return matchedWords;
}
public static void main(String[] args) {
int nth_difference = 1;
String pattern = "mert";
String[] data = new String[]{"aert", "ert", "meat", "mmert", "merst", "merts","meritos"};
System.out.println(getNthDifferenceWords(pattern,data,nth_difference));
nth_difference = 2;
pattern = "merit";
data = new String[]{"aert", "ert", "meat", "mmert", "merst", "merts","demerit","merito", "meritos"};
System.out.println(getNthDifferenceWords(pattern,data,nth_difference));
}

For a 1-letter difference, pre-build a table in the following way. Build a 2-column lexicon with the 'word' in the second column, and the following in the first column: One position at a time, remove one letter from the word.
Example: "meat" is the word; here are the rows for it in the table:
`col1` `col2`
------ ------
meat meat
eat meat
mat meat
met meat
mea meat
For "meet" (note the dup letter):
meet meet
eet meet
met meet -- only needed once
mee meet
Then test in a similar way. When searching for "mert", do
WHERE col1 IN ('mert', 'ert', 'mrt', 'met', 'ert')
Note that you will get both "meat" and "meet" from the above example. Note also what will happen with "met" and "meets".
And, it checks for simple transpositions. Searching for "meta":
WHERE col1 IN ('meta', 'eta', 'mta', 'mea', 'met')
will find "meat", "meet" (and other words like met, mean, ...) Arguably, "meta" -> "mean" is a 2-letter distance, but oh well.
Checking your test cases-- mert vs
aert -- via "ert"
ert -- via "ert"
meat -- via "met"
mmert -- via "mert"
merst -- via "mert"
merts -- via "mert"
Meanwhile, have PRIMARY KEY(col1, col2), INDEX(col2) on that table.

Find all valid words when given a string of characters (Recursion / Binary Search)

I'd like some feedback on a method I tried to implement that isn't working 100%. I'm making an Android app for practice where the user is given 20 random letters. The user then uses these letters to make a word of whatever size. It then checks a dictionary to see if it is a valid English word.
The part that's giving me trouble is with showing a "hint". If the user is stuck, I want to display the possible words that can be made. I initially thought recursion. However, with 20 letters this can take quite a long time to execute. So, I also implemented a binary search to check if the current recursion path is a a prefix to anything in the dictionary. I do get valid hints to be output however it's not returning all possible words. Do I have a mistake here in my recursion thinking? Also, is there a recommended, faster algorithm? I've seen a method in which you check each word in a dictionary and see if the characters can make each word. However, I'd like to know how effective my method is vs. that one.
private static void getAllWords(String letterPool, String currWord) {
//Add to possibleWords when valid word
if (letterPool.equals("")) {
//System.out.println("");
} else if(currWord.equals("")){
for (int i = 0; i < letterPool.length(); i++) {
String curr = letterPool.substring(i, i+1);
String newLetterPool = (letterPool.substring(0, i) + letterPool.substring(i+1));
if(dict.contains(curr)){
possibleWords.add(curr);
}
boolean prefixInDic = binarySearch(curr);
if( !prefixInDic ){
break;
} else {
getAllWords(newLetterPool, curr);
}
}
} else {
//Every time we add a letter to currWord, delete from letterPool
//Attach new letter to curr and then check if in dict
for(int i=0; i<letterPool.length(); i++){
String curr = currWord + letterPool.substring(i, i+1);
String newLetterPool = (letterPool.substring(0, i) + letterPool.substring(i+1));
if(dict.contains(curr)) {
possibleWords.add(curr);
}
boolean prefixInDic = binarySearch(curr);
if( !prefixInDic ){
break;
} else {
getAllWords(newLetterPool, curr);
}
}
}
private static boolean binarySearch(String word){
int max = dict.size() - 1;
int min = 0;
int currIndex = 0;
boolean result = false;
while(min <= max) {
currIndex = (min + max) / 2;
if (dict.get(currIndex).startsWith(word)) {
result = true;
break;
} else if (dict.get(currIndex).compareTo(word) < 0) {
min = currIndex + 1;
} else if(dict.get(currIndex).compareTo(word) > 0){
max = currIndex - 1;
} else {
result = true;
break;
}
}
return result;
}

The simplest way to speed up your algorithm is probably to use a Trie (a prefix tree)
Trie data structures offer two relevant methods. isWord(String) and isPrefix(String), both of which take O(n) comparisons to determine whether a word or prefix exist in a dictionary (where n is the number of letters in the argument). This is really fast because it doesn't matter how large your dictionary is.
For comparison, your method for checking if a prefix exists in your dictionary using binary search is O(n*log(m)) where n is the number of letters in the string and m is the number of words in the dictionary.
I coded up a similar algorithm to yours using a Trie and compared it to the code you posted (with minor modifications) in a very informal benchmark.
With 20-char input, the Trie took 9ms. The original code didn't complete in reasonable time so I had to kill it.
Edit:
As to why your code doesn't return all hints, you don't want to break if the prefix is not in your dict. You should continue to check the next prefix instead.

Is there a recommended, faster algorithm?
See Wikipedia article on "String searching algorithm", in particular the section named "Algorithms using a finite set of patterns", where "finite set of patterns" is your dictionary.
The Aho–Corasick algorithm listed first might be a good choice.

How can I most efficiently execute this recursive/iterative CPU intensive android task?

Some Background Info: I have made a program that given an arraylist of letters, and an array of integers finds all the combinations of words that can exist inside this arraylist where the words length is one of the integers in the int array (wordSizes).
i.e. given h, o, p, n, c, i, e, t, k and the integers 5 and 4, the solution would be:
phone tick.
My problem right now:
Inputs usually are about 25 characters and the output should usually return 5 word combinations.
I originally made this a console application for dekstop, and runtimes are generally less than 1 minute.
I decided to port it to android and runtimes reach over 35 minutes. I am quite a beginner and not sure about how to run a CPU intensive task on Android.
public void findWordsLimited(ArrayList<Character> letters) {
for (String s1 : first2s) {
for (String s2 : possibleSeconds) {
boolean t = true;
String s1s2 = s1.concat(s2);
ArrayList<Character> tempLetters = new ArrayList<Character>(letters);
for (int i = 0; i < s1s2.length(); i++) {
if (tempLetters.contains(s1s2.charAt(i)))
tempLetters.remove(Character.valueOf(s1s2.charAt(i)));
else
t = false;
}
if (t) {
helperFindWordsL(tempLetters, s1 + " " + s2, 2);
}
}
}
}
public void helperFindWordsL(ArrayList<Character> letters, String prefix , int index) {
boolean r;
if (letters.size() <= 1) {
output += "Success : " + prefix + "\n";
Log.i(TAG, prefix);
}
else if (index < wordSizes.size()){
for (String s : lastCheck) {
if (s.length() == wordSizes.get(index)) {
ArrayList<Character> templetters = new ArrayList<Character>(letters);
r = true;
for (int j = 0; j < s.length(); j++) {
if (templetters.contains(s.charAt(j)))
templetters.remove(Character.valueOf(s.charAt(j)));
else {
r = false;
templetters = new ArrayList<Character>(letters);
}
}
if (r)
helperFindWordsL(templetters, prefix + " " + s, index + 1);
}
}
}
}
I am not too concerned about the algorithm, as this might be confusing because it is part of a bigger project to solve a word game puzzle.
A few questions:
How would I get a CPU intensive task like this finished fastest?
Right now I call the method findWordsLimited() from my MainActivity. On my desktop app (where it says output += Success... in HelperFindWordsL) I would print all solutions to the console, but right now I have made it so that the method adds to and in the end returns a giant string (String output) back to the MainActivity, with all solutions and that String is put inside of a TextView. Is that an inefficient way to display the data? If so, could you please help explain a better way?
Should I be running this as a backgroud/foreground process or thread instead of just calling it from the MainActivity?
How can i get runtimes on my android that are currently 20x slower than my desktop faster?

Try to replace recursion with cycles, and use arrays instead of lists, to avoid inserts etc, direct access to array members is much faster. Pay main attention to the most inner loop which uses templetters.contains(s.charAt(j)), optimization of this part of code will give main effect.
You may add break; after t = false;
String s1s2 = s1.concat(s2); - it's not good to create a new String object for such case - it makes unnecessary work for GC. I would replace it with 2 cycles through s1 then s2
You could use 'letters' instead of ArrayList<Character> tempLetters = new ArrayList<Character>(letters);, just marking some items there as deleted. No need to create local clones.

For Loop is performing slow

Please have a look at the following code
//Devide the has into set of 3 pieces
private void devideHash(String str)
{
int lastIndex = 0;
for(int i=0;i<=str.length();i=i+3)
{
lastIndex = i;
try
{
String stringPiece = str.substring(i, i+3);
// pw.println(stringPiece);
hashSet.add(stringPiece);
}
catch(Exception arr)
{
String stringPiece = str.substring(lastIndex, str.length());
// pw.println(stringPiece);
hashSet.add(stringPiece);
}
}
}
The above method receives String like abcdefgjijklmnop as the parameter. Inside the method, its job is to divide this sets of 3 letters. So when the operation is completed, the hashset will have pieces like abc def ghi jkl mno p
But the problem is that if the input String is big, then this loop takes noticeable amount of time to complete. Is there any way I can use to speed this process?

As an option, you could replace all your code with this line:
private void divideHash(String str) {
hashSet.addAll(Arrays.asList(str.split("(?<=\\G...)")));
}
Which will perform well.
Here's some test code:
String str = "abcdefghijklmnop";
hashSet.addAll(Arrays.asList(str.split("(?<=\\G...)")));
System.out.println(hashSet);
Output:
[jkl, abc, ghi, def, mno, p]

There is nothing we can really tell unless you tell us what the "noticeable large amount" is, and what is the expected time. It is recommended that you start a profiler to find what logic takes most time.
Some recommendations I can give from briefly reading your code is:
If the result Set is going to be huge, it will involve lots of resize and rehashing when your HashSet resize. It is recommended you first allocate required size. e.g.
HashSet hashSet = new HashSet<String>(input.size() / 3 + 1, 1.0);
This will save you lots of time for unnecessary rehashing
Never use exception to control your program flow.
Why not simply do:
int i = 0;
for (int i = 0; i < input.size(); i += 3) {
if (i + 3 > input.size()) {
// substring from i to end
} else {
// subtring from i to i+3
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regular expressions: performance and alternative - java

Related

Checking for consecutively repeated characters in java

Regexp for Word similarity "n letter difference"

Find all valid words when given a string of characters (Recursion / Binary Search)

How can I most efficiently execute this recursive/iterative CPU intensive android task?

For Loop is performing slow

Categories

Resources