Highest performance for finding substrings

Highest performance for finding substrings - java

I have an array of strings (keywords), and I need to check how many of those strings existing within a larger string (text read from file). I need the check to be case insensitive.
At this moment what I do is this:
private void findKeywords() {
String body = email.getMessage();
for (String word : keywords) {
if (body.toLowerCase().contains(word.toLowerCase())) {
//some actions }
if (email.getSubject().contains(word)) {
//some actions
}
}
}
From reading questions in here another solution came up:
private void findKeywords() {
String body = email.getMessage();
for (String word : keywords) {
boolean body_match = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE).matcher(body).find();
boolean subject_match = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE).matcher(email.getSubject()).find();
if (body_match) {
rating++;
}
if (subject_match) {
rating++;
}
}
}
Which of these solutions is more efficient? Also is there another way to do this that is better? Any accepted solutions must be simple to implement(on par with the above) and preferably without external libraries as this is not very important issue in this case.

Both of the solutions seem viable to me. One improvement I would suggest is moving functions out of the loop. In your current code you are repeatedly doing actions such as toLowerCase() and Pattern.compile which you only need to do once.
Obviously there are much faster methods to solve this problem, but they require much more complex code than these 5-liners.

Better: build a single pattern with all keywords. Then search on that pattern. Assuming your keywords do not contain meta-characters (characters with special meanings in patterns), then use:
StringBuilder keywordRegex = new StringBuilder();
for (String w : keywords) {
keywordRegex.append("|"+w);
}
Pattern p = Pattern.compile(keywordRegex.substring(1));
Matcher m = new p.matcher(textToMatch);
while (m.find()) {
// match is at m.start(); word is m.group(0);
}
Much more efficient than iterating through all keywords: pattern compilation (once) will have generated an automata that looks for all keywords at once.

I think the explicit regex solution you mentioned would be more efficient since it doesn't have the toLowerCase operation, which would copy the input string in memory and make chars lowercase.
Both solutions should be practical and your question is mostly academic, but I think the regexes provide cleaner code.

If your email bodies are very large, writing a specialized case-insensitive contains may be justified, because you can avoid calling toUpperCase() on big strings:
static bool containsIgnoreCase(String big, String small) {
if (small == null || big == null || small.length() > big.length()) {
return false;
}
String smallLC = small.toLowerCase();
String smallUC = small.toUpperCase();
for (int i = 0; i < big.length(); ++i) {
if (matchesAt(big, i, smallLC, smallUC)) {
return true;
}
}
return false;
}
private static bool matchesAt(String big, int index, String lc, String uc) {
if (index + lc.length() > big.length()) {
return false;
}
for (int i = 0; i < lc.length(); ++i) {
char c = big.charAt(i + index);
if ((c != lc.charAt(i)) && (c != uc.charAt(i))) {
return false;
}
}
return true;
}

Related

Return the first index from arraylist where string was found logic confusion

guys so I have this method that I am trying to construct, I am just having a hard time understanding the logic. This is the condition of the method:
public int search(String str) – search the list for parameter str.
Searches should work regardless of case. For example, “TOMATO” is
equivalent to “tomato.”
Hint: the String class has a method called
equalsIgnoreCase. If the string str appears more than once in the
ArrayList, return the first index where the string str was found or
return -1 if the string str was not found in the ArrayList.
This is what I have so far for my code, I am not sure if this is the right way to do it. My ArrayList is defined as words.
In order to solve this issue, I am thinking of using a foreach statement to iterate through the ArrayList then an If to check if the words match then return the Index value based on the match but I am getting error. The other confusion I am having is how do I only return the first Index value only. Maybe I am doing this wrong. Any help or direction is appreciated.
public int search(String str)
{
for(String s : words)
if(s.contains(s.equalsIgnoreCase(str)))
return s.get(s.equalsIgnoreCase(str));
}

The first answer unnecessarily has to search through the list of words to find the index once it has determined that the word is in the list. The code should be able to already know the index. This is the more efficient approach:
public int search(String str) {
int i = 0;
for (String s : words) {
if (s.equalsIgnoreCase(str))
return i;
i++;
}
return -1;
}
There is also the more classic approach...the way it might have been done before the enhance for loop was added to the Java language:
public int search(String str) {
for (int i = 0; i < words.size(); i++)
if (words.get(i).equalsIgnoreCase(str))
return i;
return -1;
}

You actually overcomplicated it a little bit
public int search(String str) {
for(String s : words) {
if(s.equalsIgnoreCase(str)) {
return words.indexOf(s);
}
}
return -1;
}
Since the return method will stop running more code in the function it will always return the first matching word.

You can use stream also to resolve this problem:
public boolean search(List<String> words, String wordToMatch)
{
Predicate<String> equalityPred = s -> s.equalsIgnoreCase(wordToMatch);
return words.stream().anyMatch(equalityPred);
}

java startsWith() method with custom rules

I implement typing trainer and would like to create my special String startsWith() method with specific rules.
For example: '-' char should be equal to any long hyphen ('‒', etc). Also I'll add other rules for special accent characters (e equals é, but not é equals e).
public class TestCustomStartsWith {
private static Map<Character, List<Character>> identityMap = new HashMap<>();
static { // different hyphens: ‒, –, —, ―
List<Character> list = new LinkedList<>();
list.add('‒');
list.add('–'); // etc
identityMap.put('-', list);
}
public static void main(String[] args) {
System.out.println(startsWith("‒d--", "-"));
}
public static boolean startsWith(String s, String prefix) {
if (s.startsWith(prefix)) return true;
if (prefix.length() > s.length()) return false;
int i = prefix.length();
while (--i >= 0) {
if (prefix.charAt(i) != s.charAt(i)) {
List<Character> list = identityMap.get(prefix.charAt(i));
if ((list == null) || (!list.contains(s.charAt(i)))) return false;
}
}
return true;
}
}
I could just replace all kinds of long hyphens with '-' char, but if there will be more rules, I'm afraid replacing will be too slow.
How can I improve this algorithm?

I don't know all of your custom rules, but would a regular expression work?
The user is passing in a String. Create a method to convert that String to a regex, e.g.
replace a short hyphen with short or long ([-‒]),
same for your accents, e becomes [eé]
Prepend with the start of word dohicky (\b),
Then convert this to a regex and give it a go.
Note that the list of replacements could be kept in a Map as suggested by Tobbias. Your code could be something like
public boolean myStartsWith(String testString, String startsWith) {
for (Map.Entry<String,String> me : fancyTransformMap) {
startsWith = startsWith.replaceAll(me.getKey(), me.getValue());
}
return testString.matches('\b' + startsWith);
}
p.s. I'm not a regex super-guru so if there may be possible improvements.

I'd think something like a HashMap that maps the undesirable characters to what you want them to be interpreted as might be the way to go if you are worried about performance;
HashMap<Character, Character> fastMap = new Map<Character, Character>();
// read it as '<long hyphen> can be interpreted as <regular-hyphen>
fastMap.add('–', '-');
fastMap.add('é', 'e');
fastMap.add('è', 'e');
fastMap.add('？', '?');
...
// and so on
That way you could ask for the value of the key: value = map.get(key).
However, this will only work as long as you have unique key-values. The caveat is that é can't be interpreted as è with this method - all the keys must be unique. However, if you are worried about performance, this is an exceedingly fast way of doing it, since the lookup time for a HashMap is pretty close to being O(1). But as others on this page has written, premature optimization is often a bad idea - try implementing something that works first, and if at the end of it you find it is too slow, then optimize.

Print Tree components

I am new to java and I want to create a very simple "word completion " program. I will be reading in a dictionary file and recursively adding the words into a Node array (size 26). I believe I have managed to do this successfully but I am not sure how to go through and print the matches. For the sake of testing, I am simply inserting 2 words at the moment by calling the function. Once everything is working, I will add the method to read the file in and remove junk from the word.
For example: If the words "test" and "tester" are inside the tree and the user enters "tes", it should display "test" and "tester".
If somebody could please tell me how to go through and print the matches (if any), I would really appreciate it. Full code is below.
Thank you

What you implemented is called "trie". You might want to look at the existing implementations.
What you used to store child nodes is called a hash table and you might want to use a standard implementations and avoid implementing it yourself unless you have very-very specific reasons to do that. Your implementation has some limitations (character range, for example).
I think, your code has a bug in method has:
...
else if (letter[val].flag==true || word.length()==1) {
return true;
}
If that method is intended to return true if there are strings starting with word then it shouldn't check flag. If it must return true if there is an exact match only, it shouldn't check word.length().
And, finally, addressing your question: not the optimal, but the simplest solution would be to make a method, which takes a string and returns a node matching that string and a method that composes all the words from a node. Something like this (not tested):
class Tree {
...
public List<String> matches(CharSequence prefix) {
List<String> result = new ArrayList<>();
if(r != null) {
Node n = r._match(prefix, 0);
if(n != null) {
StringBuilder p = new StringBuilder();
p.append(prefix);
n._addWords(p, result);
}
}
return result;
}
}
class Node {
...
protected Node _match(CharSequence prefix, int index) {
assert index <= prefix.length();
if(index == prefix.length()) {
return this;
}
int val = prefix.charAt(index) - 'a';
assert val >= 0 && val < letter.length;
if (letter[val] != null) {
return letter[val].match(prefix, index+1);
}
return null;
}
protected void _addWords(StringBuilder prefix, List<String> result) {
if(this.flag) {
result.add(prefix.toString());
}
for(int i = 0; i<letter.length; i++) {
if(letter[i] != null) {
prefix.append((char)(i + 'a'));
letter[i]._addWords(prefix, result);
prefix.delete(prefix.length() - 1, prefix.length());
}
}
}
}

Maybe a longshot here, but why don't you try regexes here? As far as i understand you want to match words to a list of words:
List<String> getMatches(List<String> list, String regex) {
Pattern p = Pattern.compile(regex);
ArrayList<String> matches = new ArrayList<String>();
for (String s:list) {
if (p.matcher(s).matches()) {
matches.add(s);
}
}
return matches
}

Java Recursive String Comparison with "*" as a Wildcard

I'm writing a recursive method that checks each letter of the string to compare them. I'm having trouble making the "*" character match with any, and act as as many letters as needed. (Making it a wildcard)
I was wondering if someone can give me a hint on the algorithm that would be used?
Here is what I have so far.
public static boolean match(String x, String y) {
return match_loop(x, y, 0, 1);
}
public static boolean match_loop(String a, String b, int i, int s) {
try {
if (a == b) {
return true;
}
if (i >= a.length() && i >= b.length()) {
return true;
}
if (a.charAt(i) == b.charAt(i)) {
return match_loop(a, b, i + 1, s);
}
//(((...A bunch of if statements for my other recursion requirements
return false;
} catch (java.lang.StringIndexOutOfBoundsException e) {
return false;
}
}
public static void main(String[] args) {
System.out.println(match("test", "t*t")); // should return true
}
What I was thinking of doing is adding another arguement to the method, an int that will act as a letter backcounter. Basically I'm thinking of this
if a or b at char(i-s) (s originally being 1.) is a *, recall the recursion with s+1.
and then a few more different ifs statements to fix the bugs. However this method seems really long and repetitive. Are there any other algorithms I can use?

Do not use == for String value comparison. Use the equals() method.
if (a == b) should be if a.equals(b)

If you are using only one character("*") as a wildcard, I recommend you to use regular expression. Such as;
public static boolean match(String x, String y) {
String regex= y.replace("*", "(.*)");
if(x.matches(regex)) {
return true;
}
}
public static void main(String[] args) {
System.out.println(match("test", "t*t")); // should return true
}
I think it is easier to read the code this way.

Have a look at this algorithm. It returns all substrings that match the pattern, so you'll have to check whether the entire string is matched in the end, but that should be easy.
It runs in O(km) time, where k is the number of wildcards and m is the length of your input string.

This book will tell you exactly how to do it:
http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
Here's a simple Java implementation that might get you on track: http://matt.might.net/articles/implementation-of-nfas-and-regular-expressions-in-java/
Basically the industrial-strength implementation is a state machine. You deconstruct the regular expression - the string with the '*' in it - and create a graph for it. Then you recursively search the graph, for example in a breadth-first tree search.
Here's some discussion of different ways to do it, that will help illustrate the approach: http://swtch.com/~rsc/regexp/regexp1.html

Is there a way to shorten a conditional that contains a bunch of boolean comparisons?

e.g
if("viewCategoryTree".equals(actionDetail)
|| "fromCut".equals(actionDetail)
|| "fromPaste".equals(actionDetail)
|| ("viewVendorCategory".equals(actionDetail))&&"viewCategoryTree".equals(vendorCategoryListForm.getActionOrigin())
|| ("viewVendorCategory".equals(actionDetail))&&"fromEdit".equals(vendorCategoryListForm.getActionOrigin())
|| "deleteSelectedItem".equals(actionDetail)
|| ("viewVendorCategory".equals(actionDetail))&&"fromLink".equals(vendorCategoryListForm.getActionOrigin())){
//do smth
}
I've tried something like this
if(check("deleteSelectedItem,viewCategoryTree,fromCut,fromPaste,{viewVendorCategory&&viewVendorCategory},{viewVendorCategory&&fromEdit},{viewVendorCategory&&fromLink}",actionDetail,actionOrigin)){
//do smth
}
public boolean check(String str, String ad, String ao){
String oneCmp = "";
String[] result = str.split(",");
ArrayList adList = new ArrayList();
ArrayList aoList = new ArrayList();
for (int i=0; i<result.length; i++){
oneCmp = result[i];
Matcher m = Pattern.compile("\\{([^}]*)\\}").matcher(oneCmp);
if(m.matches()){
m.find();
String agrp = m.group();
String[] groupresult = agrp.split("[\\W&&[^!]]+");
Boolean a = false;
Boolean b = false;
if(groupresult[0].startsWith("!")){
a = !groupresult[0].substring(1).equals(ad);
} else a = groupresult[0].equals(ad);
if(groupresult[1].startsWith("!")){
b = !groupresult[1].substring(1).equals(ao);
}else b = groupresult[1].equals(ao);
if(agrp.indexOf("&&")!=-1){
if(!(a && b))return false;
}
else if(agrp.indexOf("||")!=-1){
if(!(a || b))return false;
}
} else {
if(oneCmp.indexOf("^")==-1){
checklist(oneCmp,ad);
if(!checklist(oneCmp,ad))return false;
}else{
if(!checklist(oneCmp,ao))return false;
}
}
}
return false;
}
public boolean checklist(String str, String key){
if(str.startsWith("!")){
if(str.substring(1).equals(key))return false;
}else { if (!str.substring(1).equals(key)) return false;
}
}
return false;
}
is there a better way to do this ? thanks.

Move the check to a method that takes actionDetail as argument:
// Assumes vendorCategoryListForm is a member variable.
boolean check(String actionDetail) {
return ("viewCategoryTree".equals(actionDetail)
|| "fromCut".equals(actionDetail)
|| "fromPaste".equals(actionDetail)
|| (("viewVendorCategory".equals(actionDetail))
&&"viewCategoryTree".equals(vendorCategoryListForm.getActionOrigin()))
|| (("viewVendorCategory".equals(actionDetail))
&&"fromEdit".equals(vendorCategoryListForm.getActionOrigin()))
|| "deleteSelectedItem".equals(actionDetail)
|| (("viewVendorCategory".equals(actionDetail))
&&"fromLink".equals(vendorCategoryListForm.getActionOrigin())))
}
if (check(actionDetail)) {
// do this
}

How about creating an array of what you need to test against.
And then some code like this:
arrayOfStrings = ["viewCategoryTree", ...]
match = false
for elem in arrayOfStrings:
if elem == actionDetail:
match = true
break
The good thing about an array is that it is easily extensible: you can easily add/remove elements to it both statically and dynamically.

Also kindly look at this post
Language Agnostic Credits to Galwegian
See Flattening Arrow Code for help.
1. Replace conditions with guard clauses.
2. Decompose conditional blocks into seperate functions.
3. Convert negative checks into positive checks.

Honestly, that code is no more readable. I would better suggest to encapsulate that conditional check into some property for the type like if (control.IsApplicable) { // do smth }.
No matter either you parameterize by one or two arguments.
But I suppose better solution is to have an array of matches that could be tested against and if matched then return true.

I don't think you are going to improve on this without adding a bunch of complexity, both in terms of the notation that you use to express the conditions and the implementation of the "engine" that evaluates them.
The notation issue is that: while you may end up expressing the conditions in fewer characters, someone else reading your code has to figure out what that funky string literal really means.
Besides, anything clever you do could have an impact on performance. For instance, your attempt compiles and applies a regex multiple times for each call to check.
Stick with what you've got would be my advice.

if(isValidActionDetail(actionDetail)
|| (isValidActionDetail(actionDetail)
&& ("viewCategoryTree".equals(vendorCategoryListForm.getActionOrigin())
|| "fromEdit".equals(vendorCategoryListForm.getActionOrigin())
|| "fromLink".equals(vendorCategoryListForm.getActionOrigin())))){
//do smth
}
}
public static boolean isValidActionDetail (String actionDetail) {
return "viewCategoryTree".equals(actionDetail) || "fromCut".equals(actionDetail)
|| "fromPaste".equals(actionDetail) || "deleteSelectedItem".equals(actionDetail)
|| "viewVendorCategory".equals(actionDetail);
}
You can decompose in the above way, as the first step to refactoring your logic.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Highest performance for finding substrings - java

Related

Return the first index from arraylist where string was found logic confusion

java startsWith() method with custom rules

Print Tree components

Java Recursive String Comparison with "*" as a Wildcard

Is there a way to shorten a conditional that contains a bunch of boolean comparisons?

Categories

Resources