match array against string in java - java

I'm reading a file using bufferedreader, so lets say i have
line = br.readLine();
I want to check if this line contains one of many possible strings (which i have in an array). I would like to be able to write something like:
while (!line.matches(stringArray) { // not sure how to write this conditional
do something here;
br.readLine();
}
I'm fairly new to programming and Java, am I going about this the right way?

Copy all values into a Set<String> and then use contains():
Set<String> set = new HashSet<String> (Arrays.asList (stringArray));
while (!set.contains(line)) { ... }
[EDIT] If you want to find out if a part of the line contains a string from the set, you have to loop over the set. Replace set.contains(line) with a call to:
public boolean matches(Set<String> set, String line) {
for (String check: set) {
if (line.contains(check)) return true;
}
return false;
}
Adjust the check accordingly when you use regexp or a more complex method for matching.
[EDIT2] A third option is to concatenate the elements in the array in a huge regexp with |:
Pattern p = Pattern.compile("str1|str2|str3");
while (!p.matcher(line).find()) { // or matches for a whole-string match
...
}
This can be more cheap if you have many elements in the array since the regexp code will optimize the matching process.

It depends on what stringArray is. If it's a Collection then fine. If it's a true array, you should make it a Collection. The Collection interface has a method called contains() that will determine if a given Object is in the Collection.
Simple way to turn an array into a Collection:
String tokens[] = { ... }
List<String> list = Arrays.asList(tokens);
The problem with a List is that lookup is expensive (technically linear or O(n)). A better bet is to use a Set, which is unordered but has near-constant (O(1)) lookup. You can construct one like this:
From a Collection:
Set<String> set = new HashSet<String>(stringList);
From an array:
Set<String> set = new HashSet<String>(Arrays.asList(stringArray));
and then set.contains(line) will be a cheap operation.
Edit: Ok, I think your question wasn't clear. You want to see if the line contains any of the words in the array. What you want then is something like this:
BufferedReader in = null;
Set<String> words = ... // construct this as per above
try {
in = ...
while ((String line = in.readLine()) != null) {
for (String word : words) {
if (line.contains(word)) [
// do whatever
}
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (in != null) { try { in.close(); } catch (Exception e) { } }
}
This is quite a crude check, which is used surprisingly open and tends to give annoying false positives on words like "scrap". For a more sophisticated solution you probably have to use regular expression and look for word boundaries:
Pattern p = Pattern.compile("(?<=\\b)" + word + "(?=\b)");
Matcher m = p.matcher(line);
if (m.find() {
// word found
}
You will probably want to do this more efficiently (like not compiling the pattern with every line) but that's the basic tool to use.

Using the String.matches(regex) function, what about creating a regular expression that matches any one of the strings in the string array? Something like
String regex = "*(";
for(int i; i < array.length-1; ++i)
regex += array[i] + "|";
regex += array[array.length] + ")*";
while( line.matches(regex) )
{
//. . .
}

Related

How can I check if a string has a substring from a List?

I am looking for the best way to check if a string contains a substring from a list of keywords.
For example, I create a list like this:
List<String> keywords = new ArrayList<>();
keywords.add("mary");
keywords.add("lamb");
String s1 = "mary is a good girl";
String s2 = "she likes travelling";
String s1 has "mary" from the keywords, but string s2 does not have it. So, I would like to define a method:
boolean containsAKeyword(String str, List<String> keywords)
Where containsAKeyword(s1, keywords) would return true but containsAKeyword(s2, keywords) would return false. I can return true even if there is a single substring match.
I know I can just iterate over the keywords list and call str.contains() on each item in the list, but I was wondering if there is a better way to iterate over the complete list (avoid O(n) complexity) or if Java provides any built-in methods for this.
I would recommend iterating over the entire list. Thankfully, you can use an enhanced for loop:
for(String listItem : myArrayList){
if(myString.contains(listItem)){
// do something.
}
}
EDIT To the best of my knowledge, you have to iterate the list somehow. Think about it, how will you know which elements are contained in the list without going through it?
EDIT 2
The only way I can see the iteration running quickly is to do the above. The way this is designed, it will break early once you've found a match, without searching any further. You can put your return false statement at the end of looping, because if you have checked the entire list without finding a match, clearly there is none. Here is some more detailed code:
public boolean containsAKeyword(String myString, List<String> keywords){
for(String keyword : keywords){
if(myString.contains(keyword)){
return true;
}
}
return false; // Never found match.
}
EDIT 3
If you're using Kotlin, you can do this with the any method:
val containsKeyword = myArrayList.any { it.contains("keyword") }
In JDK8 you can do this like:
public static boolean hasKey(String key) {
return keywords.stream().filter(k -> key.contains(k)).collect(Collectors.toList()).size() > 0;
}
hasKey(s1); // prints TRUE
hasKey(s2); // prints FALSE
Now you can use Java 8 stream for this purpose:
keywords.stream().anyMatch(keyword -> str.contains(keyword));
Here is the solution
List<String> keywords = new ArrayList<>();
keywords.add("mary");
keywords.add("lamb");
String s1 = "mary is a good girl";
String s2 = "she likes travelling";
// The function
boolean check(String str, List<String> keywords)
Iterator<String> it = keywords.iterator();
while(it.hasNext()){
if(str.contains(it.next()))
return true;
}
return false;
}
Iterate over the keyword list and return true if the string contains your keyword. Return false otherwise.
public boolean containsAKeyword(String str, List<String> keywords){
for(String k : keywords){
if(str.contains(k))
return true;
}
return false;
}
You can add all the words in keywords in a hashmap. Then you can use str.contains for string 1 and string 2 to check if keywords are available.
Depending on the size of the list, I would suggest using the matches() method of String. String.matches takes a regex argument that, with smaller lists, you could sinply build a regular expression and evaluate it:
String Str = new String("This is a test string");
System.out.println(Str.matches("(.*)test(.*)"));
This should print out "true."
Or you could use java.util.regex.Pattern.

Removing a String from an ArrayList

So I have a problem that takes the names of people from a user and stores them in an ArrayList(personalNames). After that I need to take that list and remove any name that has anything besides letters a-z (anything with numbers or symbols) in it and put them into a separate ArrayList(errorProneNames) that holds the errors. Could someone help me with the removal part?
public class NameList {
public static void main(String[] args) {
ArrayList<String> personalNames = new ArrayList<String>();
Scanner input = new Scanner(System.in);
String answer;
do{
System.out.println("Enter the personal Names: ");
String names = input.next();
personalNames.add(names);
System.out.println("would you like to enter another name (yes/no)?");
answer = input.next();
} while (answer.equalsIgnoreCase("yes"));
ArrayList<String> errorProneNames = new ArrayList<String>();
}
}
If it's the "how do I remove an element from an ArrayList<>" part which is causing problems, and you want to check all the values, you probably want to use an Iterator and call remove on that:
for (Iterator<String> iterator = personalNames.iterator(); iterator.hasNext(); ) {
String name = iterator.next();
if (isErrorProne(name)) {
iterator.remove();
}
}
Note that you mustn't remove an element from a collection while you're iterating over it in an enhanced-for loop except with the iterator. So this would be wrong:
// BAD CODE: DO NOT USE
for (String name : personalNames) {
if (isErrorProne(name)) {
personalNames.remove(name);
}
}
That will throw a ConcurrentModificationException.
Another option would be to create a new list of good names:
List<String> goodNames = new ArrayList<>();
for (String name : personalNames) {
if (!isErrorProne(name)) {
goodNames.add(name);
}
}
Now, if your real problem is that you don't know how to write the isErrorProne method, that's a different matter. I suspect that you want to use a regular expression to check that the name only contains letters, spaces, hyphens, and perhaps apostrophes - but you should think carefully about exactly what you want here. So you might want:
private static boolean isErrorProne(String name) {
return !name.matches("^[a-zA-Z \\-']+$");
}
Note that that won't cope with accented characters, for example. Maybe that's okay for your situation - maybe it's not. You need to consider exactly what you want to allow, and adjust the regular expression accordingly.
You may also want to consider expressing it in terms of whether something is a good name rather than whether it's a bad name - particularly if you use the last approach of building up a new list of good names.
Here is your solution :
String regex = "[a-zA-Z]*";
for (String temp : personalNames ) {
if (!temp.matches(regex)){
errorProneNames.add(temp);
personalNames.remove(temp);
}
}
You can use the remove() method of ArrayList
personalNames.remove("stringToBeRemoved");
Lot of overloaded methods are available. You can delete with index, Object(String itself) etc. You can see Javadocs for more info.
Also to remove all String having anything but a-z letters you can use regex. Logic is as follows
String regex = "[a-zA-Z]*";
String testString = "abc1";
if(!testString.matches(regex)){
System.out.println("Remove this");
}
As Jon pointed out while iterating over the List do not use the Lists's remove() method but the iterators remove() method.
There are two ways you can do this:
The first is to iterate backwards through the list, remove them, then add them into the second list. I say to do it backwards, because it will change the index.
for (int i = personalNames.size()-1; i >=0; i++) {
if (isBadName(personalNames.get(i)]){
errorProneNames.add(personalNames.get(i));
personalNames.remove(i);
}
}
The second way is to use the Iterator provided by ArrayList (personalNames.iterator()). This will allow you to go forward.
I would probably do this
// Check that the string contains only letters.
private static boolean onlyLetters(String in) {
if (in == null) {
return false;
}
for (char c : in.toCharArray()) {
if (!Character.isLetter(c)) {
return false;
}
}
return true;
}
public static void main(String[] args) {
ArrayList<String> personalNames = new ArrayList<String>();
ArrayList<String> errorProneNames = new ArrayList<String>(); // keep this list here.
Scanner input = new Scanner(System.in);
String answer;
do {
System.out.println("Enter the personal Names: ");
String names = input.next();
if (onlyLetters(names)) { // test on input.
personalNames.add(names); // good.
} else {
errorProneNames.add(names); // bad.
}
System.out
.println("would you like to enter another name (yes/no)?");
answer = input.next();
} while (answer.equalsIgnoreCase("yes"));
}
get an iterator from list, while itr has next element give it to a method for example isNotProneName which takes a String and returns true or false, if the given String matches not your needs. if false returned remove string from itr and add it to the other list
Use regex [a-zA-Z ]+ with String.matches to test error-prone name and Iterator to remove.
Iterator<String> it=personalNames.iterator();
while(it.hasNext()){
String name=it.next();
if(name.matches("[a-zA-Z ]+")){
it.remove();
}
}

Very slow execution

I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String called userDescription which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>> which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>.
The file is organized as:
<User=A>\t<Tweet="tweet...">\n
<User=A>\t<Tweet="tweet2...">\n
<User=B>\t<Tweet="tweet3...">\n
....
My method to do this is:
for (File file : tweetList) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// String split[] = str.split("\t");
String split[] = ptnTab.split(str);
String user = ptnEquals.split(split[1])[1];
String tweet = ptnEquals.split(split[2])[1];
// String user = split[1].split("=")[1];
// String tweet = split[2].split("=")[1];
if (tweet.length() == 0)
continue;
if (!prevUser.equals(user)) {
description = userDescription.get(user);
if (description == null)
continue;
if (prevUser.length() > 0 && wordsCount.size() > 0) {
for (String profileWord : description) {
if (wordsCorr.containsKey(profileWord)) {
HashMap<String, Integer> temp = wordsCorr
.get(profileWord);
wordsCorr.put(profileWord,
addValues(wordsCount, temp));
} else {
wordsCorr.put(profileWord, wordsCount);
}
}
}
// wordsCount = new HashMap<String, Integer>();
wordsCount.clear();
}
setTweetWordCount(wordsCount, tweet);
prevUser = user;
}
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
Here, the method setTweetWord counts the word frequency of all the tweets of a single user. The method is:
private void setTweetWordCount(HashMap<String, Integer> wordsCount,
String tweet) {
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
if (currTweet.size() == 0)
return;
for (String word : currTweet) {
try {
if (word.equals("") || word.equals(null))
continue;
} catch (NullPointerException e) {
continue;
}
Integer countWord = wordsCount.get(word);
wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
}
}
The method addValues checks to see if wordCount has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr.
Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.
Also, I have increased the heap size to 1500m which is the maximum that I can go.
Is there anything I might be doing wrong?
Thank you for your help!
EDIT: Profiling Results
first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split() and other optimization. However, after profiling, my addValues method is taking the highest time. So, here's my code for addValues. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess method a bit.
private HashMap<String, Integer> addValues(
HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {
HashMap<String, Integer> merged = new HashMap<String, Integer>();
for (String x : wordsCount.keySet()) {
Integer y = temp.get(x);
if (y == null) {
merged.put(x, wordsCount.get(x));
} else {
merged.put(x, wordsCount.get(x) + y);
}
}
for (String x : temp.keySet()) {
if (merged.get(x) == null) {
merged.put(x, temp.get(x));
}
}
return merged;
}
EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!
Two things come to mind:
You are using String.split(), which uses a regular expression to do the splitting. That's completely oversized. Use one of the many splitXYZ() methods from Apache StringUtils instead.
You are probably creating really huge hash maps. When having very large hash maps, the hash collisions will make the hashmap functions much slower. This can be improved by using more widely spread hash values. See an example over here: Java HashMap performance optimization / alternative
One suggestion (I don't know how much of an improvement you'll get from it) is based on the observation that curTweet is never modified. There is no need for creating a copy. I.e.
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
can be replaced with
List<String> currTweet = Arrays.asList(removeUnwantedStrings(tweet));
or you can use the array directly (which will be marginally faster). I.e.
String[] currTweet = removeUnwantedStrings(tweet);
Also,
word.equals(null)
is always false by the definition of the contract of equals. The right way to null-check is:
if (null == word || word.equals(""))
Additionally, you won't need that null-pointer-exception try-catch if you do this. Exception handling is expensive when it happens, so if your word array tends to return lots of nulls, this could be slowing down your code.
More generally though, this is one of those cases where you should profile the code and figure out where the actual bottleneck is (if there is a bottleneck) instead of looking for things to optimize ad-hoc.
You would gain from a few more optimizations:
String.split recompiles the input regex (in string form) to a pattern every time. You should have a single static final Pattern ptnTab = Pattern.compile( "\\t" ), ptnEquals = Pattern.compile( "=" ); and call, e.g., ptnTab.split( str ). The resulting performance should be close to StringTokenizer.
word.equals( "" ) || word.equals( null ). Lots of wasted cycles here. If you are actually seeing null words, then you are catching NPEs, which is very expensive. See the response from #trutheality above.
You should allocate the HashMap with a very large initial capacity to avoid all the resizing that is bound to happen.
split() uses regular expressions, which are not "fast". try using a StringTokenizer or something instead.
Have you thought about using db instead of Java. Using db tools you can load the data using dataload tools that comes with DB in tables and from there you can do set processing. One challenge that I see is loading data in table as fields are not delimited with common seprator like "'" or ":"
You could rewrite addValues like this to make it faster - a few notes:
I have not tested the code but I think it is equivalent to yours.
I have not tested that it is quicker (but would be surprised if it wasn't)
I have assumed that wordsCount is larger than temp, if not exchange them in the code
I have also replaced all the HashMaps by Maps which does not make any difference for you but makes the code easier to change later on
private Map<String, Integer> addValues(Map<String, Integer> wordsCount, Map<String, Integer> temp) {
Map<String, Integer> merged = new HashMap<String, Integer>(wordsCount); //puts everyting in wordCounts
for (Map.Entry<String, Integer> e : temp.entrySet()) {
Integer countInWords = merged.get(e.getKey()); //the number in wordsCount
Integer countInTemp = e.getValue();
int newCount = countInTemp + (countInWords == null ? 0 : countInWords); //the sum
merged.put(e.getKey(), newCount);
}
return merged;
}

Smart way to combine multiple Strings into a single String that can later be separated into the original Strings?

Assuming there are no restrictions in the characters that can be used in the individual Strings, and the Strings may be empty.
Edit:
Seems like the proper way to do this is to use a separator, and to escape occurances of that separator that already exist in any of the individual strings. Below is my attempt to this, which seems to work. Did miss any cases that will break it?:
public static void main(String args[])
{
Vector<String> strings = new Vector<String>();
strings.add("abab;jmma");
strings.add("defgh;,;");
strings.add("d;;efgh;,;");
strings.add("");
strings.add("");
strings.add(";;");
strings.add(";,;");
String string = combine(strings);
strings= separate(string);
System.out.println();
}
static String combine(Vector<String> strings)
{
StringBuilder builder = new StringBuilder();
for(String string : strings)
{
//don't prepend a SEPARATOR to the first string
if(!builder.toString().equals(""))
{
builder.append(";");
}
string = string.replaceAll(";", ",;");
builder.append(string);
}
return builder.toString();
}
static Vector<String> separate(String string)
{
Vector<String> strings = new Vector<String>();
separate(string, strings, 0);
return strings;
}
static void separate(String string, Vector<String> strings, int currIndex)
{
int nextIndex = -1;
int checkIndex = currIndex;
while(nextIndex == -1 && checkIndex < string.length())
{
nextIndex = string.indexOf(';', checkIndex);
//look back to determine if this occurance is escaped
if(string.charAt(nextIndex - 1) == ',')
{
//this ones is escaped, doesn't count
checkIndex = nextIndex + 1;
nextIndex = -1;
}
}
if(nextIndex == -1)
{
//no more remain
String toAdd = string.substring(currIndex, string.length());
toAdd = toAdd.replaceAll(",;", ";");
strings.add(toAdd);
return;
}
else if(currIndex + 1 == nextIndex)
{
//empty string
strings.add("");
separate(string, strings, nextIndex);
}
else
{
//there could be more
String toAdd = string.substring(currIndex, nextIndex);
toAdd = toAdd.replaceAll(",;", ";");
strings.add(toAdd);
separate(string, strings, nextIndex + 1);
}
}
}
Take your Vector of Strings and convert it to a JSON object and store the JSON object.
( http://www.json.org/ and http://www.json.org/java/ )
With your code, you can recover empty strings using the two-argument version of split:
String[] separate(String string)
{
return string.split(SEPARATOR, -1);
}
If you can truly make no assumptions about the string contents, the only way to do this properly is by escaping the separator sequence (which can then be a single character) wherever it occurs in the source string(s). Obviously, if you escape the separator sequence, you need to unescape the result after splitting. (The escape mechanism will likely require additional at least one additional escape/unescape.)
EDIT
Here's an example (XML-inspired) of escaping and unescaping. It assumes that the separator sequence is "\u0000" (a single NULL character).
/** Returns a String guaranteed to have no NULL character. */
String escape(String source) {
return source.replace("&", "&").replace("\u0000", "&null;");
}
/** Reverses the above escaping and returns the result. */
String unescape(String escaped) {
return source.replace("&null;", "\u0000").replace("&", "&");
}
Many other variations are possible. (It is important that the replacements when unescaping are in reverse order from those used for escaping.) Note that you can still use String.split() to separate the components.
You can build a class that stores the individual strings internally and then outputs a concatenated version of the strings when you call toString. Getting the original strings back is trivial as you already have them stored individually.
You can have the same comportement in two lines of code using Google Guava library (Splitter and Joiner classes).
public String combine(Collection<String> strings) {
return Joiner.on("yourUniqueSeparator").join(strings);
}
public Iterable<String> separate(String toSeparate) {
return Splitter.on("yourUniqueSeparator").split(toSeparate);
}
Take a look at opencsv if you want to use delimited text. The api is rather easy to use, and it takes care of dealing with escaping quotes and the like. However, it treats null values as empty strings, so you might get a,,c if your input was { "a", null, "c" }. If that's not acceptable, you could use a recognizable string and convert it back later.
char tokenSeparator = ',';
char quoteChar = '"';
String inputData[] = {"a","b","c"};
StringWriter stringWriter = new StringWriter();
CSVWriter csvWriter = new CSVWriter(stringWriter, tokenSeparator, quoteChar);
csvWriter.writeNext(inputData);
csvWriter.close();
StringReader stringReader = new StringReader(stringWriter.toString());
CSVReader csvReader = new CSVReader(stringReader, tokenSeparator, quoteChar);
String outputData[] = csvReader.readNext();

Partially match strings in case of List.contains(String)

I have a List<String>
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
if I do list.contains("EFGH"), it returns true.
Can I get a true in case of list.contains("IJ")? I mean, can I partially match strings to find if they exist in the list?
I have a list of 15000 strings. And I have to check about 10000 strings if they exist in the list. What could be some other (faster) way to do this?
Thanks.
If suggestion from Roadrunner-EX does not suffice then, I believe you are looking for Knuth–Morris–Pratt algorithm.
Time complexity:
Time complexity of the table algorithm is O(n), preprocessing time
Time complexity of the search algorithm is O(k)
So, the complexity of the overall algorithm is O(n + k).
n = Size of the List
k = length of pattern you are searching for
Normal Brute-Force will have time complexity of O(nm)
Moreover KMP algorithm will take same O(k) complexity for searching with same search string, on the other hand, it will be always O(km) for brute force approach.
Perhaps you want to put each String group into a HashSet, and by fragment, I mean don't add "IJ KL" but rather add "IJ" and "KL" separately. If you need both the list and this search capabilities, you may need to maintain two collections.
As a second answer, upon rereading your question, you could also inherit from the interface List, specialize it for Strings only, and override the contains() method.
public class PartialStringList extends ArrayList<String>
{
public boolean contains(Object o)
{
if(!(o instanceof String))
{
return false;
}
String s = (String)o;
Iterator<String> iter = iterator();
while(iter.hasNext())
{
String iStr = iter.next();
if (iStr.contain(s))
{
return true;
}
}
return false;
}
}
Judging by your earlier comments, this is maybe not the speed you're looking for, but is this more similar to what you were asking for?
You could use IterableUtils from Apache Commons Collections.
List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
boolean hasString = IterableUtils.contains(list, "IJ", new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
return o2.contains(o1);
}
#Override
public int hash(String o) {
return o.hashCode();
}
});
System.out.println(hasString); // true
You can iterate over the list, and then call contains() on each String.
public boolean listContainsString(List<string> list. String checkStr)
{
Iterator<String> iter = list.iterator();
while(iter.hasNext())
{
String s = iter.next();
if (s.contain(checkStr))
{
return true;
}
}
return false;
}
Something like that should work, I think.
How about:
java.util.List<String> list = new java.util.ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
java.util.regex.Pattern p = java.util.regex.Pattern.compile("IJ");
java.util.regex.Matcher m = p.matcher("");
for(String s : list)
{
m.reset(s);
if(m.find()) System.out.println("Partially Matched");
}
Here's some code that uses a regex to shortcut the inner loop if none of the test Strings are found in the target String.
public static void main(String[] args) throws Exception {
List<String> haystack = Arrays.asList(new String[] { "ABCD", "EFGH", "IJ KL", "M NOP", "UVW X" });
List<String> needles = Arrays.asList(new String[] { "IJ", "NOP" });
// To cut down on iterations, create one big regex to check the whole haystack
StringBuilder sb = new StringBuilder();
sb.append(".*(");
for (String needle : needles) {
sb.append(needle).append('|');
}
sb.replace(sb.length() - 1, sb.length(), ").*");
String regex = sb.toString();
for (String target : haystack) {
if (!target.matches(regex)) {
System.out.println("Skipping " + target);
continue;
}
for (String needle : needles) {
if (target.contains(needle)) {
System.out.println(target + " contains " + needle);
}
}
}
}
Output:
Skipping ABCD
Skipping EFGH
IJ KL contains IJ
M NOP contains NOP
Skipping UVW X
If you really want to get cute, you could bisect use a binary search to identify which segments of the target list matches, but it mightn't be worth it.
It depends which is how likely it is that yo'll find a hit. Low hit rates will give a good result. High hit rates will perform not much better than the simple nested loop version. consider inverting the loops if some needles hit many targets, and other hit none.
It's all about aborting a search path ASAP.
Yes, you can! Sort of.
What you are looking for, is often called fuzzy searching or approximate string matching and there are several solutions to this problem.
With the FuzzyWuzzy lib, for example, you can have all your strings assigned a score based on how similar they are to a particular search term. The actual values seem to be integer percentages of the number of characters matching with regards to the search string length.
After invoking FuzzySearch.extractAll, it is up to you to decide what the minimum score would be for a string to be considered a match.
There are also other, similar libraries worth checking out, like google-diff-match-patch or the Apache Commons Text Similarity API, and so on.
If you need something really heavy-duty, your best bet would probably be Lucene (as also mentioned by Ryan Shillington)
This is not a direct answer to the given problem. But I guess this answer will help someone to compare partially both given and the elements in a list using Apache Commons Collections.
final Equator equator = new Equator<String>() {
#Override
public boolean equate(String o1, String o2) {
final int i1 = o1.lastIndexOf(":");
final int i2 = o2.lastIndexOf(":");
return o1.substring(0, i1).equals(o2.substring(0, i2));
}
#Override
public int hash(String o) {
final int i1 = o.lastIndexOf(":");
return o.substring(0, i1).hashCode();
}
};
final List<String> list = Lists.newArrayList("a1:v1", "a2:v2");
System.out.println(IteratorUtils.matchesAny(list.iterator(), new EqualPredicate("a2:v1", equator)));

Categories

Resources