Very slow execution - java

I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String called userDescription which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>> which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>.
The file is organized as:
<User=A>\t<Tweet="tweet...">\n
<User=A>\t<Tweet="tweet2...">\n
<User=B>\t<Tweet="tweet3...">\n
....
My method to do this is:
for (File file : tweetList) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// String split[] = str.split("\t");
String split[] = ptnTab.split(str);
String user = ptnEquals.split(split[1])[1];
String tweet = ptnEquals.split(split[2])[1];
// String user = split[1].split("=")[1];
// String tweet = split[2].split("=")[1];
if (tweet.length() == 0)
continue;
if (!prevUser.equals(user)) {
description = userDescription.get(user);
if (description == null)
continue;
if (prevUser.length() > 0 && wordsCount.size() > 0) {
for (String profileWord : description) {
if (wordsCorr.containsKey(profileWord)) {
HashMap<String, Integer> temp = wordsCorr
.get(profileWord);
wordsCorr.put(profileWord,
addValues(wordsCount, temp));
} else {
wordsCorr.put(profileWord, wordsCount);
}
}
}
// wordsCount = new HashMap<String, Integer>();
wordsCount.clear();
}
setTweetWordCount(wordsCount, tweet);
prevUser = user;
}
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
Here, the method setTweetWord counts the word frequency of all the tweets of a single user. The method is:
private void setTweetWordCount(HashMap<String, Integer> wordsCount,
String tweet) {
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
if (currTweet.size() == 0)
return;
for (String word : currTweet) {
try {
if (word.equals("") || word.equals(null))
continue;
} catch (NullPointerException e) {
continue;
}
Integer countWord = wordsCount.get(word);
wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
}
}
The method addValues checks to see if wordCount has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr.
Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.
Also, I have increased the heap size to 1500m which is the maximum that I can go.
Is there anything I might be doing wrong?
Thank you for your help!
EDIT: Profiling Results
first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split() and other optimization. However, after profiling, my addValues method is taking the highest time. So, here's my code for addValues. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess method a bit.
private HashMap<String, Integer> addValues(
HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {
HashMap<String, Integer> merged = new HashMap<String, Integer>();
for (String x : wordsCount.keySet()) {
Integer y = temp.get(x);
if (y == null) {
merged.put(x, wordsCount.get(x));
} else {
merged.put(x, wordsCount.get(x) + y);
}
}
for (String x : temp.keySet()) {
if (merged.get(x) == null) {
merged.put(x, temp.get(x));
}
}
return merged;
}
EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!

Two things come to mind:
You are using String.split(), which uses a regular expression to do the splitting. That's completely oversized. Use one of the many splitXYZ() methods from Apache StringUtils instead.
You are probably creating really huge hash maps. When having very large hash maps, the hash collisions will make the hashmap functions much slower. This can be improved by using more widely spread hash values. See an example over here: Java HashMap performance optimization / alternative

One suggestion (I don't know how much of an improvement you'll get from it) is based on the observation that curTweet is never modified. There is no need for creating a copy. I.e.
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
can be replaced with
List<String> currTweet = Arrays.asList(removeUnwantedStrings(tweet));
or you can use the array directly (which will be marginally faster). I.e.
String[] currTweet = removeUnwantedStrings(tweet);
Also,
word.equals(null)
is always false by the definition of the contract of equals. The right way to null-check is:
if (null == word || word.equals(""))
Additionally, you won't need that null-pointer-exception try-catch if you do this. Exception handling is expensive when it happens, so if your word array tends to return lots of nulls, this could be slowing down your code.
More generally though, this is one of those cases where you should profile the code and figure out where the actual bottleneck is (if there is a bottleneck) instead of looking for things to optimize ad-hoc.

You would gain from a few more optimizations:
String.split recompiles the input regex (in string form) to a pattern every time. You should have a single static final Pattern ptnTab = Pattern.compile( "\\t" ), ptnEquals = Pattern.compile( "=" ); and call, e.g., ptnTab.split( str ). The resulting performance should be close to StringTokenizer.
word.equals( "" ) || word.equals( null ). Lots of wasted cycles here. If you are actually seeing null words, then you are catching NPEs, which is very expensive. See the response from #trutheality above.
You should allocate the HashMap with a very large initial capacity to avoid all the resizing that is bound to happen.

split() uses regular expressions, which are not "fast". try using a StringTokenizer or something instead.

Have you thought about using db instead of Java. Using db tools you can load the data using dataload tools that comes with DB in tables and from there you can do set processing. One challenge that I see is loading data in table as fields are not delimited with common seprator like "'" or ":"

You could rewrite addValues like this to make it faster - a few notes:
I have not tested the code but I think it is equivalent to yours.
I have not tested that it is quicker (but would be surprised if it wasn't)
I have assumed that wordsCount is larger than temp, if not exchange them in the code
I have also replaced all the HashMaps by Maps which does not make any difference for you but makes the code easier to change later on
private Map<String, Integer> addValues(Map<String, Integer> wordsCount, Map<String, Integer> temp) {
Map<String, Integer> merged = new HashMap<String, Integer>(wordsCount); //puts everyting in wordCounts
for (Map.Entry<String, Integer> e : temp.entrySet()) {
Integer countInWords = merged.get(e.getKey()); //the number in wordsCount
Integer countInTemp = e.getValue();
int newCount = countInTemp + (countInWords == null ? 0 : countInWords); //the sum
merged.put(e.getKey(), newCount);
}
return merged;
}

Related

read the value from nested foreach loop

I have two String Array, i have to enter the value from the second array while the first array element is used to find webelement.
Here is the sample code:
public void isAllTheFieldsDisplayed(String values, String fields) {
String[] questions = fields.split(",");
String[] answers = values.split(",");
for(String q : questions) {
// HERE IS THE PROBLEM - I want only the first answer from the String[] answers. similarly for the second question, i want the second element from the String[] answers.
// THIS WONT WORK - for(string ans : answers)
find(By.cssSelector("input[id='"+q+"']")).sendKeys(ans);
}
}
You probably need to check whether the two arrays contain the same number of elements.
Utilising a simple integer for loop and slice the elements from the arrays:-
for(int i=0; i<questions.length; i++ {
driver.findElement(By.id(questions[i])).sendKeys(answers[i]);
}
I assume the find method is some sort of wrapper for selenium's findElement
As id is being located suggest using By.id?
Ideally check whether a WebElement is found before calling sendKeys
Here's a slightly different approach. Which could be overkill depending on your environment.
Because of the coupled relationship of your questions and answers, we want to make sure they get paired correctly, and once they're paired there's no reason to distribute them separately anymore.
This could be a re-usable utility function like so:
public Map<String, String> csvsToMap(String keyCsv, String valueCsv) {
String[] questions = keyCsv.split(",");
String[] answers = valueCsv.split(",");
// This could also be something like "questions.length >= answers.length" so if there
// are more questions than answers the extras would be ignored rather than fail....
if (questions.length != answers.length) { // fail fast and explicit
throw new RuntimeException("Not the same number of questions and answers");
}
Map<String, String> map = new HashMap<>();
for (int i = 0; i < questions.length; i++) {
map.put(questions[i], answers[i]);
}
return map;
}
After the data has been sanitized and prepared for ingesting, handling it becomes a bit easier:
Map<String, String> preparedQuestions = csvsToMap(values, fields);
for (String aQuestion : preparedQuestions.keySet()) {
String selector = "input[id='" + aQuestion + "']";
String answer = preparedQuestions.get(aQuestion);
driver.findElement(By.id(selector)).sendKeys(answer);
}
Or if java8, streams could be used:
csvsToMap(values, fields).entrySet().stream()
.forEach(pair -> {
String selector = "input[id='" + pair.getKey() + "']";
driver.findElement(By.id(selector)).sendKeys(pair.getValue());
});
Preparing your data in a function like this ahead of time lets you avoid using indexes altogether elsewhere. If this is a pattern you repeat, a helper function like this becomes a single point of failure, which lets you test it, gain confidence in it, and trust that there aren't other near-identical snippets elsewhere that might have subtle differences or bugs.
Note how this helper function doesn't have any side effects, as long as the same inputs are provided, the same output should always result. This makes it easier to test than it would be having webdriver operations baked into this task, as webdriver has built in side-effects which can fail at any time with no fault to your code. (aka talking to the browser)
Iterator may resolve this, But i haven't tried.
Iterator itr = questions.iterator();
Iterator itrans = answers.iterator();
while( itr.hasNext() && itrans.hasNext())

For loop is slow

Please have a look at the following code
private StringBuffer populateStringWithUnmatchingWords(ArrayList<String>unmatchingWordsHolder)
{
StringBuffer unMatchingWordsStr = new StringBuffer("");
for(int u=0;u<unmatchingWordsHolder.size();u++)
{
Iterator iterInWordMap = wordMap.entrySet().iterator();
while(iterInWordMap.hasNext())
{
Map.Entry mEntry = (Map.Entry)iterInWordMap.next();
if(mEntry.getValue().equals(unmatchingWordsHolder.get(u)))
{
//out.println(matchingWords.get(m)+" : "+true);
unMatchingWordsStr.append(mEntry.getKey());
unMatchingWordsStr.append(",");
}
}
}
return unMatchingWordsStr;
}
This for loop takes 8387ms to complete. The unmatchingWordsHolder is pretty big too. wordMap is a HashMap and contains somewhat around 5000 elements as well.
This loop will search whether elements in unmatchingWordsHolder are available in wordMap. If they are available, then they will be loaded into unMatchingWordsStr.
Is there any way for me to speed up this task?
Does using Collection.contains() help at all? That would be much more readable, if nothing else, to my mind. It depends on the relative sizes of the List and the Map though, as the easiest way to do it would be something like this, although since you're iterating over the Map and doing the lookup on the List, if the Map is far larger than the List this isn't going to be ideal:
private StringBuffer populateStringWithUnmatchingWords(ArrayList<String>unmatchingWordsHolder) {
StringBuffer unMatchingWordsStr = new StringBuffer();
for (Entry<String, String> entry : wordMap.entrySet()) {
if(unmatchingWordsHolder.contains(entry.getValue())) {
//out.println(matchingWords.get(m)+" : "+true);
unMatchingWordsStr.append(entry.getKey());
unMatchingWordsStr.append(",");
}
}
return unMatchingWordsStr;
}
As noted elsewhere, if you don't need thread safety, StringBuilder is generally preferred to StringBuffer, but I didn't want to mess with your method signatures.
You are iterating through every element in the Map. A better way to do this is to use a HashMap and use contains() to determine if it exists in the HashMap.
Not sure if I got your problem statement correctly, but if you want to return a comma separated string of all the words that are found in another set of words then here's how you would do in Java 8:
private String populateContainedWords(List<String> words, Set<String> wordSet)
{
StringJoiner joiner = new StringJoiner(", ");
words.stream().filter(wordSet::contains).forEach(joiner::add);
return joiner.toString();
}
And if you only want to have distinct words in this comma separated string, then use the following approach:
private String populateDistinctlyContainedWords(List<String> words, Set<String> wordSet)
{
StringJoiner joiner = new StringJoiner(", ");
words.stream().distinct().filter(wordSet::contains).forEach(joiner::add);
return joiner.toString();
}
And if you want a comma separated string of words from the words list that are NOT contained in the wordSet then here's how that's done:
private String populateDisjointWords(List<String> words, Set<String> wordSet)
{
StringJoiner joiner = new StringJoiner(", ");
words.stream().filter(n -> !wordSet.contains(n)).forEach(joiner::add);
return joiner.toString();
}

Elegant solution for string-counting?

The problem I have is an example of something I've seen often. I have a series of strings (one string per line, lets say) as input, and all I need to do is return how many times each string has appeared. What is the most elegant way to solve this, without using a trie or other string-specific structure? The solution I've used in the past has been to use a hashtable-esque collection of custom-made (String, integer) objects that implements Comparable to keep track of how many times each string has appeared, but this method seems clunky for several reasons:
1) This method requires the creation of a comparable function which is identical to the String's.compareTo().
2) The impression that I get is that I'm misusing TreeSet, which has been my collection of choice. Updating the counter for a given string requires checking to see if the object is in the set, removing the object, updating the object, and then reinserting it. This seems wrong.
Is there a more clever way to solve this problem? Perhaps there is a better Collections interface I could use to solve this problem?
Thanks.
One posibility can be:
public class Counter {
public int count = 1;
}
public void count(String[] values) {
Map<String, Counter> stringMap = new HashMap<String, Counter>();
for (String value : values) {
Counter count = stringMap.get(value);
if (count != null) {
count.count++;
} else {
stringMap.put(value, new Counter());
}
}
}
In this way you still need to keep a map but at least you don't need to regenerate the entry every time you match a new string, you can access the Counter class, which is a wrapper of integer and increase the value by one, optimizing the access to the array
TreeMap is much better for this problem, or better yet, Guava's Multiset.
To use a TreeMap, you'd use something like
Map<String, Integer> map = new TreeMap<>();
for (String word : words) {
Integer count = map.get(word);
if (count == null) {
map.put(word, 1);
} else {
map.put(word, count + 1);
}
}
// print out each word and each count:
for (Map.Entry<String, Integer> entry : map.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getKey(), entry.getValue());
}
Integer theCount = map.get("the");
if (theCount == null) {
theCount = 0;
}
System.out.println(theCount); // number of times "the" appeared, or null
Multiset would be much simpler than that; you'd just write
Multiset<String> multiset = TreeMultiset.create();
for (String word : words) {
multiset.add(word);
}
for (Multiset.Entry<String> entry : multiset.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getElement(), entry.getCount());
}
System.out.println(multiset.count("the")); // number of times "the" appeared
You can use a hash-map (no need to "create a comparable function"):
Map<String,Integer> count(String[] strings)
{
Map<String,Integer> map = new HashMap<String,Integer>();
for (String key : strings)
{
Integer value = map.get(key);
if (value == null)
map.put(key,1);
else
map.put(key,value+1);
}
return map;
}
Here is how you can use this method in order to print (for example) the string-count of your input:
Map<String,Integer> map = count(input);
for (String key : map.keySet())
System.out.println(key+" "+map.get(key));
You can use a Bag data structure from the Apache Commons Collection, like the HashBag.
A Bag does exactly what you need: It keeps track of how often an element got added to the collections.
HashBag<String> bag = new HashBag<>();
bag.add("foo");
bag.add("foo");
bag.getCount("foo"); // 2

Parsing 2 Files Line-By-Line and need to avoid Duplicates (in special cases)

I have 2 files that I'm parsing line-by-line adding the information to 2 separate ArrayList<String> containers. I'm trying to create a final container "finalPNList" that reflects the 'Resulting File/ArrayList' below.
Issue is that I'm not successfully avoiding duplicates. I've changed the code various ways without success. Sometimes I restrict the condition too much, and avoid all duplicates, and sometimes I leave it too loose and include all duplicates. I can't seem to find the conditions to make it just right.
Here is the code so far -- in this case, seeing the contents of processLine() ins't truly relevant, just know that you're getting a map with 2 ArrayLists<String>
public static Map<String, List<String>> masterList = new HashMap<String, List<String>>();
public static List<String> finalPNList = new ArrayList<String>();
public static List<String> modifier = new ArrayList<String>();
public static List<String> skipped = new ArrayList<String>();
for (Entry<String, String> e : tab1.entrySet()) {
String key = e.getKey();
String val = e.getValue();
// returns BufferedReader to start line processing
inputStream = getFileHandle(val);
// builds masterList containing all data
masterList.put(key, processLine(inputStream));
}
for (Entry<String, List<String>> e : masterList.entrySet()) {
String key = e.getKey();
List<String> val = e.getValue();
System.out.println(modifier.size());
for (String s : val) {
if (modifier.size() == 0)
finalPNList.add(s);
if (!modifier.isEmpty() && finalPNList.contains(s)
&& !modifier.contains(key)) {
// s has been added by parent process so SKIP!
skipped.add(s);
} else
finalPNList.add(s);
}
modifier.add(key);
}
Here is what the data may look like (extremely simplified dealing with about 20K lines about 10K lines in each file):
File A
123;data
123;data
456,data
File B
123;data
789,data
789,data
Resulting File/ArrayList
123;data
123;data
789,data
789,data
!modifier.contains(key) is always true, it can be removed from your if-statement.
modifier.size() == 0 can be replaced with modifier.isEmpty().
Since you seem to want to add duplicates from File B, you need to check File A, not finalPNList when checking for existence (I just checked the applicable list in masterList, feel free to change this to something more appropriate / efficient).
You need to have an else after your first if-statement, otherwise you're adding items from File A twice.
I assumed you just missed 456 in your output, otherwise I might not quite understand.
Modified code with your file-IO replaced with something that's more in the spirit of an SSCCE:
masterList.put("A", Arrays.asList("123","123","456"));
masterList.put("B", Arrays.asList("123","789","789"));
for (Map.Entry<String, List<String>> e : masterList.entrySet()) {
String key = e.getKey();
List<String> val = e.getValue();
System.out.println(modifier.size());
for (String s : val) {
if (modifier.isEmpty())
finalPNList.add(s);
else if (!modifier.isEmpty() && masterList.get("A").contains(s)) {
// s has been added by parent process so SKIP!
skipped.add(s);
} else
finalPNList.add(s);
}
modifier.add(key);
}
Test.

match array against string in java

I'm reading a file using bufferedreader, so lets say i have
line = br.readLine();
I want to check if this line contains one of many possible strings (which i have in an array). I would like to be able to write something like:
while (!line.matches(stringArray) { // not sure how to write this conditional
do something here;
br.readLine();
}
I'm fairly new to programming and Java, am I going about this the right way?
Copy all values into a Set<String> and then use contains():
Set<String> set = new HashSet<String> (Arrays.asList (stringArray));
while (!set.contains(line)) { ... }
[EDIT] If you want to find out if a part of the line contains a string from the set, you have to loop over the set. Replace set.contains(line) with a call to:
public boolean matches(Set<String> set, String line) {
for (String check: set) {
if (line.contains(check)) return true;
}
return false;
}
Adjust the check accordingly when you use regexp or a more complex method for matching.
[EDIT2] A third option is to concatenate the elements in the array in a huge regexp with |:
Pattern p = Pattern.compile("str1|str2|str3");
while (!p.matcher(line).find()) { // or matches for a whole-string match
...
}
This can be more cheap if you have many elements in the array since the regexp code will optimize the matching process.
It depends on what stringArray is. If it's a Collection then fine. If it's a true array, you should make it a Collection. The Collection interface has a method called contains() that will determine if a given Object is in the Collection.
Simple way to turn an array into a Collection:
String tokens[] = { ... }
List<String> list = Arrays.asList(tokens);
The problem with a List is that lookup is expensive (technically linear or O(n)). A better bet is to use a Set, which is unordered but has near-constant (O(1)) lookup. You can construct one like this:
From a Collection:
Set<String> set = new HashSet<String>(stringList);
From an array:
Set<String> set = new HashSet<String>(Arrays.asList(stringArray));
and then set.contains(line) will be a cheap operation.
Edit: Ok, I think your question wasn't clear. You want to see if the line contains any of the words in the array. What you want then is something like this:
BufferedReader in = null;
Set<String> words = ... // construct this as per above
try {
in = ...
while ((String line = in.readLine()) != null) {
for (String word : words) {
if (line.contains(word)) [
// do whatever
}
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (in != null) { try { in.close(); } catch (Exception e) { } }
}
This is quite a crude check, which is used surprisingly open and tends to give annoying false positives on words like "scrap". For a more sophisticated solution you probably have to use regular expression and look for word boundaries:
Pattern p = Pattern.compile("(?<=\\b)" + word + "(?=\b)");
Matcher m = p.matcher(line);
if (m.find() {
// word found
}
You will probably want to do this more efficiently (like not compiling the pattern with every line) but that's the basic tool to use.
Using the String.matches(regex) function, what about creating a regular expression that matches any one of the strings in the string array? Something like
String regex = "*(";
for(int i; i < array.length-1; ++i)
regex += array[i] + "|";
regex += array[array.length] + ")*";
while( line.matches(regex) )
{
//. . .
}

Categories

Resources