Getting all combination of 2 elements from a string array java - java

Lets say I have this array list ['a', 'b', 'xx'].
I want to extract every 2 strings combination (for every 2 elements). for example ['a','b'] ['a', 'xx'] ['b', 'a'] ['b', 'xx'] ['xx', 'a'] ['xx', 'b'].
I have written this code, but when the array gets really big (10k for
example) the GC runs out of memory.
private Text empty = new Text("");
public void start(Iterable<Text> values,Context context) throws {
List<String> sitesArr = new ArrayList<String>();
HashMap<String, String> hmapPairs = new HashMap<String, String>();
for (Text site : values){
sitesArr.add(site.toString());
}
insertPairsToHash(hmapPairs, sitesArr);
writeContextFromHash(hmapPairs, context);
}
private void insertPairsToHash(HashMap<String, String> hmapPairs, List<String> sitesArr) {
for (int i=0; i<sitesArr.size(); i++) {
for (int j=i+1; j<sitesArr.size(); j++) {
String firstPair = sitesArr.get(i) + "_" + sitesArr.get(j);
String secondPair = sitesArr.get(j) + "_" + sitesArr.get(i);
hmapPairs.put(firstPair,secondPair);
}
}
}
private void writeContextFromHash(HashMap<String, String> hmapPairs, Context context) throws IOException, InterruptedException {
Text textTowriteToFile = new Text("");
for(Map.Entry<String, String> entry : hmapPairs.entrySet()) {
textTowriteToFile.set(entry.getKey());
context.write(textTowriteToFile, empty);
textTowriteToFile.set(entry.getValue());
context.write(textTowriteToFile, empty);
}
}
I use 2 for loops and in each iteration I insert 2 combinations ( ['a', 'b'] and ['b','a'] first element is the key and the second is the value so in ['a','b'] 'a' would be the key and 'b' would be the value and vice versa) to the hash.
Then I iterate once over the hash to send the values.
How can I make it faster while using less memory?

Why not just call "writeContextFromHash" right in the nested for loop and not create HashMap?

You should probably add some more information to your question. But basically with this kind of programme you will always run into memory problems as your input gets larger. With 10k entries you end up at about 100m combinations resulting in 50m Map entries. Multiplied with the size of the data structure (depending on your input) this uses a lot of memory.
If you know the rough size of your input beforehand you might just assign enough memory to your jvm (unless your machine is to small). If this doesn't solve the problem you cannot keep all results in memory. Either swap out to disk or as suggested above write your results directly to the console instead of keeping them in memory.

You can simply refactor your class streaming results. So you do not keep the whole list of result of your combining elements.
private Text empty = new Text("");
public void start(Iterable<Text> values,Context context) throws IOException, InterruptedException {
List<String> sitesArr = new ArrayList<String>();
for (Text site : values){
sitesArr.add(site.toString());
}
insertPairsToHash(sitesArr,context);
}
private void insertPairsToHash(List<String> sitesArr, Context context) {
for (int i=0; i<sitesArr.size(); i++) {
for (int j=i+1; j<sitesArr.size(); j++) {
String firstPair = sitesArr.get(i) + "_" + sitesArr.get(j);
String secondPair = sitesArr.get(j) + "_" + sitesArr.get(i);
doWrite(context, firstPair, secondPair);
}
}
}
private void doWrite(Context context, String firstPair, String secondPair) {
Text textTowriteToFile = new Text("");
textTowriteToFile.set(firstPair);
context.write(textTowriteToFile, empty);
textTowriteToFile.set(secondPair);
context.write(textTowriteToFile, empty);
}
This will lower you memory usage.
In general you try to stream results if your input is big or unbounded, streaming add some complexity but keeps the memory usage independent from the size of you input.
EDIT (After comment):
You can drop used elements by removing them from the list.
You should in this case use a LinkedList instead of an ArrayList, because removing the head element from an array list would involve much more GC and CPU time then the same operation from a linked list.
This however will not lower the peak memory usage, only the usage over time (you will require less memory as the process goes on).
It could still be useful if other components consume more memory as the process progresses.
List<String> sitesArr = new LinkedList<>();
private void insertPairsToHash(List<String> sitesArr, Context context) {
while (!sitesArr.isEmpty()) {
String left = sitesArr.remove(0);
for (String right : sitesArr) {
String firstPair = left + "_" + right;
String secondPair = right + "_" + left;
doWrite(context, firstPair, secondPair);
}
}
}

Related

Refactor the below code to utilize very minimum space even for large data

"A train has wagonCount , wagons indexed as 0,1,......wagonCount-1.Each wagon must be filled in the constructor of the Train using the fillWagon function.Which accepts wagon's index and return the wagon's cargo.The code below works, but the server has enough memory only for small train.Reactor the code so that server has enough memory even for large train?"
"Thinking we can convert the Hashtable collection to arrays, but no idea how to start, please help. or any idea would be great help. "
import java.util.function.Function;
public class Train {
private Hashtable<Integer, Integer> wagons;
public Train(int wagonCount, Function<Integer, Integer> fillWagon) {
this.wagons = new Hashtable<Integer, Integer>();
for (int i = 0; i < wagonCount; i++) {
this.wagons.put(i, fillWagon.apply(i));
}
}
public int peekWagon(int wagonIndex) {
return this.wagons.get(wagonIndex);
}
public static void main(String[] args) {
Train train = new Train(10, wagonIndex -> wagonIndex);
for (int i = 0; i < 10; i++) {
System.out.println("Wagon: " + i + ", cargo: " + train.peekWagon(i));
}
}
}
You could use int[] it consumes less memory.
It is the most optimal structure to keep integers. Hashtable<Integer, Integer> has a complex structure and huge overhead on storing the numbers, Even Ineger[] consumes alot more memory then int[]. So the best structure is array of primitives. Have a look at good explanation Memory usage of Java objects.
We use index of array to access to the element by required position, instead of Hashtable.get it's required less cpu resources:
public class Train {
private int[] wagons;
public Train(int wagonCount, Function<Integer, Integer> fillWagon) {
this.wagons = new int[wagonCount];
for (int i = 0; i < wagonCount; i++) {
this.wagons[i] = fillWagon.apply(i);
}
}
public int peekWagon(int wagonIndex) {
return this.wagons[wagonIndex];
}
public static void main(String[] args) {
Train train = new Train(10, wagonIndex -> wagonIndex);
for (int i = 0; i < 10; i++) {
System.out.println("Wagon: " + i + ", cargo: " + train.peekWagon(i));
}
}
}
If it is a requirement to fill all wagons during the execution of the constructor, then there is just no way to store an arbitrary number of wagon contents in memory, when that memory has its size limited to some small constant. Sure, using an int array will take a bit less memory than a map, but it still grows linear to the input size.
If however, it is allowed to defer the actual storing of the wagon contents, then you could use the constructor to keep a reference to the callback function, and only call it when peekWagon is called. You could still use the little memory that is available for storing some of the wagon contents, but only for the last queried k wagons. That way, you will have in memory what is queried regularly, but will need to retrieve (again) the contents when that particular wagon is not (or no longer) in memory. You would then call the callback function again.
This assumes of course that the callback function will not have undesirable side-effects, and that it will always return the same value when passed the same argument.
If these assumptions are OK, your code could look like this:
import java.util.*;
import java.util.function.Function;
public class Main {
static int maxSize = 4;
private LinkedHashMap<Integer, Integer> wagons;
private Function<Integer, Integer> fillWagon;
public Main(int wagonCount, Function<Integer, Integer> fillWagon) {
this.wagons = new LinkedHashMap<Integer, Integer>();
this.fillWagon = fillWagon;
}
public int peekWagon(int wagonIndex) {
int content;
if (!this.wagons.containsKey(wagonIndex)) {
if (this.wagons.size() >= maxSize) {
// Make room by removing an entry
int key = this.wagons.entrySet().iterator().next().getKey();
this.wagons.remove(key);
}
content = this.fillWagon.apply(wagonIndex);
} else {
// Remove entry so to put it at end of LinkedHashMap
content = this.wagons.get(wagonIndex);
this.wagons.remove(wagonIndex);
}
this.wagons.put(wagonIndex, content);
return content;
}
/* ... */
}

Modifying an individual element in an array inside of an ArrayList

I have to write a piece of code for a class that counts the occurrences of characters within an input file and then sorts them by that, and I chose to do that by creating an ArrayList where each object[] has two elements, the character and the number of occurrences.
I was trying to increment the integer representing the number of occurrences and I just couldn't get that to work
My current attempt looks like this:
for(int i=0;i<=text.length();i++) {
if(freqlist.contains(text.charAt(i))) {
freqlist.indexOf(text.charAt(i))[1]=freqlist.get(freqlist.indexOf(text.charAt(i)))[1]+1;
}
}
text is just a string containing all of the input file
freqlist is declared earlier as
List<Object[]> freqlist=new ArrayList<Object[]>();
So, I was wondering how one could increment or modify an element of an array that is inside of an arraylist
In General there are 3 mistakes in your program which prevent it from working. It cannot work because the for loop has i<=text.length() and it should be i < text.length(), otherwise you will have exception. Second mistake is that you use freqlist.contains(...) where you assume both elements of object arrays are the same, or in other words the array is the equal, which is wrong assumption. Third mistake is using freqlist.indexOf(...) which relies on array equality again. I made the example working although this data structure List<Object[]> is inefficient for the task. It is best to use Map<Character,Integer>.
Here it is:
import java.util.ArrayList;
import java.util.List;
class Scratch {
public static void main(String[] args) {
String text = "abcdacd";
List<Object[]> freqlist= new ArrayList<>();
for(int i=0;i < text.length();i++) {
Object [] objects = find(freqlist, text.charAt(i));
if(objects != null) {
objects[1] = (Integer)objects[1] +1;
} else {
freqlist.add(new Object[]{text.charAt(i), 1});
}
}
for (Object[] objects : freqlist) {
System.out.println(String.format(" %s => %d", objects[0], objects[1]));
}
}
private static Object[] find(List<Object[]> freqlist, Character charAt) {
for (Object[] objects : freqlist) {
if (charAt.equals(objects[0])) {
return objects;
}
}
return null;
}
}
The way I would do this is first parse the file and convert it to an array of characters. This would then be sent to the charCounter() method which would count the number of times a letter occurs in the file.
/**
* Calculate the number of times a character is present in a character array
*
* #param myChars An array of characters from an input file, this should be parsed and formatted properly
* before sending to method
* #return A hashmap of all characters with their number of occurrences; if a
* letter is not in myChars it is not added to the HashMap
*/
public HashMap<Character, Integer> charCounter(char[] myChars) {
HashMap<Character, Integer> myCharCount = new HashMap<>();
if (myChars.length == 0) System.exit(1);
for (char c : myChars) {
if (myCharCount.containsKey(c)) {
//get the current number for the letter
int currentNum = myCharCount.get(c);
//Place the new number plus one to the HashMap
myCharCount.put(c, (currentNum + 1));
} else {
//Place the character in the HashMap with 1 occurrence
myCharCount.put(c, 1);
}
}
return myCharCount;
}
You could use some Stream magic, if you are using Java 8 for the grouping:
Map<String, Long> map = dummyString.chars() // Turn the String to an IntStream
.boxed() // Turn int to Integer to use Collectors.groupingBy
.collect(Collectors.groupingBy(
Character::toString, // Use the character as a key for the map
Collectors.counting())); // Count the occurrences
Now you could sort the result.

Letter Combinations of a Phone Number (Java) - Passing an array through functions

Problem
Given a digit string, return all possible letter combinations that the number could represent. (Check out your cellphone to see the mappings) Input:Digit string "23", Output: ["ad", "ae", "af", "bd", "be", "bf", "cd", "ce", "cf"]
Question
I'm confused about the solution code below from LeetCode. Why does passing the result array through recursive calls change the result array in letterCombinations? Is it because the result array in ever recursive getString call is referencing the same result array?
public List<String> letterCombinations(String digits) {
HashMap<Integer, String> map = new HashMap<>();
map.put(2, "abc");
map.put(3, "def");
map.put(4, "ghi");
map.put(5, "jkl");
map.put(6, "mno");
map.put(7, "pqrs");
map.put(8, "tuv");
map.put(9, "wxyz");
map.put(0, "");
ArrayList<String> result = new ArrayList<>();
if (digits == null || digits.length() == 0) {
return result;
}
ArrayList<Character> temp = new ArrayList<>();
getString(digits, temp, result, map);
return result;
}
public void getString(String digits, ArrayList<Character> temp, ArrayList<String> result,
HashMap<Integer, String> map) {
if (digits.length() == 0) {
char[] arr = new char[temp.size()];
for (int i = 0; i < temp.size(); i++) {
arr[i] = temp.get(i);
}
result.add(String.valueOf(arr));
return;
}
Integer curr = Integer.valueOf(digits.substring(0, 1));
String letters = map.get(curr);
for (int i = 0; i < letters.length(); i++) {
temp.add(letters.charAt(i));
getString(digits.substring(1), temp, result, map);
temp.remove(temp.size() - 1);
}
}
Is it because the result array in ever recursive getString call is referencing the same result array?
The answer is yes.
Why does passing the result array through recursive calls change the result array in letterCombinations?
The passing of an array result in letterCombinations changes the array and getString call is referencing the same result array. As it is a recursive method call, it gets upadtes after each iteration and stores the value to the same reference. That is the main reason, why you have different values for each iteration or a recursive call. Thus it affects the actual array as well.
Firstly I'll point out that despite the name of the site you got it from, this isn't especially clear code.
The call to getString() has three changing parameters - digits, temp and result.
map never changes - it would be better and clearer if it were a constant. Let's pretend it is, so the signature of getString() is getString(String digits, List<Character> temp.
The naming isn't obvious, but temp contains the "work done so far", so the first time we call it, it's an empty list.
Let's look at what happens the first time it is called, with digits == 234 and temp an empty list:
digits.length() != 0 -- so we skip the whole of the first block.
we grab the first digit, 2 and look up its letters in the map - "a"
we loop through the letters:
we put 'a' onto the end of temp, making temp == ['a']
then we call getString("34", ['a'])
we remove the last item from temp, making temp == []
then the same with 'b' -- getString("34",['b'])
then the same with 'c' -- getString("34",['c'])
Then we're done. But what happened in those recursive calls?
Follow the logic through getString("34",['a']) and you'll see how it grabs 3 from its local digits and makes calls like getString("4", ['a','d']).
In turn getString("4", ['a','d']) makes calls like getString("",['a','d','g']).
Finally we're at level where the recursion stops. Look at what happens when we call getString("",['a','d','g']):
digits.length == 0, so we go into the if block and return -- we don't progress into the part that would call getString() again.
we (in a bit of a laborious way) join the chars from temp into a String, and add it to result.
And that's it.
Better code:
if(digits.isEmpty()) {
result.add(String.join("",temp));
return;
}
We have never created a new result - we're just passing the same one (and the same map too) to every invocation of getString(). So when one getString() adds an item, that item's still there when the next getString() adds a second.
Recursive methods can usually be read as:
def recursivemethod(params) {
if(it's a no-brainer) {
output an answer
} else {
do a little bit of the job
call recursiveMethod(newParams)
}
}
In this case, it's a no-brainer when digits is empty -- the whole of the answer is in temp and just needs adding to the result list.
If it's not a no-brainer, the "little bit of the job" is to handle the first digit, recursing for each possible letter it could represent.
Cleaner in my opinion, while maintaining the spirit of the original:
private static final Map<Character, String> DECODINGS = new HashMap<>();
static {
DECODINGS.put('2', "abc");
// <snip>
}
public List<String> letterCombinations(String digits) {
ArrayList<String> result = new ArrayList<>();
addCombinationsToList(digits, "", result);
return result;
}
private void addCombinationsToList(String digits, String prefix, List<String> list) {
if (digits.isEmpty()) {
list.add(prefix);
} else {
String letters = DECODINGS.get(digits.charAt(0));
for (int i = 0; i < letters.length(); i++) {
addCombinationsToList(digits.substring(1), prefix + letters.charAt(i), list);
}
}
}
By building an immutable string prefix + letters.charAt(i) rather than manipulating a mutable List<Character>, you avoid having to put it back the way you found it, making code that's much easier to understand.

Java Parsing Using Hmap

I am new to Java. I want to Parse the data which is in this Format
Apple;Mango;Orange:1234;Orange:1244;...;
There could be more than one "Orange" at any point of time. Numbers (1,2...) increase and accordingly as the "Orange".
Okay. After splitting it, Lets assume I have stored the first two data(Apple, Orange) in a variable(in setter) to return the same in the getter function. And now I want to add the value(1234,1244....etc) in the 'orange' thing into a variable to return it later. Before that i have to check how many oranges have come. For that, i know i have to use for loop. But don't know how to store the "Value" into a variable.
Please Help me guys.
String input = "Apple;Mango;Orange:1234;Orange:1244;...;"
String values[] = input.split(";");
String value1 = values[0];
String value2 = values[1];
Hashmap< String, ArrayList<String> > map = new HashMap<String, ArrayList<String>>();
for(int i = 2; i < values.length; i = i + 2){
String key = values[i];
String id = values[i+1];
if (map.get(key) == null){
map.put(key, new ArrayList<String>());
}
map.get(key).add(id);
}
//for any key s:
// get the values of s
map.get(s); // returns a list of all values added
// get the count of s
map.get(s).size(); // return the total number of values.
Let me try to rephrase the question by how I interpreted it and -- more importantly -- how it focuses on the input and output (expectations), not the actual implementation:
I need to parse the string
"Apple;Mango;Orange:1234;Orange:1244;...;"
in a way so I can retrieve the values associated (numbers after ':') with the fruits:
I should receive an empty list for both the Apple and Mango in the example, because they have no value;
I should receive a list of 1234, 1244 for Orange.
Of course your intuition of HashMap is right on the spot, but someone may always present a better solution if you don't get too involved with the specifics.
There are a few white spots left:
Should the fruits without values have a default value given?
Should the fruits without values be in the map at all?
How input errors should be handled?
How duplicate values should be handled?
Given this context, we can start writing code:
import java.util.*;
public class FruitMarker {
public static void main(String[] args) {
String input = "Apple;Mango;Orange:1234;Orange:1244";
// replace with parameter processing from 'args'
// avoid direct implementations in variable definitions
// also observe the naming referring to the function of the variable
Map<String, Collection<Integer>> fruitIds = new HashMap<String, Collection<Integer>>();
// iterate through items by splitting
for (String item : input.split(";")) {
String[] fruitAndId = item.split(":"); // this will return the same item in an array, if separator is not found
String fruitName = fruitAndId[0];
boolean hasValue = fruitAndId.length > 1;
Collection<Integer> values = fruitIds.get(fruitName);
// if we are accessing the key for the first time, we have to set its value
if (values == null) {
values = new ArrayList<Integer>(); // here I can use concrete implementation
fruitIds.put(fruitName, values); // be sure to put it back in the map
}
if (hasValue) {
int fruitValue = Integer.parseInt(fruitAndId[1]);
values.add(fruitValue);
}
}
// display the entries in table iteratively
for (Map.Entry<String, Collection<Integer>> entry : fruitIds.entrySet()) {
System.out.println(entry.getKey() + " => " + entry.getValue());
}
}
}
If you execute this code, you will get the following output:
Mango => []
Apple => []
Orange => [1234, 1244]

Very slow execution

I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String called userDescription which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>> which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>.
The file is organized as:
<User=A>\t<Tweet="tweet...">\n
<User=A>\t<Tweet="tweet2...">\n
<User=B>\t<Tweet="tweet3...">\n
....
My method to do this is:
for (File file : tweetList) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// String split[] = str.split("\t");
String split[] = ptnTab.split(str);
String user = ptnEquals.split(split[1])[1];
String tweet = ptnEquals.split(split[2])[1];
// String user = split[1].split("=")[1];
// String tweet = split[2].split("=")[1];
if (tweet.length() == 0)
continue;
if (!prevUser.equals(user)) {
description = userDescription.get(user);
if (description == null)
continue;
if (prevUser.length() > 0 && wordsCount.size() > 0) {
for (String profileWord : description) {
if (wordsCorr.containsKey(profileWord)) {
HashMap<String, Integer> temp = wordsCorr
.get(profileWord);
wordsCorr.put(profileWord,
addValues(wordsCount, temp));
} else {
wordsCorr.put(profileWord, wordsCount);
}
}
}
// wordsCount = new HashMap<String, Integer>();
wordsCount.clear();
}
setTweetWordCount(wordsCount, tweet);
prevUser = user;
}
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
Here, the method setTweetWord counts the word frequency of all the tweets of a single user. The method is:
private void setTweetWordCount(HashMap<String, Integer> wordsCount,
String tweet) {
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
if (currTweet.size() == 0)
return;
for (String word : currTweet) {
try {
if (word.equals("") || word.equals(null))
continue;
} catch (NullPointerException e) {
continue;
}
Integer countWord = wordsCount.get(word);
wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
}
}
The method addValues checks to see if wordCount has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr.
Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.
Also, I have increased the heap size to 1500m which is the maximum that I can go.
Is there anything I might be doing wrong?
Thank you for your help!
EDIT: Profiling Results
first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split() and other optimization. However, after profiling, my addValues method is taking the highest time. So, here's my code for addValues. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess method a bit.
private HashMap<String, Integer> addValues(
HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {
HashMap<String, Integer> merged = new HashMap<String, Integer>();
for (String x : wordsCount.keySet()) {
Integer y = temp.get(x);
if (y == null) {
merged.put(x, wordsCount.get(x));
} else {
merged.put(x, wordsCount.get(x) + y);
}
}
for (String x : temp.keySet()) {
if (merged.get(x) == null) {
merged.put(x, temp.get(x));
}
}
return merged;
}
EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!
Two things come to mind:
You are using String.split(), which uses a regular expression to do the splitting. That's completely oversized. Use one of the many splitXYZ() methods from Apache StringUtils instead.
You are probably creating really huge hash maps. When having very large hash maps, the hash collisions will make the hashmap functions much slower. This can be improved by using more widely spread hash values. See an example over here: Java HashMap performance optimization / alternative
One suggestion (I don't know how much of an improvement you'll get from it) is based on the observation that curTweet is never modified. There is no need for creating a copy. I.e.
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
can be replaced with
List<String> currTweet = Arrays.asList(removeUnwantedStrings(tweet));
or you can use the array directly (which will be marginally faster). I.e.
String[] currTweet = removeUnwantedStrings(tweet);
Also,
word.equals(null)
is always false by the definition of the contract of equals. The right way to null-check is:
if (null == word || word.equals(""))
Additionally, you won't need that null-pointer-exception try-catch if you do this. Exception handling is expensive when it happens, so if your word array tends to return lots of nulls, this could be slowing down your code.
More generally though, this is one of those cases where you should profile the code and figure out where the actual bottleneck is (if there is a bottleneck) instead of looking for things to optimize ad-hoc.
You would gain from a few more optimizations:
String.split recompiles the input regex (in string form) to a pattern every time. You should have a single static final Pattern ptnTab = Pattern.compile( "\\t" ), ptnEquals = Pattern.compile( "=" ); and call, e.g., ptnTab.split( str ). The resulting performance should be close to StringTokenizer.
word.equals( "" ) || word.equals( null ). Lots of wasted cycles here. If you are actually seeing null words, then you are catching NPEs, which is very expensive. See the response from #trutheality above.
You should allocate the HashMap with a very large initial capacity to avoid all the resizing that is bound to happen.
split() uses regular expressions, which are not "fast". try using a StringTokenizer or something instead.
Have you thought about using db instead of Java. Using db tools you can load the data using dataload tools that comes with DB in tables and from there you can do set processing. One challenge that I see is loading data in table as fields are not delimited with common seprator like "'" or ":"
You could rewrite addValues like this to make it faster - a few notes:
I have not tested the code but I think it is equivalent to yours.
I have not tested that it is quicker (but would be surprised if it wasn't)
I have assumed that wordsCount is larger than temp, if not exchange them in the code
I have also replaced all the HashMaps by Maps which does not make any difference for you but makes the code easier to change later on
private Map<String, Integer> addValues(Map<String, Integer> wordsCount, Map<String, Integer> temp) {
Map<String, Integer> merged = new HashMap<String, Integer>(wordsCount); //puts everyting in wordCounts
for (Map.Entry<String, Integer> e : temp.entrySet()) {
Integer countInWords = merged.get(e.getKey()); //the number in wordsCount
Integer countInTemp = e.getValue();
int newCount = countInTemp + (countInWords == null ? 0 : countInWords); //the sum
merged.put(e.getKey(), newCount);
}
return merged;
}

Categories

Resources