Counting occurrences of words in an array - java

I've been working on something which takes a stream of characters, forms words, makes an array of the words, then creates a vector which contains each unique words and the number of times it occurs (basically a word counter).
Anyway I've not used Java in a long time, or much programming to be honest and I'm not happy with how this currently looks. The part I have which makes the vector looks ugly to me and I wanted to know if I could make it less messy.
int counter = 1;
Vector<Pair<String, Integer>> finalList = new Vector<Pair<String, Integer>>();
Pair<String, Integer> wordAndCount = new Pair<String, Integer>(wordList.get(1), counter); // wordList contains " " as first word, starting at wordList.get(1) skips it.
for(int i= 1; i<wordList.size();i++){
if(wordAndCount.getLeft().equals(wordList.get(i))){
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter++);
}
else if(!wordAndCount.getLeft().equals(wordList.get(i))){
finalList.add(wordAndCount);
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter=1);
}
}
finalList.add(wordAndCount); //UGLY!!
As a secondary question, this gives me a vector with all the words in alphabetical order (as in the array). I want to have it sorted by occurrence, the alphabetical within that.
Would the best option be:
Iterate down the vector, testing each occurrence int with the one above, using Collections.swap() if it was higher, then checking the next one above (as its now moved up 1) and so on until it's no longer larger than anything above it. Any occurrence of 1 could be skipped.
Iterate down the vector again, testing each element against the first element of the vector and then iterating downwards until the number of occurrences is lower and inserting it above that element. All occurrences of 1 would once again be skipped.
The first method would doing more in terms of iterating over the elements, but the second one requires you to add and remove components of the vector (I think?) so I don't know which is more efficient, or whether its worth considering.

Why not use a Map to solve your problem?
String[] words // your incoming array of words.
Map<String, Integer> wordMap = new HashMap<String, Integer>();
for(String word : words) {
if(!wordMap.containsKey(word))
wordMap.put(word, 1);
else
wordMap.put(word, wordMap.get(word) + 1);
}
Sorting can be done using Java's sorted collections:
SortedMap<Integer, SortedSet<String>> sortedMap = new TreeMap<Integer, SortedSet<String>>();
for(Entry<String, Integer> entry : wordMap.entrySet()) {
if(!sortedMap.containsKey(entry.getValue()))
sortedMap.put(entry.getValue(), new TreeSet<String>());
sortedMap.get(entry.getValue()).add(entry.getKey());
}
Nowadays you should leave the sorting to the language's libraries. They have been proven correct with the years.
Note that the code may use a lot of memory because of all the data structures involved, but that is what we pay for higher level programming (and memory is getting cheaper every second).
I didn't run the code to see that it works, but it does compile (copied it directly from eclipse)

re: sorting, one option is to write a custom Comparator which first examines the number of times each word appears, then (if equal) compares the words alphabetically.
private final class PairComparator implements Comparator<Pair<String, Integer>> {
public int compareTo(<Pair<String, Integer>> p1, <Pair<String, Integer>> p2) {
/* compare by Integer */
/* compare by String, if necessary */
/* return a negative number, a positive number, or 0 as appropriate */
}
}
You'd then sort finalList by calling Collections.sort(finalList, new PairComparator());

How about using google guava library?
Multiset<String> multiset = HashMultiset.create();
for (String word : words) {
multiset.add(word);
}
int countFoo = multiset.count("foo");
From their javadocs:
A collection that supports order-independent equality, like Set, but may have duplicate elements. A multiset is also sometimes called a bag.
Simple enough?

Related

How to count the number of occurrences of each word?

If I have an article in English, or a novel in English, and I want to count how many times each words appears, what is the fastest algorithm written in Java?
Some people said you can use Map < String, Integer>() to complete this, but I was wondering how do I know what is the key words? Every article has different words and how do you know the "key" words then add one on its count?
Here is another way to do it with the things that appeared in Java 8:
private void countWords(final Path file) throws IOException {
Arrays.stream(new String(Files.readAllBytes(file), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())).entrySet()
.forEach(System.out::println);
}
So what is it doing?
It reads a text file completely into memory, into a byte array to be more precise: Files.readAllBytes(file). This method turned up in Java 7 and allows methods of loading files very fast, however for the price that the file will be completely in memory, costing a lot of memory. For speed however this is a good appraoch.
The byte[] is converted to a String: new String(Files.readAllBytes(file), StandardCharsets.UTF_8) while assuming that the file is UTF8 encoded. Change at your own need. The price is a full memory copy of the already huge piece of data in memory. It may be faster to work with a memory mapped file instead.
The string is split at non-Word charcaters: ...split("\\W+") which creates an array of strings with all your words.
We create a stream from that array: Arrays.stream(...). This by itself does not do very much, but we can do a lot of fun things with the stream
We group all the words together: Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting()). This means:
We want to group the words by the word themselves (identity()). We could also e.g. lowercase the string here first if you want grouping to be case insensitive. This will end up to be the key in a map.
As a result for storng the grouped values we want a TreeMap (TreeMap::new). TreeMaps are sorted by their key, so we can easily output in alphabetical order in the end. If you do not need sorting you could also use a HashMap here.
As value for each group we want to have the number of occurances of each word (counting()). In background that means that for each word we add to a group we increase the counter by one.
From Step 5 we are left with a Map that maps words to their count. Now we just want to print them. So we access a collection with all the key/value pairs in this map (.entrySet()).
Finally the actual printing. We say that each element should be passed to the println method: .forEach(System.out::println). And now you are left with a nice list.
So how good is this answer? The upside is that is is very short and thus highly expressive. It also gets along with only a single system call that hides behind Files.readAllBytes (or at least a fixed number I am not sure if this really works with a single system call) and System calls can be a bottleneck. E.g. if you are reading a file from a stream, each call to read may trigger a system call. This is significantly reduced by using a BufferedReader that as the name suggests buffers. but stilly readAllBytes should be fastest. The price for this is that it consumes huge amounts of memory. However wikipedia claims that a typical english book has 500 pages with 2,000 characters per page which mean roughly 1 Megabyte which should not be a problem in terms of memory consumption even if you are on a smartphone, raspberry pi or a really really old computer.
This solutions does involve some optimizations that were not possible prior to Java 8. For example the idiom map.put(word, map.get(word) + 1) requires the "word" to be looked up twicte in the map, which is an unnecessary waste.
But also a simple loop might be easier to optimize for the compiler and might save a number of method calls. So I wanted to know and put this to a test. I generated a file using:
[ -f /tmp/random.txt ] && rm /tmp/random.txt; for i in {1..15}; do head -n 10000 /usr/share/dict/american-english >> /tmp/random.txt; done; perl -MList::Util -e 'print List::Util::shuffle <>' /tmp/random.txt > /tmp/random.tmp; mv /tmp/random.tmp /tmp/random.txt
Which gives me a file of about 1,3MB, so not that untypical for a book with most words being repeated 15 times, but in random order to circumvent that this end up to be a branch prediction test. Then I ran the following tests:
public class WordCountTest {
#Test(dataProvider = "provide_description_testMethod")
public void test(String description, TestMethod testMethod) throws Exception {
long start = System.currentTimeMillis();
for (int i = 0; i < 100_000; i++) {
testMethod.run();
}
System.out.println(description + " took " + (System.currentTimeMillis() - start) / 1000d + "s");
}
#DataProvider
public Object[][] provide_description_testMethod() {
Path path = Paths.get("/tmp/random.txt");
return new Object[][]{
{"classic", (TestMethod)() -> countWordsClassic(path)},
{"mixed", (TestMethod)() -> countWordsMixed(path)},
{"mixed2", (TestMethod)() -> countWordsMixed2(path)},
{"stream", (TestMethod)() -> countWordsStream(path)},
{"stream2", (TestMethod)() -> countWordsStream2(path)},
};
}
private void countWordsClassic(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
Integer oldCount = wordCounts.get(word);
if (oldCount == null) {
wordCounts.put(word, 1);
} else {
wordCounts.put(word, oldCount + 1);
}
}
}
private void countWordsMixed(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1);
}
}
private void countWordsMixed2(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
Pattern.compile("\\W+")
.splitAsStream(new String(readAllBytes(path), StandardCharsets.UTF_8))
.forEach(word -> wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1));
}
private void countWordsStream2(final Path tmpFile) throws IOException {
Pattern.compile("\\W+").splitAsStream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
private void countWordsStream(final Path tmpFile) throws IOException {
Arrays.stream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
interface TestMethod {
void run() throws Exception;
}
}
The result were:
type length diff
classic 4665s +9%
mixed 4273s +0%
mixed2 4833s +13%
stream 4868s +14%
stream2 5070s +19%
Note that I previously also tested with TreeMaps, but found that the HashMaps were much faster, even if I sorted the output afterwards. Also I changed the tests above after Tagir Valeev told me in the comments below about the Pattern.splitAsStream() method. Since I got strongly varying results I left the tests run for quite a while as you can see by the length in seconds above to get meaningful results.
How I judge the results:
The "mixed" approach which does not use streams at all, but uses the "merge" method with callback introduced in Java 8 does improve the performance. This is something I expected because the classic get/put appraoch requires the key to be looked up twice in the HashMap and this is not required anymore with the "merge"-approach.
To my suprise the Pattern.splitAsStream() appraoch is actually slower compared to Arrays.asStream(....split()). I did have a look at the source code of both implementations and I noticed that the split() call saves the results in an ArrayList which starts with a size of zero and is enlarged as needed. This requires many copy operations and in the end another copy operation to copy the ArrayList to an array. But "splitAsStream" actually creates an iterator which I thought can be queried as needed avoiding these copy operations completely. I did not quite look through all the source that converts the iterator to a stream object, but it seems to be slow and I don't know why. In the end it theoretically could have to do with CPU memory caches: If exactly the same code is executed over and over again the code will more likely be in the cache then actually running on large function chains, but this is a very wild speculation on my side. It may also be something completely different. However splitAsStream MIGHT have a better memory footprint, maybe it does not, I did not profile that.
The stream approach in general is pretty slow. This is not totally unexpected because quite a number of method invocations take place, including for example something as pointless as Function.identity. However I did not expect the difference at this magnitude.
As an interesting side note I find the mixed approach which was fastest quite well to read and understand. The call to "merge" does not have the most ovbious effect to me, but if you know what this method is doing it seems most readable to me while at the same time the groupingBy command is more difficult to understand for me. I guess one might be tempted to say that this groupingBy is so special and highly optimised that it makes sense to use it for performance but as demonstrated here, this is not the case.
Map<String, Integer> countByWords = new HashMap<String, Integer>();
Scanner s = new Scanner(new File("your_file_path"));
while (s.hasNext()) {
String next = s.next();
Integer count = countByWords.get(next);
if (count != null) {
countByWords.put(next, count + 1);
} else {
countByWords.put(next, 1);
}
}
s.close();
this count "I'm" as only one word
General overview of steps:
Create a HashMap<String, Integer>
Read the file one word a time. If it doesn't exist in your HashMap, add it and change the count value assigned to 1. If it exists, increment the value by 1. Read till end of file.
This will result in a set of all your words and the count for each word.
If i were you, I would use one of the implementations of a map<String, int>, like a hashmap. Then as you loop through each word if it already exists just increment the int by one, otherwise add it into the map. At the end you can pull out all of the words, or query it based on a specific word to get the count.
If order is important to you, you could try a SortedMap<String, int> to be able to pring them out in alphabetical order.
Hope that helps!
It is actually classic word-count algorithm.
Here is the solution:
public Map<String, Integer> wordCount(String[] strings) {
Map<String, Integer> map = new HashMap<String, Integer>();
int count = 0;
for (String s:strings) {
if (map.containsKey(s)) {
count = map.get(s);
map.put(s, count + 1);
} else {
map.put(s, 1);
}
}
return map;
}
Here is my solution:
Map<String, Integer> map= new HashMap();
int count=0;
for(int i =0;i<strings.length;i++){
for(int j=0;j<strings.length;j++){
if(strings[i]==strings[j])
count++;
}map.put(strings[i],count);
count=0;
}return map;

Using the class methods of something in a TreeMap

I know there are already topics on this exact thing but none of them actually answer my question. is there a way to do this?
if I have a TreeMap that uses strings as the keys and objects of the TreeSet class as the values, is there a way that I can add some int to a set that is associated with a specific key?
Well what I'm supposed to do is make a concordance from a text file using the TreeMap and TreeSet class. my plan is this use the TreeMap keys as the words in the text file and the values will be sets of line numbers on which the word appears. So you step through the text file and every time you get a word you check the TreeMap to see if you already have that key and if you don't you add it in and create a new TreeSet of line numbers starting with the one you are on. If you already have it then you just add the line number to the set. So you see what I need to do is access the .add() function of the set
something like
map.get(identifier).add(lineNumber);
I know that doesn't work but how do I do it?
I mean if there is an easier way to do what I'm trying to do I'd be happy to do that instead, but I would still like to know how to do it this way just for you know learning and experience and all that.
Consider the following logic (I assume the input words are in an array):
TreeMap<String, TreeSet<Integer>> index = new TreeMap<String, TreeSet<Integer>>();
for (int pos = 0; pos < input.length; pos++) {
String word = input[pos];
TreeSet<Integer> wordPositions = index.get(word);
if (wordPositions == null) {
wordPositions = new TreeSet<Integer>();
index.put(word, wordPositions);
}
wordPositions.add(pos);
}
This results in the index you need, which maps from strings to the set of positions where the string appears. Depending on your specific needs, the outer/inner data structure can be changed to HashMap/HashSet respectively.
Why not to use a Map of String and ArrayList<int>, something like:
Map<String, List<Integer>> map = new HashMap<String, List<Integer>>();
And then always when you get a word you check if it already exists in the Map and if it does exist you add the line number to the List and if not you create a new entry in the Map for the given word and the given line number.
if (map.get(word ) != null) {
map.get(word).add(line);
}
else{
final List<Integer> list = new ArrayList<Integer>();
list.add(line);
map.put(word, list);
}
If I understand correctly, you want to have a treemap with each key referring to a treeset for storing line number on which the key has appeared. It is definitely doable and implementation is quiet simple. I am not sure why your map.get(identifier).add(lineNumber); is not working. This is how I would do it:
TreeMap<String, TreeSet<Integer>> map = new TreeMap<String, TreeSet<Integer>>();
TreeSet<Integer> set = new TreeSet<Integer>();
set.add(1234);
map.put("hello", set);
map.get("hello").add(123);
It all works fine.
The only reason your construct won't work is because the result of map.get(identifier) can be null. Personally, I like the lazy initialization solution that #EyalSchneider answered with. But there is an alternative if you know all your identifiers ahead of time: for example, if you preload your Map with all known English words. Then you can do something like:
for (String word : allEnglishWords) {
map.put(word, new LinkedList<Integer>);
}
for (int pos = 0; pos < input.length; pos++) {
String word = input[pos];
map.get(word).add(pos);
}

Huge Hashtable sorting - number of values - 553685

I created a hashmap to store occurence of words in multiple files like 10,000 text files. Then i wanted to sort them from hashmap and print top 10 words. Hashmap is defined as,
Hashtable <String, Integer> problem1Counter = new Hashtable<String, Integer> ();
When i kept the files to around 1000, i was able to get top ten words using a simple sorting like this,
String[] keysProblem1 = (String[]) problem1Counter.keySet().toArray(new String[0]);
Integer [] valuesProblem1 = (Integer[])problem1Counter.values().toArray(new Integer[problem1Counter.size()]);
int kk = 0;
String ii = null;
for (int jj = 0; jj < valuesProblem1.length ; jj++){
for (int bb = 0; bb < valuesProblem1.length; bb++){
if(valuesProblem1[jj] < valuesProblem1[bb]){
kk = valuesProblem1[jj];
ii = keysProblem1[jj];
valuesProblem1[jj] = valuesProblem1[bb];
keysProblem1[jj] = keysProblem1[bb];
valuesProblem1 [bb] = kk;
keysProblem1 [bb] = ii;}}}
So the above method is not working when hashtable has more than 553685 values. So can anyone suggest and show a better method to sort them? I'm a newbie to java but had worked in actionscript, so i was a bit comfortable.
Thanks.
Your problem starts when you split up keys and values and try to keep the things at each index connected yourself. Instead, keep them coupled, and sort the Map.Entry objects java gives you.
I'm not sure this compiles, but it should give you a start.
// HashMap and Hashtable are very similar, but I generally use HashMap.
HashMap<String, Integer> answers = ...
// Get the Key/Value pairs into a list so we can sort them.
List<Map.Entry<String, Integer>> listOfAnswers =
new ArrayList<Map.Entry<String, Integer>>(answers.entrySet());
// Our comparator defines how to sort our Key/Value pairs. We sort by the
// highest value, and don't worry about the key.
java.util.Collections.sort(listOfAnswers,
new Comparator<Map.Entry<String, Integer>>() {
public int compare(
Map.Entry<String, Integer> o1,
Map.Entry<String, Integer> o2) {
return o2.getValue() - o1.getValue();
}
});
// The list is now sorted.
System.out.println( String.format("Top 3:\n%s: %d\n%s: %d\n%s: %d", +
listOfAnswers.get(0).getKey(), listOfAnswers.get(0).getValue(),
listOfAnswers.get(1).getKey(), listOfAnswers.get(1).getValue(),
listOfAnswers.get(2).getKey(), listOfAnswers.get(2).getValue()));
For a better way of doing the sort, I'd do it like this:
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
public class Main {
/**
* #param args
*/
public static void main(String[] args) {
HashMap<String, Integer> counter = new HashMap<String, Integer>();
// [... Code to populate hashtable goes here ...]
//
// Extract the map as a list
List<Map.Entry<String, Integer>> entries = new ArrayList<Map.Entry<String, Integer>>(counter.entrySet());
// Sort the list of entries.
Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
#Override
public int compare(Entry<String, Integer> first, Entry<String, Integer> second) {
// This will give a *positive* value if first freq < second freq, zero if they're equal, negative if first > second.
// The result is a highest frequency first sort.
return second.getValue() - first.getValue();
}
});
// And display the results
for (Map.Entry<String, Integer> entry : entries.subList(0, 10))
System.out.println(String.format("%s: %d", entry.getKey(), entry.getValue()));
}
}
Edit explaining why this works
Your original algorithm looks like a variant of Selection Sort, which is an O(n^2) algorithm. Your variant does a lot of extra swapping too, so is quite slow.
Being O(n^2), if you multiply your problem size by 10, it will typically take 100 times longer to run. Sorting half a million elements needs to do 250 billion comparisons, many of which will lead to a swap.
The built-in sort algorithm in Collections#sort is a lightning fast variant of Merge Sort, which runs in O(n.log(n)) time. That means that every time you multiply the problem size by 10, it only takes about 30 times as long. Sorting half a millon elements only needs to do about 10 million comparisons.
This is why experienced developers will advise you to use library functions whenever possible. Writing your own sort algorithms can be great for learning, but it takes a lot of work to implement one as fast and flexible as what's in the library.
create an inner class Word that implements Comparable
override public int compareTo(Word w) to make it use occurrences
create an array of words of the size of your HashMap
fill the array iterating through the HashMap
call Arrays.sort on the array
Alternatively, since you only need the top 10, you can just iterate through your Words and maintain a top 10 list as you go along.

How do I utilize hashtables to hold words and frequency of use?

I am so confused right now. I am supposed to write a program that uses a hashtable. The hashtable holds words along with their frequency of use. The class "Word" holds a counter and the string. If the word is already in the table then its frequency increases. I have been researching how to do this but am just lost. I need to be pointed in the right direction. Any help would be great.
Hashtable<String, Word> words = new Hashtable<String, Word>();
public void addWord(String s) {
if (words.containsKey(s) {
words.get(s).plusOne();
} else {
words.put(s, new Word(s));
}
}
This will do it.
Hashtable would be an unusual choice for any new Java code these days. I assume this is some kind of exercise.
I would be slightly concerned by any exercise that hadn't been updated to use newer mechanisms.
HashMap will give you better performance than Hashtable in any single threaded scenario.
But as Emmanuel Bourg points out, Bag will do all of this for you without needing the Word class at all: just add String objects to the Bag, and the bag will automatically keep count for you.
Anyway, you're being asked to use a Map, and a map lets you find things quickly by using a key. The key can be any Object, and Strings are very commonly used: they are immutable and have good implementations of hashCode and equals, which make them ideal keys.
The javadoc for Map talks about how you use maps. Hashtable is one implementation of this interface, though it isn't a particularly good one.
You need a good key to let you find existing Word objects quickly, so that you can increment the counter. While you could make the Word object itself into the key, you would have some work to do: better is to use the String that the Word contains as the key.
You find whether the Word is already in the map by looking for the value object that has the String as its key.
You'd better use a Bag, it keeps the count of each element:
http://commons.apache.org/collections/api-release/org/apache/commons/collections/Bag.html
This piece of code should solve your problem
Hashtable <String, Word> myWords = new Hashtable<String, Word>();
Word w = new Word("test");
Word w = new Word("anotherTest");
String inputWord = "test";
if (myWords.containsKey(inputWord)){
myWords.get(inputWord).setCounter(myWords.get(inputWord).getCounter+1);
}
Given that the class Word has a counter and a string, I'd use a HashMap<String, Word>. If your input is an array of Strings, you can accomplish something like this by using:
public Map<String, Word> getWordCount(String[] input) {
Map<String, Word> output = new HashMap<String, Word>();
for (String s : input) {
Word w = output.get(s);
if (w == null) {
w = new Word(s, 0);
}
w.incrementValue(); // Or w = new Word(s, w.getCount() + 1) if you have no such function
output.put(s, w);
}
return output;
}

Java - Optimize finding a string in a list

I have an ArrayList of objects where each object contains a string 'word' and a date. I need to check to see if the date has passed for a list of 500 words. The ArrayList could contain up to a million words and dates. The dates I store as integers, so the problem I have is attempting to find the word I am looking for in the ArrayList.
Is there a way to make this faster? In python I have a dict and mWords['foo'] is a simple lookup without looping through the whole 1 million items in the mWords array. Is there something like this in java?
for (int i = 0; i < mWords.size(); i++) {
if ( word == mWords.get(i).word ) {
return mWords.get(i);
}
}
If the words are unique then use HashMap. I mean, {"a", 1}, {"b", 2}
Map<String, Integer> wordsAndDates = new HashMap<String, Integer>();
wordsAndDates.put("a", 1);
wordsAndDates.put("b", 2);
and wordsAndDates.get("a") return 1
If not you shouldn't use HashMap because it overrides previous value. I mean
wordsAndDates.put("a", 1);
wordsAndDates.put("b", 2);
wordsAndDates.put("a", 3);
and wordsAndDates.get("a") return 3
In such case you can use ArrayList and search in it
If you're not stuck with an ArrayList you should use some kind of hash based data structure. In this case it seems like a HashMap should fit nicely (it's pretty close to python's dict). This will give you an O(1) lookup time (compared to your current method of linear search).
You want to use a Map in Java
Map<String,Integer> mWords = new HashMap<String, Integer>();
mWords.put ("foo", 112345);
What about Collections.binarySearch() (NB: the list must be sorted) if ou are stuck with the ArrayList

Categories

Resources