How to count the number of occurrences of each word? - java

If I have an article in English, or a novel in English, and I want to count how many times each words appears, what is the fastest algorithm written in Java?
Some people said you can use Map < String, Integer>() to complete this, but I was wondering how do I know what is the key words? Every article has different words and how do you know the "key" words then add one on its count?

Here is another way to do it with the things that appeared in Java 8:
private void countWords(final Path file) throws IOException {
Arrays.stream(new String(Files.readAllBytes(file), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())).entrySet()
.forEach(System.out::println);
}
So what is it doing?
It reads a text file completely into memory, into a byte array to be more precise: Files.readAllBytes(file). This method turned up in Java 7 and allows methods of loading files very fast, however for the price that the file will be completely in memory, costing a lot of memory. For speed however this is a good appraoch.
The byte[] is converted to a String: new String(Files.readAllBytes(file), StandardCharsets.UTF_8) while assuming that the file is UTF8 encoded. Change at your own need. The price is a full memory copy of the already huge piece of data in memory. It may be faster to work with a memory mapped file instead.
The string is split at non-Word charcaters: ...split("\\W+") which creates an array of strings with all your words.
We create a stream from that array: Arrays.stream(...). This by itself does not do very much, but we can do a lot of fun things with the stream
We group all the words together: Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting()). This means:
We want to group the words by the word themselves (identity()). We could also e.g. lowercase the string here first if you want grouping to be case insensitive. This will end up to be the key in a map.
As a result for storng the grouped values we want a TreeMap (TreeMap::new). TreeMaps are sorted by their key, so we can easily output in alphabetical order in the end. If you do not need sorting you could also use a HashMap here.
As value for each group we want to have the number of occurances of each word (counting()). In background that means that for each word we add to a group we increase the counter by one.
From Step 5 we are left with a Map that maps words to their count. Now we just want to print them. So we access a collection with all the key/value pairs in this map (.entrySet()).
Finally the actual printing. We say that each element should be passed to the println method: .forEach(System.out::println). And now you are left with a nice list.
So how good is this answer? The upside is that is is very short and thus highly expressive. It also gets along with only a single system call that hides behind Files.readAllBytes (or at least a fixed number I am not sure if this really works with a single system call) and System calls can be a bottleneck. E.g. if you are reading a file from a stream, each call to read may trigger a system call. This is significantly reduced by using a BufferedReader that as the name suggests buffers. but stilly readAllBytes should be fastest. The price for this is that it consumes huge amounts of memory. However wikipedia claims that a typical english book has 500 pages with 2,000 characters per page which mean roughly 1 Megabyte which should not be a problem in terms of memory consumption even if you are on a smartphone, raspberry pi or a really really old computer.
This solutions does involve some optimizations that were not possible prior to Java 8. For example the idiom map.put(word, map.get(word) + 1) requires the "word" to be looked up twicte in the map, which is an unnecessary waste.
But also a simple loop might be easier to optimize for the compiler and might save a number of method calls. So I wanted to know and put this to a test. I generated a file using:
[ -f /tmp/random.txt ] && rm /tmp/random.txt; for i in {1..15}; do head -n 10000 /usr/share/dict/american-english >> /tmp/random.txt; done; perl -MList::Util -e 'print List::Util::shuffle <>' /tmp/random.txt > /tmp/random.tmp; mv /tmp/random.tmp /tmp/random.txt
Which gives me a file of about 1,3MB, so not that untypical for a book with most words being repeated 15 times, but in random order to circumvent that this end up to be a branch prediction test. Then I ran the following tests:
public class WordCountTest {
#Test(dataProvider = "provide_description_testMethod")
public void test(String description, TestMethod testMethod) throws Exception {
long start = System.currentTimeMillis();
for (int i = 0; i < 100_000; i++) {
testMethod.run();
}
System.out.println(description + " took " + (System.currentTimeMillis() - start) / 1000d + "s");
}
#DataProvider
public Object[][] provide_description_testMethod() {
Path path = Paths.get("/tmp/random.txt");
return new Object[][]{
{"classic", (TestMethod)() -> countWordsClassic(path)},
{"mixed", (TestMethod)() -> countWordsMixed(path)},
{"mixed2", (TestMethod)() -> countWordsMixed2(path)},
{"stream", (TestMethod)() -> countWordsStream(path)},
{"stream2", (TestMethod)() -> countWordsStream2(path)},
};
}
private void countWordsClassic(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
Integer oldCount = wordCounts.get(word);
if (oldCount == null) {
wordCounts.put(word, 1);
} else {
wordCounts.put(word, oldCount + 1);
}
}
}
private void countWordsMixed(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1);
}
}
private void countWordsMixed2(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
Pattern.compile("\\W+")
.splitAsStream(new String(readAllBytes(path), StandardCharsets.UTF_8))
.forEach(word -> wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1));
}
private void countWordsStream2(final Path tmpFile) throws IOException {
Pattern.compile("\\W+").splitAsStream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
private void countWordsStream(final Path tmpFile) throws IOException {
Arrays.stream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
interface TestMethod {
void run() throws Exception;
}
}
The result were:
type length diff
classic 4665s +9%
mixed 4273s +0%
mixed2 4833s +13%
stream 4868s +14%
stream2 5070s +19%
Note that I previously also tested with TreeMaps, but found that the HashMaps were much faster, even if I sorted the output afterwards. Also I changed the tests above after Tagir Valeev told me in the comments below about the Pattern.splitAsStream() method. Since I got strongly varying results I left the tests run for quite a while as you can see by the length in seconds above to get meaningful results.
How I judge the results:
The "mixed" approach which does not use streams at all, but uses the "merge" method with callback introduced in Java 8 does improve the performance. This is something I expected because the classic get/put appraoch requires the key to be looked up twice in the HashMap and this is not required anymore with the "merge"-approach.
To my suprise the Pattern.splitAsStream() appraoch is actually slower compared to Arrays.asStream(....split()). I did have a look at the source code of both implementations and I noticed that the split() call saves the results in an ArrayList which starts with a size of zero and is enlarged as needed. This requires many copy operations and in the end another copy operation to copy the ArrayList to an array. But "splitAsStream" actually creates an iterator which I thought can be queried as needed avoiding these copy operations completely. I did not quite look through all the source that converts the iterator to a stream object, but it seems to be slow and I don't know why. In the end it theoretically could have to do with CPU memory caches: If exactly the same code is executed over and over again the code will more likely be in the cache then actually running on large function chains, but this is a very wild speculation on my side. It may also be something completely different. However splitAsStream MIGHT have a better memory footprint, maybe it does not, I did not profile that.
The stream approach in general is pretty slow. This is not totally unexpected because quite a number of method invocations take place, including for example something as pointless as Function.identity. However I did not expect the difference at this magnitude.
As an interesting side note I find the mixed approach which was fastest quite well to read and understand. The call to "merge" does not have the most ovbious effect to me, but if you know what this method is doing it seems most readable to me while at the same time the groupingBy command is more difficult to understand for me. I guess one might be tempted to say that this groupingBy is so special and highly optimised that it makes sense to use it for performance but as demonstrated here, this is not the case.

Map<String, Integer> countByWords = new HashMap<String, Integer>();
Scanner s = new Scanner(new File("your_file_path"));
while (s.hasNext()) {
String next = s.next();
Integer count = countByWords.get(next);
if (count != null) {
countByWords.put(next, count + 1);
} else {
countByWords.put(next, 1);
}
}
s.close();
this count "I'm" as only one word

General overview of steps:
Create a HashMap<String, Integer>
Read the file one word a time. If it doesn't exist in your HashMap, add it and change the count value assigned to 1. If it exists, increment the value by 1. Read till end of file.
This will result in a set of all your words and the count for each word.

If i were you, I would use one of the implementations of a map<String, int>, like a hashmap. Then as you loop through each word if it already exists just increment the int by one, otherwise add it into the map. At the end you can pull out all of the words, or query it based on a specific word to get the count.
If order is important to you, you could try a SortedMap<String, int> to be able to pring them out in alphabetical order.
Hope that helps!

It is actually classic word-count algorithm.
Here is the solution:
public Map<String, Integer> wordCount(String[] strings) {
Map<String, Integer> map = new HashMap<String, Integer>();
int count = 0;
for (String s:strings) {
if (map.containsKey(s)) {
count = map.get(s);
map.put(s, count + 1);
} else {
map.put(s, 1);
}
}
return map;
}

Here is my solution:
Map<String, Integer> map= new HashMap();
int count=0;
for(int i =0;i<strings.length;i++){
for(int j=0;j<strings.length;j++){
if(strings[i]==strings[j])
count++;
}map.put(strings[i],count);
count=0;
}return map;

Related

Split List<String> into small concatenated List<String> java 8 parallel execution

So I have 2000 records of company names. I could like to take first 50 names and concatenate it and save as a string and then append it to a new List .
Does anyone have any idea how can we achieve it using java8 ?
Can this be done using parallelstream api?
Currently I’m iteration over 2k records and appending the data to a string builder . Meanwhile after every 50th count I’m creating a new String builder . After every 50 record I add the string builder content to a list. Finally i get list With all the data.
Example: a1# , a2 till a2000
Final output: LiSt of String with
1st entry —> concatenation of a1 to a50,
2nd entry —> concatenation of a51 to a100
Code:
List<String> bulkEmails ; //Fetched from DB
int count = 50;
List<String> splitEmails = new ArrayList<>(); //Final output
StringBuilder builder = new StringBuilder(); // temp builder
for (String mail : bulkEmails) {
builder.append(mail).append(",");
count++;
//append concatenated 50mails, appends to finalOutput and then resets the counter
if (count == 50) {
splitEmails.add(builder.toString());
builder = new StringBuilder();
count = 0;
}
}
Suggestions are appreciated.
As I've said in the comments, 2K isn't really a massive data.
And there's an issue with executing this task in parallel.
Let's imagine this task of splitting the data into groups of 50 elements is running in parallel somehow and the first worker thread is being assigned with a chunk of data containing 523 elements, the second - 518 elements, and so on. And none of these chunks is a product 50. As a consequence of this, each thread would produce a partial result containing a group of size which differs from 50.
Depending on how you want to deal with such cases, there are different approaches on how to implement this functionality:
Join partial results as is. It implies that the final result would contain an arbitrary number of groups having size in range [1,49]. That is the simplest option to implement, and cheapest to execute. But, note that there could even the case when every resulting group is smaller than 50 be since you're not in control of the splitterator implementation (i.e. you can't dictate what would be large would be a chuck of data which particular thread would be assigned to work with). Regardless how strict/lenient your requirements are, that doesn't sound very nice.
The second option requires reordering the elements. While joining partial results produced by each thread in parallel, we can merge the two last groups produced by every thread to ensure that there would be only one group at most that differs in size from 50 in the final result.
If you're not OK with either joining partial results as is, or reordering the elements, that implies that this task isn't suitable for parallel execution because in case when the first thread produces a partial result containing a group of size smaller than 50 all the groups created by the second thread need to be rearranged. Which results in worse performance in parallel because of doing the same job twice.
The second thing we need to consider is that the operation of creating groups requires maintaining a state. Therefore, the right place for this transformation is inside a collector, where the stream is being consumed and mutable container of the collector gets updated.
Let's start with implementing a Collector which ignores the issue described above and joins partial results as is.
For that we can use static method Collector.of().
public static <T> Collector<T, ?, Deque<List<T>>> getGroupCollector(int groupSize) {
return Collector.of(
ArrayDeque::new,
(Deque<List<T>> deque, T next) -> {
if (deque.isEmpty() || deque.getLast().size() == groupSize) deque.add(new ArrayList<>());
deque.getLast().add(next);
},
(left, right) -> {
left.addAll(right);
return left;
}
);
}
Now let's implement a Collector which merges the two last groups produced in different threads (option 2 in the list above):
public static <T> Collector<T, ?, Deque<List<T>>> getGroupCollector(int groupSize) {
return Collector.of(
ArrayDeque::new,
(Deque<List<T>> deque, T next) -> {
if (deque.isEmpty() || deque.getLast().size() == groupSize) deque.add(new ArrayList<>());
deque.getLast().add(next);
},
(left, right) -> {
if (left.peekLast().size() < groupSize) {
List<T> leftLast = left.pollLast();
List<T> rightLast = right.peekLast();
int llSize = leftLast.size();
int rlSize = rightLast.size();
if (rlSize + llSize <= groupSize) {
rightLast.addAll(leftLast);
} else {
rightLast.addAll(leftLast.subList(0, groupSize - rlSize));
right.add(new ArrayList<>(leftLast.subList(groupSize - rlSize, llSize)));
}
}
left.addAll(right);
return left;
}
);
}
If you wish to implement a Collector which combiner function rearranges the partial results if needed (the very last option in the list), I'm leaving it to the OP/reader as an exercise.
Now let's use the collector defined above (the latest). Let's consider a stream of single-letter strings representing the characters of the English alphabet, and let's group size would be 5.
public static void main(String[] args) {
List<String> groups = IntStream.rangeClosed('A', 'Z')
.mapToObj(ch -> String.valueOf((char) ch))
.collect(getGroupCollector(5))
.stream()
.map(group -> String.join(",", group))
.collect(Collectors.toList());
groups.forEach(System.out::println);
}
Output:
A,B,C,D,E
F,G,H,I,J
K,L,M,N,O
P,Q,R,S,T
U,V,W,X,Y
Z
That's the sequential result. If you switch, the stream from sequential to parallel the contents of the groups would probably change, but the sizes would not be affected.
Also note that since there are effectively two independent streams chained together, you need to apply parrallel() twice to make the whole thing working in parallel.

Need to use of concurrency in this java method

So I have the following code that takes the input of two arrays, and apply some queries to match elements from Array1 with elements from Array2, then it returns the number of elements that are similar in the two ArrayLists.
Here is the code I use:
public static void get_ND_Matches() throws IOException{
#SuppressWarnings("rawtypes")
List<String> array1 = new ArrayList<String>();
List<String> array2 = new ArrayList<String>();
array1 = new ArrayList<String>( ClassesRetrieval.getDBpediaClasses());
array2 = new ArrayList<String>( ClassesRetrieval.fileToArrayListYago());
String maxLabel="";
HashMap<String,Integer> map = new HashMap<String,Integer>();
int number;
HashMap<String,ArrayList<String>> theMap = new HashMap<>();
for(String yagoClass:array2){
theMap.put(yagoClass, getListTwo(yagoClass));
System.out.println("Done for : "+yagoClass );
}
for(String dbClass:array1){
ArrayList<String> result = get_2D_Matches(dbClass);
for(Map.Entry<String, ArrayList<String>> entry : theMap.entrySet()){
String yagoClass=entry.getKey();
Set<String> IntersectionSet =Sets.intersection(Sets.newHashSet(result), Sets.newHashSet(entry.getValue()));
System.out.println(dbClass + " and "+ yagoClass+ " = "+ IntersectionSet.size());
number = IntersectionSet.size();
map.put(yagoClass, number);
}
int maxValue=(Collections.max(map.values()));
for(Entry<String, Integer> entry:map.entrySet()){
if(entry.getValue()==maxValue && maxValue != 0){
maxLabel = entry.getKey();
}
if(maxValue==0){
maxLabel = "Nothing in yago";
}
}
System.out.println("-------------------------------");
System.out.println(dbClass+" from DBPEDIA Corresponds to "+ maxLabel);
System.out.println("-------------------------------");
}
}
This code returns for example:
Actor from DBPEDIA Corresponds to Yago_Actor
Album from DBPEDIA Corresponds to Yago_Album
SomeClass from DBPEDIA Corresponds to nothing in Yago
Etc..
Behind the scenes, this code uses getDBpediaClasses and then applies Get_2D_Matches(); method to get an arrayList of results for each class. Each ArrayList resulted is then compared to another ArrayList generated by getListTwo() for each class of fileToArrayListYago().
Now, because of all the calculations made in the background (there are millions of elements in each array), this process takes hours to execute.
I would really like to use concurrency/multithreading to solve this issue. Could anyone show me how to do that?
It makes little sense to parallelize code which is not perfectly clean and optimized. You may get factor 4 on a typical 4-core CPU or nothing at all, depending on whether you choose the part to be paralellized properly. Use of a better algorithm may give you much more.
It's possible that the bottleneck is get_2D_Matches, which you haven't published.
Computing the maximum directly instead of creating a throw-away HashMap<String,Integer> map could save quite some time, and so could moving Sets.newHashSet(result) out of the loop.
You should really reconsider variable naming. With names like map, theMap, and result (for something which is not the method's result), it's hard to find out what's going on.
If you really want to parallelize it, you need to split your overlong method first. Then it's rather simple as the processing of each dbClass can be done independently. Just encapsulate it as a Callable and submit it to an ExecutorService.
However, I'd suggest to clean the code first, then submit it to CR, and then consider parallelizing it.

Counting occurrences of words in an array

I've been working on something which takes a stream of characters, forms words, makes an array of the words, then creates a vector which contains each unique words and the number of times it occurs (basically a word counter).
Anyway I've not used Java in a long time, or much programming to be honest and I'm not happy with how this currently looks. The part I have which makes the vector looks ugly to me and I wanted to know if I could make it less messy.
int counter = 1;
Vector<Pair<String, Integer>> finalList = new Vector<Pair<String, Integer>>();
Pair<String, Integer> wordAndCount = new Pair<String, Integer>(wordList.get(1), counter); // wordList contains " " as first word, starting at wordList.get(1) skips it.
for(int i= 1; i<wordList.size();i++){
if(wordAndCount.getLeft().equals(wordList.get(i))){
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter++);
}
else if(!wordAndCount.getLeft().equals(wordList.get(i))){
finalList.add(wordAndCount);
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter=1);
}
}
finalList.add(wordAndCount); //UGLY!!
As a secondary question, this gives me a vector with all the words in alphabetical order (as in the array). I want to have it sorted by occurrence, the alphabetical within that.
Would the best option be:
Iterate down the vector, testing each occurrence int with the one above, using Collections.swap() if it was higher, then checking the next one above (as its now moved up 1) and so on until it's no longer larger than anything above it. Any occurrence of 1 could be skipped.
Iterate down the vector again, testing each element against the first element of the vector and then iterating downwards until the number of occurrences is lower and inserting it above that element. All occurrences of 1 would once again be skipped.
The first method would doing more in terms of iterating over the elements, but the second one requires you to add and remove components of the vector (I think?) so I don't know which is more efficient, or whether its worth considering.
Why not use a Map to solve your problem?
String[] words // your incoming array of words.
Map<String, Integer> wordMap = new HashMap<String, Integer>();
for(String word : words) {
if(!wordMap.containsKey(word))
wordMap.put(word, 1);
else
wordMap.put(word, wordMap.get(word) + 1);
}
Sorting can be done using Java's sorted collections:
SortedMap<Integer, SortedSet<String>> sortedMap = new TreeMap<Integer, SortedSet<String>>();
for(Entry<String, Integer> entry : wordMap.entrySet()) {
if(!sortedMap.containsKey(entry.getValue()))
sortedMap.put(entry.getValue(), new TreeSet<String>());
sortedMap.get(entry.getValue()).add(entry.getKey());
}
Nowadays you should leave the sorting to the language's libraries. They have been proven correct with the years.
Note that the code may use a lot of memory because of all the data structures involved, but that is what we pay for higher level programming (and memory is getting cheaper every second).
I didn't run the code to see that it works, but it does compile (copied it directly from eclipse)
re: sorting, one option is to write a custom Comparator which first examines the number of times each word appears, then (if equal) compares the words alphabetically.
private final class PairComparator implements Comparator<Pair<String, Integer>> {
public int compareTo(<Pair<String, Integer>> p1, <Pair<String, Integer>> p2) {
/* compare by Integer */
/* compare by String, if necessary */
/* return a negative number, a positive number, or 0 as appropriate */
}
}
You'd then sort finalList by calling Collections.sort(finalList, new PairComparator());
How about using google guava library?
Multiset<String> multiset = HashMultiset.create();
for (String word : words) {
multiset.add(word);
}
int countFoo = multiset.count("foo");
From their javadocs:
A collection that supports order-independent equality, like Set, but may have duplicate elements. A multiset is also sometimes called a bag.
Simple enough?

How do I utilize hashtables to hold words and frequency of use?

I am so confused right now. I am supposed to write a program that uses a hashtable. The hashtable holds words along with their frequency of use. The class "Word" holds a counter and the string. If the word is already in the table then its frequency increases. I have been researching how to do this but am just lost. I need to be pointed in the right direction. Any help would be great.
Hashtable<String, Word> words = new Hashtable<String, Word>();
public void addWord(String s) {
if (words.containsKey(s) {
words.get(s).plusOne();
} else {
words.put(s, new Word(s));
}
}
This will do it.
Hashtable would be an unusual choice for any new Java code these days. I assume this is some kind of exercise.
I would be slightly concerned by any exercise that hadn't been updated to use newer mechanisms.
HashMap will give you better performance than Hashtable in any single threaded scenario.
But as Emmanuel Bourg points out, Bag will do all of this for you without needing the Word class at all: just add String objects to the Bag, and the bag will automatically keep count for you.
Anyway, you're being asked to use a Map, and a map lets you find things quickly by using a key. The key can be any Object, and Strings are very commonly used: they are immutable and have good implementations of hashCode and equals, which make them ideal keys.
The javadoc for Map talks about how you use maps. Hashtable is one implementation of this interface, though it isn't a particularly good one.
You need a good key to let you find existing Word objects quickly, so that you can increment the counter. While you could make the Word object itself into the key, you would have some work to do: better is to use the String that the Word contains as the key.
You find whether the Word is already in the map by looking for the value object that has the String as its key.
You'd better use a Bag, it keeps the count of each element:
http://commons.apache.org/collections/api-release/org/apache/commons/collections/Bag.html
This piece of code should solve your problem
Hashtable <String, Word> myWords = new Hashtable<String, Word>();
Word w = new Word("test");
Word w = new Word("anotherTest");
String inputWord = "test";
if (myWords.containsKey(inputWord)){
myWords.get(inputWord).setCounter(myWords.get(inputWord).getCounter+1);
}
Given that the class Word has a counter and a string, I'd use a HashMap<String, Word>. If your input is an array of Strings, you can accomplish something like this by using:
public Map<String, Word> getWordCount(String[] input) {
Map<String, Word> output = new HashMap<String, Word>();
for (String s : input) {
Word w = output.get(s);
if (w == null) {
w = new Word(s, 0);
}
w.incrementValue(); // Or w = new Word(s, w.getCount() + 1) if you have no such function
output.put(s, w);
}
return output;
}

Keeping a pair of primitives in a Java HashMap

I have a list of files. I would like to scan through and keep a count of the number of files with the same size. the issue is with filesize which is a long, as we know, hashmap will take in only an object and not a primitive. So using new Long(filesize), I put it into the hashmap. instead of getting a pair of (filesize, count), I got a list of (filesize, 1) due to the fact that each Long obj is unique.
How do I go about building this accumulator?
Any solution for 1.4.2?
You simply do it this way:
Map<Long, Integer> count = new HashMap<Long, Integer>();
for (File file : files) {
long size = file.getTotalSpace();
Integer n = count.get(size);
if (n == null) {
count.put(size, 1);
} else {
count.put(size, n + 1);
}
}
There is some auto-boxing and unboxing going on here.
Instead of using new Long(size) , you should use Long.valueOf(size). that will return the same Long reference that is internally cached, and should also boost performance (not that it will be visible unless you do millions of these new Long() operations).
ps. only works for java 1.5 or above
You can use Trove to store pairs (long,int) - TLongIntHashMap
or you could use AtomicInteger as a mutable integer.
Map<Long, AtomicInteger> count = new HashMap<Long, AtomicInteger>();
for (File file : files) {
long size = file.length(); // getTotalSpace() get the space consumed (e.g. a multiple of 8K) rather the actual file size.
AtomicInteger n = count.get(size);
if (n == null) {
count.put(size, new AtomicInteger(1));
} else {
n.getAndIncrement();
}
}
Expanding on what cletus wrote.
His solution is fine, except it only stores each filesize that you come across and the number of files that have this size. If you ever want to know which files those are this data structure will be useless to you so I don't think cletus solution is quite complete. Instead I would do
Map<Long, Collection<File>> count = new HashMap<Long, Collection<File>>();
for (File file : files) {
long size = file.getTotalSpace();
Collection<File> c = count.get(size);
if (c == null) {
c = new ArrayList<File>(); //or whatever collection you feel comfortable with
count.put(size, c);
}
c.add(file);
}
then you can get the number of files with c.size() and you can iterate through all the files with that number easily without having to run this procedure again.
I think there's more to this, and we'll need more details from you. I'm assuming you know there's definitely more than one file of a given size, otherwise I'd first check to see that that's the case. For all you know, you simply have a lot of files with unique file sizes.
You mentioned:
...due to the fact that each Long obj is unique.
I don't think this is the problem. While this may be true depending on how you are instantiating the Longs, it should not prevent HashMaps from behaving the way you want. As long as the two key objects return the same hashCode() value, and the equals() method say they are equal, your HashMap will not create another entry for it. In fact, it should not be possible for you to see "a list of (filesize, 1)" with the same filesize values (unless you wrote your own Long and failed to implement hashCode()/equals() correctly).
That said, Cletus' code should work if you're using Java 5 or higher, if you're using Java 1.4 or below, you'll need to either do your own boxing/unboxing manually, or look into Apache Commons Collections. Here's the pre-Java 5 version of Cletus' example:
Map count = new HashMap();
for (Iterator filesIter = files.iterator(); filesIter.hasNext();) {
File file = (File)filesIter.next();
long size = file.getTotalSpace();
Integer n = count.get(size);
if (n == null) {
count.put(size, Integer.valueOf(1));
} else {
count.put(size, Integer.valueOf(n.intValue() + 1));
}
}

Categories

Resources