Java - Repeated word count in the large file - java

I want to find the repeated word count from the large file content. Is there any best approach using java 8 stream API?
Updated Details
File format is comma separated values and the file size is around 4 GB

I don’t know if there’s a best approach, and it would also depend on the details you haven’t told us. For now I am assuming a text file with a number of words separated by spaces on each line. A possible approach would be:
Map<String, Long> result = Files.lines(filePath)
.flatMap(line -> Stream.of(line.split(" ")))
.collect(Collectors.groupingBy(word -> word, Collectors.counting()));
I think the splitting of each line into words needs to be refined; you will probably want to discard punctuation, for example. Take this as a starting point and develop it into what you need in your particular situation.
Edit: with thanks to #4castle for the inspiration, the splitting into words can be done in this way of you prefer a method reference over a lambda:
Map<String, Long> result = Files.lines(filePath)
.flatMap(Pattern.compile(" ")::splitAsStream)
.collect(Collectors.groupingBy(word -> word, Collectors.counting()));
It produces the same. Edit2: nonsense about optimization deleted here.
Maybe we shouldn’t go too far here until we know the more exact requirement for delimiting words in each line.

If you already have a list of all the words, say List<String> words then you can use something like:
Map<String, Integer> counts = words.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));

You can perform same operation in different way just count number of words in the file(all words including repetitive words). Then simple add all words to the set(which does not allow duplicate values) collections using stream. Then perform total word count - size of the set. So easily you can get the all repetitive word count.
Long totalWordCount = Files.lines(filePath)
.flatMap(line -> Stream.of(line.split(" "))).count();
Set<String> uniqueWords = Files.lines(filePath)
.flatMap(line -> Stream.of(line.split(" ")))
.collect(Collectors.toSet());
Long repetitiveWordCount = totalWordCount - (long) uniqueWords.size();

Related

Reference antecedent Java stream step without breaking the stream pipeline?

I am new to functional programming, and I am trying to get better.
Currently, I am experimenting with some code that takes on the following basic form:
private static int myMethod(List<Integer> input){
Map<Integer,Long> freq = input
.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
return (int) freq
.keySet()
.stream()
.filter(key-> freq.containsKey(freq.get(key)))
.count();
}
First a hashmap is used to get the frequency of each number in the list. Next, we sum up the amount of keys which have their values that also exist as keys in the map.
What I don't like is how the two streams need to exist apart from one another, where a HashMap is made from a stream only to be instantly and exclusively consumed by another stream.
Is there a way to combine this into one stream? I was thinking something like this:
private static int myMethod(List<Integer> input){
return (int) input
.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.keySet()
.stream()
.filter(key-> freq.containsKey(freq.get(key)))
.count();
}
but the problem here is there is no freq map to reference, as it is used as part of the pipeline, so the filter cannot do what it needs to do.
In summary, I don't like that this collects to a hashmap only then to convert back into a keyset. Is there a way to "streamline" (pun intended) this operation to
Not be going back and forth from stream and hashmap
Reference itself in a way without needing to declare a separate map before the pipeline.
Thank you!
Your keySet is nothing but effectively a HashSet formed of your input. So, you should make use of temporary storage such that:
Set<Integer> freq = new HashSet<>(input);
and further count, filter based on values in a single stream pipeline as
return (int) input
.stream()
.collect(Collectors.groupingBy(Function.identity(),
Collectors.counting()))
.values() // just using the frequencies evaluated
.stream()
.filter(count -> freq.contains(count.intValue()))
.count();

Collecting occurrences in a HashMap with streams

I've got an exercise to solve. I've got a Fox class, which has got name and color fields. My exercise is to find the frequency of the foxes by color.
Thus I've created a HashMap, where the String attribute would be the fox name and the Integer would be the occurrence itself:
Map<String, Integer> freq = new HashMap<>();
Having done that, I have been trying to write the code with streams, but I am struggling to do that. I wrote something like this:
foxes.stream()
.map(Fox::getColor)
.forEach()
//...(continued later on);
, where foxes is a List.
My problem is basically with the syntax. I'd like to do something that if the color has no occurrences then
freq.put(Fox::getName, 1)
else
freq.replace(Fox::getName, freq.get(Fox::getName) + 1)
How should I put it together?
I wouldn't suggest proceeding with your approach simply because there is already a built in the collector for this i.e. groupingBy collector with counting() as downstream:
Map<String, Long> result = foxes.stream()
.collect(Collectors.groupingBy(Fox::getName, Collectors.counting()));
This finds the frequency by "name", likewise, you can get the frequency by colour by changing the classification function.
foxes.stream()
.collect(Collectors.groupingBy(Fox::getColor, Collectors.counting()));

Best Way / data structure to count occurrences of strings

Lets assume I have a very long list of strings. I want to count the number of occurrences of each string. I don't know how many and of what kind the strings are (means: I have no dictionary of all possible strings)
My first idea was to create a Map and to increase the integer every time I find the key again.
But this feels a bit clumsy. Is there a better way to count all occurrences of those strings?
Since Java 8, the easiest way is to use streams:
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.counting()));
Prior to Java 8, your currently outlined approach works just fine. (And the Java 8+ way is doing basically the same thing too, just with a more concise syntax).
You can do it without streams too:
Map<String, Long> map = new HashMap<>();
list.forEach(x -> map.merge(x, 1L, Long::sum));
If you really want a specific datastructure, you can always look towards Guava's Multiset:
Usage will be similar to this:
List<String> words = Arrays.asList("a b c a a".split(" "));
Multiset<String> wordCounts = words.stream()
.collect(toCollection(HashMultiset::create));
wordCounts.count("a"); // returns 3
wordCounts.count("b"); // returns 1
wordCounts.count("z"); // returns 0, no need to handle null!

Java binary search tree and hashtable

for my coursework(binary search tree and hashtables) I would like to make a java program that scans a text file and orders words based on the most frequent words. Something like most popular tags.
Example:
1. Scan the file.
2. List words that appears more than once
WORD TOTAL
Banana 10
Sun 7
Sea 3
Question 1. is how do I scan a text file?
Question 2. how do I check for duplicates in the text file and number it?
Question 3. how do I print out the words that appears more than 1 time out in the order like my example?
My programming is not strong.
Since it is course work, I'm not gonna provide you full details, but I'll try to point you in a possible direction:
Google how to read words from a text file (this is a very common problem, you should be able to find tons of examples)
Use for instance hashmap (string to int) to count the words: if a word is not in the hashmap yet, add it with multiplicity 1; if it is in there, increment the count (you might want to do some preprocessing on the words, for instance if you want to ignore capitals)
Filter the words with multiplicity more than 1 from your hashmap
Sort the filtered list of words based on their count
Some very high-level implementation (with many open ends :) )
List<String> words = readWordsFromFile();
Map<String, Integer> wordCounts = new HashMap<>();
for(String word : words) {
String processedWord = preprocess(word);
int count = 1;
if (wordCounts.containsKey(processedWord)) {
count = wordCounts.get(processedWord)+1;
}
wordCounts.put(processedWord, count);
}
removeSingleOccurences(wordCounts);
List<String> sortedWords = sortWords(wordCounts);
You can use Multiset from Guava Lib: http://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained#Multiset

Find most common/frequent element in an ArrayList in Java

I have an array list with 5 elements each of which is an Enum. I want to build a method which returns another array list with the most common element(s) in the list.
Example 1:
[Activities.WALKING, Activities.WALKING, Activities.WALKING, Activities.JOGGING, Activities.STANDING]
Method would return: [Activities.WALKING]
Example 2:
[Activities.WALKING, Activities.WALKING, Activities.JOGGING, Activities.JOGGING, Activities.STANDING]
Method would return: [Activities.WALKING, Activities.JOGGING]
WHAT HAVE I TRIED:
My idea was to declare a count for every activity but that means that if I want to add another activity, I have to go and modify the code to add another count for that activity.
Another idea was to declare a HashMap<Activities, Integer> and iterate the array to insert each activity and its occurence in it. But then how will I extract the Activities with the most occurences?
Can you help me out guys?
The most common way of implementing something like this is counting with a Map: define a Map<MyEnum,Integer> which stores zeros for each element of your enumeration. Then walk through your list, and increment the counter for each element that you find in the list. At the same time, maintain the current max count. Finally, walk through the counter map entries, and add to the output list the keys of all entries the counts of which matches the value of max.
In statistics, this is called the "mode" (in your specific case, "multi mode" is also used, as you want all values that appear most often, not just one). A vanilla Java 8 solution looks like this:
Map<Activities, Long> counts =
Stream.of(WALKING, WALKING, JOGGING, JOGGING, STANDING)
.collect(Collectors.groupingBy(s -> s, Collectors.counting()));
long max = Collections.max(counts.values());
List<Activities> result = counts
.entrySet()
.stream()
.filter(e -> e.getValue().longValue() == max)
.map(Entry::getKey)
.collect(Collectors.toList());
Which yields:
[WALKING, JOGGING]
jOOλ is a library that supports modeAll() on streams. The following program:
System.out.println(
Seq.of(WALKING, WALKING, JOGGING, JOGGING, STANDING)
.modeAll()
.toList()
);
Yields:
[WALKING, JOGGING]
(disclaimer: I work for the company behind jOOλ)

Categories

Resources