Word Count Program using HashMaps

Word Count Program using HashMaps - java

import java.io.*;
import java.util.*;
public class ListSetMap2
{
public static void main(String[] args)
{
Map<String, Integer> my_collection = new HashMap<String, Integer>();
Scanner keyboard = new Scanner(System.in);
System.out.println("Enter a file name");
String filenameString = keyboard.nextLine();
File filename = new File(filenameString);
int word_position = 1;
int word_num = 1;
try
{
Scanner data_store = new Scanner(filename);
System.out.println("Opening " + filenameString);
while(data_store.hasNext())
{
String word = data_store.next();
if(word.length() > 5)
{
if(my_collection.containsKey(word))
{
my_collection.get(my_collection.containsKey(word));
Integer p = (Integer) my_collection.get(word_num++);
my_collection.put(word, p);
}
else
{
Integer i = (Integer) my_collection.get(word_num);
my_collection.put(word, i);
}
}
}
}
catch (FileNotFoundException e)
{
System.out.println("Nope!");
}
}
}
I'm trying to write a program where it inputs/scans a file, logs the words in a HashMap collection, and count's the times that word occurs in the document, with only words over 5 characters being counted.
It's a bit of a mess in the middle, but I'm running into issues on how to count the number of times that word occurs, and keeping a individual count for each word. I'm sure there is a simple solution here and I'm just missing it. Please help!

Your logic of setting the frequency of word is wrong. Here is a simple approach that should work for you:
// if the word is already present in the hashmap
if (my_collection.containsKey(word)) {
// just increment the current frequency of the word
// this overrides the existing frequency
my_collection.put(word, my_collection.get(word) + 1);
} else {
// since the word is not there just put it with a frequency 1
my_collection.put(word, 1);
}

(Only giving hints, since this seems to be homework.) my_collection is (correctly) a HashMap that maps String keys to Integer values; in your situation, a key is supposed to be a word, and the corresponding value is supposed to be the number of times you have seen that word (frequency). Each time you call my_collection.get(x), the parameter x needs to be a String, namely the word whose frequency you want to know (unfortunately, HashMap doesn't enforce this). Each time you call my_collection.put(x, y), x needs to be a String, and y needs to be an Integer or int, namely the frequency for that word.
Given this, give some more thought to what you're using as parameters, and the sequence in which you need to make the calls and how you need to manipulate the values. For example, if you've already determined that my_collection doesn't contain the word, does it make sense to ask my_collection for the word's frequency? If it does contain the word, how do you need to change the frequency before putting the new value into my_collection?
(Also, please choose a more descriptive name for my_collection, e.g. frequencies.)

Try this way -
while(data_store.hasNext()) {
String word = data_store.next();
if(word.length() > 5){
if(my_collection.get(word)==null) my_collection.put(1);
else{
my_collection.put(my_collection.get(word)+1);
}
}
}

Related

Print Total Number of Different Words (case sensitive) from a file

**Edit after reviewing Tormod's answer and implementing his advice.
As the title states I'm attempting to print the total number of different words after receiving a file name from command line input. I receive the following message after attempting to compile the program:
Note: Project.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Here is my code. Any help is greatly appreciated:
import java.lang.*;
import java.util.*;
import java.io.*;
public class Project {
public static void main(String[] args) throws IOException {
File file = new File(args[0]);
Scanner s = new Scanner(file);
HashSet lib = new HashSet<>();
try (Scanner sc = new Scanner(new FileInputStream(file))) {
int count = 0;
while(sc.hasNext()) {
sc.next();
count++;
}
System.out.println("The total number of word in the file is: " + count);
}
while (s.hasNext()) {
String data = s.nextLine();
String[] pieces = data.split("\\s+");
for (int count = 0; count < pieces.length; count++)
{
if(!lib.contains(pieces[count])) {
lib.add(pieces[count]);
}
}
}
System.out.print(lib.size());
}
}

I would implement it using a HashSet Add all the words, and read out the size. If you want to make it case insensitive just manipulate all the words to uppercase or something like that. this uses some memory but...
one problem you got with the algorithm is that you do only have one "words". it only holds the words at the same line. so you only count same words at the same line.
HashSet stores strings by their hash value, and thus stores one word only one time.
construction: HashSet lib = new HashSet<>();
inside the loop: if(!lib.contains(word)){lib.add(word);}
check the word count: lib.size()

for(String s : words) {
if(s.equals(word))
count++;
}
You are comparing the words to an empty String, since it's a word it's always gonna be false.
Like Tormod said, the best would be to store the words in a HashSet, as it won't keep duplicates. Then just read out its size.

Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}

I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

ArrayList: Get length of longest string, Get average length of string

In Java, I have a method that reads in a text file that has all the words in the dictionary, each on their own line.
It reads each line by using a for loop and adds each word to an ArrayList.
I want to get the length of the longest word (String) in the Array. In addition, I want to get the length of the longest word in the dictionary file. It would probably be easier to split this into several methods, but I don't know the syntax.
So far, the code is have is:
public class spellCheck {
static ArrayList <String> dictionary; //the dictonary file
/**
* load file
* #param fileName the file containing the dictionary
* #throws FileNotFoundException
*/
public static void loadDictionary(String fileName) throws FileNotFoundException {
Scanner in = new Scanner(new File(fileName));
while (in.hasNext())
{
for(int i = 0; i < fileName.length(); ++i)
{
String dictionaryword = in.nextLine();
dictionary.add(dictionaryword);
}
}

Assuming that each word is on it's own line, you should be reading the file more like...
try (Scanner in = new Scanner(new File(fileName))) {
while (in.hasNextLine()) {
String dictionaryword = in.nextLine();
dictionary.add(dictionaryword);
}
}
Remember, if you open a resource, you are responsible for closing. See The try-with-resources Statement for more details...
Calculating the metrics can be done after reading the file, but since your here, you could do something like...
int totalWordLength = 0;
String longest = "";
while (in.hasNextLine()) {
String dictionaryword = in.nextLine();
totalWordLength += dictionaryword.length();
dictionary.add(dictionaryword);
if (dictionaryword.length() > longest.length()) {
longest = dictionaryword;
}
}
int averageLength = Math.round(totalWordLength / (float)dictionary.size());
But you could just as easily loop through the dictionary and use the same idea
(nb- I've used local variables, so you will either want to make them class fields or return them wrapped in some kind of "metrics" class - your choice)

Set a two counters and a variable that holds the current longest word found before you start reading in with your while loop. To find the average have one counter be incremented by one each time the line is read and have the second counter add up the total number of characters in each word (obviously the total number of characters entered, divided by the total number of words read -- as denoted by the total number of lines -- is the average length of each word.
As for the longest word, set the longest word to be the empty string or some dummy value like a single character. Each time you read in a line compare the current word with the previously found longest word (using the .length() method on the String to find its length) and if its longer set a new longest word found
Also, if you have all this in a file, I'd use a buffered reader to read in your input data

May be this could help
String words = "Rookie never dissappoints, dont trust any Rookie";
// read your file to string if you get string while reading then you can use below code to do that.
String ss[] = words.split(" ");
List<String> list = Arrays.asList(ss);
Map<Integer,String> set = new Hashtable<Integer,String>();
int i =0;
for(String str : list)
{
set.put(str.length(), str);
System.out.println(list.get(i));
i++;
}
Set<Integer> keys = set.keySet();
System.out.println(keys);
System.out.println(set);
Object j[]= keys.toArray();
Arrays.sort(j);
Object max = j[j.length-1];
set.get(max);
System.out.println("Tha longest word is "+set.get(max));
System.out.println("Length is "+max);

Using a user inputted string of characters find the longest word that can be made

Basically I want to create a program which simulates the 'Countdown' game on Channel 4. In effect a user must input 9 letters and the program will search for the largest word in the dictionary that can be made from these letters.I think a tree structure would be better to go with rather than hash tables. I already have a file which contains the words in the dictionary and will be using file io.
This is my file io class:
public static void main(String[] args){
FileIO reader = new FileIO();
String[] contents = reader.load("dictionary.txt");
}
This is what I have so far in my Countdown class
public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(System.in);
letters = scan.NextLine();
}
I get totally lost from here. I know this is only the start but I'm not looking for answers. I just want a small bit of help and maybe a pointer in the right direction. I'm only new to java and found this question in an interview book and thought I should give it a .
Thanks in advance

welcome to the world of Java :)
The first thing I see there that you have two main methods, you don't actually need that. Your program will have a single entry point in most cases then it does all its logic and handles user input and everything.
You're thinking of a tree structure which is good, though there might be a better idea to store this. Try this: http://en.wikipedia.org/wiki/Trie
What your program has to do is read all the words from the file line by line, and in this process build your data structure, the tree. When that's done you can ask the user for input and after the input is entered you can search the tree.
Since you asked specifically not to provide answers I won't put code here, but feel free to ask if you're unclear about something

There are only about 800,000 words in the English language, so an efficient solution would be to store those 800,000 words as 800,000 arrays of 26 1-byte integers that count how many times each letter is used in the word, and then for an input 9 characters you convert to similar 26 integer count format for the query, and then a word can be formed from the query letters if the query vector is greater than or equal to the word-vector component-wise. You could easily process on the order of 100 queries per second this way.

I would write a program that starts with all the two-letter words, then does the three-letter words, the four-letter words and so on.
When you do the two-letter words, you'll want some way of picking the first letter, then picking the second letter from what remains. You'll probably want to use recursion for this part. Lastly, you'll check it against the dictionary. Try to write it in a way that means you can re-use the same code for the three-letter words.

I believe, the power of Regular Expressions would come in handy in your case:
1) Create a regular expression string with a symbol class like: /^[abcdefghi]*$/ with your letters inside instead of "abcdefghi".
2) Use that regular expression as a filter to get a strings array from your text file.
3) Sort it by length. The longest word is what you need!
Check the Regular Expressions Reference for more information.
UPD: Here is a good Java Regex Tutorial.

A first approach could be using a tree with all the letters present in the wordlist.
If one node is the end of a word, then is marked as an end-of-word node.
In the picture above, the longest word is banana. But there are other words, like ball, ban, or banal.
So, a node must have:
A character
If it is the end of a word
A list of children. (max 26)
The insertion algorithm is very simple: In each step we "cut" the first character of the word until the word has no more characters.
public class TreeNode {
public char c;
private boolean isEndOfWord = false;
private TreeNode[] children = new TreeNode[26];
public TreeNode(char c) {
this.c = c;
}
public void put(String s) {
if (s.isEmpty())
{
this.isEndOfWord = true;
return;
}
char first = s.charAt(0);
int pos = position(first);
if (this.children[pos] == null)
this.children[pos] = new TreeNode(first);
this.children[pos].put(s.substring(1));
}
public String search(char[] letters) {
String word = "";
String w = "";
for (int i = 0; i < letters.length; i++)
{
TreeNode child = children[position(letters[i])];
if (child != null)
w = child.search(letters);
//this is not efficient. It should be optimized.
if (w.contains("%")
&& w.substring(0, w.lastIndexOf("%")).length() > word
.length())
word = w;
}
// if a node its end-of-word we add the special char '%'
return c + (this.isEndOfWord ? "%" : "") + word;
}
//if 'a' returns 0, if 'b' returns 1...etc
public static int position(char c) {
return ((byte) c) - 97;
}
}
Example:
public static void main(String[] args) {
//root
TreeNode t = new TreeNode('R');
//for skipping words with "'" in the wordlist
Pattern p = Pattern.compile(".*\\W+.*");
int nw = 0;
try (BufferedReader br = new BufferedReader(new FileReader(
"files/wordsEn.txt")))
{
for (String line; (line = br.readLine()) != null;)
{
if (p.matcher(line).find())
continue;
t.put(line);
nw++;
}
// line is not visible here.
br.close();
System.out.println("number of words : " + nw);
String res = null;
// substring (1) because of the root
res = t.search("vuetsrcanoli".toCharArray()).substring(1);
System.out.println(res.replace("%", ""));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Output:
number of words : 109563
counterrevolutionaries
Notes:
The wordlist is taken from here
the reading part is based on another SO question : How to read a large text file line by line using Java?

More efficient or more modern? Reading in & Sorting A Text File With Java

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.
My code is below.
My questions are:
Is my file input routine the most respectful of JVM resources?
Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?
Am I using the Collection classes and interface the most efficient way I can?
Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.
import java.io.*;
import java.util.*;
public class Sort
{
public static void main(String[] args)
{
String sUnsorted = null;
String[] saSplit = null;
int iCurrentWordCount = 1;
String currentword = null;
String pastword = "";
// Read the text file into a string
sUnsorted = readIn("input1.txt");
// Parse the String by white space into String array of single words
saSplit = sUnsorted.split("\\s+");
// Sort the String array in descending order
java.util.Arrays.sort(saSplit, Collections.reverseOrder());
// Count the occurences of each word in the String array
for (int i = 0; i < saSplit.length; i++ )
{
currentword = saSplit[i];
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else if (!currentword.equals(pastword))
{
if ( !pastword.equals("") )
{
System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);
}
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
// Print out the count for the last word processed
System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
// Read The Input File Into A String
public static String readIn(String infile)
{
String result = " ";
try
{
FileInputStream file = new FileInputStream (infile);
DataInputStream in = new DataInputStream (file);
byte[] b = new byte[ in.available() ];
in.readFully (b);
in.close ();
result = new String (b, 0, b.length, "US-ASCII");
}
catch ( Exception e )
{
e.printStackTrace();
}
return result;
}// end funciton readIn()
}// end class Sort()
/////////////////////////////////////////////////
// Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////
import java.io.*;
import java.util.*;
public class Sort2
{
public static void main(String[] args) throws Exception
{
// Scanner will tokenize on white space, like we need
Scanner scanner = new Scanner(new FileInputStream("input1.txt"));
ArrayList <String> wordlist = new ArrayList<String>();
String currentword = null;
String pastword = null;
int iCurrentWordCount = 1;
while (scanner.hasNext())
wordlist.add(scanner.next() );
// Sort in descending natural order
Collections.sort(wordlist);
Collections.reverse(wordlist);
for ( String temp : wordlist )
{
currentword = temp;
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else //if (!currentword.equals(pastword))
{
if ( pastword != null )
System.out.println("Count for " + pastword + ": " +
CurrentWordCount);
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
System.out.println("Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
}// end class Sort2

There are more idiomatic ways of reading in all the words in a file in Java.
BreakIterator is a better way of reading in words from an input.
Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.
You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.
Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.
You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.
That said, don't concern yourself with micro optimizations until you have "correct" behavior.

the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
you can provide customer delimiters to the Scanner type for breaking streams
Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.
Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.
To elaborate, you can count words in a map like this:
void increment(Map<String, Integer> wordCountMap, String word) {
Integer count = wordCountMap.get(word);
wordCountMap.put(word, count == null ? 1 : ++count);
}
Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)

Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.

Some other things you might find interesting:
To read the file you could use a BufferedReader (if it's text only).
This:
for (int i = 0; i < saSplit.length; i++ ){
currentword = saSplit[i];
[...]
}
Could be done using a extended for-loop (the Java-foreach), like shown here.
if ( currentword.equals(pastword) ){
[...]
} else if (!currentword.equals(pastword)) {
[...]
}
In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).
if ( !pastword.equals("") )
I think using length is faster here:
if (!pastword.length == 0)

Input method:
Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.
I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.
Collections:
You're really not use Collections classes, you're using an array. Your implementation seems fine, but...
There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Word Count Program using HashMaps - java

Try this way - while(data_store.hasNext()) { String word = data_store.next(); if(word.length() > 5){ if(my_collection.get(word)==null) my_collection.put(1); else{ my_collection.put(my_collection.get(word)+1); } } }

Related

Print Total Number of Different Words (case sensitive) from a file

Counting frequency of words from a .txt file in java

ArrayList: Get length of longest string, Get average length of string

Using a user inputted string of characters find the longest word that can be made

More efficient or more modern? Reading in & Sorting A Text File With Java

Categories

Resources