Counting frequency of words from a .txt file in java - java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}

I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

Related

Find Word From a jumbled String

I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.
Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters

Word Count Program using HashMaps

import java.io.*;
import java.util.*;
public class ListSetMap2
{
public static void main(String[] args)
{
Map<String, Integer> my_collection = new HashMap<String, Integer>();
Scanner keyboard = new Scanner(System.in);
System.out.println("Enter a file name");
String filenameString = keyboard.nextLine();
File filename = new File(filenameString);
int word_position = 1;
int word_num = 1;
try
{
Scanner data_store = new Scanner(filename);
System.out.println("Opening " + filenameString);
while(data_store.hasNext())
{
String word = data_store.next();
if(word.length() > 5)
{
if(my_collection.containsKey(word))
{
my_collection.get(my_collection.containsKey(word));
Integer p = (Integer) my_collection.get(word_num++);
my_collection.put(word, p);
}
else
{
Integer i = (Integer) my_collection.get(word_num);
my_collection.put(word, i);
}
}
}
}
catch (FileNotFoundException e)
{
System.out.println("Nope!");
}
}
}
I'm trying to write a program where it inputs/scans a file, logs the words in a HashMap collection, and count's the times that word occurs in the document, with only words over 5 characters being counted.
It's a bit of a mess in the middle, but I'm running into issues on how to count the number of times that word occurs, and keeping a individual count for each word. I'm sure there is a simple solution here and I'm just missing it. Please help!
Your logic of setting the frequency of word is wrong. Here is a simple approach that should work for you:
// if the word is already present in the hashmap
if (my_collection.containsKey(word)) {
// just increment the current frequency of the word
// this overrides the existing frequency
my_collection.put(word, my_collection.get(word) + 1);
} else {
// since the word is not there just put it with a frequency 1
my_collection.put(word, 1);
}
(Only giving hints, since this seems to be homework.) my_collection is (correctly) a HashMap that maps String keys to Integer values; in your situation, a key is supposed to be a word, and the corresponding value is supposed to be the number of times you have seen that word (frequency). Each time you call my_collection.get(x), the parameter x needs to be a String, namely the word whose frequency you want to know (unfortunately, HashMap doesn't enforce this). Each time you call my_collection.put(x, y), x needs to be a String, and y needs to be an Integer or int, namely the frequency for that word.
Given this, give some more thought to what you're using as parameters, and the sequence in which you need to make the calls and how you need to manipulate the values. For example, if you've already determined that my_collection doesn't contain the word, does it make sense to ask my_collection for the word's frequency? If it does contain the word, how do you need to change the frequency before putting the new value into my_collection?
(Also, please choose a more descriptive name for my_collection, e.g. frequencies.)
Try this way -
while(data_store.hasNext()) {
String word = data_store.next();
if(word.length() > 5){
if(my_collection.get(word)==null) my_collection.put(1);
else{
my_collection.put(my_collection.get(word)+1);
}
}
}

Verifying unexpected empty lines in a file

Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)
A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.

More efficient or more modern? Reading in & Sorting A Text File With Java

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.
My code is below.
My questions are:
Is my file input routine the most respectful of JVM resources?
Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?
Am I using the Collection classes and interface the most efficient way I can?
Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.
import java.io.*;
import java.util.*;
public class Sort
{
public static void main(String[] args)
{
String sUnsorted = null;
String[] saSplit = null;
int iCurrentWordCount = 1;
String currentword = null;
String pastword = "";
// Read the text file into a string
sUnsorted = readIn("input1.txt");
// Parse the String by white space into String array of single words
saSplit = sUnsorted.split("\\s+");
// Sort the String array in descending order
java.util.Arrays.sort(saSplit, Collections.reverseOrder());
// Count the occurences of each word in the String array
for (int i = 0; i < saSplit.length; i++ )
{
currentword = saSplit[i];
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else if (!currentword.equals(pastword))
{
if ( !pastword.equals("") )
{
System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);
}
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
// Print out the count for the last word processed
System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
// Read The Input File Into A String
public static String readIn(String infile)
{
String result = " ";
try
{
FileInputStream file = new FileInputStream (infile);
DataInputStream in = new DataInputStream (file);
byte[] b = new byte[ in.available() ];
in.readFully (b);
in.close ();
result = new String (b, 0, b.length, "US-ASCII");
}
catch ( Exception e )
{
e.printStackTrace();
}
return result;
}// end funciton readIn()
}// end class Sort()
/////////////////////////////////////////////////
// Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////
import java.io.*;
import java.util.*;
public class Sort2
{
public static void main(String[] args) throws Exception
{
// Scanner will tokenize on white space, like we need
Scanner scanner = new Scanner(new FileInputStream("input1.txt"));
ArrayList <String> wordlist = new ArrayList<String>();
String currentword = null;
String pastword = null;
int iCurrentWordCount = 1;
while (scanner.hasNext())
wordlist.add(scanner.next() );
// Sort in descending natural order
Collections.sort(wordlist);
Collections.reverse(wordlist);
for ( String temp : wordlist )
{
currentword = temp;
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else //if (!currentword.equals(pastword))
{
if ( pastword != null )
System.out.println("Count for " + pastword + ": " +
CurrentWordCount);
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
System.out.println("Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
}// end class Sort2
There are more idiomatic ways of reading in all the words in a file in Java.
BreakIterator is a better way of reading in words from an input.
Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.
You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.
Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.
You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.
That said, don't concern yourself with micro optimizations until you have "correct" behavior.
the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
you can provide customer delimiters to the Scanner type for breaking streams
Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.
Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.
To elaborate, you can count words in a map like this:
void increment(Map<String, Integer> wordCountMap, String word) {
Integer count = wordCountMap.get(word);
wordCountMap.put(word, count == null ? 1 : ++count);
}
Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)
Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.
Some other things you might find interesting:
To read the file you could use a BufferedReader (if it's text only).
This:
for (int i = 0; i < saSplit.length; i++ ){
currentword = saSplit[i];
[...]
}
Could be done using a extended for-loop (the Java-foreach), like shown here.
if ( currentword.equals(pastword) ){
[...]
} else if (!currentword.equals(pastword)) {
[...]
}
In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).
if ( !pastword.equals("") )
I think using length is faster here:
if (!pastword.length == 0)
Input method:
Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.
I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.
Collections:
You're really not use Collections classes, you're using an array. Your implementation seems fine, but...
There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).

Java - Count words in two documents

3 - Now I have to see if there is any word in current file from above terms or not, if yes then I will count.
Now this is my problem, I stucked on step 3 :(
I have some idea how to count words with TreeMap (treemap.containskey etc.) but it would be global count not local count for each file :(
Any pseudo code?
One possibility would be to have one map for each file, e.g. stored in a map again.
it's unclear to me, but i'm assuming your "two documents" refers to Document A containing all possible terms of which you are not interested it the occurrence count, and Document B containing some or all of the terms of which you are interested in the occurrence count, provided they also appear in Document A.
I'm not sure that this is what you want but its my best guess from the way you've worded your question.
Your end result could be a Map (TreeMap if you prefer) where the string is the word, and the Integer is the occurrence count.
so you would firstly read through Document A doing a map.put(word, 0); for every word. each duplicated word would replace the existing entry in the map. You could test for existence first, but i don't think this would make much of a performance difference.
you have now completed your step 1 and 2.
now you need to read through your Document B and for every word:
check its existence in the map
if it exists, increment the value
ie: if map.containsKey(word) map.put(word, map.get(word) + 1)
you have now completed your step 3 and have a map containing only the words contained in Document A, and their occurrence count in document B.
If I've misunderstood your requirements I'm sure you could adapt this to fit.
EDIT
if you just want to count the words in one document, the pseudo code becomes:
for (word)
if (map.containsKey(word))
map.put(word, map.get(word) + 1)
else
map.put(word, 1)
ie, every word you hit you increment its count by one. if the word hasn't been hit before you initialise it in your map with one.
at the end of this process you have a map containing each word in the document and its occurrence count.
He asked kinda of the same thing in this topic: Java loop and increment problem
Assuming you'll have one word on each line and the last line of the file contains "-1" to break the loop..
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Scanner;
public class StackOverflow {
#SuppressWarnings("unchecked")
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
Map<String, Integer> countedWords = new HashMap<String, Integer>();
int numberOfWords = 0;
String word = "";
while (true) {
word = scanner.nextLine();
if (word.equalsIgnoreCase("-1")) {
break;
}
if (countedWords.containsKey(word)) {
numberOfWords = countedWords.get(word);
countedWords.put(word, ++numberOfWords);
} else {
countedWords.put(word, 1);
}
}
Iterator it = countedWords.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry)it.next();
System.out.println(pairs.getKey() + " = " + pairs.getValue());
}
}
}

Categories

Resources