Java - Count words in two documents - java

3 - Now I have to see if there is any word in current file from above terms or not, if yes then I will count.
Now this is my problem, I stucked on step 3 :(
I have some idea how to count words with TreeMap (treemap.containskey etc.) but it would be global count not local count for each file :(
Any pseudo code?

One possibility would be to have one map for each file, e.g. stored in a map again.

it's unclear to me, but i'm assuming your "two documents" refers to Document A containing all possible terms of which you are not interested it the occurrence count, and Document B containing some or all of the terms of which you are interested in the occurrence count, provided they also appear in Document A.
I'm not sure that this is what you want but its my best guess from the way you've worded your question.
Your end result could be a Map (TreeMap if you prefer) where the string is the word, and the Integer is the occurrence count.
so you would firstly read through Document A doing a map.put(word, 0); for every word. each duplicated word would replace the existing entry in the map. You could test for existence first, but i don't think this would make much of a performance difference.
you have now completed your step 1 and 2.
now you need to read through your Document B and for every word:
check its existence in the map
if it exists, increment the value
ie: if map.containsKey(word) map.put(word, map.get(word) + 1)
you have now completed your step 3 and have a map containing only the words contained in Document A, and their occurrence count in document B.
If I've misunderstood your requirements I'm sure you could adapt this to fit.
EDIT
if you just want to count the words in one document, the pseudo code becomes:
for (word)
if (map.containsKey(word))
map.put(word, map.get(word) + 1)
else
map.put(word, 1)
ie, every word you hit you increment its count by one. if the word hasn't been hit before you initialise it in your map with one.
at the end of this process you have a map containing each word in the document and its occurrence count.

He asked kinda of the same thing in this topic: Java loop and increment problem
Assuming you'll have one word on each line and the last line of the file contains "-1" to break the loop..
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Scanner;
public class StackOverflow {
#SuppressWarnings("unchecked")
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
Map<String, Integer> countedWords = new HashMap<String, Integer>();
int numberOfWords = 0;
String word = "";
while (true) {
word = scanner.nextLine();
if (word.equalsIgnoreCase("-1")) {
break;
}
if (countedWords.containsKey(word)) {
numberOfWords = countedWords.get(word);
countedWords.put(word, ++numberOfWords);
} else {
countedWords.put(word, 1);
}
}
Iterator it = countedWords.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry)it.next();
System.out.println(pairs.getKey() + " = " + pairs.getValue());
}
}
}

Related

How can I put multiple parts of a string into a list?

So the goal is to look for patterns like "zip" and "zap" in the string, starting with 'z' and ending with 'p'. Then, for all such strings, delete the middle letter.
What I had in mind was that I use a for loop to check each letter of the string and once it reaches a 'z', it gets the indexOf('p') and puts that and everything in the middle into an ArrayList, while deleting itself from the original string so that indexOf('p') can be found.
How can I do that?
This is my code so far:
package Homework;
import java.util.Scanner;
import java.util.ArrayList;
import java.util.List;
public class ZipZap {
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
List < String > list = new ArrayList < String > ();
System.out.print("Write a sentence with no spaces:");
String sen = in .next();
int len = sen.length();
int p1 = sen.indexOf('p');
String word = null;
String idk = null;
for (int i = 0; i < len; i++) {
if (sen.charAt(i) == 'z') {
word = sen.substring(i, p1 + 1);
list.add(word);
idk = sen.replace(word, "");
i = 0;
}
}
}
}
use this , here i am using "\bz.p\b" pattern for finding any word that contains starting char with z and end with p anything can be in between
String s ="Write a sentence with no zip and zap spaces:";
s=s.replaceAll("\\bz.p\\b", "zp");
System.out.println(s);
output:
Write a sentence with no zp and zp spaces:
or it can be
s.replaceAll("z\\w+p", "zp");
here you can check you string
https://regex101.com/r/aKaNTJ/2
I think you’re saying that input zipzapityzoop, for example, should be changed to zpzpityzp with i, a and oo going into list. Please correct me if I misunderstood your intention.
You are on the way and seem to understand the basics. The issues I see are minor, but of course you want to fix them:
As #RicharsSchwartz mentions, to find all strings like zip, zap and zoop, you need to find p after every z you find. When you have found z at index i, you may use sen.indexOf('p', i + 1) to find a p after the z (the second argument causes the search to begin at that index).
Every time you have found a z, you are setting i back to 0, this starting over from the beginning of the string. No need to do that, and this way your program will never stop.
sen.substring(i, p1+1) takes out all of zip when I understood you only wanted i. You need to adjust the arguments to substring().
Your use of sen.replace(word, "") will replace all occurences of word. So once you fix your program to take out a from zap, zappa will become zpp (not zppa), and azap will be zp. There is no easy way to remove just one specific occurrence of a substring from a String. I think the solution is to use the StringBuilder class. It has a delete method that will remove the part between two specified indices, which is what you need.
Finally you are assigning the changed string to a different variable idk, but then you continue to search sen. This is like assigning zpzapityzoop, zipzpityzoop and zipzapityzp to idk in turn, but never zpzpityzp. However, if you use a StringBuilder as I just suggested, just use the same StringBuilder all the way through and you will be fine.

Word Count Program using HashMaps

import java.io.*;
import java.util.*;
public class ListSetMap2
{
public static void main(String[] args)
{
Map<String, Integer> my_collection = new HashMap<String, Integer>();
Scanner keyboard = new Scanner(System.in);
System.out.println("Enter a file name");
String filenameString = keyboard.nextLine();
File filename = new File(filenameString);
int word_position = 1;
int word_num = 1;
try
{
Scanner data_store = new Scanner(filename);
System.out.println("Opening " + filenameString);
while(data_store.hasNext())
{
String word = data_store.next();
if(word.length() > 5)
{
if(my_collection.containsKey(word))
{
my_collection.get(my_collection.containsKey(word));
Integer p = (Integer) my_collection.get(word_num++);
my_collection.put(word, p);
}
else
{
Integer i = (Integer) my_collection.get(word_num);
my_collection.put(word, i);
}
}
}
}
catch (FileNotFoundException e)
{
System.out.println("Nope!");
}
}
}
I'm trying to write a program where it inputs/scans a file, logs the words in a HashMap collection, and count's the times that word occurs in the document, with only words over 5 characters being counted.
It's a bit of a mess in the middle, but I'm running into issues on how to count the number of times that word occurs, and keeping a individual count for each word. I'm sure there is a simple solution here and I'm just missing it. Please help!
Your logic of setting the frequency of word is wrong. Here is a simple approach that should work for you:
// if the word is already present in the hashmap
if (my_collection.containsKey(word)) {
// just increment the current frequency of the word
// this overrides the existing frequency
my_collection.put(word, my_collection.get(word) + 1);
} else {
// since the word is not there just put it with a frequency 1
my_collection.put(word, 1);
}
(Only giving hints, since this seems to be homework.) my_collection is (correctly) a HashMap that maps String keys to Integer values; in your situation, a key is supposed to be a word, and the corresponding value is supposed to be the number of times you have seen that word (frequency). Each time you call my_collection.get(x), the parameter x needs to be a String, namely the word whose frequency you want to know (unfortunately, HashMap doesn't enforce this). Each time you call my_collection.put(x, y), x needs to be a String, and y needs to be an Integer or int, namely the frequency for that word.
Given this, give some more thought to what you're using as parameters, and the sequence in which you need to make the calls and how you need to manipulate the values. For example, if you've already determined that my_collection doesn't contain the word, does it make sense to ask my_collection for the word's frequency? If it does contain the word, how do you need to change the frequency before putting the new value into my_collection?
(Also, please choose a more descriptive name for my_collection, e.g. frequencies.)
Try this way -
while(data_store.hasNext()) {
String word = data_store.next();
if(word.length() > 5){
if(my_collection.get(word)==null) my_collection.put(1);
else{
my_collection.put(my_collection.get(word)+1);
}
}
}

Explain how this permutation works

I searched a code for permutation in java:
public class MainClass {
public static void main(String args[]) {
permuteString("", "String");
}
public static void permuteString(String beginningString, String endingString) {
if (endingString.length() <= 1)
System.out.println(beginningString + endingString);
else
for (int i = 0; i < endingString.length(); i++) {
try {
String newString = endingString.substring(0, i) + endingString.substring(i + 1);
permuteString(beginningString + endingString.charAt(i), newString);
} catch (StringIndexOutOfBoundsException exception) {
exception.printStackTrace();
}
}
}
}
I can't understand it even though I know it was only a basic code. I want someone to explain it to me to make it clearer. Thank you guys
One can construct a permutation, by picking items from a bag repeatedly and thus constructing a sequence. For a string, the bag is a collection of characters. We can use a String to represent this.
If we thus want to construct a random permutated string, we first look if the bag is empty. In the above code, the bag is the endingString and the emptiness check is done with:
if (endingString.length() <= 1)
System.out.println(beginningString + endingString);
As you can see the check does not look whether the bag is completely empty: from the moment the string has only one character (one element), it is evidently we will pick that one. So we pick it and print it after the sequence we've already constructed.
Problem: a problem with this approach is that if we want to list the permutations of the empty string (there is exactly one: the empty string), one will get errors.
Now we need the iterative case. Remember that beginningString stores the sequence we've constructed up till now and endingString stores the list of characters we still can pick from. Now a way to pick is to select a valid index i in the endingString. The character at that index is then picked.
We update the sequence (beginningString by simply appending the character that was placed at i, thus:
beginningString + endingString.charAt(i)
In order to update the bag, it means that the bag now contains all the characters before the index, and the ones after the index. This is formalized as:
String newString = endingString.substring(0, i) + endingString.substring(i + 1);
newString is here the new bag. We can then do the recursive call to pick the next item from the bag. So for a given index i, in order to pick and call recursively, the code reads:
String newString = endingString.substring(0, i) + endingString.substring(i + 1);
permuteString(beginningString + endingString.charAt(i), newString);
Now since we wish to enumerate over all possible permutations, we loop over all possible indices for i. Since we do this recursively as a consequence, we will enumerate all permutations.

Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.
Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}
I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.
If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

More efficient or more modern? Reading in & Sorting A Text File With Java

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.
My code is below.
My questions are:
Is my file input routine the most respectful of JVM resources?
Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?
Am I using the Collection classes and interface the most efficient way I can?
Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.
import java.io.*;
import java.util.*;
public class Sort
{
public static void main(String[] args)
{
String sUnsorted = null;
String[] saSplit = null;
int iCurrentWordCount = 1;
String currentword = null;
String pastword = "";
// Read the text file into a string
sUnsorted = readIn("input1.txt");
// Parse the String by white space into String array of single words
saSplit = sUnsorted.split("\\s+");
// Sort the String array in descending order
java.util.Arrays.sort(saSplit, Collections.reverseOrder());
// Count the occurences of each word in the String array
for (int i = 0; i < saSplit.length; i++ )
{
currentword = saSplit[i];
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else if (!currentword.equals(pastword))
{
if ( !pastword.equals("") )
{
System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);
}
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
// Print out the count for the last word processed
System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
// Read The Input File Into A String
public static String readIn(String infile)
{
String result = " ";
try
{
FileInputStream file = new FileInputStream (infile);
DataInputStream in = new DataInputStream (file);
byte[] b = new byte[ in.available() ];
in.readFully (b);
in.close ();
result = new String (b, 0, b.length, "US-ASCII");
}
catch ( Exception e )
{
e.printStackTrace();
}
return result;
}// end funciton readIn()
}// end class Sort()
/////////////////////////////////////////////////
// Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////
import java.io.*;
import java.util.*;
public class Sort2
{
public static void main(String[] args) throws Exception
{
// Scanner will tokenize on white space, like we need
Scanner scanner = new Scanner(new FileInputStream("input1.txt"));
ArrayList <String> wordlist = new ArrayList<String>();
String currentword = null;
String pastword = null;
int iCurrentWordCount = 1;
while (scanner.hasNext())
wordlist.add(scanner.next() );
// Sort in descending natural order
Collections.sort(wordlist);
Collections.reverse(wordlist);
for ( String temp : wordlist )
{
currentword = temp;
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else //if (!currentword.equals(pastword))
{
if ( pastword != null )
System.out.println("Count for " + pastword + ": " +
CurrentWordCount);
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
System.out.println("Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
}// end class Sort2
There are more idiomatic ways of reading in all the words in a file in Java.
BreakIterator is a better way of reading in words from an input.
Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.
You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.
Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.
You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.
That said, don't concern yourself with micro optimizations until you have "correct" behavior.
the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
you can provide customer delimiters to the Scanner type for breaking streams
Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.
Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.
To elaborate, you can count words in a map like this:
void increment(Map<String, Integer> wordCountMap, String word) {
Integer count = wordCountMap.get(word);
wordCountMap.put(word, count == null ? 1 : ++count);
}
Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)
Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.
Some other things you might find interesting:
To read the file you could use a BufferedReader (if it's text only).
This:
for (int i = 0; i < saSplit.length; i++ ){
currentword = saSplit[i];
[...]
}
Could be done using a extended for-loop (the Java-foreach), like shown here.
if ( currentword.equals(pastword) ){
[...]
} else if (!currentword.equals(pastword)) {
[...]
}
In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).
if ( !pastword.equals("") )
I think using length is faster here:
if (!pastword.length == 0)
Input method:
Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.
I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.
Collections:
You're really not use Collections classes, you're using an array. Your implementation seems fine, but...
There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).

Categories

Resources