how to find most repetitive word in a text file

how to find most repetitive word in a text file - java

The code :
import java.io.File;
import java.util.Scanner;
class Main {
public static void main(String[] args) throws Exception{
//code
int max = 0;
int count = 0;
String rep_word = "none";
File myfile = new File("rough.txt");
Scanner reader = new Scanner(myfile);
Scanner sub_reader = new Scanner(myfile);
while (reader.hasNextLine()) {
String each_word = reader.next();
while (sub_reader.hasNextLine()){
String check = sub_reader.next();
if (check == each_word){
count+=1;
}
}
if (max<count){
max = count;
rep_word = each_word;
}
}
System.out.println(rep_word);
reader.close();
sub_reader.close();
}
}
the rough.txt file :
I want to return the most repetitive word from the text file without using arrays.
I'm not getting the desired output. i found that the if statement is not satisfying even when the variable 'check' and 'each_word' are same, I dont understand where i went wrong.

You should be using a map HashMap to quickly and efficiently count the frequency of each word without repetitive re-readings of the input file with two readers.
To do this, Map::merge method is used, it also returns current frequency of the word, so the max frequency can be tracked immediately.
int max = 0;
int count = 0;
String rep_word = "none";
// use LinkedHashMap to maintain insertion order
Map<String, Integer> freqMap = new LinkedHashMap<>();
// use try-with-resources to automatically close scanner
try (Scanner reader = new Scanner(new File("rough.txt"))) {
while (reader.hasNext()) {
String word = reader.next();
count = freqMap.merge(word, 1, Integer::sum);
if (count > max) {
max = count;
rep_word = word;
}
}
}
System.out.println(rep_word + " repeated " + max + " times");
If there are several words with the same frequency, it is easier to find all of them in the map:
for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {
if (max == entry.getValue()) {
System.out.println(entry.getKey() + " repeated " + max + " times");
}
}

You could use a hashMap to store your text as key-value pair: the key is a word and the value will contain its occurrence, Then get the key of maximum value.
Something like the following :
class Main {
public static void main(String[] args) throws Exception{
Map<String, Integer> map = new HashMap<>();
File myfile = new File("/rough.txt");
Scanner reader = new Scanner(myfile);
while (reader.hasNextLine()) {
Scanner sub_reader = new Scanner(reader.nextLine());
while (sub_reader.hasNext()){
String word = sub_reader.next();
// if the word already exist increment the counter
if(map.containsKey(word)) map.put(word, map.get(word) + 1);
else map.put(word, 1);
}
sub_reader.close();
}
// get the key of the max value in the hashmap (java 8 and higher)
String mostRepeated = map.entrySet().stream().max(Comparator.comparing(Map.Entry::getValue)).get().getKey()
System.out.println(mostRepeated);
reader.close();
}
}

Related

Word Frequency of each word in 2GB txt file in UTF-8 Encoding in Java

I am working on project and in there I need to find out the frequency of each word in a large corpus of over 100 Million Bengali words. The file size is around 2GB. I actually need most frequent 20 words and least frequent 20 words with frequency count. I have done the same code in PHP but it is taking so long(the code is still running after a week). Thus, I am trying to do this in Java.
In this code, it should work like follows,
-read a line from corpus nahidd_filtered.txt
-split using whitespace
for each spitted word,read whole frequency file freq3.txt
if the word found then increase the frequency count and store in that file
else count = 1 (new word) and store freqeuncy count in that file
I have tried to read chunk of text from nahidd_filtered.txt corpus using loop and the word with frequency is stored in freq3.txt. The freq3.txt file stored frequency count like this,
Word1 Frequncy1 (single whitespace in between)
Word2 Frequency2
...........
Simply speaking, I need top 20 most frequent and 20 least frequent words along with their frequency count from the large corpus file encoded UTF-8. Please check the code and suggest me why this is not working or any other suggestion. Thank you very much.
import java.io.*;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class Main {
private static String fileToString(String filename) throws IOException {
FileInputStream inputStream = null;
Scanner reader = null;
inputStream = new FileInputStream(filename);
reader = new Scanner(inputStream, "UTF-8");
/*BufferedReader reader = new BufferedReader(new FileReader(filename));*/
StringBuilder builder = new StringBuilder();
// For every line in the file, append it to the string builder
while (reader.hasNextLine()) {
String line = reader.nextLine();
builder.append(line);
}
reader.close();
return builder.toString();
}
public static final String UTF8_BOM = "\uFEFF";
private static String removeUTF8BOM(String s) {
if (s.startsWith(UTF8_BOM)) {
s = s.substring(1);
}
return s;
}
public static void main(String[] args) throws IOException {
long startTime = System.nanoTime();
System.out.println("-------------- Start Contents of file: ---------------------");
FileInputStream inputStream = null;
Scanner sc = null;
String path = "C:/xampp/htdocs/thesis_freqeuncy_2/nahidd_filtered.txt";
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
int countWord = 0;
BufferedWriter writer = null;
while (sc.hasNextLine()) {
String word = null;
String line = sc.nextLine();
String[] wordList = line.split("\\s+");
for (int i = 0; i < wordList.length; i++) {
word = wordList[i].replace("।", "");
word = word.replace(",", "").trim();
ArrayList<String> freqword = new ArrayList<>();
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
/*freqword = freq.split("\\r?\\n");*/
Collections.addAll(freqword, freq.split("\\r?\\n"));
int flag = 0;
String[] freqwordsp = null;
int k;
for (k = 0; k < freqword.size(); k++) {
freqwordsp = freqword.get(k).split("\\s+");
String word2 = freqwordsp[0];
word = removeUTF8BOM(word);
word2 = removeUTF8BOM(word2);
word.replaceAll("\\P{Print}", "");
word2.replaceAll("\\P{Print}", "");
if (word2.toString().equals(word.toString())) {
flag = 1;
break;
}
}
int count = 0;
if (flag == 1) {
count = Integer.parseInt(freqwordsp[1]);
}
count = count + 1;
word = word + " " + count + "\n";
freqword.add(word);
System.out.println(freqword);
writer = new BufferedWriter(new FileWriter("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt"));
writer.write(String.valueOf(freqword));
}
}
// writer.close();
System.out.println(countWord);
System.out.println("-------------- End Contents of file: ---------------------");
long endTime = System.nanoTime();
long totalTime = (endTime - startTime);
System.out.println(TimeUnit.MINUTES.convert(totalTime, TimeUnit.NANOSECONDS));
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
}

First of all:
for each spitted word,read whole frequency file freq3.txt
Don't do it! Disk IO operations are very very slow. Do you have enought memory to read the file into memory? It seems, yes:
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
Collections.addAll(freqword, freq.split("\\r?\\n"));
If you really need this file then load it once and work with memory. Also in this case the Map (word to frequency) may be more comfortable than the List. Save the collection on disk when the calculations are done.
Next, you could to bufferize your input stream, it may significally improve perfomance:
inputStream = new BufferedInputStream(new FileInputStream(path));
And don't forget to close the stream/reader/writer. Explicitly or by using the try-with-resource statement.
Generally speaking, the code may be simplified depending on the used API. For example:
public class DemoApplication {
public static final String UTF8_BOM = "\uFEFF";
private static String removeUTF8BOM(String s) {
if (s.startsWith(UTF8_BOM)) {
s = s.substring(1);
}
return s;
}
private static final String PATH = "words.txt";
private static final String REGEX = " ";
public static void main(String[] args) throws IOException {
Map<String, Long> frequencyMap;
try (BufferedReader reader = new BufferedReader(new FileReader(PATH))) {
frequencyMap = reader
.lines()
.flatMap(s -> Arrays.stream(s.split(REGEX)))
.map(DemoApplication::removeUTF8BOM)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
frequencyMap
.entrySet()
.stream()
.sorted(Comparator.comparingLong(Map.Entry::getValue))
.limit(20)
.forEach(System.out::println);
}
}

How to verify if an object with a particular attribute value exists in an arraylist?

I would like to read through a a text document and then add only the unique words to the arraylist of "Word" objects. It appears that the code I have now does not enter any words at all into the wordList arraylist.
public ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
int counter;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
while (reader.hasNext()) {
word = reader.next().toLowerCase();
for (Word value : wordList) {
if(value.getValue().contains(word)) {
Word newWord = new Word(word);
wordList.add(newWord);
}
}
counter++;
}
public class Word {
String value;
int frequency;
public Word(String v) {
value = v;
frequency = 1;
}
public String getValue() {
return value;
}
public String toString() {
return value + " " + frequency;
}
}

Alright, let's start by fixing your current code. The issue you have is that you are only adding a new word object to the list when one already exists. Instead, you need to add a new Word object when none exist, and increment the frequency otherwise. Here is an example fix for that:
ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
while (reader.hasNext()) {
word = reader.next().toLowerCase();
boolean wordExists = false;
for (Word value : wordList) {
// We have seen the word before so increase frequency.
if(value.getValue().equals(word)) {
value.frequency++;
wordExists = true;
break;
}
}
// This is the first time we have seen the word!
if (!wordExists) {
Word newValue = new Word(word);
newValue.frequency = 1;
wordList.add(newValue);
}
}
}
However, this is a really bad solution (O(n^2) runtime). Instead we should be using datastructure known as a Map which will bring our runtime down to (O(n))
ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
int counter;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
Map<String, Integer> frequencyMap = new HashMap<String, Integer>();
while (reader.hasNext()) {
word = reader.next().toLowerCase();
// This is equivalent to searching every word in the list via hashing (O(1))
if(!frequencyMap.containsKey(word)) {
frequencyMap.put(word, 1);
} else {
// We have already seen the word, increase frequency.
frequencyMap.put(word, frequencyMap.get(word) + 1);
}
}
// Convert our map of word->frequency to a list of Word objects.
for(Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
Word word = new Word(entry.getKey());
word.frequency = entry.getValue();
wordList.add(word);
}
}

Your for-each loop is iterating over wordList, but that is an empty ArrayList so your code will never reach the wordList.add(newWord); line

I appreciate that perhaps you wanted critique on why your algorithm wasn't working, or maybe it was an example of a much larger problem but if all you want to do is count occurences, there is a much simpler way of doing this.
Using Streams in Java 8 you can boil this down to one method - create a Stream of the lines in the file, lowercase them and then use a Collector to count them.
public static void main(final String args[]) throws IOException
{
final File file = new File(System.getProperty("user.home") + File.separator + "Desktop" + File.separator + "myFile.txt");
for (final Entry<String, Long> entry : countWordsInFile(file).entrySet())
{
System.out.println(entry);
}
}
public static Map<String, Long> countWordsInFile(final File file) throws IOException
{
return Files.lines(file.toPath()).map(String::toLowerCase).collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
I've not done anything with Streams until now so any critique welcome.

listing the frequency of words from multiple files

I've created a program that will look at a text file in a certain directory and then proceed to list the words in that file.
So for example if my text file contained this.
hello my name is john hello my
The output would show
hello 2
my 2
name 1
is 1
john 1
However now I want my program to search through multiple text files in directory and list all the words that occur in all the text files.
Here is my program that will list the words in a single file.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Scanner;
public class WordCountstackquestion implements Runnable {
private String filename;
public WordCountstackquestion(String filename) {
this.filename = filename;
}
public void run() {
int count = 0;
try {
HashMap<String, Integer> map = new HashMap<String, Integer>();
Scanner in = new Scanner(new File(filename));
while (in.hasNext()) {
String word = in.next();
if (map.containsKey(word))
map.put(word, map.get(word) + 1);
else {
map.put(word, 1);
}
count++;
}
System.out.println(filename + " : " + count);
for (String word : map.keySet()) {
System.out.println(word + " " + map.get(word));
}
} catch (FileNotFoundException e) {
System.out.println(filename + " was not found.");
}
}
}
My main class.
public class Mainstackquestion
{
public static void main(String args[])
{
if(args.length > 0)
{
for (String filename : args)
{
CheckFile(filename);
}
}
else
{
CheckFile("C:\\Users\\User\\Desktop\\files\\1.txt");
}
}
private static void CheckFile(String file)
{
Runnable tester = new WordCountstackquestion(file);
Thread t = new Thread(tester);
t.start();
}
}
I've made an attempt using some online sources to make a method that will look at multiple files. However I'm struggling and can't seem to implement it correctly in my program.
I would have a worker class for each file.
int count;
#Override
public void run()
{
count = 0;
/* Count the words... */
...
++count;
...
}
Then this method to use them.
public static void main(String args[]) throws InterruptedException
{
WordCount[] counters = new WordCount[args.length];
for (int idx = 0; idx < args.length; ++idx) {
counters[idx] = new WordCount(args[idx]);
counters[idx].start();
}
int total = 0;
for (WordCount counter : counters) {
counter.join();
total += counter.count;
}
System.out.println("Total: " + total);
}

I'm going to assume that all of these files lie in the same directory. You can do it this way:
public void run() {
// Replace the link to your filename variable
File f = new File("link/to/folder/here");
// Check if file is a directory (always do this if you are going to use listFiles()
if (f.isDirectory()) {
// I've moved to scanner object outside the code in order to prevent mass creation of an object
Scanner in = null;
// Lists all files in a directory
// You could also use a for loop, but I prefer enchanced for loops
for (File file : f.listFiles()) {
// Everything here is your old code, utilizing a new file (now named "f" instead of "filename"
int count = 0;
try {
HashMap<String, Integer> map = new HashMap<String, Integer>();
in = new Scanner(f);
while (in.hasNext()) {
String word = in.next();
if (map.containsKey(word))
map.put(word, map.get(word) + 1);
else {
map.put(word, 1);
}
count++;
}
System.out.println(f + " : " + count);
for (String word : map.keySet()) {
System.out.println(word + " " + map.get(word));
}
} catch (FileNotFoundException e) {
System.out.println(file + " was not found.");
}
}
// Once done with the scanner, close it (I didn't see it in your code, so including it now)
in.close();
}
}
If you wanted to use a for loop rather than an enhanced for loop (for compatibility purposes), the link shared in the comments.
Otherwise, you can just keep scanning user input, and throwing it all into an ArrayList (or some other form of an ArrayList, whatever is required for your needs) and loop through the arraylist and move around the "File f" variable (to inside the loop), sorta like this:
for(String s : arraylist){
File f = new File(s);
}

Reading input files in Java

The purpose of this program is to read an input file and parse it looking for words. I used a class and instantiated objects to hold each unique word along with a count of that word as found in the input file. For instance, for a sentence “Word” is found once, “are” is found once, “fun” is found twice, ... This program ignores numeric data (e.g. 0, 1, ...) as well as punctuation (things like . , ; : - )
The assignment does not allow using a fixed size array to hold word strings or counts. The program should work regardless of the size of the input file.
I am getting the following compiling error:
'<>' operator is not allowed for source level below 1.7 [line: 9]
import java.io.*;
import java.util.*;
public class Test {
public static void main(String args[]) throws IOException {
HashMap<String,Word> map = new HashMap<>();
// The name of the file to open.
String fileName = "song.txt";
// This will reference one line at a time
String line = null;
try {
// FileReader reads text files in the default encoding.
FileReader fileReader =
new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
String[] words = line.split(" ");
for(String word : words){
if(map.containsKey(word)){
Word w = map.get(word);
w.setCount(w.getCount()+1);
}else {
Word w = new Word(word, 1);
map.put(word,w);
}
}
}
// Always close files.
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
}
catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
// Or we could just do this:
// ex.printStackTrace();
}
for(Map.Entry<String,Word> entry : map.entrySet()){
System.out.println(entry.getValue().getWord());
System.out.println("count:"+entry.getValue().getCount());
}
}
static class Word{
public Word(String word, int count) {
this.word = word;
this.count = count;
}
String word;
int count;
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public int getCount() {
return count;
}
public void setCount(int count) {
this.count = count;
}
}
}

You either need to compile with a JDK of version 1.7 or later, or change the line:
HashMap<String,Word> map = new HashMap<>();
to
HashMap<String,Word> map = new HashMap<String,Word>();

replace
HashMap<String,Word> map = new HashMap<>();
with:
HashMap<String,Word> map = new HashMap<String,Word>();

Combining a line counter method with a word counting method

I have a method that counts the occurrences of words in a text file, and returns the number of time the word is found on a particular line. However, it doesn't keep track of which line number the words are located. i have a separate method that counts the number of lines in the text file and i would like to combine the two methods into a method that tracks the line numbers, and keeps a log of the words occurrences on each line.
here are the two methods i would like to combine to give a result something like "Word occurs X times on line Y"
public class Hash
{
private static final Object dummy = new Object(); // dummy variable
public void hashbuild()
{
File file = new File("getty.txt");
// LineNumberReader lnr1 = null;
String line1;
try{
Scanner scanner = new Scanner(file);
//lnr1 = new LineNumberReader(new FileReader("getty.txt"));
// try{while((line1 = lnr1.readLine()) != null)
// {}}catch(Exception e){}
while(scanner.hasNextLine())
{
String line= scanner.nextLine();
List<String> wordList1 = Arrays.asList(line.split("\\s+"));
Map<Object, Integer> hm = new LinkedHashMap<Object, Integer>();
for (Object item : wordList1)
{
Integer count = hm.get(item);
if (hm.put(item, (count == null ? 1 : count + 1))!=null)
{
System.out.println("Found Duplicate : " +item);
}
}
for ( Object key : hm.keySet() )
{
int value = hm.get( key );
if (value>1)
{
System.out.println(key + " occurs " + (value) + " times on line # "+lnr1.getLineNumber());
}
}
}
} catch (FileNotFoundException f)
{f.printStackTrace();}
}
}
here is my original line counting method
public void countLines()
{
LineNumberReader lnr = null; String line;
try
{
lnr = new LineNumberReader(new FileReader("getty.txt"));
while ((line = lnr.readLine()) != null)
{
System.out.print("\n" +lnr.getLineNumber() + " " +line);
}
System.out.println("\n");
}catch(Exception e){}
}

Why don't you just remember the line number in the while loop? Initialize a new variable and increase it when calling nextline.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to find most repetitive word in a text file - java

Related

Word Frequency of each word in 2GB txt file in UTF-8 Encoding in Java

How to verify if an object with a particular attribute value exists in an arraylist?

listing the frequency of words from multiple files

Reading input files in Java

Combining a line counter method with a word counting method

Categories

Resources