The purpose of this program is to read an input file and parse it looking for words. I used a class and instantiated objects to hold each unique word along with a count of that word as found in the input file. For instance, for a sentence “Word” is found once, “are” is found once, “fun” is found twice, ... This program ignores numeric data (e.g. 0, 1, ...) as well as punctuation (things like . , ; : - )
The assignment does not allow using a fixed size array to hold word strings or counts. The program should work regardless of the size of the input file.
I am getting the following compiling error:
'<>' operator is not allowed for source level below 1.7 [line: 9]
import java.io.*;
import java.util.*;
public class Test {
public static void main(String args[]) throws IOException {
HashMap<String,Word> map = new HashMap<>();
// The name of the file to open.
String fileName = "song.txt";
// This will reference one line at a time
String line = null;
try {
// FileReader reads text files in the default encoding.
FileReader fileReader =
new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
String[] words = line.split(" ");
for(String word : words){
if(map.containsKey(word)){
Word w = map.get(word);
w.setCount(w.getCount()+1);
}else {
Word w = new Word(word, 1);
map.put(word,w);
}
}
}
// Always close files.
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
}
catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
// Or we could just do this:
// ex.printStackTrace();
}
for(Map.Entry<String,Word> entry : map.entrySet()){
System.out.println(entry.getValue().getWord());
System.out.println("count:"+entry.getValue().getCount());
}
}
static class Word{
public Word(String word, int count) {
this.word = word;
this.count = count;
}
String word;
int count;
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public int getCount() {
return count;
}
public void setCount(int count) {
this.count = count;
}
}
}
You either need to compile with a JDK of version 1.7 or later, or change the line:
HashMap<String,Word> map = new HashMap<>();
to
HashMap<String,Word> map = new HashMap<String,Word>();
replace
HashMap<String,Word> map = new HashMap<>();
with:
HashMap<String,Word> map = new HashMap<String,Word>();
Related
I am working on project and in there I need to find out the frequency of each word in a large corpus of over 100 Million Bengali words. The file size is around 2GB. I actually need most frequent 20 words and least frequent 20 words with frequency count. I have done the same code in PHP but it is taking so long(the code is still running after a week). Thus, I am trying to do this in Java.
In this code, it should work like follows,
-read a line from corpus nahidd_filtered.txt
-split using whitespace
for each spitted word,read whole frequency file freq3.txt
if the word found then increase the frequency count and store in that file
else count = 1 (new word) and store freqeuncy count in that file
I have tried to read chunk of text from nahidd_filtered.txt corpus using loop and the word with frequency is stored in freq3.txt. The freq3.txt file stored frequency count like this,
Word1 Frequncy1 (single whitespace in between)
Word2 Frequency2
...........
Simply speaking, I need top 20 most frequent and 20 least frequent words along with their frequency count from the large corpus file encoded UTF-8. Please check the code and suggest me why this is not working or any other suggestion. Thank you very much.
import java.io.*;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class Main {
private static String fileToString(String filename) throws IOException {
FileInputStream inputStream = null;
Scanner reader = null;
inputStream = new FileInputStream(filename);
reader = new Scanner(inputStream, "UTF-8");
/*BufferedReader reader = new BufferedReader(new FileReader(filename));*/
StringBuilder builder = new StringBuilder();
// For every line in the file, append it to the string builder
while (reader.hasNextLine()) {
String line = reader.nextLine();
builder.append(line);
}
reader.close();
return builder.toString();
}
public static final String UTF8_BOM = "\uFEFF";
private static String removeUTF8BOM(String s) {
if (s.startsWith(UTF8_BOM)) {
s = s.substring(1);
}
return s;
}
public static void main(String[] args) throws IOException {
long startTime = System.nanoTime();
System.out.println("-------------- Start Contents of file: ---------------------");
FileInputStream inputStream = null;
Scanner sc = null;
String path = "C:/xampp/htdocs/thesis_freqeuncy_2/nahidd_filtered.txt";
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
int countWord = 0;
BufferedWriter writer = null;
while (sc.hasNextLine()) {
String word = null;
String line = sc.nextLine();
String[] wordList = line.split("\\s+");
for (int i = 0; i < wordList.length; i++) {
word = wordList[i].replace("।", "");
word = word.replace(",", "").trim();
ArrayList<String> freqword = new ArrayList<>();
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
/*freqword = freq.split("\\r?\\n");*/
Collections.addAll(freqword, freq.split("\\r?\\n"));
int flag = 0;
String[] freqwordsp = null;
int k;
for (k = 0; k < freqword.size(); k++) {
freqwordsp = freqword.get(k).split("\\s+");
String word2 = freqwordsp[0];
word = removeUTF8BOM(word);
word2 = removeUTF8BOM(word2);
word.replaceAll("\\P{Print}", "");
word2.replaceAll("\\P{Print}", "");
if (word2.toString().equals(word.toString())) {
flag = 1;
break;
}
}
int count = 0;
if (flag == 1) {
count = Integer.parseInt(freqwordsp[1]);
}
count = count + 1;
word = word + " " + count + "\n";
freqword.add(word);
System.out.println(freqword);
writer = new BufferedWriter(new FileWriter("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt"));
writer.write(String.valueOf(freqword));
}
}
// writer.close();
System.out.println(countWord);
System.out.println("-------------- End Contents of file: ---------------------");
long endTime = System.nanoTime();
long totalTime = (endTime - startTime);
System.out.println(TimeUnit.MINUTES.convert(totalTime, TimeUnit.NANOSECONDS));
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
}
First of all:
for each spitted word,read whole frequency file freq3.txt
Don't do it! Disk IO operations are very very slow. Do you have enought memory to read the file into memory? It seems, yes:
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
Collections.addAll(freqword, freq.split("\\r?\\n"));
If you really need this file then load it once and work with memory. Also in this case the Map (word to frequency) may be more comfortable than the List. Save the collection on disk when the calculations are done.
Next, you could to bufferize your input stream, it may significally improve perfomance:
inputStream = new BufferedInputStream(new FileInputStream(path));
And don't forget to close the stream/reader/writer. Explicitly or by using the try-with-resource statement.
Generally speaking, the code may be simplified depending on the used API. For example:
public class DemoApplication {
public static final String UTF8_BOM = "\uFEFF";
private static String removeUTF8BOM(String s) {
if (s.startsWith(UTF8_BOM)) {
s = s.substring(1);
}
return s;
}
private static final String PATH = "words.txt";
private static final String REGEX = " ";
public static void main(String[] args) throws IOException {
Map<String, Long> frequencyMap;
try (BufferedReader reader = new BufferedReader(new FileReader(PATH))) {
frequencyMap = reader
.lines()
.flatMap(s -> Arrays.stream(s.split(REGEX)))
.map(DemoApplication::removeUTF8BOM)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
frequencyMap
.entrySet()
.stream()
.sorted(Comparator.comparingLong(Map.Entry::getValue))
.limit(20)
.forEach(System.out::println);
}
}
I would like to read through a a text document and then add only the unique words to the arraylist of "Word" objects. It appears that the code I have now does not enter any words at all into the wordList arraylist.
public ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
int counter;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
while (reader.hasNext()) {
word = reader.next().toLowerCase();
for (Word value : wordList) {
if(value.getValue().contains(word)) {
Word newWord = new Word(word);
wordList.add(newWord);
}
}
counter++;
}
public class Word {
String value;
int frequency;
public Word(String v) {
value = v;
frequency = 1;
}
public String getValue() {
return value;
}
public String toString() {
return value + " " + frequency;
}
}
Alright, let's start by fixing your current code. The issue you have is that you are only adding a new word object to the list when one already exists. Instead, you need to add a new Word object when none exist, and increment the frequency otherwise. Here is an example fix for that:
ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
while (reader.hasNext()) {
word = reader.next().toLowerCase();
boolean wordExists = false;
for (Word value : wordList) {
// We have seen the word before so increase frequency.
if(value.getValue().equals(word)) {
value.frequency++;
wordExists = true;
break;
}
}
// This is the first time we have seen the word!
if (!wordExists) {
Word newValue = new Word(word);
newValue.frequency = 1;
wordList.add(newValue);
}
}
}
However, this is a really bad solution (O(n^2) runtime). Instead we should be using datastructure known as a Map which will bring our runtime down to (O(n))
ArrayList<Word> wordList = new ArrayList<Word>();
String fileName, word;
int counter;
Scanner reader = null;
Scanner scanner = new Scanner(System.in);
try {
reader = new Scanner(new FileInputStream(fileName));
}
catch(FileNotFoundException e) {
System.out.println("The file could not be found. The program will now exit.");
System.exit(0);
}
Map<String, Integer> frequencyMap = new HashMap<String, Integer>();
while (reader.hasNext()) {
word = reader.next().toLowerCase();
// This is equivalent to searching every word in the list via hashing (O(1))
if(!frequencyMap.containsKey(word)) {
frequencyMap.put(word, 1);
} else {
// We have already seen the word, increase frequency.
frequencyMap.put(word, frequencyMap.get(word) + 1);
}
}
// Convert our map of word->frequency to a list of Word objects.
for(Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
Word word = new Word(entry.getKey());
word.frequency = entry.getValue();
wordList.add(word);
}
}
Your for-each loop is iterating over wordList, but that is an empty ArrayList so your code will never reach the wordList.add(newWord); line
I appreciate that perhaps you wanted critique on why your algorithm wasn't working, or maybe it was an example of a much larger problem but if all you want to do is count occurences, there is a much simpler way of doing this.
Using Streams in Java 8 you can boil this down to one method - create a Stream of the lines in the file, lowercase them and then use a Collector to count them.
public static void main(final String args[]) throws IOException
{
final File file = new File(System.getProperty("user.home") + File.separator + "Desktop" + File.separator + "myFile.txt");
for (final Entry<String, Long> entry : countWordsInFile(file).entrySet())
{
System.out.println(entry);
}
}
public static Map<String, Long> countWordsInFile(final File file) throws IOException
{
return Files.lines(file.toPath()).map(String::toLowerCase).collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
I've not done anything with Streams until now so any critique welcome.
Here is my NameRecord constructor class:
public class NameRecord {
String firstName;
int count;
public NameRecord(String name, int count){
this.firstName = name;
this.count = count;
}
#Override
public String toString() {
return firstName + " - " + count + " registered births.";
}
public String getFirstName() {
return firstName;
}
public int getCount() {
return count;
}
}
And here is what I have so far of the actual program:
public class Names {
public final int MAX_NAMES = 3;
NameRecord[] boyNames = new NameRecord[MAX_NAMES];
String boysFile = "data/boynames.txt";
#Override
public String toString() {
String result = "";
for (NameRecord record : boyNames)
result += record + "\n";
return result;
}
public void loadNamesFromFile() {
try {
BufferedReader stream = new BufferedReader(new FileReader("Data/boysnames.txt"));
} catch (Exception e)
{
System.out.println("File not found");
}
}
}
Basically, the program reads a file and determines if the name is on the boys list or girls list txt files, and then outputs if it is on the list, and if so how many times it was used. I am only working with boys for right now to keep confusion to a minimum. My question is, in the loadNamesFromFile method, how do I add information from the file to the boyNames array. I know the NameRecord calls for the name and the count, but I'm not sure how to retrieve that information from the file and add it to the array. I have included the top three names from the file below, the name is of course the first name and the number is the number of times it was used, or count.
Jacob 29195
Michael 26991
Joshua 24950
First of all if it is possible to have your file structure like this
Jacob;29195
Michael;26991
Joshua;24950
to make it more easy to develop the solution
and now this is how you can read the file lines and store them into your tabel
public void loadNamesFromFile() {
try {
BufferedReader stream = new BufferedReader(new FileReader("Data/boysnames.txt"));
String currentLine ="";
int i = 0;
while(curentLine = stram.readLine()) {
String [] record = currentLine.split(";");
NameRecord = name = new NameRecord(record[0], Integer.parseInt(record[1]);
boyNames[i] = name;
i++;
}
} catch (Exception e)
{
System.out.println("File not found");
}
}
First of all you should add a scanner for the file in order to read what is in the file. After that u keep reading the file and add the information untill there is no more content in the file. Besides this I would use an ArrayList of NameRecord for being more flexible with the number of names.
Im assuming that the content of your file is always the same (given your example).
public class Names {
try {
ArrayList<NameRecord> boyNames = new ArrayList<>();
public void loadNamesFromFile() {
File file = new File("Data/boysnames.txt");
Scanner sc = new Scanner(file);
while(sc.hasNextLine()) {
boyNames.add(new NameRecord(sc.next(), sc.nextInt()));
}
sc.close();
}
} catch(IOException e) {
System.err.println(e);
}
}
My suggestion is to read all the contents of the file i.e., all the names, in a single String variable. Then, iterate over each word and count the number of occurrences and add the info to the array. Let's use Scanner to read the file.
I presume that the length of the array boyNames[] is equal to the number of unique names in the file.
Scanner boys = new Scanner(new File("Data/boys.txt"));
int a,i,n=0,c,b;
String con = "", x; //con holds all names
//reading the names
while(boys.hasNext())
con+= boys.next()+" ";
b = con.split(" ").length; //b = total number of names in the file
for(i=0; i<b; i++){
x = con.split(" ")[i];
if(!x.equals("*")){
c = 0; a = 0;
//counting frequency of x in con
while(con.indexOf(x, a) != -1){
c++; a = con.indexOf(x, a) + x.length() + 1;
}
//adding name and frequency to array
boyNames[n++] = new NameRecord(x, c);
con = con.replaceAll(x, "*"); //removing all instances of x from con
}
}
The boyNames[] array now stores the names and their respective frequencies in the file.
I am working in creating inverted index for list of words in java. Basically it creates a list for each word contains the document index that word appear on associated with frequency of word in that document, the desired output should be like this:
[word1:[FileNo:frequency],[FileNo:frequency],[FileNo:frequency],word2:[FileNo:frequency],[FileNo:frequency]...etc]
Here is the code:
package assigenment2;
import java.io.*;
import java.util.*;
public class invertedIndex {
public static Map<String, Map<Integer,Integer>> wordTodocumentMap;
public static BufferedReader buffer;
public static BufferedReader br;
public static BufferedReader reader;
public static List<String> files = new ArrayList<String>();
public static List<String>[] tokens;
public static void main(String[] args) throws IOException {
//read the token file and store the token in list
String tokensPath="/Users/Manal/Documents/workspace/Information Retrieval/tokens.txt";
int k=0;
String[] tokens = new String[8500];
String sCurrentLine;
try
{
FileReader fr=new FileReader(tokensPath);
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null)
{
tokens[k]=sCurrentLine;
k++;
}
System.out.println("the number of token are:"+k+" words");
br.close();
}
catch(Exception ex)
{System.out.println(ex);}
Until there it works correctly, I believe that the problem is in the manipulating the nested map in the following part:
TreeMap<Integer,Integer> documentToCount = new TreeMap<Integer,Integer>();
//read files
System.out.print("Enter the path of files you want to process:\n");
Scanner InputPath = new Scanner(System.in);
String cranfield = InputPath.nextLine();
File cranfieldFiles = new File(cranfield);
for (File file: cranfieldFiles.listFiles())
{
int fileno = files.indexOf(file.getPath());
if (fileno == -1) //the current file isn't in the files list \
{
files.add(file.getPath());// add file to the files list
fileno = files.size() - 1;//the index of file will start from 0 to size-1
}
int frequency = 0;
BufferedReader reader = new BufferedReader(new FileReader(file));
for (String line = reader.readLine(); line != null; line = reader.readLine())
{
for (String _word : line.split(" "))
{
String word = _word.toLowerCase();
if (Arrays.asList(tokens).contains(word))
if (wordTodocumentMap.get(word) == null)//check whether word is new word
{
documentToCount = new TreeMap<Integer,Integer>();
wordTodocumentMap.put(word, documentToCount);
}
documentToCount.put(fileno, frequency+1);//add the location and frequency
}
}
}
reader.close();
}
}
The error I get is:
Exception in thread "main" java.lang.NullPointerException
at assigenment2.invertedIndex.main(invertedIndex.java:65)
You’re never instantiating wordTodocumentMap, so it remains null throughout. Therefore the line if (wordTodocumentMap.get(word) == null)//check whether word is new word throws a NullPointerException when you do .get(), that is, before you have anything to compare to null. One possible solution is to instantiate the map in the declaration:
public static Map<String, Map<Integer,Integer>> wordTodocumentMap = new HashMap<>();
There may be other problems in your code, but this should get you a step further.
I am have a project that need to modify some text in the text file.
Like BB,BO,BR,BZ,CL,VE-BR
I need make it become BB,BO,BZ,CL,VE.
and HU, LT, LV, UA, PT-PT/AR become HU, LT, LV, UA,/AR.
I have tried to type some code, however the code fail to loop and also,in this case.
IN/CI, GH, KE, NA, NG, SH, ZW /EE, HU, LT, LV, UA,/AR, BB
"AR, BB,BO,BR,BZ,CL, CO, CR, CW, DM, DO,VE-AR-BR-MX"
I want to delete the AR in second row, but it just delete the AR in first row.
I got no idea and seeking for helps.
Please
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.util.Scanner;
public class tomy {
static StringBuffer stringBufferOfData = new StringBuffer();
static StringBuffer stringBufferOfData1 = stringBufferOfData;
static String filename = null;
static String input = null;
static String s = "-";
static Scanner sc = new Scanner(s);
public static void main(String[] args) {
boolean fileRead = readFile();
if (fileRead) {
replacement();
writeToFile();
}
System.exit(0);
}
private static boolean readFile() {
System.out.println("Please enter your files name and path i.e C:\\test.txt: ");
filename = "C:\\test.txt";
Scanner fileToRead = null;
try {
fileToRead = new Scanner(new File(filename));
for (String line; fileToRead.hasNextLine()
&& (line = fileToRead.nextLine()) != null;) {
System.out.println(line);
stringBufferOfData.append(line).append("\r\n");
}
fileToRead.close();
return true;
} catch (FileNotFoundException ex) {
System.out.println("The file " + filename + " could not be found! "+ ex.getMessage());
return false;
} finally {
fileToRead.close();
return true;
}
}
private static void writeToFile() {
try {
BufferedWriter bufwriter = new BufferedWriter(new FileWriter(
filename));
bufwriter.write(stringBufferOfData.toString());
bufwriter.close();
} catch (Exception e) {// if an exception occurs
System.out.println("Error occured while attempting to write to file: "+ e.getMessage());
}
}
private static void replacement() {
System.out.println("Please enter the contents of a line you would like to edit: ");
String lineToEdit = sc.nextLine();
int startIndex = stringBufferOfData.indexOf(lineToEdit);
int endIndex = startIndex + lineToEdit.length() + 2;
String getdata = stringBufferOfData.substring(startIndex + 1, endIndex);
String data = " ";
Scanner sc1 = new Scanner(getdata);
Scanner sc2 = new Scanner(data);
String lineToEdit1 = sc1.nextLine();
String replacementText1 = sc2.nextLine();
int startIndex1 = stringBufferOfData.indexOf(lineToEdit1);
int endIndex1 = startIndex1 + lineToEdit1.length() + 3;
boolean test = lineToEdit.contains(getdata);
boolean testh = lineToEdit.contains("-");
System.out.println(startIndex);
if (testh = true) {
stringBufferOfData.replace(startIndex, endIndex, replacementText1);
stringBufferOfData.replace(startIndex1, endIndex1 - 2,
replacementText1);
System.out.println("Here is the new edited text:\n"
+ stringBufferOfData);
} else {
System.out.println("nth" + stringBufferOfData);
System.out.println(getdata);
}
}
}
I wrote a quick method for you that I think does what you want, i.e. remove all occurrences of a token in a line, where that token is embedded in the line and is identified by a leading dash.
The method reads the file and writes it straight out to a file after editing for the token. This would allow you to process a huge file without worrying about about memory constraints.
You can simply rename the output file after a successful edit. I'll leave it up to you to work that out.
If you feel you really must use string buffers to do in memory management, then grab the logic for the line editing from my method and modify it to work with string buffers.
static void onePassReadEditWrite(final String inputFilePath, final String outputPath)
{
// the input file
Scanner inputScanner = null;
// output file
FileWriter outputWriter = null;
try
{
// open the input file
inputScanner = new Scanner(new File(inputFilePath));
// open output file
File outputFile = new File(outputPath);
outputFile.createNewFile();
outputWriter = new FileWriter(outputFile);
try
{
for (
String lineToEdit = inputScanner.nextLine();
/*
* NOTE: when this loop attempts to read beyond EOF it will throw the
* java.util.NoSuchElementException exception which is caught in the
* containing try/catch block.
*
* As such there is NO predicate required for this loop.
*/;
lineToEdit = inputScanner.nextLine()
)
// scan all lines from input file
{
System.out.println("START LINE [" + lineToEdit + "]");
// get position of dash in line
int dashInLinePosition = lineToEdit.indexOf('-');
while (dashInLinePosition != -1)
// this line has needs editing
{
// split line on dash
String halfLeft = lineToEdit.substring(0, dashInLinePosition);
String halfRight = lineToEdit.substring(dashInLinePosition + 1);
// get token after dash that is to be removed from whole line
String tokenToRemove = halfRight.substring(0, 2);
// reconstruct line from the 2 halves without the dash
StringBuilder sb = new StringBuilder(halfLeft);
sb.append(halfRight.substring(0));
lineToEdit = sb.toString();
// get position of first token in line
int tokenInLinePosition = lineToEdit.indexOf(tokenToRemove);
while (tokenInLinePosition != -1)
// do for all tokens in line
{
// split line around token to be removed
String partLeft = lineToEdit.substring(0, tokenInLinePosition);
String partRight = lineToEdit.substring(tokenInLinePosition + tokenToRemove.length());
if ((!partRight.isEmpty()) && (partRight.charAt(0) == ','))
// remove prefix comma from right part
{
partRight = partRight.substring(1);
}
// reconstruct line from the left and right parts
sb.setLength(0);
sb = new StringBuilder(partLeft);
sb.append(partRight);
lineToEdit = sb.toString();
// find next token to be removed from line
tokenInLinePosition = lineToEdit.indexOf(tokenToRemove);
}
// handle additional dashes in line
dashInLinePosition = lineToEdit.indexOf('-');
}
System.out.println("FINAL LINE [" + lineToEdit + "]");
// write line to output file
outputWriter.write(lineToEdit);
outputWriter.write("\r\n");
}
}
catch (java.util.NoSuchElementException e)
// end of scan
{
}
finally
// housekeeping
{
outputWriter.close();
inputScanner.close();
}
}
catch(FileNotFoundException e)
{
e.printStackTrace();
}
catch(IOException e)
{
inputScanner.close();
e.printStackTrace();
}
}