How to calculate the amount of letters written in a txt file - java

i'd like to know whether there is an easy way to count the letters in a txt file.
Lets say i have different txt files with a different amount of letters written in it, and i want to delete all txt files which have more letters than lets say 2000.
Furthermore let's assume i deal with one txt at a time. I've tried this so far:
FileReader reader2 = new FileReader("C:\\Users\\Internet\\eclipse-workspace\\test2.txt");
BufferedReader buff = new BufferedReader(reader2)){
int counter = 0;
while(buff.ready()){
String aa = buff.readLine();
counter = counter + aa.length();
}
System.out.println(counter);
}
catch(Exception e) {
e.printStackTrace();
}
Is there an easier way or one which has better performance?
Reading all letters in a String to just discard them afterwards seems like a lot of timewaste.
Should i maybe use an InputStream and use available() and then divide? On the other hand i saw that available() counts literally everything like when i press Enter in the txt file it adds +2 to the amount of letters.
Thanks for all answers.

You can use Files.lines as below,
counter = Files.lines(Paths.get("C:\\Users\\Internet\\eclipse-workspace\\test2.txt"))
.mapToInt(String::length)
.sum();

Related

Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.
Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}
I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.
If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

How to read integers/doubles from a large text file in Java

I am making a Pi based RNG(Random Number Generator) for a research project. I am getting stumped at this point hence I cant seem to figure out how to read the digits form a rather large file (1GB). Here is the input:
....159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536436789259036001133053054882046652138414695194151160943305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912983367336244065664308602139494639522473719070217986094370277053921717629317675238467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925892354201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859502445945534690830264252230825334468503526193118817101000313783875288658753320838142061717766914730359825349042875546873115956286388235378759375195778185778053217122680661300192787661119590921642019893809525720106548586327886593615338182....
File is ugly I know... its Pi to 1 Billionth decimal place. I am not going into details on why I am doing this but here is my goal. I want to be able to skip x number of decimal places before beginning printing output, I also need to be able to read out y number of consecutive digits at a time so like if it was 4 at a time output would look like:
1111\n
2222\n
3333\n
4444\n....
My base objective is to be able to print at least 1 number at a time hence after that I can piece them together how I want... So basic output is:
For input 3.1415.. I get..
3,1,4,1,5....
I tried bunch of File Streams from Java API but it only prints bytes/bits... I have no idea on how to convert them to something meaningful.
Also, Reading line by line is not optimal hence I have to have my numbers be same length and I feel like reading line by line would cut them off in a funny way..
What you need is a character stream, basically a subclass of Reader, so you can read character by character, rather than byte by byte.
To achive what you need, you will have to:
List item
open a character stream to the file containing your input digits. Prefer a BufferedReader over a FileReader to speed up the I/O, since reading char by char can be very slow, especially with large files
you will need to keep track of the previous character read (if any) and group strings of identical characters in an appropriate data strcuture (for instance a StringBuilder)
if you need to skip the first n characters, use Reader.skip(n); at the start
The following code does exactly what I understand of your requirements:
public class Test {
public static void main(String[] args) {
final char decimalSeparator = ',';
try (Reader reader = new BufferedReader(new FileReader("pi.txt"))) {
int prevC = -1; // previous character read from the stream
int c; // latest character read from the stream
StringBuilder sb = new StringBuilder();
while ((c = reader.read()) != -1) {
// if first digit or same as previous digit
if ((prevC == -1) || (c == prevC)) {
sb.append((char) c);
} else {
// print the group of digits and reset sb
if (sb.length() > 0) {
System.out.println(sb.toString());
sb = new StringBuilder();
}
sb.append((char) c);
}
prevC = c;
}
// print the last digits group
if (sb.length() > 0) {
System.out.println(sb.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Okay I have spoken to a CS professor and it seems that I have forgotten my basic Java training. 1Byte = 1 char. In this case BufferedInputReader spits out ASCII values for said chars. Here is simple solution:
FileInputStream ifs = new FileInputStream(pi); //Input File containing 1 billion digits
BufferedInputStream bis = new BufferedInputStream(ifs);
System.out.println((char)bis.read()); //Build strings or parse chars how you want
..Rinse and repeat. Sorry for wasting time... but I hope this will set someone one the right track down the road.

Verifying unexpected empty lines in a file

Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)
A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.

Count the number of lines in a text file while its still being written

I have a java program that creates a text file as follows:
FileWriter fstream = new FileWriter(filename + ".txt");
BufferedWriter outwriter = new BufferedWriter(fstream);
I add lines to the file by using
outwriter.write("line to be added");
Now at certain stages of the program I need to know how many lines I have added in the text file, while the file still has lines waiting to be added.
This whole process is to add some footer and header.
Is there any method that can find the current or last added line number?
EDIT
Adding a function or a counter to add lines is a solution but the time constraint doesn't allow me that. As I am dealing with hell lot of LOC this would be a major change at a lot of places it would consume a lot of time.
Take a look at Apache Commons Tailer.
It will perform a tail-like operation and call you back (via TailListener) for each line as lines are added to the file. You can then maintain your count in the callback. You don't have to worry about writing file reading/parsing code etc.
Sure. Just implement your own writer, e.g. LineCountWriter extends Writer that wraps other Buffered and counts written lines.
Make a counter:
lineCounter+="line to be added".split("\n").length-1
Make a method which will write lines to the files
private static int lineCount;
private static void fileWriter(BufferedWriter outwriter, String line)
{
outwriter.write(line);
lineCount++;
}
Now each time you need to write a line to a file, then just call this method. And at any point in time you need to know the line count then lineCount variable will do the concern.
Hope this helps!
Create a method like this:
public int writeLine(BufferedWriter out, String message, int numberOfLines) {
out.write(message);
return numberOfLines++;
}
Then you can do this:
int totalLines = 0;
totalLines = writeLine("Line to be added", outwriter, totalLines);
System.out.println("Total lines: " + totalLines);
> "Total lines: 1"

Comparing Sentences From a Read-In File - Java

I need to read in a file that contains 2 sentences to compare and return a number between 0 and 1. If the sentences are exactly the same it should return a 1 for true and if they are totally opposite it should return a 0 for false. If the sentences are similar but words are changed to synonyms or something close it should return a .25 .5 or .75. The text file is formatted like this:
______________________________________
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
// Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
// Should score lower than text20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
// Should score lower than text21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.
// Should score a 0!
________________________________________________
I have a file reader, but I am not sure the best way to store each line so I can compare them. For now I have the file being read and then being printed out on the screen. What is the best way to store these and then compare them to get my desired number?
import java.io.*;
public class implement
{
public static void main(String[] args)
{
try
{
FileInputStream fstream = new FileInputStream("textfile.txt");
DataInputStream in = new DataInputStream (fstream);
BufferedReader br = new BufferedReader (new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}
Save them in an array list.
ArrayList list = new ArrayList();
//Read File
//While loop
list.add(strLine)
To check each variable in a sentence simply remove punctuation then delimit by spaces and search for each word in the sentence you are comparing. I would suggest ignoring words or 2 or 3 characters. it is up to your digression
then save the lines to the array and compare them however you wanted to.
To compare similar words you will need a database to efficiently check words. Aka a hash table. Once you have this you can search words in a database semiquickly. Next this hash table of works will need a thesaurus linked to each word for similar words. Then take the similar words for the key words in each sentence and run a search for these words on the sentence you are comparing. Obviously before you search for the similar words you would want to compare the two actually sentences. In the end you will need an advanced datastucture you will have to build yourself to do more than direct comparisons.

Categories

Resources