Verifying unexpected empty lines in a file

Verifying unexpected empty lines in a file - java

Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)

A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.

Related

Find Word From a jumbled String

I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}

You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.

Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters

Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}

I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

Parse .csv File in java returns outofbounds exception

I have the following issue: I am trying to parse a .csv file in java, and store specifically 3 columns of it in a 2 Dimensional array. The Code for the method looks like this:
public static void parseFile(String filename) throws IOException{
FileReader readFile = new FileReader(filename);
BufferedReader buffer = new BufferedReader(readFile);
String line;
String[][] result = new String[10000][3];
String[] b = new String[6];
for(int i = 0; i<10000; i++){
while((line = buffer.readLine()) != null){
b = line.split(";",6);
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
result[i][0] = b[0];
result[i][1] = b[3];
result[i][2] = b[4];
}
}
buffer.close();
}
I feel like I have to specify this: the .csv file is HUGE. It has 32 columns, and (almost) 10.000 entries (!).
When Parsing, I keep getting the following:
XXXXX CHUNKS OF SUCCESFULLY EXTRACTED CODE
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:3
at ParseCSV.parseFile(ParseCSV.java:24)
at ParseCSV.main(ParseCSV.java:41)
However, I realized that SOME of the stuff in the file has a strange format e.g. some of the texts inside it for instance have newlines in them, but there is no newline character involved in any way. However, if I delete those blank lines manually, the output generated (before the error message is prompted) adds the stuff to the array up until the next blank line ...
Does anyone have an idea how to fix this? Any help would be greately appreciated...

Your first problem is that you probably have at least one blank line in your csv file. You need to replace:
b = line.split(";", 6);
with
b = line.split(";");
if(b.length() < 5){
System.err.println("Warning, line has only " + b.length() +
"entries, so skipping it:\n" + line);
continue;
}
If your input can legitimately have new lines or embedded semi-colons within your entries, that is a more complex parsing problem, and you are probably better off using a third-party parsing library, as there are several very good ones.
If your input is not supposed to have new lines in it, the problem probably is \r. Windows uses \r\n to represent a new line, while most other systems just use \n. If multiple people/programs edited your text file, it is entirely possible to end up with stray \r by themselves, which are not easily handled by most parsers.
A way to easily check if that's your problem is before you split your line, do
line = line.replace("\r","").
If this is a process you are repeating many times, you might need to consider using a Scanner (or library) instead to get more efficient text processing. Otherwise, you can make do with this.

When you have new lines in your CSV file, after this line
while((line = buffer.readLine()) != null){
variable line will have not a CSV line but just some text without ;
For example, if you have file
column1;column2;column
3 value
after first iteration variable line will have
column1;column2;column
after second iteration it will have
3 value
when you call "3 value".split(";",6) it will return array with one element. and later when you call b[3] it will throw exception.
CSV format has many small things, to implement which you will spend a lot of time. This is a good article about all possible csv examples
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules_and_examples
I would recommend to you some ready CSV parsers like this
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html

String's split(pattern, limit) method returns an array sized to the number of tokens found up to the the number specified by the limit parameter. Limit is the maximum, not the minimum number of array elements returned.
"1,2,3" split with (",", 6) with return an array of 3 elements: "1", "2" and "3".
"1,2,3,4,5,6,7" will return 6 elements: "1", "2", "3", "4", "5" and ""6,7" The last element is goofy because the split method stopped splitting after 5 and returned the rest of the source string as the sixth element.
An empty line is represented as an empty string (""). Splitting "" will return an array of 1 element, the empty string.
In your case, the string array created here
String[] b = new String[6];
and assigned to b is replaced by the the array returned by
b = line.split(";",6);
and meets it's ultimate fate at the hands of the garbage collector unseen and unloved.
Worse, in the case of the empty lines, it's replaced by a one element array, so
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]);
blows up when trying to access b[3].
Suggested solution is to either
while((line = buffer.readLine()) != null){
if (line.length() != 0)
{
b = line.split(";",6);
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
...
}
or (better because the previous could trip over a malformed line)
while((line = buffer.readLine()) != null){
b = line.split(";",6);
if (b.length() == 6)
{
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
...
}
You might also want to think about the for loop around the while. I don't think it's doing you any good.
while((line = buffer.readLine()) != null)
is going to read every line in the file, so
for(int i = 0; i<10000; i++){
while((line = buffer.readLine()) != null){
is going to read every line in the file the first time. Then it going to have 9999 attempts to read the file, find nothing new, and exit the while loop.
You are not protected from reading more than 10000 elements because the while loop because the while loop will read a 10001th element and overrun your array if there are more than 10000 lines in the file. Look into replacing the big array with an arraylist or vector as they will size to fit your file.

Please check b.length>0 before accessing b[].

How to read integers/doubles from a large text file in Java

I am making a Pi based RNG(Random Number Generator) for a research project. I am getting stumped at this point hence I cant seem to figure out how to read the digits form a rather large file (1GB). Here is the input:
....159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536436789259036001133053054882046652138414695194151160943305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912983367336244065664308602139494639522473719070217986094370277053921717629317675238467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925892354201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859502445945534690830264252230825334468503526193118817101000313783875288658753320838142061717766914730359825349042875546873115956286388235378759375195778185778053217122680661300192787661119590921642019893809525720106548586327886593615338182....
File is ugly I know... its Pi to 1 Billionth decimal place. I am not going into details on why I am doing this but here is my goal. I want to be able to skip x number of decimal places before beginning printing output, I also need to be able to read out y number of consecutive digits at a time so like if it was 4 at a time output would look like:
1111\n
2222\n
3333\n
4444\n....
My base objective is to be able to print at least 1 number at a time hence after that I can piece them together how I want... So basic output is:
For input 3.1415.. I get..
3,1,4,1,5....
I tried bunch of File Streams from Java API but it only prints bytes/bits... I have no idea on how to convert them to something meaningful.
Also, Reading line by line is not optimal hence I have to have my numbers be same length and I feel like reading line by line would cut them off in a funny way..

What you need is a character stream, basically a subclass of Reader, so you can read character by character, rather than byte by byte.
To achive what you need, you will have to:
List item
open a character stream to the file containing your input digits. Prefer a BufferedReader over a FileReader to speed up the I/O, since reading char by char can be very slow, especially with large files
you will need to keep track of the previous character read (if any) and group strings of identical characters in an appropriate data strcuture (for instance a StringBuilder)
if you need to skip the first n characters, use Reader.skip(n); at the start
The following code does exactly what I understand of your requirements:
public class Test {
public static void main(String[] args) {
final char decimalSeparator = ',';
try (Reader reader = new BufferedReader(new FileReader("pi.txt"))) {
int prevC = -1; // previous character read from the stream
int c; // latest character read from the stream
StringBuilder sb = new StringBuilder();
while ((c = reader.read()) != -1) {
// if first digit or same as previous digit
if ((prevC == -1) || (c == prevC)) {
sb.append((char) c);
} else {
// print the group of digits and reset sb
if (sb.length() > 0) {
System.out.println(sb.toString());
sb = new StringBuilder();
}
sb.append((char) c);
}
prevC = c;
}
// print the last digits group
if (sb.length() > 0) {
System.out.println(sb.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

Okay I have spoken to a CS professor and it seems that I have forgotten my basic Java training. 1Byte = 1 char. In this case BufferedInputReader spits out ASCII values for said chars. Here is simple solution:
FileInputStream ifs = new FileInputStream(pi); //Input File containing 1 billion digits
BufferedInputStream bis = new BufferedInputStream(ifs);
System.out.println((char)bis.read()); //Build strings or parse chars how you want
..Rinse and repeat. Sorry for wasting time... but I hope this will set someone one the right track down the road.

Comparing Sentences From a Read-In File - Java

I need to read in a file that contains 2 sentences to compare and return a number between 0 and 1. If the sentences are exactly the same it should return a 1 for true and if they are totally opposite it should return a 0 for false. If the sentences are similar but words are changed to synonyms or something close it should return a .25 .5 or .75. The text file is formatted like this:
______________________________________
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
// Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
// Should score lower than text20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
// Should score lower than text21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.
// Should score a 0!
________________________________________________
I have a file reader, but I am not sure the best way to store each line so I can compare them. For now I have the file being read and then being printed out on the screen. What is the best way to store these and then compare them to get my desired number?
import java.io.*;
public class implement
{
public static void main(String[] args)
{
try
{
FileInputStream fstream = new FileInputStream("textfile.txt");
DataInputStream in = new DataInputStream (fstream);
BufferedReader br = new BufferedReader (new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}

Save them in an array list.
ArrayList list = new ArrayList();
//Read File
//While loop
list.add(strLine)
To check each variable in a sentence simply remove punctuation then delimit by spaces and search for each word in the sentence you are comparing. I would suggest ignoring words or 2 or 3 characters. it is up to your digression
then save the lines to the array and compare them however you wanted to.
To compare similar words you will need a database to efficiently check words. Aka a hash table. Once you have this you can search words in a database semiquickly. Next this hash table of works will need a thesaurus linked to each word for similar words. Then take the similar words for the key words in each sentence and run a search for these words on the sentence you are comparing. Obviously before you search for the similar words you would want to compare the two actually sentences. In the end you will need an advanced datastucture you will have to build yourself to do more than direct comparisons.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.