Comparing Sentences From a Read-In File - Java - java

I need to read in a file that contains 2 sentences to compare and return a number between 0 and 1. If the sentences are exactly the same it should return a 1 for true and if they are totally opposite it should return a 0 for false. If the sentences are similar but words are changed to synonyms or something close it should return a .25 .5 or .75. The text file is formatted like this:
______________________________________
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
// Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
// Should score lower than text20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
// Should score lower than text21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.
// Should score a 0!
________________________________________________
I have a file reader, but I am not sure the best way to store each line so I can compare them. For now I have the file being read and then being printed out on the screen. What is the best way to store these and then compare them to get my desired number?
import java.io.*;
public class implement
{
public static void main(String[] args)
{
try
{
FileInputStream fstream = new FileInputStream("textfile.txt");
DataInputStream in = new DataInputStream (fstream);
BufferedReader br = new BufferedReader (new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}

Save them in an array list.
ArrayList list = new ArrayList();
//Read File
//While loop
list.add(strLine)
To check each variable in a sentence simply remove punctuation then delimit by spaces and search for each word in the sentence you are comparing. I would suggest ignoring words or 2 or 3 characters. it is up to your digression
then save the lines to the array and compare them however you wanted to.
To compare similar words you will need a database to efficiently check words. Aka a hash table. Once you have this you can search words in a database semiquickly. Next this hash table of works will need a thesaurus linked to each word for similar words. Then take the similar words for the key words in each sentence and run a search for these words on the sentence you are comparing. Obviously before you search for the similar words you would want to compare the two actually sentences. In the end you will need an advanced datastucture you will have to build yourself to do more than direct comparisons.

Related

How to calculate the amount of letters written in a txt file

i'd like to know whether there is an easy way to count the letters in a txt file.
Lets say i have different txt files with a different amount of letters written in it, and i want to delete all txt files which have more letters than lets say 2000.
Furthermore let's assume i deal with one txt at a time. I've tried this so far:
FileReader reader2 = new FileReader("C:\\Users\\Internet\\eclipse-workspace\\test2.txt");
BufferedReader buff = new BufferedReader(reader2)){
int counter = 0;
while(buff.ready()){
String aa = buff.readLine();
counter = counter + aa.length();
}
System.out.println(counter);
}
catch(Exception e) {
e.printStackTrace();
}
Is there an easier way or one which has better performance?
Reading all letters in a String to just discard them afterwards seems like a lot of timewaste.
Should i maybe use an InputStream and use available() and then divide? On the other hand i saw that available() counts literally everything like when i press Enter in the txt file it adds +2 to the amount of letters.
Thanks for all answers.
You can use Files.lines as below,
counter = Files.lines(Paths.get("C:\\Users\\Internet\\eclipse-workspace\\test2.txt"))
.mapToInt(String::length)
.sum();

Java - How to Delimit Single Quotes Around a Phrase but Not an Apostrophe in a Word

I am practicing Java on my own from a book. I read the chapter on text processing and wrapper classes and attempted the excercise below.
Word Counter
Write a program that asks the user for the name of a file. The program should display the number of words that the file contains.
import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.StringTokenizer;
public class FileWordCounter {
public static void main(String[] args) throws IOException {
// Create a Scanner object
Scanner keyboard = new Scanner(System.in);
// Ask user for filename
System.out.print("Enter the name of a file: ");
String filename = keyboard.nextLine();
// Open file for reading
File file = new File(filename);
Scanner inputFile = new Scanner(file);
int words = 0;
String word = "";
while (inputFile.hasNextLine()) {
String line = inputFile.nextLine();
System.out.println(line); // for debugging
StringTokenizer stringTokenizer = new StringTokenizer(line, " \n.!?;,()"); // Create a StringTokenizer object and use the current line contents and delimiters as parameters
while (stringTokenizer.hasMoreTokens()) { // for each line do this
word = stringTokenizer.nextToken();
System.out.println(word); // for debugging
words++;
}
System.out.println("Line contains " + words + " words");
}
// Close file
inputFile.close();
System.out.println("The file has " + words + " words.");
}
}
I chose this random poem from online to test this program. I put the poem in a file called TheSniper.txt:
Two hundred yards away he saw his head;
He raised his rifle, took quick aim and shot him.
Two hundred yards away the man dropped dead;
With bright exulting eye he turned and said,
'By Jove, I got him!'
And he was jubilant; had he not won
The meed of praise his comrades haste to pay?
He smiled; he could not see what he had done;
The dead man lay two hundred yards away.
He could not see the dead, reproachful eyes,
The youthful face which Death had not defiled
But had transfigured when he claimed his prize.
Had he seen this perhaps he had not smiled.
He could not see the woman as she wept
To the news two hundred miles away,
Or through his very dream she would have crept.
And into all his thoughts by night and day.
Two hundred yards away, and, bending o'er
A body in a trench, rough men proclaim
Sadly, that Fritz, the merry is no more.
(Or shall we call him Jack? It's all the same.)
Here is some of my output...
For debugging purposes, I print out each line and the total words in the file up including those in the current line.
Enter the name of a file: TheSniper.txt
Two hundred yards away he saw his head;
Two
hundred
yards
away
he
saw
his
head
Line contains 8 words
He raised his rifle, took quick aim and shot him.
He
raised
his
rifle
took
quick
aim
and
shot
him
Line contains 18 words
...
At the end, my program displays that the poem has 176 words. However, Microsoft Word counts 174 words. I see from printing each word that I am miscounting apostrophes and single quotes. Here is the last section of the poem in my output where the problem occurs:
(Or shall we call him Jack? It's all the same.)
Or
shall
we
call
him
Jack
It
s
all
the
same
Line contains 176 words
The file has 176 words
In my StringTokenizer parameter list, when I don't delimit a single quote, which looks like an apostrophe, the word "It's" is counted as one. However, when I do, its counted as two words (It and s) because the apostrophe, which looks like a single quote, gets delimited. Also, the phrase 'By Jove, I got him!' is miscounted when I don't delimit the single quote/apostrophe. Are the apostrophe and single quote the same character when it comes to delimiting them?? I'm not sure how to delimit single quotes that surround a phrase but not an apostrophe between a word like "It's". I hope I am somewhat clear in asking my question. Please ask for any clarifications. Any guidance is appreciated. Thank you!
Why not use another Scanner for each line to count the number of words?
int words = 0;
while (inputFile.hasNextLine()) {
int lineLength = 0;
Scanner lineScanner = new Scanner(inputFile.nextLine());
while (lineScanner.hasNext()) {
System.out.println(lineScanner.next());
lineLength++;
}
System.out.println("Line contains " + lineLength + " words");
words += lineLength;
}
I don't believe it is possible to delimit a single quote for a phrase like "'By Jove, I got him!'", but ignore it in "it's" unless you use a regex search to ignore single quotes in the middle of a word.
Alternatively, you could treat the characters ".!?;,()" as part of a single word (eg. "Jack?" is one word), which will give you the correct word count. This is what the scanner does. Just change the delimiter in your StringTokenizer to " " (\n isn't required since you're already searching each line):
StringTokenizer stringTokenizer = new StringTokenizer(line, " ");

Find Word From a jumbled String

I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.
Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters

How do I get rid of these empty strings?

My constructor takes a filename of a text file and converts it to an ArrayList of all the words in lowercase, without punctuation or white space. These specs, along with the constructor's argument are specified by my homework assignment, so don't suggest I change them.
private ArrayList<String> list;
public Tokenizer(String file) throws IOException {
list = new ArrayList<>();
String thisLine;
BufferedReader br = new BufferedReader(new FileReader(file));
while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().split("\\s+")));
}
My problem is that there are many empty strings that appear. I've tried using "-1" as the second argument in "split", but it doesn't seem to do anything.
My other question is if its inefficient to do Arrays.asList, or if I should just create an iterator, plus if you think I do anything else wrong. eg, is there another way to input a filename into the BufferedReader?
Thanks
Edit 1:
Below is test I used for an online book (it is a text file and there are not problems with the text file) I found on project Gutenberg. I also get similar results when using a text file that I personally create, so don't think its a problem with the text file itself.
In fact, I'll just reproduce my entire code since its pretty simple:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
public class Tokenizer {
private ArrayList<String> list;
public Tokenizer(String file) throws IOException {
list = new ArrayList<>();
String thisLine;
BufferedReader br = new BufferedReader(new FileReader(file));
while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
}
public ArrayList<String> wordList() {
return list;
}
public static void main(String[] args) throws IOException {
Tokenizer T = new Tokenizer("C:\\...\\1898amongmyb00loweuoft_djvu.txt");
ArrayList<String> array = T.wordList();
for(int i = 0; i < 20; i++) {
System.out.println(array.get(i));
}
}
}
And here is my output:
i
9
digitized
by
the
internet
archive
in
2007
with
funding
from
microsoft
corporation
No, those empty lines are not white space. They are empty strings. As in, "". I hope I am as clear as possible.
Since it will probably cause confusion, no that is not the actual argument I use for the path name of the file. The ellipsis (the "...") is just a shorthand, so I don't have to reveal my computer directories to the internet.
Also, yes there is another empty string at the end, but this website's interface will not let me put it there.
Edit 2:
I always forget something, here is the first few lines of the text file:
I 9
Digitized by the Internet Archive
in 2007 with funding from
Microsoft Corporation
http://www.archive.org/details/1898amongmyb00loweuoft
James Ettsscll Lotocll.
COMPLETE POETICAL AND PROSE WORKS. Riverside
Edition, n vols, crown 8vo, gilt top, each, $ 1.50 ; the set,
$ 1 6. 50.
1-4. Literary Essays (including My Study Windows, Among
My Books, Fireside Travels) ; 5. Political Essays ; 6. Literary
and Political Addresses ; 7. Latest Literary Essays and Ad-
dresses, The Old English Dramatists ; 8-1 1. Poems.
PROSE WORKS. Riverside Edition. With Portraits. 7 vols,
crown 8vo, gilt top, $10.50.
POEMS. Riverside Edition. With Portraits. 4 vols, crown
8vo, gilt top, $6.00.
COMPLETE POETICAL WORKS. Cambridge Edition.
Printed from clear type on opaque paper, and attractively
bound. With a Portrait and engraved Title-page, and a
Vignette of Lowell's Home, Elmwood. Large crown 8vo, $2.00.
Household Edition. With Portrait and Illustrations. Crown
8vo, $1.50.
Cabinet Edition. i8
I think I now see the problem. The empty strings correspond to the empty lines.
Edit 3:
So I ended up answering my own problem. I ended up doing this:
while ((thisLine = br.readLine()) != null) {
ArrayList<String> newList = new ArrayList(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().split("\\s+")));
while(newList.remove(""));
list.addAll(newList);
}
I did try using an if statement, but then you are comparing the line before the split. This could be problematic because the split may produce some empty lines you would then miss. Therefore, I made the list I was going to add to my main list, but before adding it, I just went through it and deleted all of the instances of empty strings.
I don't really know if this is the most efficient way of doing things... if its not let me know!
Your problem most likely is that there is a space at the beginning or end of your thisLine read from the file. Which is very common for a text document to have lines like this. So if you call split on \s+ and the line ends with a space, the very last thing will be an empty string.
To fix this, I would suggest to add a trim on your string before you do the split.
Using your code change it to:
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
Try that and see if it doesn't get rid of most if not all of your empty strings. Also, you should consider breaking this statement up into multiple operations so that it is easier to read.
How about replacing while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
with: while ((thisLine = br.readLine()) != null )
if (thisLine.length() > 0)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+", " ").toLowerCase().trim().split("\\s+")));

Verifying unexpected empty lines in a file

Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)
A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.

Categories

Resources