How to read integers/doubles from a large text file in Java

How to read integers/doubles from a large text file in Java - java

I am making a Pi based RNG(Random Number Generator) for a research project. I am getting stumped at this point hence I cant seem to figure out how to read the digits form a rather large file (1GB). Here is the input:
....159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536436789259036001133053054882046652138414695194151160943305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912983367336244065664308602139494639522473719070217986094370277053921717629317675238467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925892354201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859502445945534690830264252230825334468503526193118817101000313783875288658753320838142061717766914730359825349042875546873115956286388235378759375195778185778053217122680661300192787661119590921642019893809525720106548586327886593615338182....
File is ugly I know... its Pi to 1 Billionth decimal place. I am not going into details on why I am doing this but here is my goal. I want to be able to skip x number of decimal places before beginning printing output, I also need to be able to read out y number of consecutive digits at a time so like if it was 4 at a time output would look like:
1111\n
2222\n
3333\n
4444\n....
My base objective is to be able to print at least 1 number at a time hence after that I can piece them together how I want... So basic output is:
For input 3.1415.. I get..
3,1,4,1,5....
I tried bunch of File Streams from Java API but it only prints bytes/bits... I have no idea on how to convert them to something meaningful.
Also, Reading line by line is not optimal hence I have to have my numbers be same length and I feel like reading line by line would cut them off in a funny way..

What you need is a character stream, basically a subclass of Reader, so you can read character by character, rather than byte by byte.
To achive what you need, you will have to:
List item
open a character stream to the file containing your input digits. Prefer a BufferedReader over a FileReader to speed up the I/O, since reading char by char can be very slow, especially with large files
you will need to keep track of the previous character read (if any) and group strings of identical characters in an appropriate data strcuture (for instance a StringBuilder)
if you need to skip the first n characters, use Reader.skip(n); at the start
The following code does exactly what I understand of your requirements:
public class Test {
public static void main(String[] args) {
final char decimalSeparator = ',';
try (Reader reader = new BufferedReader(new FileReader("pi.txt"))) {
int prevC = -1; // previous character read from the stream
int c; // latest character read from the stream
StringBuilder sb = new StringBuilder();
while ((c = reader.read()) != -1) {
// if first digit or same as previous digit
if ((prevC == -1) || (c == prevC)) {
sb.append((char) c);
} else {
// print the group of digits and reset sb
if (sb.length() > 0) {
System.out.println(sb.toString());
sb = new StringBuilder();
}
sb.append((char) c);
}
prevC = c;
}
// print the last digits group
if (sb.length() > 0) {
System.out.println(sb.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

Okay I have spoken to a CS professor and it seems that I have forgotten my basic Java training. 1Byte = 1 char. In this case BufferedInputReader spits out ASCII values for said chars. Here is simple solution:
FileInputStream ifs = new FileInputStream(pi); //Input File containing 1 billion digits
BufferedInputStream bis = new BufferedInputStream(ifs);
System.out.println((char)bis.read()); //Build strings or parse chars how you want
..Rinse and repeat. Sorry for wasting time... but I hope this will set someone one the right track down the road.

Related

Find Word From a jumbled String

I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}

You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.

Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters

Using bufferedreader then convert to a string

Hi im having this assignment that I don't really understand how to pull off.
Ive been programing java for 2.5 weeks so Im really new.
Im supposed to import a text document into my program and then do these operations, count letters, sentences and average length of words. I've to perform the counting task letter by letter, I'm not allowed to scan the entire document at the same time. Ive managed to import the text and also print it out, but my problem is I cant use my string "line" to do any of these operations. Ive tried converting it to arrays, strings and after a lot of failed attempts im giving up. So how do I convert my input to something I can use, because i always get the error message "line is not a variable" or smth like that.
Jesper
UPDATE WITH MY SOLUTION! also some of it is in Swedish, sorry for that.
Somehow the Format is wrong so I uploaded the code here instead, really don't feel to argue with this wright now!
http://txs.io/3eIb

To count letters, check each character. If it's a space or punctuation, ignore it. Otherwise, it's a letter and we should this increment.
Every word should have a space after it unless it is the last word of the sentence. To get the number of words, track the number of spaces + number of sentences. To get number of sentences, find the number of ! ? and .
I would do that by looking at the ascii value of each character.
int numSentences = 0;
int numWords = 0;
while (line = ...){
for(int i = 0; i <line.length(); i++){
int curCharAsc = (int)(line.at(i)) //get ascii value by casting char to int
if((curCharAsc >= 65 && curCharAsc <= 90) || (curCharAsc >= 97 && curCharAsc <= 122) //check if letter is uppercase or lowercase
numLetters++;
if(curCharAsc == 32){ //ascii for space
numWords++;
}
else if (curCharAsc == 33 || curCharAsc == 46 || curCharAsc == 63){
numWords++;
numSentences++;
}
}
}
double avgWordLength = ((double)(letters))/numWords; //cast to double before dividing to avoid round-off

Your code as presented works fine, it loads a file and prints out the contents line by line. What you probably need to do is capture each of those lines. Java has two useful classes for this StringBuilder or StringBuffer (pick one).
BufferedReader input = new BufferedReader(new FileReader(args[0]));
String line;
StringBuffer buffer = new StringBuffer();
while ((line = input.readLine()) != null) {
System.out.println(line);
buffer.append(line+" ");
}
input.close();
performOperations(buffer.toString());
The only other possibility is (if your own code is not running for you) - possibly you aren't passing the input file name as a parameter when you run this class?
UPDATE
NB - I've modified the line
buffer.append(line+"\n");
to add a space instead of a line break, so that it is compatible with algorithms in the #faraza answer
The method performOperations doesn't exist yet. So you should / could add something like this
public static void performOperations(String data){
}
You method could in turn make calls out to separate methods for each operation
public static void performOperations(String data){
countWords(data);
countLetters(data);
averageWordLength(data);
}
To take it to the next level, and introduce Object Orientation, you could create a class TextStatsCollector.
public class TextStatsCollector{
private final String data;
public TextStatsCollector(final String data) {
this.data = data;
}
public int countWords(){
//word count impl here
}
public int countLetters(){
//letter count impl here
}
public int averageWordLength(){
//average word length impl here
}
public void performOperations(){
System.out.println("Number of Words is " + countWords());
System.out.println("Number of Letters is " + countLetters());
System.out.println("Average word length is " + averageWordLength());
}
}
Then you could use TextStatsCollector like the following in your main method
new TextStatsCollector(buffer.toString()).performOperations();

Verifying unexpected empty lines in a file

Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)

A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.

Using a user inputted string of characters find the longest word that can be made

Basically I want to create a program which simulates the 'Countdown' game on Channel 4. In effect a user must input 9 letters and the program will search for the largest word in the dictionary that can be made from these letters.I think a tree structure would be better to go with rather than hash tables. I already have a file which contains the words in the dictionary and will be using file io.
This is my file io class:
public static void main(String[] args){
FileIO reader = new FileIO();
String[] contents = reader.load("dictionary.txt");
}
This is what I have so far in my Countdown class
public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(System.in);
letters = scan.NextLine();
}
I get totally lost from here. I know this is only the start but I'm not looking for answers. I just want a small bit of help and maybe a pointer in the right direction. I'm only new to java and found this question in an interview book and thought I should give it a .
Thanks in advance

welcome to the world of Java :)
The first thing I see there that you have two main methods, you don't actually need that. Your program will have a single entry point in most cases then it does all its logic and handles user input and everything.
You're thinking of a tree structure which is good, though there might be a better idea to store this. Try this: http://en.wikipedia.org/wiki/Trie
What your program has to do is read all the words from the file line by line, and in this process build your data structure, the tree. When that's done you can ask the user for input and after the input is entered you can search the tree.
Since you asked specifically not to provide answers I won't put code here, but feel free to ask if you're unclear about something

There are only about 800,000 words in the English language, so an efficient solution would be to store those 800,000 words as 800,000 arrays of 26 1-byte integers that count how many times each letter is used in the word, and then for an input 9 characters you convert to similar 26 integer count format for the query, and then a word can be formed from the query letters if the query vector is greater than or equal to the word-vector component-wise. You could easily process on the order of 100 queries per second this way.

I would write a program that starts with all the two-letter words, then does the three-letter words, the four-letter words and so on.
When you do the two-letter words, you'll want some way of picking the first letter, then picking the second letter from what remains. You'll probably want to use recursion for this part. Lastly, you'll check it against the dictionary. Try to write it in a way that means you can re-use the same code for the three-letter words.

I believe, the power of Regular Expressions would come in handy in your case:
1) Create a regular expression string with a symbol class like: /^[abcdefghi]*$/ with your letters inside instead of "abcdefghi".
2) Use that regular expression as a filter to get a strings array from your text file.
3) Sort it by length. The longest word is what you need!
Check the Regular Expressions Reference for more information.
UPD: Here is a good Java Regex Tutorial.

A first approach could be using a tree with all the letters present in the wordlist.
If one node is the end of a word, then is marked as an end-of-word node.
In the picture above, the longest word is banana. But there are other words, like ball, ban, or banal.
So, a node must have:
A character
If it is the end of a word
A list of children. (max 26)
The insertion algorithm is very simple: In each step we "cut" the first character of the word until the word has no more characters.
public class TreeNode {
public char c;
private boolean isEndOfWord = false;
private TreeNode[] children = new TreeNode[26];
public TreeNode(char c) {
this.c = c;
}
public void put(String s) {
if (s.isEmpty())
{
this.isEndOfWord = true;
return;
}
char first = s.charAt(0);
int pos = position(first);
if (this.children[pos] == null)
this.children[pos] = new TreeNode(first);
this.children[pos].put(s.substring(1));
}
public String search(char[] letters) {
String word = "";
String w = "";
for (int i = 0; i < letters.length; i++)
{
TreeNode child = children[position(letters[i])];
if (child != null)
w = child.search(letters);
//this is not efficient. It should be optimized.
if (w.contains("%")
&& w.substring(0, w.lastIndexOf("%")).length() > word
.length())
word = w;
}
// if a node its end-of-word we add the special char '%'
return c + (this.isEndOfWord ? "%" : "") + word;
}
//if 'a' returns 0, if 'b' returns 1...etc
public static int position(char c) {
return ((byte) c) - 97;
}
}
Example:
public static void main(String[] args) {
//root
TreeNode t = new TreeNode('R');
//for skipping words with "'" in the wordlist
Pattern p = Pattern.compile(".*\\W+.*");
int nw = 0;
try (BufferedReader br = new BufferedReader(new FileReader(
"files/wordsEn.txt")))
{
for (String line; (line = br.readLine()) != null;)
{
if (p.matcher(line).find())
continue;
t.put(line);
nw++;
}
// line is not visible here.
br.close();
System.out.println("number of words : " + nw);
String res = null;
// substring (1) because of the root
res = t.search("vuetsrcanoli".toCharArray()).substring(1);
System.out.println(res.replace("%", ""));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Output:
number of words : 109563
counterrevolutionaries
Notes:
The wordlist is taken from here
the reading part is based on another SO question : How to read a large text file line by line using Java?

Storing input from text file

My question is quite simple, I want to read in a text file and store the first line from the file into an integer, and every other line of the file into a multi-dimensional array. The way of which I was thinking of doing this would be of creating an if-statement and another integer and when that integer is at 0 store the line into the integer variable. Although this seems stupid and there must be a more simple way.
For example, if the contents of the text file were:
4
1 2 3 4
4 3 2 1
2 4 1 3
3 1 4 2
The first line "4", would be stored in an integer, and every other line would go into the multi-dimensional array.
public void processFile(String fileName){
int temp = 0;
int firstLine;
int[][] array;
try{
BufferedReader input = new BufferedReader(new FileReader(fileName));
String inputLine = null;
while((inputLine = input.readLine()) != null){
if(temp == 0){
firstLine = Integer.parseInt(inputLine);
}else{
// Rest goes into array;
}
temp++;
}
}catch (IOException e){
System.out.print("Error: " + e);
}
}

I'm intentionally not answering this to do it for you. Try something with:
String.split
A line that says something like array[temp-1] = new int[firstLine];
An inner for loop with another Integer.parseInt line
That should be enough to get you the rest of the way

Instead, you could store the first line of the file as an integer, and then enter a for loop where you loop over the rest of the lines of the file, storing them in arrays. This doesn't require an if, because you know that the first case happens first, and the other cases (array) happen after.

I'm going to assume that you know how to use file IO.
I'm not extremely experienced, but this is how I would think about it:
while (inputFile.hasNext())
{
//Read the number
String number = inputFile.nextLine();
if(!number.equals(" ")){
//Do what you need to do with the character here (Ex: Store into an array)
}else{
//Continue on reading the document
}
}
Good Luck.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.