Java: ASCII random line file access with state - java

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?
Given an ASCII file of arbitrary large size where each line is terminated by a \n
For each invocation of some method readLine() read a random line from the file
And for the life of the file handle no call to readLine() should return the same line twice
Update:
All lines must eventually be read
Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream. With RandomAccessFile, you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. The code would look something like this.
RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
//seek to a random point in the file
raf.seek((long)(Math.random()*raf.length()));
//skip from the random location to the beginning of the next line
int nextByte = raf.read();
while(((char)nextByte) != '\n')
{
if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
nextByte = raf.read();
}
//read the line into a buffer
StringBuffer lineBuffer = new StringBuffer();
nextByte = raf.read();
while(nextByte != -1 && (((char)nextByte) != '\n'))
lineBuffer.append((char)nextByte);
//ensure uniqueness
String line = lineBuffer.toString();
if(sampledLines.get(line.hashCode()) != null)
i--;
else
sampledLines.put(line.hashCode(),line);
}
Here, sampledLines should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.
EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.
EDIT 2: I made it verify uniqueness of lines by using a HashMap.

Pre-process the input file and remember the offset of each new line. Use a BitSet to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.

Since you can pad the lines, I would do something along those lines, and you should also note that even then, there may exist a limitation with regards to what a List can actually hold.
Using a random number each time you want to read the line and adding it to a Set would also do, however this ensures that the file is completely read:
public class VeryLargeFileReading
implements Iterator<String>, Closeable
{
private static Random RND = new Random();
// List of all indices
final List<Long> indices = new ArrayList<Long>();
final RandomAccessFile fd;
public VeryLargeFileReading(String fileName, long lineSize)
{
fd = new RandomAccessFile(fileName);
long nrLines = fd.length() / lineSize;
for (long i = 0; i < nrLines; i++)
indices.add(i * lineSize);
Collections.shuffle(indices);
}
// Iterator methods
#Override
public boolean hasNext()
{
return !indices.isEmpty();
}
#Override
public void remove()
{
// Nope
throw new IllegalStateException();
}
#Override
public String next()
{
final long offset = indices.remove(0);
fd.seek(offset);
return fd.readLine().trim();
}
#Override
public void close() throws IOException
{
fd.close();
}
}

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.
I'd consider something along the lines of the following:
Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.)
Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria.
Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity.
Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.

Related

Sum all the numbers in all the files presents in all subdirectories - complexity?

Given a root directory, read all the files line by line inside rootDirectory or sub directories and sum all the numbers from each file. Each file will have a number in each line. So I just need to read all the files and sum all the numbers and return it. I came up with below code and it does the job (if there is any better or efficient way, let me know)..
I am trying to understand what is the complexity of the below program. If the structure is very deep and we have lot of files in lot of subdirectories then what will be the complexity of the below program. How should we describe the complexity in this case if it is asked in an interview?
private static int count = 0;
public static void main(String[] args) {
System.out.println(sumNumbersInFile("/home/david"));
}
private static int sumNumbersInFile(String rootDirectory) {
if (rootDirectory == null || rootDirectory.isEmpty()) {
return 0;
}
File file = new File(rootDirectory);
for (File fileEntry : file.listFiles()) {
if (fileEntry.isDirectory()) {
count += sumNumbersInFile(fileEntry.getName());
} else {
try (BufferedReader br = new BufferedReader(new FileReader(fileEntry))) {
String line;
while ((line = br.readLine()) != null) {
count += Integer.parseInt(line);
}
} catch (NumberFormatException | IOException e) {
e.printStackTrace();
}
}
}
return count;
}
Lets say you have n files. So, you visit each file once. So that part is O(n). Lets say, m is maximum possible number of lines that occurs in that process. You read each line in each file once. So, worst case scenario is that you will read m lines in n files. So, that makes it O(n*m). You can look at m even as average number of lines.
Reason why you need to have both n and m, is because you have two unknown variables, number of files (it doesn't matter if its in one folder on its formatted as one file and one subdirectory in each directory, since you go one by one, you need to visit it all, and you visit it only once each, and number of lines. Each of them can grow independently, so its a function of two unknown. Therefore, its O(n*m).
Even if you put all lines in one file, that would be O(f(r)), where f(r)=g(n*m), so it would be still O(n*m), where r is overall number of lines (r = n * m). Reason why its different function, but still of same order, its because of factor of traveling trough folders and initiations of file reading, which should be some constant defined before starting of algorithm, which doesn't affect the order of function.
You're still doing only one calculation step per line. The algorithm is O(n) where n is the number of lines in all files.

How to add information to objects that take 2 IDs as parameters without over writing it when info is added again?

So, ill start with a brief explanation so that my code makes sense. We are reading a file through a scanner and adding every "N" lines to a "Block" that carries those lines (Basically an array list within the object "Block").
Every Block is then added randomly to a "DataNode" that has an "int MachineID" and an "int NodeID" (one machines can hold several data nodes and data nodes are just objects that carry an array list of "Blocks").
I have wrote the a method that reads a file after taking its location as parameter and then starts filling up the lines found with lines and then adding the blocks to the nodes. However, I have found that my code creates a new node of determined machine ID and node ID every time a block needs to be added (blocks are added randomly to the machines and nodes) but there is a possibility for one node to hold 2 blocks but what I have written (the code), over writes the previously created node every time i want to add a block to the array list of the node.
Here is my code: (this is a part of a much bigger problem so hope this makes sense)
public void createFile(String fileNameOriginal, String fileNameDFS ) throws FileNotFoundException {
this.fileNameOriginal = fileNameOriginal;
this.fileNameDFS = fileNameDFS;
int randomMachine;
int randomNode;
File f = new File(fileNameOriginal);
Scanner readFile = new Scanner(f);
while(readFile.hasNextLine()){
//IDs locating the node the block is going to be added to
randomMachine = (int)(Math.random()*machines);
randomNode = (int)(Math.random()*nodes);
Block b = new Block(randomMachine, randomNode);
// number of lines per block are decided in the class as parameter
for(int i = 0; i <= n; i++){
b.add(readFile.nextLine());
}
// this here is my problem... everything is over written... :'(
DataNode N = new DataNode(randomMachine, randomNode);
N.addToNode(b);
}
}
if you have taken the time to read the entire thing I thank you. :)

what is the fastest way to get dimensions of a csv file in java

My regular procedure when coming to the task on getting dimensions of a csv file as following:
Get how many rows it has:
I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.
then get how many columns it has:
I use String[] temp = lineOfText.split(","); and then take the size of temp.
Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;
I guess it depends on how regular the structure is, and whether you need an exact answer or not.
I could imagine looking at the first few rows (or randomly skipping through the file), and then dividing the file size by average row size to determine a rough row count.
If you control how these files get written, you could potentially tag them or add a metadata file next to them containing row counts.
Strictly speaking, the way you're splitting the line doesn't cover all possible cases. "hello, world", 4, 5 should read as having 3 columns, not 4.
Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).
You should use a CSV parser such as the one provided by univocity-parsers.
Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds:
// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {
int lastColumn = -1;
long rowCount = 0;
#Override
public void rowProcessed(String[] row, ParsingContext context) {
rowCount++;
if (lastColumn < row.length) {
lastColumn = row.length;
}
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
CsvDimension myDimensionProcessor = new CsvDimension();
CsvParserSettings settings = new CsvParserSettings();
//This tells the parser that no row should have more than 2,000,000 columns
settings.setMaxColumns(2000000);
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myDimensionProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 1.3 million rows.
parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Columns: " + myDimensionProcessor.lastColumn);
System.out.println("Rows: " + myDimensionProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
The output will be:
Columns: 7
Rows: 3173959
Time taken: 1279 ms
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
IMO, What you are doing is an acceptable way to do it. But here are some ways you could make it faster:
Rather than reading lines, which creates a new String Object for each line, just use String.indexOf to find the bounds of your lines
Rather than using line.split, again use indexOf to count the number of commas
Multithreading
I guess here are the options which will depend on how you use the data:
Store dimensions of your csv file when writing the file (in the first row or as in an additional file)
Use a more efficient way to count lines - maybe http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
Instead of creating an arrays of fixed size (assuming thats what you need the line count for) use array lists - this may or may not be more efficient depending on size of file.
To find number of rows you have to read the whole file. There is nothing you can do here. However your method of finding number of cols is a bit inefficient. Instead of split just count how many times "," appeard in the line. You might also include here special condition about fields put in the quotas as mentioned by #Vlad.
String.split method creates an array of strings as a result and splits using regexp which is not very efficient.
I find this short but interesting solution here:
https://stackoverflow.com/a/5342096/4082824
LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")));
lnr.skip(Long.MAX_VALUE);
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
lnr.close();
My solution is simply and correctly process CSV with multiline cells or quoted values.
for example We have csv-file:
1,"""2""","""111,222""","""234;222""","""""","1
2
3"
2,"""2""","""111,222""","""234;222""","""""","2
3"
3,"""5""","""1112""","""10;2""","""""","1
2"
And my solution snippet is:
import java.io.*;
public class CsvDimension {
public void parse(Reader reader) throws IOException {
long cells = 0;
int lines = 0;
int c;
boolean qouted = false;
while ((c = reader.read()) != -1) {
if (c == '"') {
qouted = !qouted;
}
if (!qouted) {
if (c == '\n') {
lines++;
cells++;
}
if (c == ',') {
cells++;
}
}
}
System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
reader.close();
}
public static void main(String args[]) throws IOException {
new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
}
}

sorting lines of an enormous file.txt in java

I'm working with a very big text file (755Mb).
I need to sort the lines (about 1890000) and then write them back in another file.
I already noticed that discussion that has a starting file really similar to mine:
Sorting Lines Based on words in them as keys
The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)
I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..
I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?
Any hints appreciated
Thanks in advance
I think the solution here is to do a merge sort using temporary files:
Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.
Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.
Delete all the temporary files.
This works with arbitrary large files, as long as you have enough disk space.
You can run the following with
-mx1g -XX:+UseCompressedStrings # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings # on Java 6 update 29
-mx2g # on Java 7 update 2.
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
generateFile("lines.txt", 755 * 1024 * 1024, 189000);
List<String> lines = loadLines("lines.txt");
System.out.println("Sorting file");
Collections.sort(lines);
System.out.println("... Sorted file");
// save lines.
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
System.out.println("Creating file to load");
int lineSize = size / lines;
StringBuilder sb = new StringBuilder();
while (sb.length() < lineSize) sb.append('-');
String padding = sb.toString();
PrintWriter pw = new PrintWriter(fileName);
for (int i = 0; i < lines; i++) {
String text = (i + padding).substring(0, lineSize);
pw.println(text);
}
pw.close();
System.out.println("... Created file to load");
}
private static List<String> loadLines(String fileName) throws IOException {
System.out.println("Reading file");
BufferedReader br = new BufferedReader(new FileReader(fileName));
List<String> ret = new ArrayList<String>();
String line;
while ((line = br.readLine()) != null)
ret.add(line);
System.out.println("... Read file.");
return ret;
}
}
prints
Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
divide and conquer is the best solution :)
divide your file to smaller ones, sort each file seperately then regroup.
Links:
Sort a file with huge volume of data given memory constraint
http://hackerne.ws/item?id=1603381
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)
Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.
u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks

Strange FileInputStream/DataFileInputStream behaviour: seek()ing to odd positions

The good:
so, I have this binary data file (size - exactly 640631 bytes), and I'm trying to make Java read it.
I have two interchangeable classes implemented as layers for reading that data. One of them uses RandomAccessFile, which works great and all.
The bad:
Another one (the one this question is mostly about) tries to use FileInputStream and DataInputStream so that the very same data could be (at least theoretically) be read on MIDP 2.0 (CLDC 1.1) Java configuration (which doesn't have RandomAccessFile).
In that class, I open the data file like this:
FileInputStream res = new FileInputStream(new File(filename));
h = new DataInputStream(res);
...and implement seek()/skip() like this (position is a long that takes note of a current position in a file):
public void seek(long pos) throws java.io.IOException {
if (! this.isOpen()) {
throw new java.io.IOException("No file is open");
}
if (pos < position) {
// Seek to the start, then skip some bytes
this.reset();
this.skip(pos);
} else if (pos > position) {
// skip the remaining bytes until the position
this.skip(pos - position);
}
}
and
public void skip(long bytes) throws java.io.IOException {
if (! this.isOpen()) {
throw new java.io.IOException("No file is open");
}
long skipped = 0, step = 0;
do {
step = h.skipBytes((int)(bytes - skipped));
if (step < 0) {
throw new java.io.IOException("skip() failed");
}
skipped += step;
} while (skipped < bytes);
position += bytes;
}
The ugly:
The problem with the second class (the FileInputStream/DataInputStream one) is that sometimes it decides to reset the file position to some strange place in a file :) This happens both when I run this on J2SE (a computer) and J2ME (a mobile phone). Here's an example of the actual usage of that reader class and a bug that occurs:
// Open the data file
Reader r = new Reader(filename);
// r.position = 0, actual position in a file = 0
// Skip to where the data block that is needed starts
// (determined by some other code)
r.seek(189248);
// r.position = 189248, actual position in a file = 189248
// Do some reading...
r.readID(); r.readName(); r.readSurname();
// r.position = 189332, actual position in a file = 189332
// Skip some bytes (an unneeded record)
r.skip(288);
// r.position = 189620, actual position in a file = 189620
// Do some more reading...
r.readID(); r.readName(); r.readSurname();
// r.position = 189673, actual position in a file = 189673
// Skip some bytes (an unneeded record)
r.skip(37);
// AAAAND HERE WE GO:
// r.position = 189710, actual position in a file = 477
I was able to determine that when asked to skip another 37 bytes, Java positioned the file pointer to byte 477 from the very start or file instead.
"Freshly" (just after opening a file) seeking to a position 189710 (and beyond that) works OK. However, reopening a file every time I need a seek() is just painfully slow, especially on a mobile phone.
What has happened?
I can see nothing wrong with this. Are you positive of the r.position value before the last skip? Unless there's an underlying bug in the JDK streams or if you have multiple multiple threads using the Reader, then the only possibility I can guess at is that something is modifying the position value incorrectly when you read your fields.

Categories

Resources