Using more Threads for reading the same file in Java - java

How can I read a file in Java using multithreading?
It doesn't matter if it's slower than using once, I have to do it.
So, for example, if there are 2 threads, the first reads the first line and, at the same time, the second reads the second line; then the first reads the third line and the second reads the fourth line and they continue reading in this way since the end of the file. How can I implement this in Java?

Just use a single BufferedReader that is shared between the threads, and synchronize on it when calling readLine().
It is completely pointless.

Related

what are the concern regarding simultaneous read and write to a file?

consider the following scenario:
Process 1 (Writer) continuously appends a line to a file ( sharedFile.txt )
Process 2 (Reader) continuously reads a line from sharedFile.txt
my questions are:
In java is it possible that :
Reader process somehow crashes Writer process (i.e. breaks the process of Writer)?
Reader some how knows when to stop reading the file purely based on the file stats (Reader doesn't know if others are writing to the file)?
to demonsterate
Process one (Writer):
...
while(!done){
String nextLine;//process the line
writeLine(nextLine);
...
}
...
Process Two (Reader):
...
while(hasNextLine()){
String nextLine= readLine();
...
}
...
NOTE:
Writer Process has priority. so nothing must interfere with it.
Since you are talking about processes, not threads, the answer depends on how the underlying OS manages open file handles:
On every OS I'm familiar with, Reader will never crash a writer process, as Reader's file handle only allows reading. On Linux, system calls a Reader can potentially invoke on the underlying OS are open(2) with O_RDONLY flag, lseek(2) and read(2) -- are known not to interfere with the syscalls that the Writer is invoking, such as write(2).
Reader most likely won't know when to stop reading on most OS. More precisely, on some read attempt it will receive zero as the number of read bytes and will treat this as an EOF (end of file). At this very moment, there can be Writer preparing to append some data to a file, but Reader have no way of knowing it.
If you need a way for two processes to communicate via file, you can do it using some extra files that pass meta-information between Readers and Writers, such as whether there are Writer currently running. Introducing some structure into a file can be useful too (for example, every Writer appends a byte to a file indicating that the write process is happening).
For very fast non-blocking I/O you may want consider memory mapped files via Java's MappedByteBuffer.
The code will not crash. However, the reader will terminate when the end is reached, even if the writer may still be writing. You will have to synchronize somehow!
Concern:
Your reader thread can read a stale value even when you think another writer thread has updated the variable value
Even if you write to a file if synchronization is not there you will see a different value while reading
Java File IO and plain files were not designed for simultaneous writes and reads. Either your reader will overtake your writer, or your reader will never finish.
JB Nizet provided the answer in his comment. You use a BlockingQueue to hold the writer data while you're reading it. Either the queue will empty, or the reader will never finish. You have the means through the BlockingQueue methods to detect either situation.

Should multiple threads read from the same DataInputStream?

I'd like my program to get a file, and then create 4 files based on its byte content.
Working with only the main thread, I just create one DataInputStream and do my thing sequentially.
Now, I'm interested in making my program concurrent. Maybe I can have four threads - one for each file to be created.
I don't want to read the file's bytes into memory all at once, so my threads will need to query the DataInputStream constantly to stream the bytes using read().
What is not clear to me is, should my 4 threads call read() on the same DataInputStream, or should each one have their own separate stream to read from?
I don't think this is a good idea. See http://download.java.net/jdk7/archive/b123/docs/api/java/io/DataInputStream.html
DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class.
Assuming you want all of the data in each of your four new files, each thread should create its own DataInputStream.
If the threads share a single DataInputStream, at best each thread will get some random quarter of the data. At worst, you'll get a crash or data corruption due to multithreaded access to code that is not thread safe.
If you want to read data from 1 file into 4 separate ones you will not share DataInputStream. You can however wrap that stream and add functionality that would make it thread safe.
For example you may want to read in a chunk of data from your DataInputStream and cache that small chunk. When all 4 threads have read the chunk you can dispose of it and continue reading. You would never have to load the complete file into memory. You would only have to load a small amount.
If you look at the doc of DataInputStream. It is a FilterInputStream, which means the read operation is delegated to other inputStream. Suppose you use here is a FileInputStream, In most platform, concurrent read will be supported.
So in your case, you should initialize four different FileInputStream, result in four DataInputStream, used in four thread separately. The read operation will not be interfered.
Short answer is no.
Longer answer: have a single thread read the DataInputStream, and put the data into one of four Queues, one per output file. Decide which Queue based upon the byte content.
Have four threads, each one reading from a Queue, that write to the output files.

Reading from socket input stream

I'm trying to determine the best way to transfer data through a socket between a client and server. Currently I have a BufferedReader that reads one character at a time (or however many characters have arrived since the last iteration). Through each iteration, it pulls the data received so far and puts it into an array. When the '|' character is read, it knows that the current instruction is done.
I know what I have so far is grossly inefficient and burns the CPU, but I'm a little unclear as to the differences between all the ways to read from a socket input stream. What would I use to not have to read each character at a time, but rather to wait until the input stream is finished receiving the current instruction (which would be terminated by "\n")?
I find the best way is to create a new thread for each socket/client that listens for input. readLine() always works for me fine. Perhaps this might help you out a bit.

process a file line by line in concurrency way

now i am working on a job about data format transform.
there is a large file, like 10GB, the current solution i implemented is read this file line by line, transform the format for each line, then output to a output file. i found the transform process is a bottle neck. so i am trying to do this in a concurrent way.
Each line is a complete unit, has nothing to do with other lines. Some lines may be discarded as some specific value in the line do not meet the demand.
now i have two plans:
one thread read data line by line from input file, then put the line into a queue, several threads get lines from the queue, transform the format, then put the line into a output queue, finally an output thread reads lines from the output queue and writes to a output file.
several threads currently read data from different part of the input file, then process the line and output to a file through a output queue or file lock.
would you guys please give me some advise ? i really appreciate it.
thanks in advance!
I would go for the first option ... reading data from a file in small pieces normally is slower than reading the whole file at once (depending on file caches/buffering/read ahead etc).
You also might need to think about a way to create the output file (acquiring all lines from the different processes, possibly in the correct order if needed).
Solution 1 makes sense.
This would also map nicely and simply to Java's Executor framework. Your main thread reads lines and submits each line to an Executor or ExecutorService.
It gets more complicated if you must keep order intact, though.

Completely stumped on a Java problem

Please Note: I am not "looking for teh codez" - just ideas for algorithms to solve this problem.
This IS a homework assignment. I thought I was in the home stretch, about to finish it out, but the last part has absolutely stumped me. Never have I been stuck like this. It has to do with threading in Java.
The Driver class reads a file, the first line indicates the number of threads, second line is a space delimited list of file names for each thread to read from. Each thread is numbered (0 - N), N being the total number of files. Each thread reads the file specified, and outputs to a file named t#_out.txt where # is the threads index.
After all of this is done the Driver thread must:
After all threads finish execution, the program Driver.java opens all
output files t#_out.txt, reads a line from each file, and writes the
line to an output file out.txt.
Example of the out.txt:
MyThread[0]: Line[1]: Something there is that doesn't love a wall,
MyThread[1]: Line[1]: HOG Butcher for the World,
MyThread[2]: Line[1]: I think that I shall never see
MyThread[0]: Line[2]: That sends the frozen-ground-swell under it,
MyThread[1]: Line[2]: Tool Maker, Stacker of Wheat,
MyThread[2]: Line[2]: A poem lovely as a tree.
MyThread[0]: Line[3]: And spills the upper boulders in the sun,
MyThread[1]: Line[3]: Player with Railroads and the Nation's Freight Handler;
MyThread[2]: Line[3]: A tree whose hungry mouth is prest
My problem is: What kind of loop structure could I setup to do this? Read a line from t1_out.txt, write to out.txt, read line from t2_out.txt, write to out.txt, read line from tN_out.txt, write to out.txt? How do I know when one file has reached the end?
Ideas:
Use a while(!done) loop to continue looping until each scanner is done. Keep track of an array of booleans indicating whether or not the Scanner is done reading its file. The Scanners would be in an array as well. The problem with this is how do I tell when ALL are done, to finish my infinite loop? In each iteration see if booleans[i] is done and if not then done = false? No good.
Just read every files lines into its own String[] array. Then figure out a loop to alternate the writing to the out.txt. Problem with this is what happens when I hit array index out of bounds? Also this is not in the specs, it says to read a line, and write a line.
EDIT: The solution was to create an allFilesReachedEOF() method which has an initial boolean of true. It then loops through each one, and if ANY have another line to read, sets the return condition to false. This was my while loops condition: while (!allFilesReachedEOF()).
My problem was that I was trying to control the loop from within the loop. So if a file had another line it would continue, but if ANY file EOF'd, the loop would stop.
Thanks for the help!
You could do a while with a condition that not all the files have reached EOF. Then you iterate through all the files, and for those that haven't reached EOF, you read the next line and write it to your output file. As you go, you update your condition variable for the "while" loop.
Is this what you're looking to do?
It sounds like you could use a Queue to achieve this. Add each t#_out.txt's input to the Queue then implement a loop in which you read a line from the polled input and write it to your output. As long as the read line isn't EOF, re-add the input to the Queue. When the Queue is empty, break out from the loop.
Also I recommend a BufferedWriter for the output, which you flush() at the end so that the actual writing only occurs once.
Here are the main points:
Create a class that implements Runnable whose run() method does what you need one thread to do. It'll likely need fields for threadNumber and filename. The run method should make sure to close() the output streams of the output files
For each thread you need to create, instantiate one of your class (giving it the filename (and other) data it needs) and pass it into the constructor of Thread. Keep a reference to the Thread objects
Call the start() method on all the threads
Call the join() method on all the threads (join waits for the thread to finish)
Do your final Driver task of opening the output files
This can be done by using a do - while "exit condition" loop and an inner for loop for iterating through the output files. Set the exit condition to true before the start of the for loop, and reset it within the for loop if you get at least a line from any of the files.
Files that have reached eof will continue to be read, but will not return any lines. You can choose to print blank lines for these or just skip them.

Categories

Resources