Reading Really big Files With Java

Reading Really big Files With Java - java

I am reading a 77MB file inside a Servlet, in future this will be 150GB. This file is not written using any kind of nio package thing, it is just written using BufferedWriter.
Now this is what I need to do.
Read the file line by line. Each line is a "hash code" of a text. Separate it into pieces of 3 chars (3 chars represents 1 word) It could be long, it could be short, I don't know.
After reading the line, convert it into real words. We have a Map of words and Hashes so we can find the words.
Up to now, I used BufferedReader to read the file. It is slow and not good for Huge files like 150GB. It took hours to complete the entire process even for this 77MB file. Because we can't keep the user waiting for hours, it should be within seconds. So, we decided to load the file into the memory. First we thought about loadng every single line into a LinkedList, so the memory coulkd save it. But you know, memory cannot save such a big amount. After a Big Search, I decided Mapping Files to the memory would be the answer. Memory is super faster than the Disk, so we could read the files super fast too.
Code:
public class MapRead {
public MapRead()
{
try {
File file = new File("E:/Amazon HashFile/Hash.txt");
FileChannel c = new RandomAccessFile(file,"r").getChannel();
MappedByteBuffer buffer = c.map(FileChannel.MapMode.READ_ONLY, 0,c.size()).load();
for(int i=0;i<buffer.limit();i++)
{
System.out.println((char)buffer.get());
}
System.out.println(buffer.isLoaded());
System.out.println(buffer.capacity());
} catch (IOException ex) {
Logger.getLogger(MapRead.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
But I could not see any "super fast" thing. And I need line by line. I have few questions to ask.
You read my description and you know what I need to do. I have done the first step for that, so is that correct?
The way I Map is correct? I mean, this is no difference than reading it in normal way. So does this hold the "entire" file in memory first? (lets say using a technique called Mapping) Then we have to write another code to access that memory?
How to read line by line, in super "fast" ? (If I have to load/map the entire file to the memory first for hours, then access it in super speed in seconds, I am totally fine with it too)
Reading files in Servlets is good ? (Because it is being accessed by number of people, and only one IO stream will be opened at once. In this case this servlet will be accessed by thousands at once)
Update
This is how my code look when I updated it with SO user Luiggi Mendoza's answer.
public class BigFileProcessor implements Runnable {
private final BlockingQueue<String> linesToProcess;
public BigFileProcessor (BlockingQueue<String> linesToProcess) {
this.linesToProcess = linesToProcess;
}
#Override
public void run() {
String line = "";
try {
while ( (line = linesToProcess.take()) != null) {
System.out.println(line); //This is not happening
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public class BigFileReader implements Runnable {
private final String fileName;
int a = 0;
private final BlockingQueue<String> linesRead;
public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
this.fileName = fileName;
this.linesRead = linesRead;
}
#Override
public void run() {
try {
//Scanner do not work. I had to use BufferedReader
BufferedReader br = new BufferedReader(new FileReader(new File("E:/Amazon HashFile/Hash.txt")));
String str = "";
while((str=br.readLine())!=null)
{
// System.out.println(a);
a++;
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
public class BigFileWholeProcessor {
private static final int NUMBER_OF_THREADS = 2;
public void processFile(String fileName) {
BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
es.execute(bigFileReader);
es.execute(bigFileProcessor);
es.shutdown();
}
}
public class Main {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
BigFileWholeProcessor b = new BigFileWholeProcessor ();
b.processFile("E:/Amazon HashFile/Hash.txt");
}
}
I am trying to print the file in BigFileProcessor. What I understood is this;
User enter file name
That file get read by BigFileReader, line by line
After each line, the BigFileProcessor get called. Which means, assume BigFileReader read the first line. Now the BigFileProcessor is called. Now once the BigFileProcessor completes the processing for that line, now the BigFileReader reads the line 2. Then again the BigFileProcessor get called for that line, and so on.
May be my understanding about this code is incorrect. How should I process the line anyway?

I would suggest using multi thread here:
One thread will take care to read every line of the file and insert it into a BlockingQueue in order to be processed.
Another thread(s) will take the elements from this queue and process them.
To implement this multi thread work, it would be better using ExecutorService interface and passing Runnable instances, each should implement each task. Remember to have only a single task to read the file.
You could also manage a way to stop reading if the queue has a specific size e.g. if the queue has 10000 elements then wait until its size is down to 8000, then continue reading and filling the queue.
Reading files in Servlets is good ?
I would recommend never do heavy work in servlet. Instead, fire an asynchronous task e.g. via JMS call, then in this external agent you will process your file.
A brief sample of the above explanation to solve the problem:
public class BigFileReader implements Runnable {
private final String fileName;
private final BlockingQueue<String> linesRead;
public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
this.fileName = fileName;
this.linesRead = linesRead;
}
#Override
public void run() {
//since it is a sample, I avoid the manage of how many lines you have read
//and that stuff, but it should not be complicated to accomplish
Scanner scanner = new Scanner(new File(fileName));
while (scanner.hasNext()) {
try {
linesRead.put(scanner.nextLine());
} catch (InterruptedException ie) {
//handle the exception...
ie.printStackTrace();
}
}
scanner.close();
}
}
public class BigFileProcessor implements Runnable {
private final BlockingQueue<String> linesToProcess;
public BigFileProcessor (BlockingQueue<String> linesToProcess) {
this.linesToProcess = linesToProcess;
}
#Override
public void run() {
String line = "";
try {
while ( (line = linesToProcess.take()) != null) {
//do what you want/need to process this line...
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public class BigFileWholeProcessor {
private static final int NUMBER_OF_THREADS = 2;
public void processFile(String fileName) {
BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
es.execute(bigFileReader);
es.execute(bigFileProcessor);
es.shutdown();
}
}

NIO won't help you here. BufferedReader is not slow. If you're I/O bound, you're I/O bound -- get faster I/O.
Mapping the file in to memory can help, but only if you're actually using the memory in place, rather than just copying all of the data out of the big byte array that you get back. The primary advantage of mapping the file is that it keeps the data out of the java heap, and away from the garbage collector.
Your best performance will come from working on the data in place, and not copying it in to the heap if you can.
Some of your performance may be impacted by the object creation. For example, if you were trying to load your data in to the LinkedList, you're creating (likely) millions of nodes for the List itself, plus the object surrounding your data (even if they're just strings).
Creating Strings based on your memory mapped array can be quite efficient, as the String will simply wrap the data, not copy it. But you'll have to be UTF aware if you're working with something other than ASCII (as bytes are not characters in Java).
Also if you're loading in large things, with lots of objects, ensure that you have free space in your heap for them. And by free space, I mean actual room. You can have a 500MB heap, as specified by -Xmx, but the ACTUAL heap will not be that large initially, it will grow to that limit.
Assuming you have sufficient memory in the first place, you can do this via -Xms, which will pre-allocate the heap to a desired size, or you can simply do a quick byte[] buf = new byte[400 * 1024 * 1024], to make a huge allocation, force the GC, and stretch the heap.
What you don't want to be doing is allocating a million objects and have the VM GC every 10000 or so as it grows. Pre-allocating other data structures is also helpful (notably ArrayLists, LinkedLists not so much).

Divide the file into smaller parts. For this you'll need have access to seekable read so you can fast-forward to other parts of file.
For each part, spawn multiple worker threads, each with its own copy of the hash lookup table. Let completed threads join a collector thread, which will write completed chunks in order and signal the processing completion.
It will be better to stream file chunks rather than loading all of them in memory.

Related

How to consume a process' stdout as a stream, without blocking?

In Java (or clojure) I would like to spin up an external process and consume its stdout as a stream. Ideally, I would like to consume the process' output stream every time that the external process flushes it, but am not sure how that can be accomplished, and how it can be accomplished without blocking.
Going around consuming a Java ProcessPipeInputStream for a shelled out process (for example a Unix ProcessPipeInputStream), I find the inherited InputStream methods a bit low-level to work with, and am not sure if there's a non-blocking way to consume from the stream every time the producer-side flushes or otherwise in a non-blocking fashion.
Many code examples block on the output stream in an infinite loop, thereby hogging a thread for the listening. My hope is this blocking behavior can be avoided altogether.
Bottom line:
Is there a non-blocking way to be notified on an input stream, every time that the producing side of it flushes?

You need to create a separate Thread that would consume from such a stream allowing the rest of your program to do whatever is meant to be do doing in parallel.
class ProcessOutputReader implements Runnable {
private InputStream processOutput;
public ProcessOutputReader(final InputStream processOutput) {
this.processOutput = processOutput;
}
#Override
public void run() {
int nextByte;
while ((nextByte = processOutput.read()) != -1) {
// do whatever you need to do byte-by-byte.
processByte(nextByte);
}
}
}
class Main {
public static void main(final String[] args) {
final Process proc = ...;
final ProcessOutputReader reader = new ProcessOutputReader(proc.getInputStream());
final Thread processOutputReaderThread = new Thread(reader);
processOutputReaderThread.setDaemon(true); // allow the VM to terminate if this is the only thread still active.
processOutputReaderThread.start();
...
// if you wanna wait for the whole process output to be processed at some point you can do this:
try {
processOutputReaderThread.join();
} catch (final InterruptedException ex) {
// you need to decide how to recover from if your wait was interrupted.
}
}
}
If instead of processing byte-by-byte you want to deal with each flush as a single piece... I'm not sure there is 100% guaranteed to be able tocapture each process flush. After all the process own's IO framework software (Java, C, Python, etc.) may process the "flush" operation differently and perhaps what you end up receiving is multiple blocks of bytes for any given flush in that external process.
In any case you can attempt to do that by using the InputStream's available method like so:
#Override
public void run() {
int nextByte;
while ((nextByte = processOutput.read()) != -1) {
final int available = processOutput.available();
byte[] block = new byte[available + 1];
block[0] = nextByte;
final int actuallyAvailable = processOutput.read(block, 1, available);
if (actuallyAvailable < available) {
if (actuallyAvailable == -1) {
block = new byte[] { nextByte };
} else {
block = Arrays.copyOf(block, actuallyAvailable + 1);
}
}
// do whatever you need to do on that block now.
processBlock(block);
}
}
I'm not 100% sure of this but I think that one cannot trust that available will return a guaranteed lower bound of the number of bytes that you can retrieve without being block nor that the next read operation is going to return that number of available bytes if so requested; that is why the code above checks on the actual number of bytes read (actuallyAvailable).

Java - How do I safely stop a thread in a Web App from the GUI?

Is there a way to safely and immediately stop the execution of a Thread in Java? Especially, if the logic inside the run() method of the Runnable implementation executes only a single iteration and does not regularly check for any flag that tells it to stop?
I am building a Web Application, using which a user can translate the contents of an entire document from one language to another.
Assuming the documents are extra-large, and subsequently assuming each translation is going to take a long time (say 20-25 minutes), my application creates a separate Thread for each translation that is initiated by its users. A user can see a list of active translations and decide to stop a particular translation job if he/she wishes so.
This is my Translator.java
public class Translator {
public void translate(File file, String sourceLanguage, String targetLanguage) {
//Translation happens here
//.......
//Translation ends and a new File is created.
}
}
I have created a TranslatorRunnable class which implements the Runnable interface as follows:
public class TranslatorRunnable implements Runnable {
private File document;
private String sourceLanguage;
private String targetLanguage;
public TranslatorRunnable(File document, String sourceLanguage, String targetLanguage) {
this.document = document;
this.sourceLanguage = sourceLanguage;
this.targetLanguage = targetLanguage;
}
public void run() {
// TODO Auto-generated method stub
Translator translator = new Translator();
translator.translate(this.document, this.sourceLanguage, this.targetLanguage);
System.out.println("Translator thread is finished.");
}
}
I'm creating the thread for translating a document from an outer class like this:
TranslatorRunnable tRunnable = new TranslatorRunnable(document, "ENGLISH", "FRENCH");
Thread t = new Thread(tRunnable);
t.start();
Now my problem is how do I stop a translation process (essentially a Thread) when the user clicks on "Stop" in the GUI?
I have read a few posts on StackOverflow as well as on other sites, which tell me to have a volatile boolean flag inside the Runnable implementation, which I should check on regularly from inside the run() method and decide when to stop. See this post
This doesn't work for me as the run() method is just calling the Translator.translate() method, which itself is going to take a long time. I have no option here.
The next thing I read is to use ExecutorService and use its shutDownAll() method. But even here, I'd have to handle InterruptedException somewhere regularly within my code. This, is again out of the option. Referred this documentation of the ExecutorService class.
I know I cannot use Thread.stop() as it is deprecated and may cause issues with objects that are commonly used by all threads.
What options do I have?
Is my requirement really feasible without substantial changes to my design? If yes, please tell me how.
If it is absolutely necessary for me to change the design, could anyone tell me what is the best approach I can take?
Thanks,
Sriram

Is there a way to safely and immediately stop the execution of a Thread in Java?
No. each thread is reponsible to periodically check if it has been interrupted to exit as soon as possible
if (Thread.currentThread().isInterrupted() ) {
// release resources. finish quickly what it was doing
}
if you want a more responsive application, you have to change the logic (for example divide each job in smaller batches) so each thread does this checking more often than every 20-25 minutes

If you are the one that created the Translator class what's stopping you from adding some kind of value inside the function that is checked periodically and if needed stops reading the lines from file something like this
public static List<String> readFile(String filename)
{
List<String> records = new ArrayList<>();
try
{
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line;
while ((line = reader.readLine()) != null)
{
String[] split = line.split("\\s+");
records.addAll(Arrays.asList(split));
if (needsToStop) {
break; //Or throw exception
}
}
reader.close();
return records;
}
catch (Exception e)
{
System.err.format("Exception occurred trying to read '%s'.", filename);
e.printStackTrace();
return null;
}
}

Best way to write huge number of files

I am writing a lots of files like bellow.
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0)
throws Exception {
// TODO Auto-generated method stub
while (arg0.hasNext()) {
Tuple2<Text, BytesWritable> tuple2 = arg0.next();
System.out.println(tuple2._1().toString());
PrintWriter writer = new PrintWriter("/home/suv/junk/sparkOutPut/"+tuple2._1().toString(), "UTF-8");
writer.println(new String(tuple2._2().getBytes()));
writer.close();
}
}
Is there any better way to write the files..without closing or creating printwriter every time.

There is no significantly better way to write lots of files. What you are doing is inherently I/O intensive.
UPDATE - #Michael Anderson is right, I think. Using multiple threads to write the files (probably) will speed things up considerably. However, the I/O is still going to be the ultimate bottleneck from a couple of respects:
Creating, opening and closing files involves file & directory metadata access and update. This entails non-trivial CPU.
The file data and metadata changes need to be written to disc. That is possibly multiple disc writes.
There are at least 3 syscalls for each file written.
Then there are thread stitching overheads.
Unless the quantity of data written to each file is significant (multiple kilobytes per file), I doubt that the techniques like using NIO, direct buffers, JNI and so on will be worthwhile. The real bottlenecks will be in the kernel: file system operations and low-level disk I/O.
... without closing or creating printwriter every time.
No. You need to create a new PrintWriter ( or Writer or OutputStream ) for each file.
However, this ...
writer.println(new String(tuple2._2().getBytes()));
... looks rather peculiar. You appear to be:
calling getBytes() on a String (?),
converting the byte array to a String
calling the println() method on the String which will copy it, and the convert it back into bytes before finally outputting them.
What gives? What is the point of the String -> bytes -> String conversion?
I'd just do this:
writer.println(tuple2._2());
This should be faster, though I wouldn't expect the percentage speed-up to be that large.

I'm assuming you're after the fastest way. Because everyone knows fastest is best ;)
One simple way is to use a bunch of threads to do your writing for you.
However you're not going to get much benefit by doing this unless your filesystem scales well. (I use this technique on Luster based cluster systems, and in cases where "lots of files" could mean 10k - in this case many of the writes will be going to different servers / disks)
The code would look something like this: (Note I think this version is not right as for small numbers of files this fills the work queue - but see the next version for the better version anyway...)
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0) throws Exception {
int nThreads=5;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
int nJobs = 0;
while (arg0.hasNext()) {
++nJobs;
final Tuple2<Text, BytesWritable> tuple2 = arg0.next();
ecs.submit(new Callable<Void>() {
#Override Void call() {
System.out.println(tuple2._1().toString());
String path = "/home/suv/junk/sparkOutPut/"+tuple2._1().toString();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(tuple2._2().getBytes()))
}
return null;
}
});
}
for(int i=0; i<nJobs; ++i) {
ecs.take().get();
}
}
Better yet is to start writing your files as soon as you have data for the first one, not when you've got data for all of them - and for this writing to not block the calculation thread(s).
To do this you split your application into several pieces communicating over a (thread safe) queue.
Code then ends up looking more like this:
public void main() {
SomeMultithreadedQueue<Data> queue = ...;
int nGeneratorThreads=1;
int nWriterThreads=5;
int nThreads = nGeneratorThreads + nWriterThreads;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
AtomicInteger completedGenerators = new AtomicInteger(0);
// Start some generator threads.
for(int i=0; ++i; i<nGeneratorThreads) {
ecs.submit( () -> {
while(...) {
Data d = ... ;
queue.push(d);
}
if(completedGenerators.incrementAndGet()==nGeneratorThreads) {
queue.push(null);
}
return null;
});
}
// Start some writer threads
for(int i=0; i<nWriterThreads; ++i) {
ecs.submit( () -> {
Data d
while((d = queue.take())!=null) {
String path = data.path();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(data.getBytes()));
}
return null;
}
});
}
for(int i=0; i<nThreads; ++i) {
ecs.take().get();
}
}
Note I've not provided an implementation of the queue class you can easily wrap the standard java threadsafe ones to get what you need.
There's still lots more that can be done to reduce latency, etc - heres some of the further things I've used to get the times down ...
don't even wait for all the data to be generated for a given file. Pass another queue containing packets of bytes to write.
Watch out for allocations - you can reuse some of your buffers.
There's some latency in the nio stuff - you can get some performance improvements by using C writes and JNI and direct buffers.
Thread switching can hurt, and the latency in the queues can hurt, so you might want to batch up your data slightly. Balancing this with 1 can be tricky.

How to run a java program on multiple threads

Below is my code to extract text from a text file and displaying it on the console.
Could some one please tell me how to make this program run on multiple threads simultaneoulsly?
I would also like to know if multiple threads are being used in performing the task as the time taken to run the task is varied every time i run.??
//Code
import java.io.*;
import java.util.*;
class Extract{
static int i=0;
FileInputStream in;
BufferedReader br;
ArrayList<String> stringList;
String li;
Extract() throws FileNotFoundException
{
FileInputStream in = new FileInputStream("C:\\Users\\sputta\\workspace\\Sample\\src\\threads.txt");
br = new BufferedReader(new InputStreamReader(in));
stringList = new ArrayList<String>();
li=" ";
}
void call()
{
try{
while(li!=null)
{
String str = br.readLine();
stringList.add(str);
li=stringList.get(i);
if(li!=null)
{
System.out.println(li);
i++;
}
}
Thread.sleep(1000);
in.close();
}
catch(Exception e)
{
System.out.println(e);
}
}
}
class Caller implements Runnable {
Extract target;
Thread t;
public Caller(Extract targ)
{
target = targ;
t = new Thread(this);
t.start();
System.out.println(t.isAlive());
}
public void run()
{
synchronized(target) { // synchronized block
target.call();
}
}
}
public class Sample {
public static void main(String args[]) throws FileNotFoundException
{
long startTime = System.currentTimeMillis();
System.out.println(startTime);
Extract target = new Extract();
Caller ob1 = new Caller(target);
Caller ob2 = new Caller(target);
Caller ob3 = new Caller(target);
try {
ob1.t.join();
ob2.t.join();
ob3.t.join();
}
catch(InterruptedException e)
{
System.out.println("Interrupted");
}
}
}

It does not make much sense performance-wise to have multiple threads reading from the same file, due to the inevitable input/output (I/O) bottleneck.
Two things that can be done to improve the situation:
"Split" the file into smaller pieces and assign each such "split" to a different thread. This is the approach followed by Hadoop, but it does require copying each "split" before processing, so it is only beneficial for large files (say, at least 100 MB each, or much more).
Use 1 thread to read from the file into a "prefetch" buffer, in memory, and then process the input from the buffer, via multiple other threads. A variation of this approach would be for the prefetch thread to "feed" each of the "consumer" threads with data, before each of them starts. Obviously, the relative allocation of prefetch vs. processing across the threads, will yield varying results, so further tuning would be necessary, depending on the application.
Both approaches have limitations and do not guarantee performance improvements in all cases.
Reading a text file line-by-line from a single thread can be done at a speed of over 1 million lines/sec, but still the bottleneck will remain in I/O, as already discussed.

Knowing when akka actors are finished

There are a few people working on a project along with me that have been trying to figure out the best way to deal with this issue. It seems this should be a standard thing wanted regularly, but for some reason we can't seem to get the right answer.
If I have some work to be done and I throw a bunch of messages at a router, how can I tell when all the work is done? For example, if we're reading lines of a 1 million line file and sending the line off to actors to process this, and you need to process the next file, but must wait for the first to complete, how can you know when it is complete?
One further comment. I'm aware and have used Await.result() and Await.ready() used with Patters.ask(). One difference is, each line would have a Future and we'd have a HUGE array of these futures to wait on, not just one. Additionally, we are populating a large domain model taking up considerable memory, and do not wish to add additional memory for holding an equal number of futures in memory waiting to be composed, while using actors each one completes after doing it's work not holding memory waiting to be composed.
We're using Java and not Scala.
Pseudo code:
for(File file : files) {
...
while((String line = getNextLine(fileStream)) != null) {
router.tell(line, this.getSelf());
}
// we need to wait for this work to finish to do the next
// file because it's dependent on the previous work
}
It would seem you'd often want to do a lot of work and know when it's finished with actors.

I believe I have a solution for you and it does not involve accumulating a whole bunch of Futures. First, the high level concept. There will be two actors participating in this flow. The first we'll call FilesProcessor. This actor will be short lived and stateful. Whenever you want to process a bunch of files sequentially, you spin up an instance of this actor and pass it a message containing the names (or paths) of the files you want to process. When it has completed processing of all of the files, it stops itself. The second actor we will call LineProcessor. This actor is stateless, long lived and pooled behind a router. It processes a file line and then responds back to whoever requested the line processing telling them it has completed processing that line. Now onto the code.
First the messages:
public class Messages {
public static class ProcessFiles{
public final List<String> fileNames;
public ProcessFiles(List<String> fileNames){
this.fileNames = fileNames;
}
}
public static class ProcessLine{
public final String line;
public ProcessLine(String line){
this.line = line;
}
}
public static class LineProcessed{}
public static LineProcessed LINE_PROCESSED = new LineProcessed();
}
And the FilesProcessor:
public class FilesProcessor extends UntypedActor{
private List<String> files;
private int awaitingCount;
private ActorRef router;
#Override
public void onReceive(Object msg) throws Exception {
if (msg instanceof ProcessFiles){
ProcessFiles pf = (ProcessFiles)msg;
router = ... //lookup router;
files = pf.fileNames;
processNextFile();
}
else if (msg instanceof LineProcessed){
awaitingCount--;
if (awaitingCount <= 0){
processNextFile();
}
}
}
private void processNextFile(){
if (files.isEmpty()) getContext().stop(getSelf());
else{
String file = files.remove(0);
BufferedReader in = openFile(file);
String input = null;
awaitingCount = 0;
try{
while((input = in.readLine()) != null){
router.tell(new Messages.ProcessLine(input), getSelf());
awaitingCount++;
}
}
catch(IOException e){
e.printStackTrace();
getContext().stop(getSelf());
}
}
}
private BufferedReader openFile(String name){
//do whetever to load file
...
}
}
And the LineProcessor:
public class LineProcessor extends UntypedActor{
#Override
public void onReceive(Object msg) throws Exception {
if (msg instanceof ProcessLine){
ProcessLine pl = (ProcessLine)msg;
//Do whatever line processing...
getSender().tell(Messages.LINE_PROCESSED, getSelf());
}
}
}
Now the line processor is sending a response back with no additional content. You could certainly change this if you needed to send something back based on the processing of the line. I'm sure this code is not bullet proof, I just wanted to show you a high level concept for how you could accomplish this flow without request/response semantics and Futures.
If you have any questions on this approach or want more detail, let me know and I'd be happy to provide it.

Use context.setRecieveTimeout on the routees to send back a message back to the sender with a count of the messages processed. When the total messages processed == the amount sent you are finished.
If your routees are going to stay busy enough that setReceiveTimeout won't fire often enough then schedule your own messages to send the counts back.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.