Multithreaded file processing and reporting - java

I have an application that processes data stored in a number of files from an input directory and then produces some output depending on that data.
So far, the application works in a sequential basis, i.e. it launches a "manager" thread that
Reads the contents of the input directory into a File[] array
Processes each file in sequence and stores results
Terminates when all files are processed
I would like to convert this into a multithreaded application, in which the "manager" thread
Reads the contents of the input directory into a File[] array
Launches a number of "processor" threads, each of which processes a single file, stores results and returns a summary report for that file to the "manager" thread
Terminates when all files have been processed
The number of "processor" threads would be at most equal to the number of files, since they would be recycled via a ThreadPoolExecutor.
Any solution avoiding the use of join() or wait()/notify() would be preferrable.
Based on the above scenario:
What would be the best way of having those "processor" threads reporting back to the "manager" thread? Would an implementation based on Callable and Future make sense here?
How can the "manager" thread know when all "processor" threads are done, i.e. when all files have been processed?
Is there a way of "timing" a processor thread and terminating it if it takes "too long" (i.e., it hasn't returned a result despite the lapse of a pre-configured amount of time)?
Any pointers to, or examples of, (pseudo-)source code would be greatly appreciated.

You can definitely do this without using join() or wait()/notify() yourself.
You should take a look at java.util.concurrent.ExecutorCompletionService to start with.
The way I see it you should write the following classes:
FileSummary - Simple value object that holds the result of processing a single file
FileProcessor implements Callable<FileSummary> - The strategy for converting a file into a FileSummary result
File Manager - The high level manager that creates FileProcessor instances, submits them to a work queue and then aggregates the results.
The FileManager would then look something like this:
class FileManager {
private CompletionService<FileSummary> cs; // Initialize this in constructor
public FinalResult processDir(File dir) {
int fileCount = 0;
for(File f : dir.listFiles()) {
cs.submit(new FileProcessor(f));
fileCount++;
}
for(int i = 0; i < fileCount; i++) {
FileSummary summary = cs.take().get();
// aggregate summary into final result;
}
}
If you want to implement a timeout you can use the poll() method on CompletionService instead of take().

wait()/notify() are very low level primitives and you are right in wanting to avoid them.
The simplest solution would be to use a thread-safe queues (or stacks, etc. -- it doesn't really matter in this case). Before starting the worker threads, your main thread can add all the Files to the thread-safe queue/stack. Then start the worker threads, and let them all pull Files and process them until there are none left.
The worker threads can add results to another thread-safe queue/stack, where the main thread can get them from. The main thread knows how many Files there were, so when it has retrieved the same number of results, it will know that the job is finished.
Something like a java.util.concurrent.BlockingQueue would work, and there are other thread-safe collections in java.util.concurrent which would also be fine.
You also asked about terminating worker threads which are taking too long. I will tell right up front: if you can make the code which runs on the worker threads robust enough that you can safely leave this feature out, you will make things a lot simpler.
If you do need this feature, the simplest and most reliable solution is to have a per-thread "terminate" flag, and make the worker task code check that flag frequently and exit if it is set. Make a custom class for workers, and include a volatile boolean field for this purpose. Also include a setter method (because of volatile, it doesn't need to be synchronized).
If a worker discovers that its "terminate" flag is set, it could push its File object back on the work queue/stack so another thread can process it. Of course, if there is some problem which means the File cannot be successfully processed, this will lead to an infinite cycle.
The best is to make the worker code very simple and robust, so you don't need to worry about it "not terminating".

No need for them to report back. Just have a count of the number of jobs remaining to be done and have the thread decrement that count when it's done.
When the count reaches zero of jobs remaining to be done, all the "processor" threads are done.
Sure, just add that code to the thread. When it starts working, check the time and compute the stop time. Periodically (say when you go to read more from the file), check to see if it's past the stop time and, if so, stop.

Related

Need a solution for Java Method which returns a string Value to be executed only n number of threads in JVM

Java Method which returns string(fileName) is consuming memory internally for few Image operations which cannot be further optimized, lets say consuming 20 MB HeapSpace per method execution.
This method is executing as part of ProcessingImageData and need to return the file name as output to RestWebService caller.
While some n threads are parallel processing which is giving OutofMemory.
To escape from OutofMemory-HeapSpace can you please provide your suggestions like
setting only fixed number of threads to execute this method.
public String deleteImageAndProvideFile(String inputImage, int deletePageNum){
// process image
//find page and delete
//merge pages to new file
// return new file Name
}
If you have an number of tasks but you want to limit the number of threads performing them, use an ExecutorService with a bounded thread pool.
The Executors class has a helper method for creating what you need:
newFixedThreadPool(int nosThreads) (javadoc).
Adjust the nosThreads parameter according to how much memory you want to use.
The ExecutionService documentation explains how to use the API (javadoc). You submit tasks and get Future objects that can be used to wait until a given task is finished.
In your use-case, one of your web requests might submit task to a "global" executor service and then wait for the task to complete. Alternatively you could design your system so that the processing is done asynchronously with the web requests; e.g. submit a task in one request, and then make another request to see if it has been done yet.

Is a "Chain of Threads" a bad solution for this Java application?

I'm running a program where I download large files, parse them and then write the data I have extracted from the file into another file.
The files take a long time to download and parse but the write task only takes a minute or so on average. My solution I threw together was to have three fixedthreadpools of three threads.
ExecutorService downloadExecutor = Executors.newFixedThreadPool(3);
ExecutorService parseExecutor = Executors.newFixedThreadPool(3);
ExecutorService writeExecutor = Executors.newFixedThreadPool(3);
A thread in the download pool downloads the file, then submits a new thread to the parser threadpool, with the filename as a parameter. This is done within the thread itself. The download thread then gets to work downloading another file from a list of file URLs.
Once the parser thread has finished parsing the data I want from the file,it then submits a new thread containing the data to the write threadpool, where it is then written to a .csv file.
My question is if there is a more elegant solution to this. I have not really done much complex threading. Since I have a lot of files to download and parse, I do not want any of the threads being idle at any time. The idea again, is that since parsing a file can take a while, I might as well make seperate threads devoted to downloading those files first.
Why not use only one Thread pool. Download, parse and save must wait anyway for each other so the best seperation of tasks would be to use one thread per file.
This is not a bad practice as many developers do similar sort of coding. But there are something you need to keep in mind.
Number One, You can't expect the performance to increase just because you have more threads. There are optimum number of threads based on the no of CPUs.
Number Two, You must make sure how exceptions are handled.
Number Three, You must make sure you can shutdown all the thread pools in an event where you need to stop the application.
So your problem has two aspects:
Compute bound
IO bound
Reading and writing to the file is IO bound. Async IO is the best for IO bound tasks. Java has AsynchronousFileChannel that allows you to read and write files without worrying about thread pools where continuation is achieved through completion handlers.
Complete Example.
AsynchronousFileChannel ch = AsynchronousFileChannel.open(path);
final ByteBuffer buf = ByteBuffer.allocate(1024);
ch.read(buf, 0, 0,
new CompletionHandler() {
public void completed(Integer result, Integer length){
..
}
public void failed(Throwable exc, Integer length) {
..
}
}
);
And you do the same for writes, you just write to the channel
ch.write(...
No for parsing the file, thats a compute bound task, and you should get your CPU cores hot for that, you can assign a thread pool equal to the number of cores you have.
executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors())
Now what get to remember is: you need to test your code, and testing concurrent code is hard. If you can't proof its correctness, don't do it.

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?
It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.
Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file
A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Java multithreading and files

I'm working on project. One part of it is read given folders files.
Program travels into deep and collects filenames and other info which i wrap into my own DFile class, and puts it into collection for further work.
It worked when was singlethreaded (using recursive read), but I want to do that in multithreading perspective, ignoring the thing that disk IO and multithreading won't increase performance. I want it for learning purpose.
So far, I've been jumping from one decision to another, changing plans how it will be and can't get it good. Your help would be appreciated.
What I want, that I supply root folder name, and my program runs several minithreads (user defined number of threads for this purpose), each thread reads given folders content:
- When it finds file, wraps it into DFile and puts into shared between threads collection
- When it finds folder, puts folder (as File object) into jobQueue, for other available thread to take work on it.
I can't get this system correctly. I've been changing code, puting idea what classes should be from one class with static collections to many.
So far few classes I am listing here:
DirectoryCrawler http://pastebin.com/8tVGpGT9
Won't publish rest of my work (maybe in other topic, because purpose of the program absolutely not covered here). Program should read folder and make a list of files in it, then sort it (where I'll probably use multithreading too), then search for same hashed files and there's constantly working thread for writing those equal file groups into result file. I don't need to gain any performance, files gonna be small, as at first I was working on speed, I don't need it now.
Any help regarding design of reading would be appreciated
EDIT:
So much of headache :((. Doesn't work correctly :( Here so far:
crawler (like a minithread for reading one folder, found files goes to fileList which is in other class, and folders to queue) pastebin. com/AkJLAUhD
scanner class (Don't even know should it be runnable or no). DirectoryScanner (main, should control crawlers, hold main filelist) pastebin. com/2abGMgG9 .
DFile itself pastebin. com/8uqPWh6Z (something became wrong with hashing, now when sorting all get same hash.. worked .. (hashing is for other task unrelated)) .
Filelist past ebin. com/Q2yM6ZwS
testcode:
DirectoryScanner reader = new DirectoryScanner(4);
for (int i = 0; i < 4; i ++) {
reader.runTask(new DirectoryCrawler("myroot", reader));
}
try {
reader.kill();
while (!reader.isDone()) {
System.out.println("notdone");
}
reader.getFileList().print();
}
myroot is a folder with some files for test
Anything, i can't even think of should scanner be itself runnable, or only crawlers. Because while scanning I actualy don't want to start doing other stuff like sorting (because nothing to sort while not gathered all files) ..
You need the Executor threadpool and some classes:
A Fsearch class. This contains your container for the results. It also has a factory method that returns an Ffolder, counting up a 'foldersOutstanding' counter, and an OnComplete that counts them back in by counting down 'foldersOutstanding':
You need a Ffolder class to represent a folder and is passed its path as ctor parameter. It should have a run method that iterates is folder path that is supplied as a parameter along with the Fsearch instance.
Create and load up an Fsearch with the root folder and fire it into the pool. It creates a folder class, passing its root path and itslef, and loads that on. Then it waits on a 'searchComplete' event.
That first Ffolder iterates its folder, creating, (or depooling), DFiles for each 'ordinary' file and pushing them into the Fsearch container. If it finds a folder, it gets another Ffolder from the Fsearch, loads it with the new path and loads that onto the pool as well.
When an Ffolder has finished iterating its own folder, it calls the OnComplete' method of the Fsearch. The OnComplete is counting down the 'foldersOutstanding' and, when it is decremented to zero, all the folders have been scanned and files processed. The thread that did this final decrement signals the searchComplete event so that the Fsearch can continue. The Fsearch could call some 'OnSearchComplete' event that is was passed when it was created.
It goes almost without saying that the Fsearch callbacks must be thread-safe.
Such an exercise does not have to be academic.
The container in the Fsearch, where all the DFiles go, could be a producer-consumer queue. Other threads could start processing the DFiles as the search is in progress, instead of waiting until the end.
I have done this before, (but not in Java), - it works OK. A design like this can easily do multiple searches in parallel - it's fun to issue an Fsearch for several hard drive roots at once - the clattering noise is impressive
Forgot to say - the big gain from such a design is when searching several networked drives with high latency. They can all be searched in parallel. The speedup over a miserable single-threaded sequential search is many times. By the time a single-thread seach has finished queueing up the DFiles for one drive, the multi-search has searched four drives and already had most of its DFiles processed.
NOTE:
1) If implemented strictly as above, the threadpool thread taht executes the FSearch is blocked on the 'OnSearchComplete' event until the search is over, so 'using up' one thread. There must therefore be more threadpool threads than live Fsearch instances else there will be no threads left over to do the actual searching, (yes, of course that happened to me:).
2) Unlike a single-thread search, results don't come back in any sort of predictable or repeatable order. If, for example, you signal your results as they come in to a GUI thread and try to display them in a TreeView, the path through the treeview component will likely be different for each result, updating the visual treeview will be lengthy. This can result in the Windows GUI input queue getting full, (10000 limit), because the GUI cannot keep up or, if using object pools for the Ffolder etc, the pool can empty, slugging performance and, if the GUI thread tries to get an Ffolder to issue a new search from the empty pool and so blocks, all-round deadlock with all Ffolder instances stuck in Windows messages, (yes, of course that happened to me:). It's best to not let such things happen!
Example - something like this I found - it's quite old Windows/C++ Builder code but it still works - I tried it on my Rad Studio 2009 , removed all the legacy/proprietary gunge and added some extra comments. All it does here is count up the folders and files, just as an example. There are only a couple of 'runnable' classes The myPool->submit() methods loads a runnable onto the pool and it's run() method gets executed. The base ctor has an 'OnComplete' EventHander, (TNotifyEvent), delgate parameter - that gets fired by the pool thread when the run() method returns.
//******************************* CLASSES ********************************
class DirSearch; // forward dec.
class ScanDir:public PoolTask{
String FmyDirPath;
DirSearch *FmySearch;
TStringList *filesAndFolderNames;
public: // Counts for FmyDirPath only
int fileCount,folderCount;
ScanDir(String thisDirPath,DirSearch *mySearch);
void run(); // an override - called by pool thread
};
class DirSearch:public PoolTask{
TNotifyEvent FonComplete;
int dirCount;
TEvent *searchCompleteEvent;
CRITICAL_SECTION countLock;
public:
String FdirPath;
int totalFileCount,totalFolderCount; // Count totals for all ScanDir's
DirSearch(String dirPath, TNotifyEvent onComplete);
ScanDir* getScanDir(String path); // get a ScanDir and inc's count
void run(); // an override - called by pool thread
void __fastcall scanCompleted(TObject *Sender); // called by ScanDir's
};
//******************************* METHODS ********************************
// ctor - just calls base ctor an initialzes stuff..
ScanDir::ScanDir(String thisDirPath,DirSearch *mySearch):FmySearch(mySearch),
FmyDirPath(thisDirPath),fileCount(0),folderCount(0),
PoolTask(0,mySearch->scanCompleted){};
void ScanDir::run() // an override - called by pool thread
{
// fileCount=0;
// folderCount=0;
filesAndFolderNames=listAllFoldersAndFiles(FmyDirPath); // gets files
for (int index = 0; index < filesAndFolderNames->Count; index++)
{ // for all files in the folder..
if((int)filesAndFolderNames->Objects[index]&faDirectory){
folderCount++; //do count and, if it's a folder, start another ScanDir
String newFolderPath=FmyDirPath+"\\"+filesAndFolderNames->Strings[index];
ScanDir* newScanDir=FmySearch->getScanDir(newFolderPath);
myPool->submit(newScanDir);
}
else fileCount++; // inc 'ordinary' file count
}
delete(filesAndFolderNames); // don't leak the TStringList of filenames
};
DirSearch::DirSearch(String dirPath, TNotifyEvent onComplete):FdirPath(dirPath),
FonComplete(onComplete),totalFileCount(0),totalFolderCount(0),dirCount(0),
PoolTask(0,onComplete)
{
InitializeCriticalSection(&countLock); // thread-safe count
searchCompleteEvent=new TEvent(NULL,false,false,"",false); // an event
// for DirSearch to wait on till all ScanDir's done
};
ScanDir* DirSearch::getScanDir(String path)
{ // up the dirCount while providing a new DirSearch
EnterCriticalSection(&countLock);
dirCount++;
LeaveCriticalSection(&countLock);
return new ScanDir(path,this);
};
void DirSearch::run() // called on pool thread
{
ScanDir *firstScanDir=getScanDir(FdirPath); // get first ScanDir for top
myPool->submit(firstScanDir); // folder and set it going
searchCompleteEvent->WaitFor(INFINITE); // wait for them all to finish
}
/* NOTE - this is a DirSearch method, but it's called by the pool threads
running the DirScans when they complete. The 'DirSearch' pool thread is stuck
on the searchCompleteEvent, waiting for all the DirScans to complete, at which
point the dirCount will be zero and the searchCompleteEvent signalled.
*/
void __fastcall DirSearch::scanCompleted(TObject *Sender){ // a DirSearch done
ScanDir* thiscan=(ScanDir*)Sender; // get the instance that completed back
EnterCriticalSection(&countLock); // thread-safe
totalFileCount+=thiscan->fileCount; // add DirSearch countst to totals
totalFolderCount+=thiscan->folderCount;
dirCount--; // another one gone..
LeaveCriticalSection(&countLock);
if(!dirCount) searchCompleteEvent->SetEvent(); // if all done, signal
delete(thiscan); // another one bites the dust..
};
..and here it is, working:
If you want to learn some multi-threading by doing some practical implementation, it would be best to pick something where the switch from a single-threaded activity to a multi-threaded one would actually make sense.
In this case, it doesn't make any sense. And, to achieve it, it would require some ugly pieces of code to be written. That is because you could have, for example, one thread handle just one subfolder (first level after the root folder). But what if you start with 200 subfolders ? Or more... Will 200 threads in that case make sense ? I doubt it...

Directory naming using Process ID and Thread ID

I have an application with a few threads that manipulate data and save the output in different temporary files on a particular directory, in a Linux or a Windows machine. These files eventually need to be erased.
What I want to do is to be able to better separate the files, so I am thinking of doing this by Process ID and Thread ID. This will help the application save disk space because, upon termination of a thread, the whole directory with that thread's files can be erased and leave the rest of the application reuse the corresponding disk space.
Since the application runs on a single instance of the JVM, I assume it will have a single Process ID, which will be that of the JVM, right?
That being the case, the only way to discriminate among these files is to save them in a folder, the name of which will be related to the Thread ID.
Is this approach reasonable, or should I be doing something else?
java.io.File can create temporary files for you. As long as you keep a list of those files associated with each thread, you can delete them when the thread exits. You can also mark the files to delete on exit in case a thread does not complete.
It seems the simplest solution for this approach is really to extend Thread - never thought I'd see that day.
As P.T. already said Thread IDs are only unique as long as the thread is alive, they can and most certainly will be reused by the OS.
So instead of doing it this way, you use the Thread name that can be specified at construction and to make it simple, just write a small class:
public class MyThread extends Thread {
private static long ID = 0;
public MyThread(Runnable r) {
super(r, getNextName());
}
private static synchronized String getNextName() {
// We can get rid of synchronized with some AtomicLong and so on,
// doubt that's necessary though
return "MyThread " + ID++;
}
}
Then you can do something like this:
public static void main(String[] args) throws InterruptedException {
Thread t = new MyThread(new Runnable() {
#Override
public void run() {
System.out.println("Name: " + Thread.currentThread().getName());
}
});
t.start();
}
You have to overwrite all constructors you want to use and always use the MyThread class, but this way you can guarantee a unique mapping - well at least 2^64-1 (negative values are fine too after all) which should be more than enough.
Though I still don't think that's the best approach, possibly better to create some "job" class that contains all necessary information and can clean up its files as soon as it's no longer needed - that way you also can easily use ThreadPools and co where one thread will do more than one job. At the moment you have business logic in a thread - that doesn't strike me as especially good design.
You're correct, the JVM has one process ID, and all threads in that JVM will share the process id. (It is possible for a JVM to use multiple processes, but AFAIK, no JVM does that.)
A JVM may very well re-use underlying OS threads for multiple Java threads, so there is no guaranteed correlation between a thread exiting in Java and anything similar happening at the OS level.
If you just need to cleanup stale files, sorting the files by their creation timestamp should do the job sufficiently? No need to encode anything special at all in the temporary file names.
Note that PIDs and TIDs are neither guaranteed to be increasing, no guaranteed to be unique across exits. The OS is free to recycle an ID. (In practice the IDs have to wrap around before re-use, but on some machines that can happen after only 32k or 64k processes have been created.

Categories

Resources