Java multithreading and files

Java multithreading and files - java

I'm working on project. One part of it is read given folders files.
Program travels into deep and collects filenames and other info which i wrap into my own DFile class, and puts it into collection for further work.
It worked when was singlethreaded (using recursive read), but I want to do that in multithreading perspective, ignoring the thing that disk IO and multithreading won't increase performance. I want it for learning purpose.
So far, I've been jumping from one decision to another, changing plans how it will be and can't get it good. Your help would be appreciated.
What I want, that I supply root folder name, and my program runs several minithreads (user defined number of threads for this purpose), each thread reads given folders content:
- When it finds file, wraps it into DFile and puts into shared between threads collection
- When it finds folder, puts folder (as File object) into jobQueue, for other available thread to take work on it.
I can't get this system correctly. I've been changing code, puting idea what classes should be from one class with static collections to many.
So far few classes I am listing here:
DirectoryCrawler http://pastebin.com/8tVGpGT9
Won't publish rest of my work (maybe in other topic, because purpose of the program absolutely not covered here). Program should read folder and make a list of files in it, then sort it (where I'll probably use multithreading too), then search for same hashed files and there's constantly working thread for writing those equal file groups into result file. I don't need to gain any performance, files gonna be small, as at first I was working on speed, I don't need it now.
Any help regarding design of reading would be appreciated
EDIT:
So much of headache :((. Doesn't work correctly :( Here so far:
crawler (like a minithread for reading one folder, found files goes to fileList which is in other class, and folders to queue) pastebin. com/AkJLAUhD
scanner class (Don't even know should it be runnable or no). DirectoryScanner (main, should control crawlers, hold main filelist) pastebin. com/2abGMgG9 .
DFile itself pastebin. com/8uqPWh6Z (something became wrong with hashing, now when sorting all get same hash.. worked .. (hashing is for other task unrelated)) .
Filelist past ebin. com/Q2yM6ZwS
testcode:
DirectoryScanner reader = new DirectoryScanner(4);
for (int i = 0; i < 4; i ++) {
reader.runTask(new DirectoryCrawler("myroot", reader));
}
try {
reader.kill();
while (!reader.isDone()) {
System.out.println("notdone");
}
reader.getFileList().print();
}
myroot is a folder with some files for test
Anything, i can't even think of should scanner be itself runnable, or only crawlers. Because while scanning I actualy don't want to start doing other stuff like sorting (because nothing to sort while not gathered all files) ..

You need the Executor threadpool and some classes:
A Fsearch class. This contains your container for the results. It also has a factory method that returns an Ffolder, counting up a 'foldersOutstanding' counter, and an OnComplete that counts them back in by counting down 'foldersOutstanding':
You need a Ffolder class to represent a folder and is passed its path as ctor parameter. It should have a run method that iterates is folder path that is supplied as a parameter along with the Fsearch instance.
Create and load up an Fsearch with the root folder and fire it into the pool. It creates a folder class, passing its root path and itslef, and loads that on. Then it waits on a 'searchComplete' event.
That first Ffolder iterates its folder, creating, (or depooling), DFiles for each 'ordinary' file and pushing them into the Fsearch container. If it finds a folder, it gets another Ffolder from the Fsearch, loads it with the new path and loads that onto the pool as well.
When an Ffolder has finished iterating its own folder, it calls the OnComplete' method of the Fsearch. The OnComplete is counting down the 'foldersOutstanding' and, when it is decremented to zero, all the folders have been scanned and files processed. The thread that did this final decrement signals the searchComplete event so that the Fsearch can continue. The Fsearch could call some 'OnSearchComplete' event that is was passed when it was created.
It goes almost without saying that the Fsearch callbacks must be thread-safe.
Such an exercise does not have to be academic.
The container in the Fsearch, where all the DFiles go, could be a producer-consumer queue. Other threads could start processing the DFiles as the search is in progress, instead of waiting until the end.
I have done this before, (but not in Java), - it works OK. A design like this can easily do multiple searches in parallel - it's fun to issue an Fsearch for several hard drive roots at once - the clattering noise is impressive
Forgot to say - the big gain from such a design is when searching several networked drives with high latency. They can all be searched in parallel. The speedup over a miserable single-threaded sequential search is many times. By the time a single-thread seach has finished queueing up the DFiles for one drive, the multi-search has searched four drives and already had most of its DFiles processed.
NOTE:
1) If implemented strictly as above, the threadpool thread taht executes the FSearch is blocked on the 'OnSearchComplete' event until the search is over, so 'using up' one thread. There must therefore be more threadpool threads than live Fsearch instances else there will be no threads left over to do the actual searching, (yes, of course that happened to me:).
2) Unlike a single-thread search, results don't come back in any sort of predictable or repeatable order. If, for example, you signal your results as they come in to a GUI thread and try to display them in a TreeView, the path through the treeview component will likely be different for each result, updating the visual treeview will be lengthy. This can result in the Windows GUI input queue getting full, (10000 limit), because the GUI cannot keep up or, if using object pools for the Ffolder etc, the pool can empty, slugging performance and, if the GUI thread tries to get an Ffolder to issue a new search from the empty pool and so blocks, all-round deadlock with all Ffolder instances stuck in Windows messages, (yes, of course that happened to me:). It's best to not let such things happen!
Example - something like this I found - it's quite old Windows/C++ Builder code but it still works - I tried it on my Rad Studio 2009 , removed all the legacy/proprietary gunge and added some extra comments. All it does here is count up the folders and files, just as an example. There are only a couple of 'runnable' classes The myPool->submit() methods loads a runnable onto the pool and it's run() method gets executed. The base ctor has an 'OnComplete' EventHander, (TNotifyEvent), delgate parameter - that gets fired by the pool thread when the run() method returns.
//******************************* CLASSES ********************************
class DirSearch; // forward dec.
class ScanDir:public PoolTask{
String FmyDirPath;
DirSearch *FmySearch;
TStringList *filesAndFolderNames;
public: // Counts for FmyDirPath only
int fileCount,folderCount;
ScanDir(String thisDirPath,DirSearch *mySearch);
void run(); // an override - called by pool thread
};
class DirSearch:public PoolTask{
TNotifyEvent FonComplete;
int dirCount;
TEvent *searchCompleteEvent;
CRITICAL_SECTION countLock;
public:
String FdirPath;
int totalFileCount,totalFolderCount; // Count totals for all ScanDir's
DirSearch(String dirPath, TNotifyEvent onComplete);
ScanDir* getScanDir(String path); // get a ScanDir and inc's count
void run(); // an override - called by pool thread
void __fastcall scanCompleted(TObject *Sender); // called by ScanDir's
};
//******************************* METHODS ********************************
// ctor - just calls base ctor an initialzes stuff..
ScanDir::ScanDir(String thisDirPath,DirSearch *mySearch):FmySearch(mySearch),
FmyDirPath(thisDirPath),fileCount(0),folderCount(0),
PoolTask(0,mySearch->scanCompleted){};
void ScanDir::run() // an override - called by pool thread
{
// fileCount=0;
// folderCount=0;
filesAndFolderNames=listAllFoldersAndFiles(FmyDirPath); // gets files
for (int index = 0; index < filesAndFolderNames->Count; index++)
{ // for all files in the folder..
if((int)filesAndFolderNames->Objects[index]&faDirectory){
folderCount++; //do count and, if it's a folder, start another ScanDir
String newFolderPath=FmyDirPath+"\\"+filesAndFolderNames->Strings[index];
ScanDir* newScanDir=FmySearch->getScanDir(newFolderPath);
myPool->submit(newScanDir);
}
else fileCount++; // inc 'ordinary' file count
}
delete(filesAndFolderNames); // don't leak the TStringList of filenames
};
DirSearch::DirSearch(String dirPath, TNotifyEvent onComplete):FdirPath(dirPath),
FonComplete(onComplete),totalFileCount(0),totalFolderCount(0),dirCount(0),
PoolTask(0,onComplete)
{
InitializeCriticalSection(&countLock); // thread-safe count
searchCompleteEvent=new TEvent(NULL,false,false,"",false); // an event
// for DirSearch to wait on till all ScanDir's done
};
ScanDir* DirSearch::getScanDir(String path)
{ // up the dirCount while providing a new DirSearch
EnterCriticalSection(&countLock);
dirCount++;
LeaveCriticalSection(&countLock);
return new ScanDir(path,this);
};
void DirSearch::run() // called on pool thread
{
ScanDir *firstScanDir=getScanDir(FdirPath); // get first ScanDir for top
myPool->submit(firstScanDir); // folder and set it going
searchCompleteEvent->WaitFor(INFINITE); // wait for them all to finish
}
/* NOTE - this is a DirSearch method, but it's called by the pool threads
running the DirScans when they complete. The 'DirSearch' pool thread is stuck
on the searchCompleteEvent, waiting for all the DirScans to complete, at which
point the dirCount will be zero and the searchCompleteEvent signalled.
*/
void __fastcall DirSearch::scanCompleted(TObject *Sender){ // a DirSearch done
ScanDir* thiscan=(ScanDir*)Sender; // get the instance that completed back
EnterCriticalSection(&countLock); // thread-safe
totalFileCount+=thiscan->fileCount; // add DirSearch countst to totals
totalFolderCount+=thiscan->folderCount;
dirCount--; // another one gone..
LeaveCriticalSection(&countLock);
if(!dirCount) searchCompleteEvent->SetEvent(); // if all done, signal
delete(thiscan); // another one bites the dust..
};
..and here it is, working:

If you want to learn some multi-threading by doing some practical implementation, it would be best to pick something where the switch from a single-threaded activity to a multi-threaded one would actually make sense.
In this case, it doesn't make any sense. And, to achieve it, it would require some ugly pieces of code to be written. That is because you could have, for example, one thread handle just one subfolder (first level after the root folder). But what if you start with 200 subfolders ? Or more... Will 200 threads in that case make sense ? I doubt it...

Related

Is Files.copy a thread-safe function in Java?

I have a function, that's purpose is to create a directory and copy a csv file to that directory. This same function gets ran multiple times, each time by an object in a different thread. It gets called in the object's constructor, but I have logic in there to only copy the file if it does not already exist (meaning, it checks to make sure that one of the other instances in parallel did not already create it).
Now, I know that I could simply rearrange the code so that this directory is created and the file is copied before the objects are ran in parallel, but that is not ideal for my use case.
I am wondering, will the following code ever fail? That is, due to one of the instances being in the middle of copying a file, while another instance attempts to start copying that same file to the same location?
private void prepareGroupDirectory() {
new File(outputGroupFolderPath).mkdirs();
String map = "/path/map.csv"
File source = new File(map);
String myFile = "/path/test_map.csv";
File dest = new File(myFile);
// copy file
if (!dest.exists()) {
try{
Files.copy(source, dest);
}catch(Exception e){
// do nothing
}
}
}
To sum it all up. Is this function thread-safe in the sense that, different threads could all run this function in parallel without it breaking? I think yes, but any thoughts would be helpful!
To be clear, I have tested this many many times and it has worked every time. I am asking this question to make sure, that in theory, it will still never fail.
EDIT: Also, this is highly simplified so that I could ask the question in an easy to understand format.
This is what I have now after following comments (I still need to use nio instead), but this is currently working:
private void prepareGroupDirectory() {
new File(outputGroupFolderPath).mkdirs();
logger.info("created group directory");
String map = instance.getUploadedMapPath().toString();
File source = new File(map);
String myFile = FilenameUtils.getBaseName(map) + "." + FilenameUtils.getExtension(map);
File dest = new File(outputGroupFolderPath + File.separator + "results_" + myFile);
instance.setWritableMapForGroup(dest.getAbsolutePath());
logger.info("instance details at time of preparing group folder: {} ", instance);
final ReentrantLock lock = new ReentrantLock();
lock.lock();
try {
// copy file
if (!dest.exists()) {
String pathToWritableMap = createCopyOfMap(source, dest);
logger.info(pathToWritableMap);
}
} catch (Exception e) {
// do nothing
// thread-safe
} finally {
lock.unlock();
}
}

It isn't.
What you're looking for is the concept of rotate-into-place. The problem with file operations is that almost none of it is atomic.
Presumably you don't just want 'only one' thread to win the race for making this file, you also want that file to either be perfect, or not exist at all: You would not want anybody to be able to observe that CSV file in a half-baked state, and you most certainly wouldn't want a crash halfway through generating the CSV file to mean that the file is there, half-baked, but its mere existence means it prevents any attempt to write it out properly. You can't use finally blocks or exception catching to address this issue; someone might trip over a powercable.
So, how do you solve all these problems?
You do not write to foo.csv. Instead you write to foo.csv.23498124908.tmp where that number is randomly generated. Because that just isn't the actual CSV file anybody is looking for, you can take all the time in the world to finish it properly. Once it is done, then you do the magic trick:
You rename foo.csv.23498124908.tmp into foo.csv, and do so atomically - one instant in time foo.csv does not exist, the next instant in time it does and it has the complete contents. Also, that rename will only succeed if the file didn't exist before: It is impossible for two separate threads to both rename their foo.csv.23481498.tmp file into foo.csv simultaneously. If you were to try it and get the timing just perfect, one of them (arbitrary which one) 'wins', the other one gets an IOException and doesn't rename anything.
The way to do this is using Files.move(from, to, StandardCopyOptions.ATOMIC_MOVE). ATOMIC_MOVE is even kind enough to flat out refuse to execute if somehow the OS/filesystem combination simply does not support ATOMIC_MOVE (they pretty much all do, though).
The second advantage is that this locking mechanism works even if you have multiple entirely different apps running. If they all use ATOMIC_MOVE or the equivalent of this in that language's API, only one can win, whether we're talking 'threads in a JVM' or 'apps on a system'.
If you want to instead avoid the notion that multiple threads are both simultaneously doing the work to make this CSV file even though only one should do so and the rest should 'wait' until the first thread is done, file system locks are not the answer - you can try (make an empty file whose existence is a sign that some other thread is working on it) - and there's even a primitive for that in java's java.nio.file APIs. The CREATE_NEW flag can be used when creating a file, which means: Atomically create it, failing if the file already exists with concurrency guarantees (if multiple processes/threads all run that simultaneously, one succeeds and all others fail, guaranteed). However, CREATE_NEW can only atomically create. It cannot atomically write, nothing can (hence the whole 'rename it into place' trick above).
The problem with such locks are two fold:
If the JVM crashes that file doesn't go away. Ever launched a linux daemon process, such as postgresd, and it told you that 'the pid file is still there, if there is no postgres running please delete it'? Yeah, that problem.
There's no way to know when it is done, other than to just re-check for that file's existence every few milliseconds. If you wait very few milliseconds you're trashing the disk potentially (hopefully your OS and disk cache algorithms do a decent job). If you wait a lot you might be waiting around for no reason for a long time.
Hence why you shouldn't do this stuff, and just use locks within the process. Use synchronized or make a new java.util.concurrent.ReentrantLock or whatnot.
To answer your code snippet specifically, no that is broken: It is possible for 2 threads to run simultaneously and both get false when it runs dest.exists(), thus both entering the copy block, and then they fall all over each other when copying - depending on file system, usually one thread ends up 'winning', with their copy operation succeeding and the other thread's seemingly lost to the aether (most file systems are ref/node based, meaning, the file was written to disk but its 'pointer' was immediately overwritten, and the filesystem considers it garbage, more or less).
Presumably you consider that a failing scenario, and your code does not guarantee that it can't happen.
NB: What API are you using? Files.copy(instanceOfJavaIoFile, anotherInstanceOfJavaIoFile) isn't java. There is java.nio.file.Files.copy(instanceOfjnfPath, anotherInstanceOfjnfPath) - that's the one you want. Perhaps this Files you have is from apache commons? I strongly suggest you don't use that stuff; those APIs are usually obsolete (java itself has better APIs to do the same thing), and badly designed. Ditch java.io.File, it's outdated API. Use java.nio.file instead. The old API doesn't have ATOMIC_MOVE or CREATE_NEW, and doesn't throw exceptions when things go wrong - it just returns false which is easily ignored and has no room to explain what went wrong. Hence why you should not use it. One of the major issues with the apache libraries is that it uses the anti-pattern of piling a ton of static utility methods into a giant container. Unfortunately, the second take on file stuff in java itself (java.nio.file) is similarly boneheaded API design. I guess in the java world, third time will be the charm. At any rate, a bad core java API with advanced capabilities is still a better than a bad apache utility API that wraps around the older API which simply does not expose the kinds of capabilities you need here.

Read JSON files into collections, best practice

I'm working on a JavaFX application. I have several JSON files which I would like to read and insert into Collections in domain objects. I am using Gson to read these files at present. My application currently is working, however, there is a long delay before the application launches. I assume that this is because I'm reading these files sequentially and in the same Thread. Therefore, I am looking to enhance the launch time by introducing some concurrency. I'm thinking If I can figure out how to read the files in parallel it should speed up the launch time. I'm new to the idea of concurrency so I'm trying to learn as I go. Needless to say, I've hit a few roadblocks and can't seem to find much information or examples online.
Here are my issues:
Not sure if the JSON file reads can be done in a background thread.
Domain classes use these Collections to compute and eventually display values in the GUI. From my understanding, if you modify the GUI it has to be done in the JavaFX Application thread and not in the background. I'm not sure if loading data to be used in the GUI counts as modifying the GUI. I'm not directly updating any GUI Nodes like textField.setText("something") by reading Json, so I would assume no, I'm not. Am I wrong?
What is the difference between a Task> and Thread or an ExecutorService and Callable>? Is one method preferred over the other? I've tried both and failed. When I tried using a task and background thread, I would get a NullPointerException because the app tried to access the collection before the files were read and initialized with data. It went from being too slow to being too fast. SMH.
To solve this problem, I heard about Preloaders. The idea here was to launch some sort of splash screen to delay until the loading of resources (reading of JSON files) was complete, then proceed to the main application. However, the examples or information here is VERY scarce. I'm using Java 10 and IntelliJ, so I may have cornered myself into a one in a million niche.
I'm not asking for anyone to solve my problem for me. I'm just a little lost and don't know where or how to proceed. I'll be happy to share specifics if needed but I think my issues are still conceptual at this point.
Help me StackOverflow you're my only hope.
edit: code example:
public class Employee {
private List<Employee> employeeList;
public Employee() {
employeeList = new ArrayList<>();
populateEmployees();
}
private final void populateEmployees() {
Task<Void> readEmployees = new Task<>() {
#Override
protected Void call() throws Exception {
System.out.println("Starting to read employee.json"); // #1
InputStream in = getClass().getResourceAsStream("/json/employee.json");
Reader reader = new InputStreamReader(in);
Type type = new TypeToken<List<Employee>>(){}.getType();
Gson gson = new Gson();
employeeList.addAll(gson.fromJson(reader, type));
System.out.println("employeeList has " + employeeList.size() + " elements"); // #2
return null;
}
};
readEmployees.run();
System.out.println(readEmployees.getMessage()); // #3
}
}
I see #1 printed to the console, never #2 or 3. How do I know that it processed all through the Task?

How much your app will speed up depends on how big are those files and how much files there are. You should know that creating threads is also resource consuming task. I can imagine situation where you have plenty of files and for each one you're creating a new thread which could even make your app initialize slower.
In case of big amount of files or number of files which can change in time, you can arrange some thread pool of constant number eg. 5 which can work simultaneously on reading files task.
Back to the problem and the question is it worth to use separate threads for reading files, I'll say yes but only if your app have some work on initialization which can be done without knowing content of those files. You should be aware that in some point in time you'll probably need to wait for file parsing results.
As a part of problem solving you can do some benchmark to check how long parsing each file process takes and then you'll know what configuration/amount of working threads will be the best. Eg. you won't create thread for each file when parsing takes 1 second, but if you have 100 files of 1 second processing time you can create some thread pool and divide the job for each thread equally.
yes
I don't know JavaFX but in general concept of Thread and Task is the same. Thread gives you certanity that you're starting new thread, it's lower level of abstraction. Task is some sort of higher abstraction where you want to run part of your code separately, and asynchronously but you don't want to be aware on which thread it will run. Some programming languages behind Task hides actually some thread pool.
Preloaders are fine, because they show user some job is being done in background so he won't worry if application has frozen. On the other hand if you can speed up initialization process it will be great. You can join those two ideas, but remember, no one wants to wait a lot :)

java application multi-threading design and optimization

I designed a java application. A friend suggested using multi-threading, he claims that running my application as several threads will decrease the run time significantly.
In my main class, I carry several operations that are out of our scope to fill global static variables and hash maps to be used across the whole life time of the process. Then I run the core of the application on the entries of an array list.
for(int customerID : customers){
ConsumerPrinter consumerPrinter = new ConsumerPrinter();
consumerPrinter.runPE(docsPath,outputPath,customerID);
System.out.println("Customer with CustomerID:"+customerID+" Done");
}
for each iteration of this loop XMLs of the given customer is fetched from the machine, parsed and calculations are taken on the parsed data. Later, processed results are written in a text file (Fetched and written data can reach up to several Giga bytes at most and 50 MBs on average). More than one iteration can write on the same file.
Should I make this piece of code multi-threaded so each group of customers are taken in an independent thread?
How can I know the most optimal number of threads to run?
What are the best practices to take into consideration when implementing multi-threading?

Should I make this piece of code multi-threaded so each group of customers are taken
in an independent thread?
Yes multi-threading will save your processing time. While iterating on your list you can spawn new thread each iteration and do customer processing in it. But you need to do proper synchronization meaning if two customers processing requires operation on same resource you must synchronize that operation to avoid possible race condition or memory inconsistency issues.
How can I know the most optimal number of threads to run?
You cannot really without actually analyzing the processing time for n customers with different number of threads. It will depend on number of cores your processor has, and what is the actually processing that is taking place for each customer.
What are the best practices to take into consideration when implementing multi-threading?
First and foremost criteria is you must have multiple cores and your OS must support multi-threading. Almost every system does that in present times but is a good criteria to look into. Secondly you must analyze all the possible scenarios that may led to race condition. All the resource that you know will be shared among multiple threads must be thread-safe. Also you must also look out for possible chances of memory inconsistency issues(declare your variable as volatile). Finally there are something that you cannot predict or analyze until you actually run test cases like deadlocks(Need to analyze Thread dump) or memory leaks(Need to analyze Heap dump).

The idea of multi thread is to make some heavy process into another, lets say..., "block of memory".
Any UI updates have to be done on the main/default thread, like print messenges or inflate a view for example. You can ask the app to draw a bitmap, donwload images from the internet or a heavy validation/loop block to run them on a separate thread, imagine that you are creating a second short life app to handle those tasks for you.
Remember, you can ask the app to download/draw a image on another thread, but you have to print this image on the screen on the main thread.
This is common used to load a large bitmap on a separated thread, make math calculations to resize this large image and then, on the main thread, inflate/print/paint/show the smaller version of that image to te user.
In your case, I don't know how heavy runPE() method is, I don't know what it does, you could try to create another thread for him, but the rest should be on the main thread, it is the main process of your UI.
You could optmize your loop by placing the "ConsumerPrinter consumerPrinter = new ConsumerPrinter();" before the "for(...)", since it does not change dinamically, you can remove it inside the loop to avoid the creating of the same object each time the loop restarts : )

While straight java multi-threading can be used (java.util.concurrent) as other answers have discussed, consider also alternate programming approaches to multi-threading, such as the actor model. The actor model still uses threads underneath, but much complexity is handled by the actor framework rather than directly by you the programmer. In addition, there is less (or no) need to reason about synchronizing on shared state between threads because of the way programs using the actor model are created.
See Which Actor model library/framework for Java? for a discussion of popular actor model libraries.

Multithreaded file processing and reporting

I have an application that processes data stored in a number of files from an input directory and then produces some output depending on that data.
So far, the application works in a sequential basis, i.e. it launches a "manager" thread that
Reads the contents of the input directory into a File[] array
Processes each file in sequence and stores results
Terminates when all files are processed
I would like to convert this into a multithreaded application, in which the "manager" thread
Reads the contents of the input directory into a File[] array
Launches a number of "processor" threads, each of which processes a single file, stores results and returns a summary report for that file to the "manager" thread
Terminates when all files have been processed
The number of "processor" threads would be at most equal to the number of files, since they would be recycled via a ThreadPoolExecutor.
Any solution avoiding the use of join() or wait()/notify() would be preferrable.
Based on the above scenario:
What would be the best way of having those "processor" threads reporting back to the "manager" thread? Would an implementation based on Callable and Future make sense here?
How can the "manager" thread know when all "processor" threads are done, i.e. when all files have been processed?
Is there a way of "timing" a processor thread and terminating it if it takes "too long" (i.e., it hasn't returned a result despite the lapse of a pre-configured amount of time)?
Any pointers to, or examples of, (pseudo-)source code would be greatly appreciated.

You can definitely do this without using join() or wait()/notify() yourself.
You should take a look at java.util.concurrent.ExecutorCompletionService to start with.
The way I see it you should write the following classes:
FileSummary - Simple value object that holds the result of processing a single file
FileProcessor implements Callable<FileSummary> - The strategy for converting a file into a FileSummary result
File Manager - The high level manager that creates FileProcessor instances, submits them to a work queue and then aggregates the results.
The FileManager would then look something like this:
class FileManager {
private CompletionService<FileSummary> cs; // Initialize this in constructor
public FinalResult processDir(File dir) {
int fileCount = 0;
for(File f : dir.listFiles()) {
cs.submit(new FileProcessor(f));
fileCount++;
}
for(int i = 0; i < fileCount; i++) {
FileSummary summary = cs.take().get();
// aggregate summary into final result;
}
}
If you want to implement a timeout you can use the poll() method on CompletionService instead of take().

wait()/notify() are very low level primitives and you are right in wanting to avoid them.
The simplest solution would be to use a thread-safe queues (or stacks, etc. -- it doesn't really matter in this case). Before starting the worker threads, your main thread can add all the Files to the thread-safe queue/stack. Then start the worker threads, and let them all pull Files and process them until there are none left.
The worker threads can add results to another thread-safe queue/stack, where the main thread can get them from. The main thread knows how many Files there were, so when it has retrieved the same number of results, it will know that the job is finished.
Something like a java.util.concurrent.BlockingQueue would work, and there are other thread-safe collections in java.util.concurrent which would also be fine.
You also asked about terminating worker threads which are taking too long. I will tell right up front: if you can make the code which runs on the worker threads robust enough that you can safely leave this feature out, you will make things a lot simpler.
If you do need this feature, the simplest and most reliable solution is to have a per-thread "terminate" flag, and make the worker task code check that flag frequently and exit if it is set. Make a custom class for workers, and include a volatile boolean field for this purpose. Also include a setter method (because of volatile, it doesn't need to be synchronized).
If a worker discovers that its "terminate" flag is set, it could push its File object back on the work queue/stack so another thread can process it. Of course, if there is some problem which means the File cannot be successfully processed, this will lead to an infinite cycle.
The best is to make the worker code very simple and robust, so you don't need to worry about it "not terminating".

No need for them to report back. Just have a count of the number of jobs remaining to be done and have the thread decrement that count when it's done.
When the count reaches zero of jobs remaining to be done, all the "processor" threads are done.
Sure, just add that code to the thread. When it starts working, check the time and compute the stop time. Periodically (say when you go to read more from the file), check to see if it's past the stop time and, if so, stop.

Directory naming using Process ID and Thread ID

I have an application with a few threads that manipulate data and save the output in different temporary files on a particular directory, in a Linux or a Windows machine. These files eventually need to be erased.
What I want to do is to be able to better separate the files, so I am thinking of doing this by Process ID and Thread ID. This will help the application save disk space because, upon termination of a thread, the whole directory with that thread's files can be erased and leave the rest of the application reuse the corresponding disk space.
Since the application runs on a single instance of the JVM, I assume it will have a single Process ID, which will be that of the JVM, right?
That being the case, the only way to discriminate among these files is to save them in a folder, the name of which will be related to the Thread ID.
Is this approach reasonable, or should I be doing something else?

java.io.File can create temporary files for you. As long as you keep a list of those files associated with each thread, you can delete them when the thread exits. You can also mark the files to delete on exit in case a thread does not complete.

It seems the simplest solution for this approach is really to extend Thread - never thought I'd see that day.
As P.T. already said Thread IDs are only unique as long as the thread is alive, they can and most certainly will be reused by the OS.
So instead of doing it this way, you use the Thread name that can be specified at construction and to make it simple, just write a small class:
public class MyThread extends Thread {
private static long ID = 0;
public MyThread(Runnable r) {
super(r, getNextName());
}
private static synchronized String getNextName() {
// We can get rid of synchronized with some AtomicLong and so on,
// doubt that's necessary though
return "MyThread " + ID++;
}
}
Then you can do something like this:
public static void main(String[] args) throws InterruptedException {
Thread t = new MyThread(new Runnable() {
#Override
public void run() {
System.out.println("Name: " + Thread.currentThread().getName());
}
});
t.start();
}
You have to overwrite all constructors you want to use and always use the MyThread class, but this way you can guarantee a unique mapping - well at least 2^64-1 (negative values are fine too after all) which should be more than enough.
Though I still don't think that's the best approach, possibly better to create some "job" class that contains all necessary information and can clean up its files as soon as it's no longer needed - that way you also can easily use ThreadPools and co where one thread will do more than one job. At the moment you have business logic in a thread - that doesn't strike me as especially good design.

You're correct, the JVM has one process ID, and all threads in that JVM will share the process id. (It is possible for a JVM to use multiple processes, but AFAIK, no JVM does that.)
A JVM may very well re-use underlying OS threads for multiple Java threads, so there is no guaranteed correlation between a thread exiting in Java and anything similar happening at the OS level.
If you just need to cleanup stale files, sorting the files by their creation timestamp should do the job sufficiently? No need to encode anything special at all in the temporary file names.
Note that PIDs and TIDs are neither guaranteed to be increasing, no guaranteed to be unique across exits. The OS is free to recycle an ID. (In practice the IDs have to wrap around before re-use, but on some machines that can happen after only 32k or 64k processes have been created.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.