Java nio: Multithreading with DirectoryStream

Java nio: Multithreading with DirectoryStream - java

I have an occasional hard to replicate bug where one of my threads hangs.
A web spider thread dumps html files into a directory.
A file processing thread reads the files in the directory, process
them one by one and moves them.
Since the file processor move file (by logical necessity) can only occur when a file is already in the directory, the file processor file read process is asynchronous and unlikely to lead to a hang.
HOWEVER, the fileprocessor thread also scans the directory and this can happen as the web spider thread saves a file into the directory.
QUESTION:
If a file is saved into this directory while the following read directory method is called, will it cause a hang? (Frankly, I don't see how it could, but maybe that is why I have tthe bug).
If yes, then how do I resolve the issue?
private void listFiles(Path path)
{
Log.getLogger().debug("started ......");
try (DirectoryStream<Path> stream = Files.newDirectoryStream(path))
{
for (Path entry : stream)
{
if (Files.isDirectory(entry))
{
listFiles(entry);
}
else
{
files.add(entry);
}
}
}
catch (Exception e)
{
Log.getLogger().error(e.getMessage(), e);
}
Log.getLogger().debug("done");
}

To avoid threads from interfering in each others work a semaphore (or mutex in it's simplest form) should be used. Sempaphores can be acquired by a thread in order to run code in the so called critical section. Code in this section could for example access an ArrayList. If multiple threads access that list and add and remove elements to and from it, you will eventually get a ConcurrentModificationException. In other cases you will not get an exception at all but your program might do unexpected things (see the lost-update problem for example).
If you however acquire a lock each time you access the critical section, other threads won't be able to access the shared resource (the list in this case or the directory in your case).
In order to achieve this behaviour you can either use classes that implement the lock interface, create an object and use that as a lock like so:
Object lock = new Object();
synchronized(lock) {
// do critical work here
}
A third and probably the most uneffective (but simplest) way is to use the synchronized keyword for your methods. Only one method that is declared as synchronized inside a class can be called at a time.

Related

Is Files.copy a thread-safe function in Java?

I have a function, that's purpose is to create a directory and copy a csv file to that directory. This same function gets ran multiple times, each time by an object in a different thread. It gets called in the object's constructor, but I have logic in there to only copy the file if it does not already exist (meaning, it checks to make sure that one of the other instances in parallel did not already create it).
Now, I know that I could simply rearrange the code so that this directory is created and the file is copied before the objects are ran in parallel, but that is not ideal for my use case.
I am wondering, will the following code ever fail? That is, due to one of the instances being in the middle of copying a file, while another instance attempts to start copying that same file to the same location?
private void prepareGroupDirectory() {
new File(outputGroupFolderPath).mkdirs();
String map = "/path/map.csv"
File source = new File(map);
String myFile = "/path/test_map.csv";
File dest = new File(myFile);
// copy file
if (!dest.exists()) {
try{
Files.copy(source, dest);
}catch(Exception e){
// do nothing
}
}
}
To sum it all up. Is this function thread-safe in the sense that, different threads could all run this function in parallel without it breaking? I think yes, but any thoughts would be helpful!
To be clear, I have tested this many many times and it has worked every time. I am asking this question to make sure, that in theory, it will still never fail.
EDIT: Also, this is highly simplified so that I could ask the question in an easy to understand format.
This is what I have now after following comments (I still need to use nio instead), but this is currently working:
private void prepareGroupDirectory() {
new File(outputGroupFolderPath).mkdirs();
logger.info("created group directory");
String map = instance.getUploadedMapPath().toString();
File source = new File(map);
String myFile = FilenameUtils.getBaseName(map) + "." + FilenameUtils.getExtension(map);
File dest = new File(outputGroupFolderPath + File.separator + "results_" + myFile);
instance.setWritableMapForGroup(dest.getAbsolutePath());
logger.info("instance details at time of preparing group folder: {} ", instance);
final ReentrantLock lock = new ReentrantLock();
lock.lock();
try {
// copy file
if (!dest.exists()) {
String pathToWritableMap = createCopyOfMap(source, dest);
logger.info(pathToWritableMap);
}
} catch (Exception e) {
// do nothing
// thread-safe
} finally {
lock.unlock();
}
}

It isn't.
What you're looking for is the concept of rotate-into-place. The problem with file operations is that almost none of it is atomic.
Presumably you don't just want 'only one' thread to win the race for making this file, you also want that file to either be perfect, or not exist at all: You would not want anybody to be able to observe that CSV file in a half-baked state, and you most certainly wouldn't want a crash halfway through generating the CSV file to mean that the file is there, half-baked, but its mere existence means it prevents any attempt to write it out properly. You can't use finally blocks or exception catching to address this issue; someone might trip over a powercable.
So, how do you solve all these problems?
You do not write to foo.csv. Instead you write to foo.csv.23498124908.tmp where that number is randomly generated. Because that just isn't the actual CSV file anybody is looking for, you can take all the time in the world to finish it properly. Once it is done, then you do the magic trick:
You rename foo.csv.23498124908.tmp into foo.csv, and do so atomically - one instant in time foo.csv does not exist, the next instant in time it does and it has the complete contents. Also, that rename will only succeed if the file didn't exist before: It is impossible for two separate threads to both rename their foo.csv.23481498.tmp file into foo.csv simultaneously. If you were to try it and get the timing just perfect, one of them (arbitrary which one) 'wins', the other one gets an IOException and doesn't rename anything.
The way to do this is using Files.move(from, to, StandardCopyOptions.ATOMIC_MOVE). ATOMIC_MOVE is even kind enough to flat out refuse to execute if somehow the OS/filesystem combination simply does not support ATOMIC_MOVE (they pretty much all do, though).
The second advantage is that this locking mechanism works even if you have multiple entirely different apps running. If they all use ATOMIC_MOVE or the equivalent of this in that language's API, only one can win, whether we're talking 'threads in a JVM' or 'apps on a system'.
If you want to instead avoid the notion that multiple threads are both simultaneously doing the work to make this CSV file even though only one should do so and the rest should 'wait' until the first thread is done, file system locks are not the answer - you can try (make an empty file whose existence is a sign that some other thread is working on it) - and there's even a primitive for that in java's java.nio.file APIs. The CREATE_NEW flag can be used when creating a file, which means: Atomically create it, failing if the file already exists with concurrency guarantees (if multiple processes/threads all run that simultaneously, one succeeds and all others fail, guaranteed). However, CREATE_NEW can only atomically create. It cannot atomically write, nothing can (hence the whole 'rename it into place' trick above).
The problem with such locks are two fold:
If the JVM crashes that file doesn't go away. Ever launched a linux daemon process, such as postgresd, and it told you that 'the pid file is still there, if there is no postgres running please delete it'? Yeah, that problem.
There's no way to know when it is done, other than to just re-check for that file's existence every few milliseconds. If you wait very few milliseconds you're trashing the disk potentially (hopefully your OS and disk cache algorithms do a decent job). If you wait a lot you might be waiting around for no reason for a long time.
Hence why you shouldn't do this stuff, and just use locks within the process. Use synchronized or make a new java.util.concurrent.ReentrantLock or whatnot.
To answer your code snippet specifically, no that is broken: It is possible for 2 threads to run simultaneously and both get false when it runs dest.exists(), thus both entering the copy block, and then they fall all over each other when copying - depending on file system, usually one thread ends up 'winning', with their copy operation succeeding and the other thread's seemingly lost to the aether (most file systems are ref/node based, meaning, the file was written to disk but its 'pointer' was immediately overwritten, and the filesystem considers it garbage, more or less).
Presumably you consider that a failing scenario, and your code does not guarantee that it can't happen.
NB: What API are you using? Files.copy(instanceOfJavaIoFile, anotherInstanceOfJavaIoFile) isn't java. There is java.nio.file.Files.copy(instanceOfjnfPath, anotherInstanceOfjnfPath) - that's the one you want. Perhaps this Files you have is from apache commons? I strongly suggest you don't use that stuff; those APIs are usually obsolete (java itself has better APIs to do the same thing), and badly designed. Ditch java.io.File, it's outdated API. Use java.nio.file instead. The old API doesn't have ATOMIC_MOVE or CREATE_NEW, and doesn't throw exceptions when things go wrong - it just returns false which is easily ignored and has no room to explain what went wrong. Hence why you should not use it. One of the major issues with the apache libraries is that it uses the anti-pattern of piling a ton of static utility methods into a giant container. Unfortunately, the second take on file stuff in java itself (java.nio.file) is similarly boneheaded API design. I guess in the java world, third time will be the charm. At any rate, a bad core java API with advanced capabilities is still a better than a bad apache utility API that wraps around the older API which simply does not expose the kinds of capabilities you need here.

How to achieve Synchronization in Multithreading Java App

I've designed a Server-Client App using Java and i have connected to the Server more than one users.
The Server provides some features, such as:
downloading files
creating files
writting/appending files etc.
I spotted some issues that need to be synchronized when two or more users send the same request.
For Example: When users want to download the same file at the same time, how can i synchronize this action by using synchronized blocks or any other method?
//main (connecting 2 users in the server)
ServerSocket server= new ServerSocket(8080, 50);
MyThread client1=new MyThread(server);
MyThread client2=new MyThread(server);
client1.start();
client2.start();
Here is the method i would like to synchronize:
//outstream = new BufferedWriter(new OutputStreamWriter(sock.getOutputStream()));//output to client
//instream = new BufferedReader(new InputStreamReader(sock.getInputStream()));//input from client
public void downloadFile(File file,String name) throws FileNotFoundException, IOException {
synchronized(this)
{
if (file.exists()) {
BufferedReader readfile = new BufferedReader(new FileReader(name + ".txt"));
String newpath = "../Socket/" + name + ".txt";
BufferedWriter socketfile = new BufferedWriter(new FileWriter(newpath));
String line;
while ((line = readfile.readLine()) != null) {
outstream.write(line + "\n");
outstream.flush();
socketfile.write(line);
}
outstream.write("EOF\n");
outstream.flush();
socketfile.close();
outstream.write("Downloaded\n");
outstream.flush();
} else {
outstream.write("FAIL\n");
}
outstream.flush();
}
}
Note: This method is in a class that extends Thread and is being used when i want to "download" the file in the overriden method Run()
Does this example assures me that when 2 users want to download the same file one of them will have to wait? and will the other one get it? Thanks for your time!

Locking in concurrent is used to provide mutual exclusion to some piece of code. For locking you can use as synchronized as unstructured lock like ReentrantLock and others.
The main goal of any lock is to provide mutual exclusion to the piece of code placed inside which will mean that this piece will be executed only by one thead at a time. Section inside the lock is called critical section.
To achieve a proper locking it is not enough just to place critical code there. Also you have to make sure that modification of you variables made inside the critical section is made only there. Because if you locked some piece of code but references to variables inside also got passed to some concurrent executing thread without any locking then lock wont save you in that case and you will get a data race. Locks secure only execution of a critical section and only guarantee you that code placed in the critical section will be executed only by one thread at a time.
//outstream = new BufferedWriter(new
OutputStreamWriter(sock.getOutputStream()));//output to client
//instream = new BufferedReader(new
InputStreamReader(sock.getInputStream()));//input from client
public void downloadFile(File file,String name) throws
FileNotFoundException, IOException {
synchronized(this)
{
Who is the owner of this method? Client? If yes then it won't work. You should lock on the same object. It should be shared with all threads which require locking. But in your case every client will have it's own lock and the other threads know nothing about other thread's locks. You can lock at the Client.class. This will work.
synchronize(this) vs synchronize(MyClass.class)
After doing that you will have proper locking for reading (downloading) the file. But what about write? Imagine the case when during reading some other thread will want to modify that file. But you have locks only for reading. You are reading the beginning of the file and the other thread is modifying the end of it. So the writing thread will succeed and you will get logically a corrupted file with the begging from one file and the end of the other. Of course file systems and standard java library try to take care about such cases (by using locks in readers\writes, locking the file offsets etc) but in general it is a possible scenario. So you will need also the same lock on write. And read and write methods should share and use the same lock.
And we've came to a situation when we have correct behavior but low performance. This is our tradeoff. But we can do better. Now we are using the same lock for every write and read method and this means that we can read or write to only one any file at a time. But this is not correct cause we can modify or read different files without any possible corruption. So the better approach will be to associate a lock with a file not the whole method. And here nio comes to help you.
How can I lock a file using java (if possible)
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileLock.html
And actually you can read a file concurrently if offsets are different. Due to obvious physical reasons you can't read the same part of a file concurrently. But concurrent read and taking care about offsets seems as a huge overhead fmpv and im not sure that you will need that. Anyway here is some info: Concurrent reading of a File (java preferred)

Thread safe creating and deleting file in Java

Are methods Files.createFile() and Files.delete() thread safe? I have read in documentation that createFile() always an atomic operation but delete() is not. Should I somehow synchronize these blocks in my Java application and how? What atomic operation means for multihreading task?

a. What atomic operation means for multihreading task?
In context of multi-threading atomicity is the ability of a thread to execute a task in such a manner so that other threads have apparently no side-effect over the state varibles of that task when it was being executed by this thread.
File.createNewFile() :- For this method the state is existence or non existence of the file, when the thread was about to execute this method. Lets say that when this method was being executed by the thread the file did not exist. Now lets say that this method takes 5 ms of time to execute and create the file. So according to the concept of Atomicity no other thread should be able to create the same file(which was not existing before) during these 5ms otherwise the very first assumption of this thread about the state of the file will change and hence the output.
So in this case the executing thread does-this by obtaining a write lock over the directory where file is to be created.
Files.delete():- The Java doc for this method says
this method may not be atomic with respect to
other file system operations. If the file is a symbolic link, then the
symbolic link itself, not the final target of the link, is deleted.
the above statement says that this operation is also atomic but in case if this method is invoked on a symbolic link, the link is deleted and not the file. Which implies that the original file exists and file system operations on that file are feasible by other threads.
to determine if a file is a symlink see the reference:-
determine symlink
b. Should I somehow synchronize these blocks in my Java application and how?
You need not handle any multi-threading scenarios in both the cases.
However you can use the method mentioned in the link above to determine symlinks and handle that separately as you would wish.
But no synchronization is required from your end for sure.

Do you mean File.createNewFile()?
Javadoc says:
The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.
With other words, between the check if the file exist and the creation of the file will be no other file operation, changing the existence of the file.
If two threads want to create the same non existing file, only one will create the file and return true. The other thread will return false.
Usually you dont need to synchronize these operations but do a proper exception handling. Maybe other programs operate on your files too.

Directory naming using Process ID and Thread ID

I have an application with a few threads that manipulate data and save the output in different temporary files on a particular directory, in a Linux or a Windows machine. These files eventually need to be erased.
What I want to do is to be able to better separate the files, so I am thinking of doing this by Process ID and Thread ID. This will help the application save disk space because, upon termination of a thread, the whole directory with that thread's files can be erased and leave the rest of the application reuse the corresponding disk space.
Since the application runs on a single instance of the JVM, I assume it will have a single Process ID, which will be that of the JVM, right?
That being the case, the only way to discriminate among these files is to save them in a folder, the name of which will be related to the Thread ID.
Is this approach reasonable, or should I be doing something else?

java.io.File can create temporary files for you. As long as you keep a list of those files associated with each thread, you can delete them when the thread exits. You can also mark the files to delete on exit in case a thread does not complete.

It seems the simplest solution for this approach is really to extend Thread - never thought I'd see that day.
As P.T. already said Thread IDs are only unique as long as the thread is alive, they can and most certainly will be reused by the OS.
So instead of doing it this way, you use the Thread name that can be specified at construction and to make it simple, just write a small class:
public class MyThread extends Thread {
private static long ID = 0;
public MyThread(Runnable r) {
super(r, getNextName());
}
private static synchronized String getNextName() {
// We can get rid of synchronized with some AtomicLong and so on,
// doubt that's necessary though
return "MyThread " + ID++;
}
}
Then you can do something like this:
public static void main(String[] args) throws InterruptedException {
Thread t = new MyThread(new Runnable() {
#Override
public void run() {
System.out.println("Name: " + Thread.currentThread().getName());
}
});
t.start();
}
You have to overwrite all constructors you want to use and always use the MyThread class, but this way you can guarantee a unique mapping - well at least 2^64-1 (negative values are fine too after all) which should be more than enough.
Though I still don't think that's the best approach, possibly better to create some "job" class that contains all necessary information and can clean up its files as soon as it's no longer needed - that way you also can easily use ThreadPools and co where one thread will do more than one job. At the moment you have business logic in a thread - that doesn't strike me as especially good design.

You're correct, the JVM has one process ID, and all threads in that JVM will share the process id. (It is possible for a JVM to use multiple processes, but AFAIK, no JVM does that.)
A JVM may very well re-use underlying OS threads for multiple Java threads, so there is no guaranteed correlation between a thread exiting in Java and anything similar happening at the OS level.
If you just need to cleanup stale files, sorting the files by their creation timestamp should do the job sufficiently? No need to encode anything special at all in the temporary file names.
Note that PIDs and TIDs are neither guaranteed to be increasing, no guaranteed to be unique across exits. The OS is free to recycle an ID. (In practice the IDs have to wrap around before re-use, but on some machines that can happen after only 32k or 64k processes have been created.

FileChannel & RandomAccessFile don't seem to work

To put it simple: a swing app that uses sqlitejdbc as backend. Currently, there's no problem launching multiple instances that work with the same database file. And there should be.
The file is locked (can't delete it while the app is running) so the check should be trivial. Turns out not.
File f = new File("/path/to/file/db.sqlite");
FileChannel channel = new RandomAccessFile(f, "rw").getChannel();
System.out.println(channel.isOpen());
System.out.println(channel.tryLock());
results in
true
sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]
No matter whether the app is running or not. Am I missing the point?
TIA.

FileLocks are exclusive to the JVM, not an individual thread. So if you ran that code inside the same process as your Swing app, you would get the lock because it is shared by the JVM.
If your Swing app is not running, no other process is contending for the lock so you will obtain it there is well.

A File System level lock interacts with other applications. You get one of these from FileChannel. So what you do in your example code will make the file seem locked to another process, for example vi.
However, other Java threads or processes within the JVM will NOT see the lock. The key sentence is "File locks are held on behalf of the entire Java virtual machine. They are not suitable for controlling access to a file by multiple threads within the same virtual machine." You are not seeing the lock, so you are running sqlitejdbc from within the same JVM as your application.
So the question is how do you see whether your JVM has already acquired a lock on a file (assuming you don't control the code acquiring the lock)? One suggestion I would have is try and acquire an exclusive lock on a different subset of the file, for example with this code:
fc.tryLock(0L, 1L, false)
If there is already a lock you should get an OverlappingFileLockException. This is a bit hacky but might work.

Can you do a little experiment? Run two copies of this program (just your code with a sleep):
public class Main {
public static void main(String [] args) throws Exception {
File f = new File("/path/to/file/db.sqlite");
FileChannel channel = new RandomAccessFile(f, "rw").getChannel();
System.out.println(channel.isOpen());
System.out.println(channel.tryLock());
Thread.sleep(60000);
}
}
If this doesn't lock you know that tryLock() isn't working on your OS/drive/JVM. If this does lock then something else is wrong with your logic. Let us know the result in a comment.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.