How to check for a dynamically created file in Java?

How to check for a dynamically created file in Java? - java

I have an application where I need to check for a file which may be created dynamically during my execution, I will give up after some MAX time where the file has yet to show up. I wanted to know if there was a more efficient method in Java of checking for the file other than polling for it and then sleeping every X seconds? If not what would be the most efficient manner of doing this?

You currently have to poll the file system as you mentioned. Java 7 is supposed to have file system notifications, so this should get easier at some point.

If the same program is doing the creation of the file as the polling, you could instead have the logic that creates the file notify the part of the program using Object.notify(). A general description of the wait() and notify/notifyAll() mechanism can be found here: http://java.sun.com/docs/books/tutorial/essential/concurrency/guardmeth.html

You could try JPoller to poll for the file changes.
If you are running on Windows, you can get directory change notifications, see Obtaining Directory Change Notifications. Of course, this is not cross-platform, and will require use of JNA or similar native bridge. In fact, JNA offsers such as class, the FileMonitor class (in the download) that uses the underlying platform's file change notification.
If you are watching a handlful of files or fewer, then of course, polling is unlikely to be a performance problem, it's just not a "feel-good" solution - but not so bad to warrant the pain of a non pure java solution. Monitoring directories containing thousands of files on the other hand would benefit from direct noficiation from the OS.

Related

Is java.io.File.createNewFile() atomic in a network file system?

EDIT : Well, I'm back a bunch of months later, the lock mechanism that I was trying to code doesn't work, because createNewFile isn't reliable on the NFS. Check the answer below.
Here is my situation : I have only 1 application which may access the files, so I don't have any constraint about what other applications may do, but the application is running concurrently on several servers in the production environment for redundancy and performance purposes (a couple of machines are hosting each a couple of JVM with our apps).
Basically, what I need is to put some kind of flag in a folder to tell the other instances to leave this folder alone as another instance is already dealing with it.
Many search results are telling to use FileLock to achieve this, but I checked the Javadoc, and from my understanding it will not help much, since it's using the hosting OS's locking possibilities. So I doubt that it will help much since there are different hosting machines.
This question covers a similar subject : Java file locking on a network , and the accepted answer is recommending to implement your own kind of cooperative locking process (using the File.createNewFile() as asked by the OP).
The Javadoc of File.createNewFile() says that the process is atomically creating the file if it doesn't already exist. Does that work reliably in a network file system ?
I mean, how is it possible with the potential network lag to do both existence check and creation simultaneously ? :
The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.

No, createNewFile doesn't work properly on a network file system.
Even if the system call is atomic, it's only atomic regarding the OS, and not over the network.
Over the time, I got a couple of collisions, like once every 2-3 months (approx. once every 600k files).
The thing that happens is my program is running in 6 separates instances over 2 separate servers, so let's call them A1,A2,A3 and B1,B2,B3.
When A1, A2, and A3 try to create the same file, the OS can properly ensure that only one file is created, since it is working with itself.
When A1 and B1 try to create the same file at the same exact moment, there is some form of network cache and/or network delays happening, and they both get a true return from File.createNewFile().
My code then proceeds by renaming the parent folder to stop the other instances of the program from unnecessarily trying to process the folder and that's where it fails :
On A1, the folder renaming operation is successful, but the lock file can't be removed, so A1 just lets it like that and keeps on processing new incoming folders.
On B1, the folder renaming operation (File.renameTo(), can't do much to fix it) gets stuck in a infinite loop because the folder was already renamed (also causing a huge I/O traffic according to my sysadmin), and B1 is unable to process any new file until the program is rebooted.

The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.
That can be implemented easily via the open() system call or its equivalents in any operating system I have ever used.

I mean, how is it possible with the potential network lag to do both
existence check and creation simultaneously ?
There is a difference between simultaneously and atomically. Java doc is not saying anything about this function being a set of two simultaneous actions but two actions designed to work in atomic way. If this method is built to do two operations atomically than means file will never be created without checking file existence first and if file gets created by current call then it means there were no files present and if file doesn't get created that means there was already a file by that name.
I don't see a reason to doubt function being atomic or working reliably despite call being on network or local disk. Local call is equally unreliable - so many things can go wrong in an IO.
What you have to doubt is when trying to use empty file created by this function as a Lock as explained D-Mac's answer for this question and that is what explicitly mentioned in Java Doc for this function too.
You are looking for a directory lock and empty files working as a directory lock ( to signal other processes and threads to not touch it ) has worked quite well for me provided due care is taken to write logic to check for file existence,lock file clean up and orphaned locks.

java api to receive notifications when a file system is mounted

I am looking for a Java API that will allow registering for file system mount events, i.e. when a file system is mounted or dismounted. Specifically I want to know when a file system on removable USB devices is available, and also know exactly what type of USB device it was.
The udev subsystem provides notifications on USB plug and unplug events by default but not specifically when the file system on the device is available. It is possible to create udev rules that can do this in pieces, e.g. create a directory and execute a program when devices are added and removed. But my experience with udev rules is that the syntax is arcane and they are fragile and not simple to debug.
I've installed usbmount per this post:
https://serverfault.com/questions/414120/how-to-get-usb-devices-to-automount-in-ubuntu-12-04-server
though I believe the devices were automouting by default.
As an alternative I constructed a JDK 7 WatcherService on /media which can detect changes in /etc/mtab. This works but I have seen cases where the file systems on some USB devices are still not ready - meaning that attempts to read the directory throw an Exception - even after the entry in /etc/mtab is made. I added a timer to sleep for a configurable number of milliseconds and in most cases a 100ms wait time works but not 100% of the time. What this means is that increasing this wait time is not an absolute guarantee nor deterministic.
Clearly at some low level the mount event is being generated because the Nautilus pop-up window gets displayed. I had a case of one flash drive that would put the Nautilus icon in the launch pad menu but it would not mount until the icon was clicked open.
I've also looked at these options:
tailing /var/log/syslog; this may be the next best option. I see lines like the following:
:Dec 2 08:58:07 fred-Inspiron-530 udisksd[1759]: Mounted /dev/sdk1 at /media/fred/USB DISK1 on behalf of uid 1000
I am going to try a WatcherService here to see if the same timing issue exists, i.e. is the directory readable once this message is written.
jlibudev [ github.com/nigelb/jlibudev ] Much better Java API to udev subsystem than writing rules but it still falls short in that you still have to piece a number of different events together. NB: jlibudev depends on JNA [https://github.com/twall/jna] and purejavacomm [ github.com/nyholku/purejavacomm, sparetimelabs.com/purejavacomm/purejavacomm.php] both of which are pretty useful in their own right.
lsusb provides details on the usb device but nothing about where it is mounted.
Ideally I would like a simple API that would allow registering for file system mount/unmount events using the standard Java event listening pattern. I want to believe that such an API exists or is at least possible given that at a macro-level the net effect is occurring. I am still scouring the JDK 7 and JDK 8 APIs for other options.
Any and all pointers and assistance would be greatly appreciated.

Since there's no OS-agnostic way to deal with mounting filesystems, there's definitely no JDK API for this. I'm guessing this problem is not dealt with much (not a lot of programs deal with mounting filesystems directly), so it's unlikely that there's any prebuilt library out there waiting for you.
Of the approaches you mentioned, they all sound roughly equal in terms of how platform-specific they are (all Linux-only), so that just leaves performance and ease of coding as open questions. Regarding performance, running lsusb more than once a second is (a) a giant hack :-) and (b) fork+exec is slow compared to running something in-process, and tailing the event log will create a lot of (unpredictable) work for your program that is not related to USB mounts as well as making your implementation more fragile (what if the message strings change when you upgrade your OS?). Regarding ease of programming, either using jna or JNI to call into libudev or a WatcherService on /media sound about equal -- using libudev seems like the most portable option across Linux distros / user configurations (I'm guessing that's what Nautilus uses).
However, for simplicity of implementation that will work for 99% of users, it's hard to do better than a WatcherService on /media. To help ensure that the filesystem is available before use, I would just use a loop with some kind of randomized exponential backoff in the amount of time to wait between attempts to read the directory -- that way you never wait way longer than necessary for the filesystem to mount, you aren't burning tons of CPU waking up and trying to read, and you don't have to pick a single timeout number that won't work everywhere. If you care enough to ensure you don't tie down a single thread sleeping forever, I'd use a ScheduledExecutorService to issue Runnables that try to access the filesystem, and if it's not available schedule themselves to run again in a bit, otherwise alert your main thread that a new filesystem is available for use using a queue of some kind.
Edit: I just learned that you could also watch for updates to the /proc/mounts file. Hopefully since the kernel is responsible for updating this file things only show up when they're fully mounted, although I don't know for certain. For more details, How to interpret /proc/mounts? and the Red Hat docs were useful.

Java application needs exclusive access to file being delivered by sftp

Environment: Java 7 on an Ubuntu 12 server.
I have a Java application that polls for incoming .zip files that are delivered via sftp. I have no control over the client that's delivering the files.
The files being delivered are quite large, and in some cases, the poll mechanism detects a file while it's still being written. In this situation, the Java application borks because it thinks the file is corrupt.
What's the most effective way of detecting when the local sftp server has finished writing the file?

There are a number of approaches to dealing with this. You can choose one, but the more you implement the better:
The sender should upload as a .tmp file, then rename to .zip once done so that the watcher only sees the finished file.
The watcher should check the last modified time of the file, and if it was modified in the last 10 seconds (maybe 1 minute) then ignore the file and try again later.
If your OS supports it, try to get an exclusive lock on the file before reading it. This is not so easy in java, and depends on OS specifics.
Always send the file as a zip file, as if the file is incomplete of otherwise corrupted it will fail the CRC check. Also you get the added benefit for smaller transfers, smaller archive folder etc. (Of course you are already doing this, as mentioned in the question).
Look at the File2 component of camel and look at all the options it gives you. Make you want to use Camel, right?

See answer: https://stackoverflow.com/a/5851185/92063 which mentions incron. You can use it to notify your application that a file system event has taken place.
A quote from the linked website:
incron :: inotify cron system
This program is an "inotify cron" system. It consists of a daemon and
a table manipulator. You can use it a similar way as the regular cron.
The difference is that the inotify cron handles filesystem events
rather than time periods.

You have no control over the sender, that is unfortunate because the best solution would be the following (I will give another solution afterwards which doesn't require the sender to change anything).
The sender should rename the file when the upload is finished.
E.g. the file is named fileInProgress.txt during upload and fileFinished.txt when the upload is finished. You will restrict your java program to only watch for files with the name *Finished.txt. This is the easiest and an absolute reliable solution.
Your solution would be the following.
From your java program do a file-listing on the upload folder and store the file-sizes.
Wait for 10 secs (or longer if you want to be on the save side).
Do a file-listing again.
All files that didn't change in size are finished and can be processed.
Note that this does not give you absolute certainity that the upload is finished but it comes closer the longer the interval between your file size check is.

As David Roussel mentioned, Camel would be very useful for this. Take a look at initialDelay (among any others you may find useful) from File2 as this would place a specified delay before it polls the directory.
Any sort of file polling that I have done I have used Camel as it is easier to handle these kinds of situations.

How can I process multiple files concurrently?

I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?

In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here

My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.

What's the most efficient method of continually deleting files older than X hours on Windows?

I have a directory that continually fills up with "artefact" files. Many different programs dump their temporary files in this directory and it's unlikely that these programs will become self-cleaning any time soon.
Meanwhile, I would like to write a program that continually deletes files in this directory as they become stale, which I'll define as "older than 30 minutes".
A typical approach would be to have a timed mechanism that lists the files in the directory, filters on the old stuff, and deletes the old stuff. However, this approach is not very performant in my case because this directory could conceivably contain 10s or hundreds of thousands of files that do not yet qualify as stale. Consequently, this approach would continually be looping over the same thousands of files to find the old ones.
What I'd really like to do is implement some kind of directory listener that was notified of any new files added to the directory. This listener would then add those files to a queue to be deleted down the road. However, there doesn't appear to be a way to implement such a solution in the languages I program in (JVM languages like Java and Scala).
So: I'm looking for the most efficient way to keep a directory "as clean as it can be" on Windows, preferably with a JVM language. Also, though I've never programmed with Powershell, I'd consider it if it offered this kind of functionality. Finally, if there are 3rd party tools out there to do such things, I'd like to hear about them.
Thanks.

Why can't you issue a directory system command sorted by oldest first:
c:>dir /OD
Take the results and delete all files older than your threshold or sleep if no files are old enough.
Combine that with a Timer or Executor set to a granularity 1 second - 1 minute which guarantees that the files don't keep piling up faster than you can delete them.

If you don't want to write C++, you can use Python. Install pywin32 and you can then use the win32 API as such:
import win32api, win32con
change_handle = win32api.FindFirstChangeNotification(
path_to_watch,
0,
win32con.FILE_NOTIFY_CHANGE_FILE_NAME
)
Full explanation of what to do with that handle by Tim Golden here: http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html.

In Java, you can also use Apache Commons JCI FAM. It's is an opensource java library that you can use for free.
JDK 7 (released in beta currently) includes support for file notifications as well. Check out Java NIO2 tutorial.
Both options should work both on Windows and Linux.

http://www.cyberpro.com.au/Tips_n_Tricks/Windows_Related_Tips/Purge_a_Directory_in_Windows_automatically/

I'd go with C++ for a utility like this - lets you interface with the WIN32 API, which does indeed have directory listening facilities (FindFirstChangeNotification or ReadDirectoryChangesW). Use one thread that listens for change notifications and updates your list of files (iirc FFCN requires you to rescan the folder, whereas RDCW gives you the actual changes).
If you keep this list sorted according to modification time, it becomes easy to Sleep() just long enough for a file to go stale, instead of polling at some random fixed interval. You might want to do a WaitForSingleObject with a timeout instead of Sleep, in order to react to outside changes (ie, the file you're waiting for to become stale has been deleted externally, so you'll want to wake up and determine when the next file will become stale).
Sounds like a fun little tool to write :)

You might want to bite the bullet and code it up in C# (or VB). What you're asking for is pretty well handled by the FileSystemWatcher class. It would work basically the way you are describing. Register files as they are added into the directory. Have a periodic timer that scans the list of files for ones that are stale and deletes them if they are still there. I'd probably code it up as a Windows service running under a service id that has enough rights to read/delete files in the directory.
EDIT: A quick google turned up this FileSystemWatcher for Java. Commercial software. Never used it, so can't comment on how well it works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.