I need to develop an application that will process csv files as soon as the files are created in a predefined directory. Huge number of incoming files is expected.
I have seen applications using Apache Commons IO File Monitoring in the production. It works pretty well. I have seen it processing as many as 21 million files in a day. It seems Apache Commons IO File Monitoring polls the directory and do listFiles to process the files.
My question:
Is JDK WatchService as good an option as Apache Commons IO File Monitoring? Does anyone know of any pros and cons?
Since the time I asked this question, I have got some more insight into the matter. Hence trying to answer for those who might have similar question.
Apache commons monitoring uses a polling mechanism with a configurable polling interval. In every poll, it calls listFiles() method of File class and compares with the listFiles() output of the previous iteration to identify file creation, modification and deletion. The algorithm is robust enough and I have never seen any miss. It works great with even large volume of files. However, since it polls and invokes listFiles in every iteration, it will consume unnecessary CPU cycles, if the input file inflow is not much. Works even on network drives.
JDK WatchService does not need polling. It is event based. It s triggered only when an event occurs and hence less CPU is required if the input file inflow is not that much. If the input file inflow is heavy and the event processing mechanism is processing at a slower rate that the rate at which the event is occurring, there may be a chance of event overflow. Additionally, it will not work with network drives.
Hence, in conclusion, if the file inflow is continuos and huge, it is better to go for Apache File Monitoring. Otherwise, JDK WatchService is a good option.
Related
One ThreadPool is downloading files from the FTP server and another thread pool is reading files from it.
Both ThreadPool are running concurrently. So exactly what happens, I'll explain you by taking one example.
Let's assume, I've one csv file with 100 records.
While threadPool-1 is downloading and writing it in a file in pending folder, and at the same time threadpool-2 reads the content from that file, but assume in 1 sec only 10 records can be written in a file in /pending folder and threadpool - 2 reads only 10 record.
ThreadPool - 2 doesn't know about that 90 records are currently in process of downloading. Now, threadPool - 2 will not read 90 records because it doesn't know that whole file is downloaded or not. After reading it'll move that file in another folder. So, my 90 records will be proceed further.
My question is, how to wait until whole file is downloaded and then only threadPool 2 can read contents from the file.
One more thing is that both threadPools are use scheduleFixedRate method and run at every 10 sec.
Please guide me on this.
I'm a fan of Mark Rotteveel's #6 suggestion (in comments above):
use a temporary name when downloading,
rename when download is complete.
That looks like:
FTP download threads write all files with some added extension – perhaps .pending – but name it whatever you want.
When a file is downloaded – say some.pdf – the FTP download thread writes the file to some.pdf.pending
When an FTP download thread completes a file, the last step is a file rename operation – this is the mechanism for ensuring only "done" files are ready to be processed. So it downloads the file to some.pdf.pending, then at the end, renames it to some.pdf.
Reader threads look for files, ignoring anything matching *.pending
I've built systems using this approach and they worked out well. In contrast, I've also worked with more complicated systems that tried to coordinate across threads and.. those often did not work so well.
Over time, any software system will have bugs. Edsger Dijkstra captured this so well:
"If debugging is the process of removing software bugs, then programming must be the process of putting them in."
However difficult it is to reason about program correctness now – while the program is still in design phase,
and has not yet been built – it will be harder to reason about correctness when things are broken in production (which will happen, because bugs).
That is, when things are broken and you're under time pressure to find the root cause (and fix it!), even the best of us would be at a disadvantage
with a complicated (vs. simple) system.
The approach of using temporary names is simple to reason about, which should minimize code complexity and thus make it easier to implement.
In turn, maintenance and bug fixes should be easier, too.
Keep it simple – let the filesystem help you out.
Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks
If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.
Environment: Java 7 on an Ubuntu 12 server.
I have a Java application that polls for incoming .zip files that are delivered via sftp. I have no control over the client that's delivering the files.
The files being delivered are quite large, and in some cases, the poll mechanism detects a file while it's still being written. In this situation, the Java application borks because it thinks the file is corrupt.
What's the most effective way of detecting when the local sftp server has finished writing the file?
There are a number of approaches to dealing with this. You can choose one, but the more you implement the better:
The sender should upload as a .tmp file, then rename to .zip once done so that the watcher only sees the finished file.
The watcher should check the last modified time of the file, and if it was modified in the last 10 seconds (maybe 1 minute) then ignore the file and try again later.
If your OS supports it, try to get an exclusive lock on the file before reading it. This is not so easy in java, and depends on OS specifics.
Always send the file as a zip file, as if the file is incomplete of otherwise corrupted it will fail the CRC check. Also you get the added benefit for smaller transfers, smaller archive folder etc. (Of course you are already doing this, as mentioned in the question).
Look at the File2 component of camel and look at all the options it gives you. Make you want to use Camel, right?
See answer: https://stackoverflow.com/a/5851185/92063 which mentions incron. You can use it to notify your application that a file system event has taken place.
A quote from the linked website:
incron :: inotify cron system
This program is an "inotify cron" system. It consists of a daemon and
a table manipulator. You can use it a similar way as the regular cron.
The difference is that the inotify cron handles filesystem events
rather than time periods.
You have no control over the sender, that is unfortunate because the best solution would be the following (I will give another solution afterwards which doesn't require the sender to change anything).
The sender should rename the file when the upload is finished.
E.g. the file is named fileInProgress.txt during upload and fileFinished.txt when the upload is finished. You will restrict your java program to only watch for files with the name *Finished.txt. This is the easiest and an absolute reliable solution.
Your solution would be the following.
From your java program do a file-listing on the upload folder and store the file-sizes.
Wait for 10 secs (or longer if you want to be on the save side).
Do a file-listing again.
All files that didn't change in size are finished and can be processed.
Note that this does not give you absolute certainity that the upload is finished but it comes closer the longer the interval between your file size check is.
As David Roussel mentioned, Camel would be very useful for this. Take a look at initialDelay (among any others you may find useful) from File2 as this would place a specified delay before it polls the directory.
Any sort of file polling that I have done I have used Camel as it is easier to handle these kinds of situations.
I want to be able to measure thread I/O through code on my live/running application. So far the best (and only) solution I found is this one which requires me to hook directly into windows' performance monitor. However it seems very complex and there must be a simpler way to do this. I don't mind writing different code for Windows & Linux, to be honest I was expecting it.
Thanks for you help
Update:
So what I mean by Disk I/O is: If you open windows resource monitoring and go to the disk tab you can look at each process that is running and lock at the avg read/write of B/Sec over the last minute. I want to get the same data but per thread
Take a look at Hyperic Sigar. It is a cross-platform native resource Java API.
The class FileSystemUsage gives you a bunch of stats about file system usage (obvious...) such as
Disk Queue
Disk Writes / Disk Bytes Written
Disk Reads / Disk Reads Written
Most of them are cumulative, but you just compute a delta every n seconds to get a rate.
There's also a javaagent for profiling/monitoring file IO called JPicus. I have not looked at it in a while, but it may be useful.
I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?
In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here
My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.