Java application needs exclusive access to file being delivered by sftp - java

Environment: Java 7 on an Ubuntu 12 server.
I have a Java application that polls for incoming .zip files that are delivered via sftp. I have no control over the client that's delivering the files.
The files being delivered are quite large, and in some cases, the poll mechanism detects a file while it's still being written. In this situation, the Java application borks because it thinks the file is corrupt.
What's the most effective way of detecting when the local sftp server has finished writing the file?

There are a number of approaches to dealing with this. You can choose one, but the more you implement the better:
The sender should upload as a .tmp file, then rename to .zip once done so that the watcher only sees the finished file.
The watcher should check the last modified time of the file, and if it was modified in the last 10 seconds (maybe 1 minute) then ignore the file and try again later.
If your OS supports it, try to get an exclusive lock on the file before reading it. This is not so easy in java, and depends on OS specifics.
Always send the file as a zip file, as if the file is incomplete of otherwise corrupted it will fail the CRC check. Also you get the added benefit for smaller transfers, smaller archive folder etc. (Of course you are already doing this, as mentioned in the question).
Look at the File2 component of camel and look at all the options it gives you. Make you want to use Camel, right?

See answer: https://stackoverflow.com/a/5851185/92063 which mentions incron. You can use it to notify your application that a file system event has taken place.
A quote from the linked website:
incron :: inotify cron system
This program is an "inotify cron" system. It consists of a daemon and
a table manipulator. You can use it a similar way as the regular cron.
The difference is that the inotify cron handles filesystem events
rather than time periods.

You have no control over the sender, that is unfortunate because the best solution would be the following (I will give another solution afterwards which doesn't require the sender to change anything).
The sender should rename the file when the upload is finished.
E.g. the file is named fileInProgress.txt during upload and fileFinished.txt when the upload is finished. You will restrict your java program to only watch for files with the name *Finished.txt. This is the easiest and an absolute reliable solution.
Your solution would be the following.
From your java program do a file-listing on the upload folder and store the file-sizes.
Wait for 10 secs (or longer if you want to be on the save side).
Do a file-listing again.
All files that didn't change in size are finished and can be processed.
Note that this does not give you absolute certainity that the upload is finished but it comes closer the longer the interval between your file size check is.

As David Roussel mentioned, Camel would be very useful for this. Take a look at initialDelay (among any others you may find useful) from File2 as this would place a specified delay before it polls the directory.
Any sort of file polling that I have done I have used Camel as it is easier to handle these kinds of situations.

Related

How to wait until whole files is downloaded from ftp server in Java?

One ThreadPool is downloading files from the FTP server and another thread pool is reading files from it.
Both ThreadPool are running concurrently. So exactly what happens, I'll explain you by taking one example.
Let's assume, I've one csv file with 100 records.
While threadPool-1 is downloading and writing it in a file in pending folder, and at the same time threadpool-2 reads the content from that file, but assume in 1 sec only 10 records can be written in a file in /pending folder and threadpool - 2 reads only 10 record.
ThreadPool - 2 doesn't know about that 90 records are currently in process of downloading. Now, threadPool - 2 will not read 90 records because it doesn't know that whole file is downloaded or not. After reading it'll move that file in another folder. So, my 90 records will be proceed further.
My question is, how to wait until whole file is downloaded and then only threadPool 2 can read contents from the file.
One more thing is that both threadPools are use scheduleFixedRate method and run at every 10 sec.
Please guide me on this.
I'm a fan of Mark Rotteveel's #6 suggestion (in comments above):
use a temporary name when downloading,
rename when download is complete.
That looks like:
FTP download threads write all files with some added extension – perhaps .pending – but name it whatever you want.
When a file is downloaded – say some.pdf – the FTP download thread writes the file to some.pdf.pending
When an FTP download thread completes a file, the last step is a file rename operation – this is the mechanism for ensuring only "done" files are ready to be processed. So it downloads the file to some.pdf.pending, then at the end, renames it to some.pdf.
Reader threads look for files, ignoring anything matching *.pending
I've built systems using this approach and they worked out well. In contrast, I've also worked with more complicated systems that tried to coordinate across threads and.. those often did not work so well.
Over time, any software system will have bugs. Edsger Dijkstra captured this so well:
"If debugging is the process of removing software bugs, then programming must be the process of putting them in."
However difficult it is to reason about program correctness now – while the program is still in design phase,
and has not yet been built – it will be harder to reason about correctness when things are broken in production (which will happen, because bugs).
That is, when things are broken and you're under time pressure to find the root cause (and fix it!), even the best of us would be at a disadvantage
with a complicated (vs. simple) system.
The approach of using temporary names is simple to reason about, which should minimize code complexity and thus make it easier to implement.
In turn, maintenance and bug fixes should be easier, too.
Keep it simple – let the filesystem help you out.

Processing 100 MB+ files from FTP server in less than 30 sec?

Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks
If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.

Concurrency while reading files from file system

We have an application that reads files from a particular folder, processes them and copies(some business logic) it to another folder.
The problem here is when there are very large number of files to be processed, running a single instance of an application or a single thread is no longer enough to process this files.
One approach we have for this is to start multiple instances of the application(I feel something is wrong with this approach. Suggest me an alternative if there is one).
Spawning threads or starting multiple instances of the application, care should be taken that, if a thread reads one file and starts processing it, another thread should not pick it up.
We are trying to achieve this by having a database table with the list of file names in the folder, so that when a thread first reads the table for the file name ,we will change the status to in-process or completed and pessimistically lock the table so that other threads cannot read it.
Is there any better solution to the problem ?
You can use most of your existing implementation as the front-end processor to feed file streams to worker threads that you can start/stop as demand dictates. Only the front-end thread opens files, so there is no possibility of one worker interfering with another.
EDIT: Added the word 'no' as it changes the meaning quite a bit...
Also have a look at JDK 7. It has a new file I/O API and a fork/ join framework which might help.
Take a look at Apache Camel (http://camel.apache.org), and its File component (http://camel.apache.org/file2.html). Using Camel allows you to very easily define a set of processing instructions to consume files in a directory atomically, and also to configure a thread pool to deal with multiple files at the same time. Camel in Action's a great book to get you started.
What you describe reminds me of the classical style to develop on UNIX.
In this classical style, you would move a file to a work-in-progress directory so that other files do not pick it up. In general: You could use one directory per processing state and than move files from state to state.
This works essentially because file moves are atomic (at least under Unix systems and NFTS).
What is nice with this approach, is that it is pretty easy to handle problematic situations like crashes and it has automatically a nice management interface everyone is familiar with (the filesystem GUI, ls, Windows Explorer, ...).

How can I process multiple files concurrently?

I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?
In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here
My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.

How to check for a dynamically created file in Java?

I have an application where I need to check for a file which may be created dynamically during my execution, I will give up after some MAX time where the file has yet to show up. I wanted to know if there was a more efficient method in Java of checking for the file other than polling for it and then sleeping every X seconds? If not what would be the most efficient manner of doing this?
You currently have to poll the file system as you mentioned. Java 7 is supposed to have file system notifications, so this should get easier at some point.
If the same program is doing the creation of the file as the polling, you could instead have the logic that creates the file notify the part of the program using Object.notify(). A general description of the wait() and notify/notifyAll() mechanism can be found here: http://java.sun.com/docs/books/tutorial/essential/concurrency/guardmeth.html
You could try JPoller to poll for the file changes.
If you are running on Windows, you can get directory change notifications, see Obtaining Directory Change Notifications. Of course, this is not cross-platform, and will require use of JNA or similar native bridge. In fact, JNA offsers such as class, the FileMonitor class (in the download) that uses the underlying platform's file change notification.
If you are watching a handlful of files or fewer, then of course, polling is unlikely to be a performance problem, it's just not a "feel-good" solution - but not so bad to warrant the pain of a non pure java solution. Monitoring directories containing thousands of files on the other hand would benefit from direct noficiation from the OS.

Categories

Resources