Processing 100 MB+ files from FTP server in less than 30 sec?

Processing 100 MB+ files from FTP server in less than 30 sec? - java

Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks

If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.

Related

How to wait until whole files is downloaded from ftp server in Java?

One ThreadPool is downloading files from the FTP server and another thread pool is reading files from it.
Both ThreadPool are running concurrently. So exactly what happens, I'll explain you by taking one example.
Let's assume, I've one csv file with 100 records.
While threadPool-1 is downloading and writing it in a file in pending folder, and at the same time threadpool-2 reads the content from that file, but assume in 1 sec only 10 records can be written in a file in /pending folder and threadpool - 2 reads only 10 record.
ThreadPool - 2 doesn't know about that 90 records are currently in process of downloading. Now, threadPool - 2 will not read 90 records because it doesn't know that whole file is downloaded or not. After reading it'll move that file in another folder. So, my 90 records will be proceed further.
My question is, how to wait until whole file is downloaded and then only threadPool 2 can read contents from the file.
One more thing is that both threadPools are use scheduleFixedRate method and run at every 10 sec.
Please guide me on this.

I'm a fan of Mark Rotteveel's #6 suggestion (in comments above):
use a temporary name when downloading,
rename when download is complete.
That looks like:
FTP download threads write all files with some added extension – perhaps .pending – but name it whatever you want.
When a file is downloaded – say some.pdf – the FTP download thread writes the file to some.pdf.pending
When an FTP download thread completes a file, the last step is a file rename operation – this is the mechanism for ensuring only "done" files are ready to be processed. So it downloads the file to some.pdf.pending, then at the end, renames it to some.pdf.
Reader threads look for files, ignoring anything matching *.pending
I've built systems using this approach and they worked out well. In contrast, I've also worked with more complicated systems that tried to coordinate across threads and.. those often did not work so well.
Over time, any software system will have bugs. Edsger Dijkstra captured this so well:
"If debugging is the process of removing software bugs, then programming must be the process of putting them in."
However difficult it is to reason about program correctness now – while the program is still in design phase,
and has not yet been built – it will be harder to reason about correctness when things are broken in production (which will happen, because bugs).
That is, when things are broken and you're under time pressure to find the root cause (and fix it!), even the best of us would be at a disadvantage
with a complicated (vs. simple) system.
The approach of using temporary names is simple to reason about, which should minimize code complexity and thus make it easier to implement.
In turn, maintenance and bug fixes should be easier, too.
Keep it simple – let the filesystem help you out.

Incomplete files moved on Windows NFS

I have a Unix system mounting an NFS "share" from a Windows server. On the Windows server I have a PowerShell script that will check every 10 s if there is a new file coming in on the NFS share and Move-Item it somewhere else and then it gets processed further.
What we are seeing is that files are corrupted in this process. My hunch is that the NFS writing takes a little longer, the script picks up an incomplete file and Move-Item it to the other folder. There is also a theory a colleague has that the further processing picks up the file before Move-Item has completed. I do not believe in that theory, because Move-Item on the same file system should be an atomic metadata only operation. (Don't be confused by the NFS reference, the Windows server has these files locally, the NFS share is mounted by the Unix system, so Move-Item does not involve NFS, and in my case, doesn't cross file system boundaries either.)
Either way, I want to know why it would be that the writing of the file to NFS which is by a Java process on Unix, is not locking the file on the Windows host file system? Would I have to explicitly on Java cause an NFS lock to be set somehow? Is there even support for fcntl lock feature from Java?
Also, if I used power-shell Copy command rather than Move-Item, there would be a certain moment of file incomplete copied. Isn't the Copy command automatically setting a lock on the destination file until it is finished?
EDIT: This is actually getting more and more puzzling. First I tried locking the file explicitly while writing to the NFS. This is Java and it creates a huge problem with NFS, I couldn't set up the nlockmgr service to actually work, there is a firewall involved between the two, I made all the right passages, and get no response to the lock requests from the Windows NFS server. This causes the Java side to completely hang, so bad you can't even kill -KILL the JVM. The only way to end this nightmare is to reboot the Unix system, crazy! There also isn't a timeout on the lock request, big problem in Java, other places like read from socket I have seen such problems too, you can't kill a thread that hangs reading from a socket. Whatever, there is no way to cancel a lock request. So I gave up on that.
Then I added a filter in the PowerShell script to only move files that have a last written to time less than 10 seconds before the current time. That should leave more than enough time for the writer to finish. But apparently it doesn't help either.
UPDATE: but yes, I now watched it, that copy process on Unix from S3 to NFS to Windows NTFS takes a long time, and it is all running on AWS so even S3 should be considered fast. Yet, it crawls between 0 kB ... 64 kB ... 90 kB with 10 seconds not enough to wait between each new chunk written. I updated this wait time to 30 seconds and that seems to work, but it is not guaranteed.
The locking would be the right solution, but I have 2 major obstacles:
can't get the Windows NFS "share" to work with mounted on Unix and nlockmgr service playing
Java JVM will totally stall unkillable if the nlockmgs has a problem.

Apache Commons IO File Monitoring vs. JDK WatchService

I need to develop an application that will process csv files as soon as the files are created in a predefined directory. Huge number of incoming files is expected.
I have seen applications using Apache Commons IO File Monitoring in the production. It works pretty well. I have seen it processing as many as 21 million files in a day. It seems Apache Commons IO File Monitoring polls the directory and do listFiles to process the files.
My question:
Is JDK WatchService as good an option as Apache Commons IO File Monitoring? Does anyone know of any pros and cons?

Since the time I asked this question, I have got some more insight into the matter. Hence trying to answer for those who might have similar question.
Apache commons monitoring uses a polling mechanism with a configurable polling interval. In every poll, it calls listFiles() method of File class and compares with the listFiles() output of the previous iteration to identify file creation, modification and deletion. The algorithm is robust enough and I have never seen any miss. It works great with even large volume of files. However, since it polls and invokes listFiles in every iteration, it will consume unnecessary CPU cycles, if the input file inflow is not much. Works even on network drives.
JDK WatchService does not need polling. It is event based. It s triggered only when an event occurs and hence less CPU is required if the input file inflow is not that much. If the input file inflow is heavy and the event processing mechanism is processing at a slower rate that the rate at which the event is occurring, there may be a chance of event overflow. Additionally, it will not work with network drives.
Hence, in conclusion, if the file inflow is continuos and huge, it is better to go for Apache File Monitoring. Otherwise, JDK WatchService is a good option.

Java application needs exclusive access to file being delivered by sftp

Environment: Java 7 on an Ubuntu 12 server.
I have a Java application that polls for incoming .zip files that are delivered via sftp. I have no control over the client that's delivering the files.
The files being delivered are quite large, and in some cases, the poll mechanism detects a file while it's still being written. In this situation, the Java application borks because it thinks the file is corrupt.
What's the most effective way of detecting when the local sftp server has finished writing the file?

There are a number of approaches to dealing with this. You can choose one, but the more you implement the better:
The sender should upload as a .tmp file, then rename to .zip once done so that the watcher only sees the finished file.
The watcher should check the last modified time of the file, and if it was modified in the last 10 seconds (maybe 1 minute) then ignore the file and try again later.
If your OS supports it, try to get an exclusive lock on the file before reading it. This is not so easy in java, and depends on OS specifics.
Always send the file as a zip file, as if the file is incomplete of otherwise corrupted it will fail the CRC check. Also you get the added benefit for smaller transfers, smaller archive folder etc. (Of course you are already doing this, as mentioned in the question).
Look at the File2 component of camel and look at all the options it gives you. Make you want to use Camel, right?

See answer: https://stackoverflow.com/a/5851185/92063 which mentions incron. You can use it to notify your application that a file system event has taken place.
A quote from the linked website:
incron :: inotify cron system
This program is an "inotify cron" system. It consists of a daemon and
a table manipulator. You can use it a similar way as the regular cron.
The difference is that the inotify cron handles filesystem events
rather than time periods.

You have no control over the sender, that is unfortunate because the best solution would be the following (I will give another solution afterwards which doesn't require the sender to change anything).
The sender should rename the file when the upload is finished.
E.g. the file is named fileInProgress.txt during upload and fileFinished.txt when the upload is finished. You will restrict your java program to only watch for files with the name *Finished.txt. This is the easiest and an absolute reliable solution.
Your solution would be the following.
From your java program do a file-listing on the upload folder and store the file-sizes.
Wait for 10 secs (or longer if you want to be on the save side).
Do a file-listing again.
All files that didn't change in size are finished and can be processed.
Note that this does not give you absolute certainity that the upload is finished but it comes closer the longer the interval between your file size check is.

As David Roussel mentioned, Camel would be very useful for this. Take a look at initialDelay (among any others you may find useful) from File2 as this would place a specified delay before it polls the directory.
Any sort of file polling that I have done I have used Camel as it is easier to handle these kinds of situations.

Measuring Thread I/O in Java

I want to be able to measure thread I/O through code on my live/running application. So far the best (and only) solution I found is this one which requires me to hook directly into windows' performance monitor. However it seems very complex and there must be a simpler way to do this. I don't mind writing different code for Windows & Linux, to be honest I was expecting it.
Thanks for you help
Update:
So what I mean by Disk I/O is: If you open windows resource monitoring and go to the disk tab you can look at each process that is running and lock at the avg read/write of B/Sec over the last minute. I want to get the same data but per thread

Take a look at Hyperic Sigar. It is a cross-platform native resource Java API.
The class FileSystemUsage gives you a bunch of stats about file system usage (obvious...) such as
Disk Queue
Disk Writes / Disk Bytes Written
Disk Reads / Disk Reads Written
Most of them are cumulative, but you just compute a delta every n seconds to get a rate.
There's also a javaagent for profiling/monitoring file IO called JPicus. I have not looked at it in a while, but it may be useful.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.