Concurrent common file download in Java - java

I am working on Concurrent File Download process, but not sure what approach to take.
About:
An application bundles bunch of files together into a zip file. The files are usually available on the hard drive in a common location (for example /tmp). However there are cases when files are not there and need to be downloaded from a remote http server.
Question:
How can I download multiple files concurrently and ensure that NO other thread (bundling files) downloads the same file at the same time?
More over, how can I ensure that in case of multiple applications running at the same time (remember that the files are all located in a common location), no instance of the application downloads the same file at the same time?
Please describe strategy and perhaps a way to implement it. Perhaps solution the above issue already exists.
Thank you!

You could use a queue or db to download needed files, just keep a
'status' column and a thread will mark the file as 'fetching'. When
done it will set as 'done'. Keep a last change timestamp and if the
file is downloading for a long time, stop or restart download.
Using a database for this file queue might ensure that other apps
don't fetch the same file multiple times (maybe persist download
etc;). Also you can have multiple downloads running and the db could
be used to track download speed, progress, etc;
In the future your question should be formatted with specific code, a specific problem. Your question is very broad and presents a discussion (better suited for chat) vs a single answer someone else might use.

Here is a possible strategy:
In case of a single app: have some sort of dispatcher thread which reads work from a queue (could be some persisted queue too like DB table or other) and spawns new threads for each item that was read from the queue. By read I mean, read and remove from the queue.
Have that queue stored in a shared DB (or any shared storage). In this case there may be a separate single dispatcher app which just reads works or work portions from the DB, and gives work to worker apps. So each worker app asks the dispatcher app for work, this ensures that only the dispatcher app reads from the DB (or the other central storage you decide to use). This on its turn eliminates the need to sync your DB (permanent storage) access.

Related

Multithreaded access to files in Java

I'm working on a multithreaded server in Java.
The server monitors a directory of files. Clients can ask the server:
to download a file from the server directory
to upload a new version of an already existing file to the server, overwriting the old version in the server directory.
To do the transfers, I'm planning to use FileChannels and SocketChannels, using the methods transferFrom and transferTo. According to the documentation, these two methods are thread safe.
The thing is that a single call to these two function could not be sufficient to read/write the file entirely.
The problem arises if there are more than one request on the same file at the same time. In this scenario, multiple threads could be doing read/write operations on the same file. Now, the single calls to transferFrom/transferTo are thread safe, according to the Java documentation. But a single call to these two functions could not be sufficient to read/write the file entirely. If thread A is replying to a download request and thread B is replying to an upload request referring to the same file, it could happen that:
Thread A starts reading from the file
In thread A, for some reason the read call returns before the EOF
Thread B overwrites the entire file with a single write call
Thread A continues reading from the file
In this case, the downloading client receives a portion of the old version and a portion of the new version.
To solve this I think I should be using some sort of locking, but I'm not sure how to do it in an efficient way. I could create two synchronized methods for reading and writing, but that creates obviously too much contention.
The best solution I have in mind is to use lock striping. Before doing any read/write operation, an hash based on the filename is calculated. Then, the lock in position lockArr[hash % numOfLocks] is acquired.
I think also that I should be using ReadWriteLocks, since multiple simultaneous reads should be allowed.
Now, this is my analysis of the problem and I could be completely wrong. Is there any better solution to this?
Locking means that somebody has to wait for somebody else -- not the best solution.
When the client uploads a file, you should write it out to a temp file on the same disk (usually in the same directory), and then when the file upload is done:
Rename the old version to a temporary name. Any current readers should be forced to close the old one, re-open the temp version, and seek to the correct position.
Rename the uploaded file to the target file name.
Delete the temp version of the old file when any readers are done with it.
In a typical implementation, you'd need a centralized class (lets call it ConcurrentFileAccessor) to manage the interactions between threads.
Readers would need to register with this class, and synchronize on some object during the actual read operation. When an upload completes, the writer would have to claim all those locks to block reads, close all the read files, rename the old version, reopen, seek, and then release them to allow the readers to continue.

Java File watcher with multiple JVM watching single directory for incoming files

I have a situation where there are two java applications are watching a directory for incoming file. Say there is a directory DIR that is being watched by two JVM processes for any files with the extension .SGL.
The problem we face here is that, sometimes both nodes are being notified about the new files and both nodes are trying to process the same file.
Usually we handle these situations using a database that try to insert into a table with unique file name column and only one will succeed and continue processing.
But for this situation, we don't have database.
What is the best way to handle these kind of problems? Can we depend on the file renaming solutions? Is file renaming is atomic operation?
For such a situation Spring Integration suggests FileSystemPersistentAcceptOnceFileListFilter: https://docs.spring.io/spring-integration/reference/html/files.html#file-reading
Stores "seen" files in a MetadataStore to survive application restarts.
The default key is 'prefix' plus the absolute file name; value is the timestamp of the file.
Files are deemed as already 'seen' if they exist in the store and have the
same modified time as the current file.
When you have shared persistent MetadataStore for all your application instances only one of them will process the file. All others will just filter it.
Every watcher (even two in the same JVM) should always be notified of the new File being added.
If you want to divide the work, you can either
use one JVM to run twice as many threads and divide the work via a queue.
use an operation which will only succeed for one JVM. e.g.
file rename
create a lock file
lock the file itself
Is file renaming is atomic operation?
Yes, only one process can successful rename a file, even if both attempt to rename to same name.

Concurrency while reading files from file system

We have an application that reads files from a particular folder, processes them and copies(some business logic) it to another folder.
The problem here is when there are very large number of files to be processed, running a single instance of an application or a single thread is no longer enough to process this files.
One approach we have for this is to start multiple instances of the application(I feel something is wrong with this approach. Suggest me an alternative if there is one).
Spawning threads or starting multiple instances of the application, care should be taken that, if a thread reads one file and starts processing it, another thread should not pick it up.
We are trying to achieve this by having a database table with the list of file names in the folder, so that when a thread first reads the table for the file name ,we will change the status to in-process or completed and pessimistically lock the table so that other threads cannot read it.
Is there any better solution to the problem ?
You can use most of your existing implementation as the front-end processor to feed file streams to worker threads that you can start/stop as demand dictates. Only the front-end thread opens files, so there is no possibility of one worker interfering with another.
EDIT: Added the word 'no' as it changes the meaning quite a bit...
Also have a look at JDK 7. It has a new file I/O API and a fork/ join framework which might help.
Take a look at Apache Camel (http://camel.apache.org), and its File component (http://camel.apache.org/file2.html). Using Camel allows you to very easily define a set of processing instructions to consume files in a directory atomically, and also to configure a thread pool to deal with multiple files at the same time. Camel in Action's a great book to get you started.
What you describe reminds me of the classical style to develop on UNIX.
In this classical style, you would move a file to a work-in-progress directory so that other files do not pick it up. In general: You could use one directory per processing state and than move files from state to state.
This works essentially because file moves are atomic (at least under Unix systems and NFTS).
What is nice with this approach, is that it is pretty easy to handle problematic situations like crashes and it has automatically a nice management interface everyone is familiar with (the filesystem GUI, ls, Windows Explorer, ...).

How can I process multiple files concurrently?

I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?
In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here
My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.

Howto synchronize file access in a shared folder using Java (OR: ReadWriteLock on network level)

I have multiple applications running in one virtual machine. I have multiple virtual machines running on one server. And I have multiple servers. They all share a file using a shared folder on linux. The file is read and written by all applications. During the write process no application is allowed to read this file. The same for writing: If an application is reading the file no application is allowed to write it.
How do I manage to synchronize the applications so they will wait for the write process to finish before they read, and vice versa? (the applications inside a vm have to be synchronized and also applications across servers)
Curent implementation uses "file semaphores". If the file is about to be written the application tries to "acquire" the semaphore by creating an additional file (lets name it "file.semaphore") in the shared folder. If the "file.semaphore" file already exists this means the semaphore is already locked by a different application. This approach has the problem that I cannot make sure that the "file exists"-test and "create file"- operation are executed atomic. This way it is possible that two applications test for the "file.semaphore" file, see it does not exist and try to create the file at the same time.
You can use NIO locking capabilities. See FileChannel#lock().
However, this will work only if underlying filesystem supports locking over the network. Recent NFS should support it. Probably, Samba supports them too, but can’t say for sure.
See article for example.
Have a look at the Javadocs for the createNewFile() method - it specifically states that creating files is not a reliable method for synchronization, and recommends the FileLock class instead (it's another package in java.nio.channels so is essentially the same as what Ivan Dubrov is suggesting).
This would imply that your identification of the problem is accurate, and no amount of playing around will solve this with traditional file creation. My first thought was to check the return code from createNewFile(), but if the Javadocs say it's not suitable then it's time to move on.
Need to combine file locking for protection between JVM's with synchronization within threads of a given JVM. See the answer by cyber-monk here
I am also trying to determine the best way to solve this problem for a similar situation (less participating processes, but still same underlying problem). If you haven't been able to employ the file locking scheme suggested by Ivan (e.g. system|language|network service does not support it), maybe you could designate one of the participants as a referee. All participants write unique semaphores, call them "participant#.request" when they want the file. The referee polls the file system for these semaphores. When he sees one, he writes back "participant#.lock", and deletes the request. If he happens to see multiple at the "same time" he selects one at random (or first by file modification time) and deletes only their request. Then, the participant issued the lock knows they can access the file safely. When the participant is done with the file, they delete their own lock. While there is a lock in place, no other locks are issued by the referee. Any requests that are present after the user deletes their lock could be served a new lock without issuing a new request, so you could have the other users poll for their lock after sending the request. Probably this is what the locking mechanism is doing anyway, except maybe for the ability to manage the lock as a queue that comes with requests being processed in the order they are received (i.e. if the referee uses modification time). Also, since you're in charge of the referee you could set timeouts to locks, allowing him issue timeout semaphores to the process that is hogging the file and then remove the lock (hoping of course that if that process with the lock died, it did so nicely).

Categories

Resources