I need to monitor a certain folder for new files, which I need to process.
I have the following requirements:
The filenames of the files are sequence numbers. I need to process each file in order. (Lowest number first, there's no guarantee that each sequence number exists. eg: 1,2,5,8,9
If files already exist in the folder during startup, I need to process them directly
I need a guarantee that I only process each file once
I need to avoid reading incomplete files (which are still being copied)
The service should ofcourse be reliable...
What is the most common way to accomplish this?
I'm using Java SE7 and Spring 4.
I already had a look at the WatchService of Java 7 but it seems to have problems with processing already existing files during startup, and avoid processing incomplete files.
Assembling comments into an answer.
Easiest way to parse the files in the correct order is to load the entire directory file listing into an array / list and then sort the list using an appropriate comparator. E.g. Load files with File.list() or File.listFiles().
This is not the most efficient methodology, but for less than 10,000 files should be adequate unless you need faster startup time performance (I can imagine a small lag before processing begins as all of the files are listed).
To avoid reading incomplete files you should acquire an exclusive FileLock (via a FileChannel which you can get from the FileOutputStream or FileInputStream, however you may not be able to get an exclusive lock from the FileInputStream) on the file. Assuming the OS being used supports file locking (which modern OSes do) and the application writing the file is well behaved and holding a lock (hopefully it is) then as soon as you are able to acquire the lock you know the file is complete.
If for some reason you cannot rely on file locking then you either need to have the writing program first write to a temporary file (perhaps with a different extension) and then atomically move / rename the file (atomic for most OSes if on the same file system / partition), or monitor the file for a period of time to see if further bytes are being written (not the most robust methodology).
Related
I'm working on a multithreaded server in Java.
The server monitors a directory of files. Clients can ask the server:
to download a file from the server directory
to upload a new version of an already existing file to the server, overwriting the old version in the server directory.
To do the transfers, I'm planning to use FileChannels and SocketChannels, using the methods transferFrom and transferTo. According to the documentation, these two methods are thread safe.
The thing is that a single call to these two function could not be sufficient to read/write the file entirely.
The problem arises if there are more than one request on the same file at the same time. In this scenario, multiple threads could be doing read/write operations on the same file. Now, the single calls to transferFrom/transferTo are thread safe, according to the Java documentation. But a single call to these two functions could not be sufficient to read/write the file entirely. If thread A is replying to a download request and thread B is replying to an upload request referring to the same file, it could happen that:
Thread A starts reading from the file
In thread A, for some reason the read call returns before the EOF
Thread B overwrites the entire file with a single write call
Thread A continues reading from the file
In this case, the downloading client receives a portion of the old version and a portion of the new version.
To solve this I think I should be using some sort of locking, but I'm not sure how to do it in an efficient way. I could create two synchronized methods for reading and writing, but that creates obviously too much contention.
The best solution I have in mind is to use lock striping. Before doing any read/write operation, an hash based on the filename is calculated. Then, the lock in position lockArr[hash % numOfLocks] is acquired.
I think also that I should be using ReadWriteLocks, since multiple simultaneous reads should be allowed.
Now, this is my analysis of the problem and I could be completely wrong. Is there any better solution to this?
Locking means that somebody has to wait for somebody else -- not the best solution.
When the client uploads a file, you should write it out to a temp file on the same disk (usually in the same directory), and then when the file upload is done:
Rename the old version to a temporary name. Any current readers should be forced to close the old one, re-open the temp version, and seek to the correct position.
Rename the uploaded file to the target file name.
Delete the temp version of the old file when any readers are done with it.
In a typical implementation, you'd need a centralized class (lets call it ConcurrentFileAccessor) to manage the interactions between threads.
Readers would need to register with this class, and synchronize on some object during the actual read operation. When an upload completes, the writer would have to claim all those locks to block reads, close all the read files, rename the old version, reopen, seek, and then release them to allow the readers to continue.
I have a very large text data file. Is it possible to read this file by multiple file readers from different locations at the same time in parallel? For example one reader starts reading it from start and goes to middle, another starts reading from the middle to the end of the file.
I have an alternate approach of reading file via stream and using parallel. But it does not achieve the goal.
Files.lines(filePath).parallel
Multiple file readers might not read the same file because the file has already been taken and locked by another thread that is currently reading the file. Is there anyway to make this file shared among many threads and they can concurrently read it.
Is it possible to read this file by multiple file readers from different locations at the same time in parallel?
Yes. You can read the file from multiple points at once using RandomAccessFile. You can only do this if the file isn't locked.
If you have a text file you need to find where the start of the line is and the built-in tools don't really support this class so you will have to do some of the work yourself.
Is there any way to make this file shared among many threads and they can concurrently read it.
We have a couple of tools which support both concurrent reads and writes to the same file across processes so it is possible. (They work on binary files, not text)
I have a situation where there are two java applications are watching a directory for incoming file. Say there is a directory DIR that is being watched by two JVM processes for any files with the extension .SGL.
The problem we face here is that, sometimes both nodes are being notified about the new files and both nodes are trying to process the same file.
Usually we handle these situations using a database that try to insert into a table with unique file name column and only one will succeed and continue processing.
But for this situation, we don't have database.
What is the best way to handle these kind of problems? Can we depend on the file renaming solutions? Is file renaming is atomic operation?
For such a situation Spring Integration suggests FileSystemPersistentAcceptOnceFileListFilter: https://docs.spring.io/spring-integration/reference/html/files.html#file-reading
Stores "seen" files in a MetadataStore to survive application restarts.
The default key is 'prefix' plus the absolute file name; value is the timestamp of the file.
Files are deemed as already 'seen' if they exist in the store and have the
same modified time as the current file.
When you have shared persistent MetadataStore for all your application instances only one of them will process the file. All others will just filter it.
Every watcher (even two in the same JVM) should always be notified of the new File being added.
If you want to divide the work, you can either
use one JVM to run twice as many threads and divide the work via a queue.
use an operation which will only succeed for one JVM. e.g.
file rename
create a lock file
lock the file itself
Is file renaming is atomic operation?
Yes, only one process can successful rename a file, even if both attempt to rename to same name.
EDIT : Well, I'm back a bunch of months later, the lock mechanism that I was trying to code doesn't work, because createNewFile isn't reliable on the NFS. Check the answer below.
Here is my situation : I have only 1 application which may access the files, so I don't have any constraint about what other applications may do, but the application is running concurrently on several servers in the production environment for redundancy and performance purposes (a couple of machines are hosting each a couple of JVM with our apps).
Basically, what I need is to put some kind of flag in a folder to tell the other instances to leave this folder alone as another instance is already dealing with it.
Many search results are telling to use FileLock to achieve this, but I checked the Javadoc, and from my understanding it will not help much, since it's using the hosting OS's locking possibilities. So I doubt that it will help much since there are different hosting machines.
This question covers a similar subject : Java file locking on a network , and the accepted answer is recommending to implement your own kind of cooperative locking process (using the File.createNewFile() as asked by the OP).
The Javadoc of File.createNewFile() says that the process is atomically creating the file if it doesn't already exist. Does that work reliably in a network file system ?
I mean, how is it possible with the potential network lag to do both existence check and creation simultaneously ? :
The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.
No, createNewFile doesn't work properly on a network file system.
Even if the system call is atomic, it's only atomic regarding the OS, and not over the network.
Over the time, I got a couple of collisions, like once every 2-3 months (approx. once every 600k files).
The thing that happens is my program is running in 6 separates instances over 2 separate servers, so let's call them A1,A2,A3 and B1,B2,B3.
When A1, A2, and A3 try to create the same file, the OS can properly ensure that only one file is created, since it is working with itself.
When A1 and B1 try to create the same file at the same exact moment, there is some form of network cache and/or network delays happening, and they both get a true return from File.createNewFile().
My code then proceeds by renaming the parent folder to stop the other instances of the program from unnecessarily trying to process the folder and that's where it fails :
On A1, the folder renaming operation is successful, but the lock file can't be removed, so A1 just lets it like that and keeps on processing new incoming folders.
On B1, the folder renaming operation (File.renameTo(), can't do much to fix it) gets stuck in a infinite loop because the folder was already renamed (also causing a huge I/O traffic according to my sysadmin), and B1 is unable to process any new file until the program is rebooted.
The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.
That can be implemented easily via the open() system call or its equivalents in any operating system I have ever used.
I mean, how is it possible with the potential network lag to do both
existence check and creation simultaneously ?
There is a difference between simultaneously and atomically. Java doc is not saying anything about this function being a set of two simultaneous actions but two actions designed to work in atomic way. If this method is built to do two operations atomically than means file will never be created without checking file existence first and if file gets created by current call then it means there were no files present and if file doesn't get created that means there was already a file by that name.
I don't see a reason to doubt function being atomic or working reliably despite call being on network or local disk. Local call is equally unreliable - so many things can go wrong in an IO.
What you have to doubt is when trying to use empty file created by this function as a Lock as explained D-Mac's answer for this question and that is what explicitly mentioned in Java Doc for this function too.
You are looking for a directory lock and empty files working as a directory lock ( to signal other processes and threads to not touch it ) has worked quite well for me provided due care is taken to write logic to check for file existence,lock file clean up and orphaned locks.
Is there any way to access the number of blocks allocated to a file with the standard Java File API? Or even do it with some unsupported & undocumented API underneat. Anything to avoid native code plugins.
I'm talking about the st_blocks field of struct stat that the fstat/stat syscalls work on in Unix.
What I want to do is to create a sparse copy of a file that now has lots of redundant data, i.e. make a new copy of it, only containing the active data but sparsely written to it. Then swap the two files with an atomic rename/link operation. But I need a way to find out how many blocks are allocated to the file beforehand, it might already have been sparsely copied. The old file is then removed.
This will be used to free up disk space in a database application that is 100% Java. The benefit on relying on sparse file support in the filesystem is that I would not have to change the index that point out the location where the data is, that increases the complexity of the task at hand.
I think I can do somewhat well by relying on the file timestamp to see if files have already been cleaned up. But this intrigued me. I can not even find anything in the java 7 NIO.2 API for file attribute access at this level.
The only way I can think of is to use ls -s filename to get the actual size of the file on disk. http://www.lrdev.com/lr/unix/sparsefile.html