My boss is worried that our NFS file system will not be happy with the jboss run java process calling getFD().sync on the files we are writing.
We have noticed that frequently the time stamp on the created file is minutes (sometimes as many as 15 minutes) after the log claims the file was finished writing. My only guess is that the NFS is hanging on to the file in memory and not writing it till it feels like it. sync should solve that probelm, right?
I also noticed that there is never a close() called on the file. Wondering if that could have been the cause as well?
any thoughts appreciated.
If you mean that the Java code never calls close() on the stream, yes, that is a bug. Always close a stream, input or output, as soon as use is complete. Good static analysis tools will warn about code that fails to do this.
Related
One ThreadPool is downloading files from the FTP server and another thread pool is reading files from it.
Both ThreadPool are running concurrently. So exactly what happens, I'll explain you by taking one example.
Let's assume, I've one csv file with 100 records.
While threadPool-1 is downloading and writing it in a file in pending folder, and at the same time threadpool-2 reads the content from that file, but assume in 1 sec only 10 records can be written in a file in /pending folder and threadpool - 2 reads only 10 record.
ThreadPool - 2 doesn't know about that 90 records are currently in process of downloading. Now, threadPool - 2 will not read 90 records because it doesn't know that whole file is downloaded or not. After reading it'll move that file in another folder. So, my 90 records will be proceed further.
My question is, how to wait until whole file is downloaded and then only threadPool 2 can read contents from the file.
One more thing is that both threadPools are use scheduleFixedRate method and run at every 10 sec.
Please guide me on this.
I'm a fan of Mark Rotteveel's #6 suggestion (in comments above):
use a temporary name when downloading,
rename when download is complete.
That looks like:
FTP download threads write all files with some added extension – perhaps .pending – but name it whatever you want.
When a file is downloaded – say some.pdf – the FTP download thread writes the file to some.pdf.pending
When an FTP download thread completes a file, the last step is a file rename operation – this is the mechanism for ensuring only "done" files are ready to be processed. So it downloads the file to some.pdf.pending, then at the end, renames it to some.pdf.
Reader threads look for files, ignoring anything matching *.pending
I've built systems using this approach and they worked out well. In contrast, I've also worked with more complicated systems that tried to coordinate across threads and.. those often did not work so well.
Over time, any software system will have bugs. Edsger Dijkstra captured this so well:
"If debugging is the process of removing software bugs, then programming must be the process of putting them in."
However difficult it is to reason about program correctness now – while the program is still in design phase,
and has not yet been built – it will be harder to reason about correctness when things are broken in production (which will happen, because bugs).
That is, when things are broken and you're under time pressure to find the root cause (and fix it!), even the best of us would be at a disadvantage
with a complicated (vs. simple) system.
The approach of using temporary names is simple to reason about, which should minimize code complexity and thus make it easier to implement.
In turn, maintenance and bug fixes should be easier, too.
Keep it simple – let the filesystem help you out.
I have a Unix system mounting an NFS "share" from a Windows server. On the Windows server I have a PowerShell script that will check every 10 s if there is a new file coming in on the NFS share and Move-Item it somewhere else and then it gets processed further.
What we are seeing is that files are corrupted in this process. My hunch is that the NFS writing takes a little longer, the script picks up an incomplete file and Move-Item it to the other folder. There is also a theory a colleague has that the further processing picks up the file before Move-Item has completed. I do not believe in that theory, because Move-Item on the same file system should be an atomic metadata only operation. (Don't be confused by the NFS reference, the Windows server has these files locally, the NFS share is mounted by the Unix system, so Move-Item does not involve NFS, and in my case, doesn't cross file system boundaries either.)
Either way, I want to know why it would be that the writing of the file to NFS which is by a Java process on Unix, is not locking the file on the Windows host file system? Would I have to explicitly on Java cause an NFS lock to be set somehow? Is there even support for fcntl lock feature from Java?
Also, if I used power-shell Copy command rather than Move-Item, there would be a certain moment of file incomplete copied. Isn't the Copy command automatically setting a lock on the destination file until it is finished?
EDIT: This is actually getting more and more puzzling. First I tried locking the file explicitly while writing to the NFS. This is Java and it creates a huge problem with NFS, I couldn't set up the nlockmgr service to actually work, there is a firewall involved between the two, I made all the right passages, and get no response to the lock requests from the Windows NFS server. This causes the Java side to completely hang, so bad you can't even kill -KILL the JVM. The only way to end this nightmare is to reboot the Unix system, crazy! There also isn't a timeout on the lock request, big problem in Java, other places like read from socket I have seen such problems too, you can't kill a thread that hangs reading from a socket. Whatever, there is no way to cancel a lock request. So I gave up on that.
Then I added a filter in the PowerShell script to only move files that have a last written to time less than 10 seconds before the current time. That should leave more than enough time for the writer to finish. But apparently it doesn't help either.
UPDATE: but yes, I now watched it, that copy process on Unix from S3 to NFS to Windows NTFS takes a long time, and it is all running on AWS so even S3 should be considered fast. Yet, it crawls between 0 kB ... 64 kB ... 90 kB with 10 seconds not enough to wait between each new chunk written. I updated this wait time to 30 seconds and that seems to work, but it is not guaranteed.
The locking would be the right solution, but I have 2 major obstacles:
can't get the Windows NFS "share" to work with mounted on Unix and nlockmgr service playing
Java JVM will totally stall unkillable if the nlockmgs has a problem.
Using Java's java.nio.channels.FileLock I am trying to synchronize file reading and writing on a Windows filesystem. I have a test program that runs in a loop:
Lock the file X.LOCK
Test that X.JSON exists (just a consistency check)
Write the file X.TMP
Rename X.TMP to X.JSON using java.nio.files.Files.move() deletes the old X.JSON and renames X.TMP to X.JSON in a atomic action.
Test that X.JSON exists (this always returns true)
Release the lock on X.LOCK
I run this in a tight loop in multiple instances of the test program. It locks the symbolic file "X.LOCK" and not the actual file that is being written and renamed. I believe that is necessary to preserve the lock through the rename operation.
Here is what I find: In about 2% of the cases, process 1 will write/rename/release the lock, and process 2 which was waiting on that lock will get that lock, start executing, but find that X.JSON does not exist. The "exists" check returns false!
If I introduce a delay (200ms) after the rename, and before the unlock, then the whole thing runs 100% reliably. I can try smaller delays, but I am loath to add any delay since that is never the right answer to making a reliable program.
It appears that when one process atomically renames a file, it takes some time for the other process to see that. But the unlock signal goes faster! So the lock signal tells the other program to move forward, and that other program can't see the file it is supposed to be working on!
Question: is there any way I can force the unlock signal to be sent AFTER the file system has settled and guaranteed to be consistent with operations that were put in there before the unlock was called?
Any hints on where I can look for information on this kind of timing/sequencing on a Windows file system using Java? I have not tried this test program on any other platform yet, but I certainly will check Linux soon.
UPDATE
I am suspicious of interference from virus scanning. It got a test to a reproducible state, and it was failing about 1% of the time, this time reporting "AccessDeniedException". I think the virus scan might be kicking in, scanning the file between being created and being renamed, and when it does this, it runs at a higher privilege, and causes this error when trying to rename it. Anyone else run into this problem?
The solution appears to be that on a system where virus scan is running, depending upon the specific brand of virus scanner, it is possible that the call to move can be interfered with. I was calling:
java.nio.files.Files.move(src, dest, StandardCopyOption.REPLACE_EXISTING,
StandardCopyOption.ATOMIC_MOVE );
This command will effectively delete the dest if it exists, and rename the src file to the dest, and it will do it atomically. It is documented that if it can not do it atomically, it will throw an exception. I was getting AccessDeniedException which is not mentioned in the documentation specifically but apparently happens.
What appears to be happening is that -- and this all depended on a specific timing that was happening about 1% of the time -- was that the operation of the virus scan either on the src file or the dest file caused the atomic move to fail.
I tried on each of three different systems configured differently. The windows computer with the Microsoft Windows Defender never caused the AccessDeniedException while another with Trend Micro virus scan was failing regularly. That is not a thorough survey of virus scan options; they were the only options I had available for test. The machine with the Trend Micro also has an encrypted hard disk, and that might be a factor to make the timing such as to trip this problem.
I even went so far as to implement a "retry" where if the move threw an exception, the code would wait 10ms and try again. Even with this, the retry failed about 0.1% of the time. Maybe I could have waited longer, but that would in any case be a problem making the code slower.
What worked was to add a step to delete the file being replaced before doing the move. My guess is that the virus scan is either stopped by the delete, or else it continues to scan on the src or dest file without bothering the move command. The steps are these:
Lock the file X.LOCK
Test that X.JSON exists (just a consistency check)
Write the file X.TMP
(NEW) Delete the old X.JSON
Rename X.TMP to X.JSON using java.nio.files.Files.move() simply renames X.TMP to X.JSON in an atomic action.
Test that X.JSON exists (this always returns true)
Release the lock on X.LOCK
Is this now 100% reliable? I can't say for sure, since all this is timing dependent. It is possible that this just changed the timing in a way that allows it to run.
I'm running my java application on a windows 2008 server (64-bit) in the hotspot vm.
A few months ago I created a tool to assist in the detection of deadlocking in my application. For the past month or so, the only thing that has been giving me any problems is the writing to text files.
The main thread always seems to get stuck on the following line for what I would assume to be almost 5 seconds at a time. After a few seconds the application continues to run normally and without problems:
PrintWriter writer = new PrintWriter(new FileWriter(PATH + name + ".txt"));
Not sure what causes this, but any insight into the problem would be most appreciated. The files I'm writing are small and that is unlikely the issue (unless anyone has any objections).
If you need any more information, please let me know.
Is PATH on a network drive? You could see almost any delay writing to a network file system. It's generally a very bad idea to do that with applications. They should generally write all their files locally and then post transactions to a server somehow.
When your file system gets overloaded, you can see delays with even the simplest of tasks. e.g. If I create a large file (multiple GB) and try to do a a simple disk access which is not cached it can wait seconds.
I would check your disk write cache is turned on and your disks are idle most of the time. ;)
In my Java app, on Linux, I need to periodically read some text files that change often.
(these text files are updated by a separate app).
Do I need to be concerned about the rare case when attempting to read the file at the exact moment it is being updated? If so, how can I guarantee that my reads always return without failing? Does the OS handle this for me, or could I potentially read 1/2 a file?
thanks.
The OS can help you achieve consistent reads, but it requires that both apps are written with this in mind.
In a nutshell, you open the file in your java app with exclusive read/write permission - this ensures that no one else, including your other app is modifying the file while you are reading it. The FileLock class can help you ensure you have exclusive access to a file.
Your other app will then periodically try to write to the file. If it does this at the same time you are reading the file, then access will be denied, and the other app should retry. This is the critical part, since if the app doesn't expect the file to be unavailable and treats this as a fatal error condition, the write will fail, and app doesn't save the data and may fail/exit etc.
If the other app must always be able to write to the file, then you have to avoid using exclusive reads. Instead, you have to try to detect an inconsistent read, such as by checking the last modified timestamp when you start reading, and when you finish reading. If the timestamps are the same, then you are good to go and have a consistent read.
Yes, you need to worry about this.
No, your reads shouldn't "fail" AFAIK, unless the file is momentarily being locked, in which you can catch the exception and try again after a brief pause. You might certainly, though, get more or less data than you expected.
(If you post code we might be able to comment more accurately on what'll happen.)