In my program I have loop that scans a bunch of files and reads their content. The problem happened over the iteration of about 1500 files and can't seem to be reproduced (or understood (by me))
The problem:
java.io.FileNotFoundException: /path/to/file//myFile (Too many open files)
Exception points to this method:
private static String readFileAsRawString(File f) throws IOException {
FileInputStream stream = new FileInputStream(f); // <------------Stacktrace
try{
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
return Charset.defaultCharset().decode(bb).toString();
} finally {
stream.close();
}
}
I ran this method over 20,000 files in QA and it seems to have no problems.
Do you see anything wrong with code i pasted above that would cause this issue?
The mapping is suspect. A MappedByteBuffer can outlive its FileChannel, and is valid until it is garbage collected. You might not have enough garbage to run the GC, but perhaps on a particular platform file handles are retained by unreferenced buffers.
Unless explicit garbage collection is disabled (-XX:-DisableExplicitGC), you should be able to test for this by catching the exception, calling System.gc(), and trying again. If it works on the second try, that's your problem. However, calling System.gc() as a permanent fix is a bad idea. The solution that will perform best overall will take some profiling on the target platform.
Don't use MappedByteBuffer for this trivial task. There is no well-defined time at which they are released. Just open the file, read it, close it.
I think you open too many files to fast, try to add a wait() to test this.
Then add a static counter that keeps tracks of opens files and if many files are already open, add a wait mechanism...
Related
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I'm facing a little weird situation.
I'm copying from FileInputStream to FileOutputStream a file that is sized around 500MB.
It goes pretty well (takes around 500ms). When I close this FileOutputStream the FIRST time, it takes about 1ms.
But here comes the catch, when I run this again, every consecutive close takes around 1500-2000ms!
The duration is dropped back to 1ms when I delete this file.
Is there some essential java.io knowledge I'm missing?
It seems to be related to OS. I'm running on ArchLinux (the same code run on Windows 7 have all the times under 20ms). Note that it doesn't matter if it runs in OpenJDK or Oracle's JDK. Hard drive is a solid state drive with ext4 file-system.
Here is my testing code:
public void copyMultipleTimes() throws IOException {
copy();
copy();
copy();
new File("/home/d1x/temp/500mb.out").delete();
copy();
copy();
// Runtime.getRuntime().exec("sync") => same results
// Thread.sleep(30000) => same results
// combination of sync & sleep => same results
copy();
}
private void copy() throws IOException {
FileInputStream fis = new FileInputStream("/home/d1x/temp/500mb.in");
FileOutputStream fos = new FileOutputStream("/home/d1x/temp/500mb.out");
IOUtils.copy(fis, fos); // copyLarge => same results
// copying takes always the same amount of time, only close "enlarges"
fis.close(); // input stream close this is always fast
// fos.flush(); // has no effect
// fos.getFD().sync(); // Solves the problem but takes ~2.5s
long start = System.currentTimeMillis();
fos.close();
System.out.println("OutputStream close took " + (System.currentTimeMillis() - start) + "ms");
}
The output is then:
OutputStream close took 0ms
OutputStream close took 1951ms
OutputStream close took 1934ms
OutputStream close took 1ms
OutputStream close took 1592ms
OutputStream close took 1727ms
#Duncan proposed the following explanation:
The first call to close() returns quickly, yet the OS is still flushing data to disk. The subsequent calls to close() can't complete until the previous flushing is complete.
I think this is close to the mark, but not exactly correct.
I think that what is actually going on here is that the first copy is filling up the operating system's file buffer cache with large numbers of dirty pages. The internal daemon that flushes the dirty pages to discs may start working on them, but it is still going when you start the second copy.
When you do the second copy, the OS tries to acquire buffer cache pages for reading and writing. But since the buffer cache is full of dirty pages the read and write calls are repeatedly blocked, waiting for free pages to become available. But before a dirty page can be recycled, the data in the page needs to be written to disc. The net result is that the copy slows down to the effective data write rate.
A 30 second pause may not be sufficient to complete flushing the dirty pages to disc.
One thing you could try is to do an fsync(fd) or fdatasync(fd) between the copies. In Java, the way to do that is to call FileDescriptor.sync().
Now, I can't say if this is going to improve total copy throughput, but I'd expect a sync operation to be better at writing out (just) one file than relying on the page eviction algorithm to do it.
You seem on to something interesting. Under Linux someone is allowed to be holding a file handle to the original file, when you open it, actually deleting the directory entry and starting afresh. This does not bother the original file (handle). On closing than, maybe some disk directory work happens.
Test it with IOUtils.copyLarge and Files.copy:
Path target = Paths.get("/home/d1x/temp/500mb.out");
Files.copy(fis, target, StandardCopyOption.REPLACE_EXISTING);
(I once saw a IOUtils.copy that just called copyLarge, but Files.copy should act nice.)
Note that this question was asked because I was curious why this is happening, it was not meant to be measurement of copy throughput.
To summarize:
As EJP noted, the whole thing is not connected to Java. The result is the same if multiple consecutive cp commands are run in bash script.
The best answer why is this happening is Stephen's one - fsync between copy calls removes the issue (but fsync itself takes ~2.5s).
The best way to solve this is to do it as Files.copy(I, o, REPLACE_EXISTING) (as in Joop's answer) => First check if target file exists and if so delete it (instead of "overwriting"). Then you can write and close stream fast.
On my Windows 7 Files.newInputStream returns sun.nio.ch.ChannelInputStream. When I tested its performance vs FileInputStream I was surprised to know that FileInputStream is faster.
This test
InputStream in = new FileInputStream("test");
long t0 = System.currentTimeMillis();
byte[] a = new byte[16 * 1024];
for (int n; (n = in.read(a)) != -1;) {
}
System.out.println(System.currentTimeMillis() - t0);
reads 100mb file in 125 ms. If I replace the first line with
InputStream in = Files.newInputStream(Paths.get("test"));
I get 320ms.
If Files.newInputStream is slower what advantages it has over FileInputStream?
If you tested new FileInputStream second, you are probably just seeing the effect of cache priming by the operating system. It isn't plausible that Java is causing any significant difference to an I/O-bound process. Try it the other way around, and on a much larger dataset.
I don't want to be the buzzkill, but the javadoc doesn't state any advantages, nor does any documentation I could find
Opens a file, returning an input stream to read from the file. The
stream will not be buffered, and is not required to support the mark
or reset methods. The stream will be safe for access by multiple
concurrent threads. Reading commences at the beginning of the file.
Whether the returned stream is asynchronously closeable and/or
interruptible is highly file system provider specific and therefore
not specified.
I think the method is just a utility method not necessarily meant to replace or improve on FileInputStream. Note that the concurrency point might explain some slow down.
Your FileInputStream and FileOutputstreams might introduce long GC pauses
Every time you create either a FileInputStream or a FileOutputStream, you are creating an object. Even if you close it correctly and promptly, it will be put into a special category that only gets cleaned up when the garbage collector does a full GC. Sadly, due to backwards compatibility constraints, this is not something that can be fixed in the JDK anytime soon as there could be some code out there where somebody has extended FileInputStream / FileOutputStream and is relying on those finalize() methods to ensure the call to close().
The solution (at least if you are using Java 7 or newer) is not too hard
— just switch to Files.newInputStream(...) and Files.newOutputStream(...)
https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful
The document said
"The stream will not be buffered"
It's because Files.newInputStream(Paths) support non-blocking IO.
You can try in debug mode, you can open non blocking inputstream and in the same time modify the file, but if you use FileInputStream, you cannot do such things.
FileInputStream will require "write lock" of file, so it can buffer the content of file, increase the speed of reading.
But ChannelInputStream cannot. It must guaranteed that it is reading the "current" content of file.
Above is my experience, I didn't check every point in Java doc.
I'm using RandomAccessFile in java:
file = new RandomAccessFile(filename, "rw");
...
file.writeBytes(...);
How can I ensure that this data is flushed to the Operating System? There is no file.flush() method. (Note that I don't actually expect it to be physically written, I'm content with it being flushed to the operating system, so that the data will survive a tomcat crash but not necessarily an unexpected server power loss).
I'm using tomcat6 on Linux.
The only classes that provide a .flush() method are those that actually maintain their own buffers. As java.io.RandomAccessFile does not itself maintain a buffer, it does not need to be flushed.
Have a look carefully at RandomAccessFile constructor javadoc:
The "rws" and "rwd" modes work much like the force(boolean) method of the FileChannel class, passing arguments of true and false, respectively, except that they always apply to every I/O operation and are therefore often more efficient. If the file resides on a local storage device then when an invocation of a method of this class returns it is guaranteed that all changes made to the file by that invocation will have been written to that device. This is useful for ensuring that critical information is not lost in the event of a system crash. If the file does not reside on a local device then no such guarantee is made.
You can use getFD().sync() method.
here's what i do in my app:
rf.close();
rf = new RandomAccessFile("mydata", "rw");
this is give 3-4times gain in performance
compared to getFd().sync() and 5-7 times compared to "rws' mode
deoes exactly what the original question proposed: passes
on unsaved data to the OS and out of JVM. Doesn't physically
write to disk, and therefore introduces no annoying delays
I reached here with the very same curiosity.
And I really can't figure what need to flush on OS and not necessarily need to flush to Disk part means.
In my opinion,
The best thing matches to the concept of a managed flushing is getFD().sync(), as #AVD said,
try(RandomAccessFile raw = new RandomAccessFile(file, "rw")) {
raw.write...
raw.write...
raw.getFD().sync();
raw.wirte...
}
which looks like, by its documentation, it works very much like what FileChannel#force(boolean) does with true.
Now "rws" and "rwd" are look like they work as if specifying StandardOpenOption#SYNC and StandardOpenOption#DSYNC respectively while a FileChannel is open.
try(RandomAccessFile raw = new RandomAccessFile(file, "rws")) {
raw.write...
raw.write...
raw.wirte...
// don't worry be happy, woo~ hoo~ hoo~
}
I learned that you can't..
Some related links here: http://www.cs.usfca.edu/~parrt/course/601/lectures/io.html
and here: http://tutorials.jenkov.com/java-io/bufferedwriter.html