I'm facing a little weird situation.
I'm copying from FileInputStream to FileOutputStream a file that is sized around 500MB.
It goes pretty well (takes around 500ms). When I close this FileOutputStream the FIRST time, it takes about 1ms.
But here comes the catch, when I run this again, every consecutive close takes around 1500-2000ms!
The duration is dropped back to 1ms when I delete this file.
Is there some essential java.io knowledge I'm missing?
It seems to be related to OS. I'm running on ArchLinux (the same code run on Windows 7 have all the times under 20ms). Note that it doesn't matter if it runs in OpenJDK or Oracle's JDK. Hard drive is a solid state drive with ext4 file-system.
Here is my testing code:
public void copyMultipleTimes() throws IOException {
copy();
copy();
copy();
new File("/home/d1x/temp/500mb.out").delete();
copy();
copy();
// Runtime.getRuntime().exec("sync") => same results
// Thread.sleep(30000) => same results
// combination of sync & sleep => same results
copy();
}
private void copy() throws IOException {
FileInputStream fis = new FileInputStream("/home/d1x/temp/500mb.in");
FileOutputStream fos = new FileOutputStream("/home/d1x/temp/500mb.out");
IOUtils.copy(fis, fos); // copyLarge => same results
// copying takes always the same amount of time, only close "enlarges"
fis.close(); // input stream close this is always fast
// fos.flush(); // has no effect
// fos.getFD().sync(); // Solves the problem but takes ~2.5s
long start = System.currentTimeMillis();
fos.close();
System.out.println("OutputStream close took " + (System.currentTimeMillis() - start) + "ms");
}
The output is then:
OutputStream close took 0ms
OutputStream close took 1951ms
OutputStream close took 1934ms
OutputStream close took 1ms
OutputStream close took 1592ms
OutputStream close took 1727ms
#Duncan proposed the following explanation:
The first call to close() returns quickly, yet the OS is still flushing data to disk. The subsequent calls to close() can't complete until the previous flushing is complete.
I think this is close to the mark, but not exactly correct.
I think that what is actually going on here is that the first copy is filling up the operating system's file buffer cache with large numbers of dirty pages. The internal daemon that flushes the dirty pages to discs may start working on them, but it is still going when you start the second copy.
When you do the second copy, the OS tries to acquire buffer cache pages for reading and writing. But since the buffer cache is full of dirty pages the read and write calls are repeatedly blocked, waiting for free pages to become available. But before a dirty page can be recycled, the data in the page needs to be written to disc. The net result is that the copy slows down to the effective data write rate.
A 30 second pause may not be sufficient to complete flushing the dirty pages to disc.
One thing you could try is to do an fsync(fd) or fdatasync(fd) between the copies. In Java, the way to do that is to call FileDescriptor.sync().
Now, I can't say if this is going to improve total copy throughput, but I'd expect a sync operation to be better at writing out (just) one file than relying on the page eviction algorithm to do it.
You seem on to something interesting. Under Linux someone is allowed to be holding a file handle to the original file, when you open it, actually deleting the directory entry and starting afresh. This does not bother the original file (handle). On closing than, maybe some disk directory work happens.
Test it with IOUtils.copyLarge and Files.copy:
Path target = Paths.get("/home/d1x/temp/500mb.out");
Files.copy(fis, target, StandardCopyOption.REPLACE_EXISTING);
(I once saw a IOUtils.copy that just called copyLarge, but Files.copy should act nice.)
Note that this question was asked because I was curious why this is happening, it was not meant to be measurement of copy throughput.
To summarize:
As EJP noted, the whole thing is not connected to Java. The result is the same if multiple consecutive cp commands are run in bash script.
The best answer why is this happening is Stephen's one - fsync between copy calls removes the issue (but fsync itself takes ~2.5s).
The best way to solve this is to do it as Files.copy(I, o, REPLACE_EXISTING) (as in Joop's answer) => First check if target file exists and if so delete it (instead of "overwriting"). Then you can write and close stream fast.
Related
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
On my Windows 7 Files.newInputStream returns sun.nio.ch.ChannelInputStream. When I tested its performance vs FileInputStream I was surprised to know that FileInputStream is faster.
This test
InputStream in = new FileInputStream("test");
long t0 = System.currentTimeMillis();
byte[] a = new byte[16 * 1024];
for (int n; (n = in.read(a)) != -1;) {
}
System.out.println(System.currentTimeMillis() - t0);
reads 100mb file in 125 ms. If I replace the first line with
InputStream in = Files.newInputStream(Paths.get("test"));
I get 320ms.
If Files.newInputStream is slower what advantages it has over FileInputStream?
If you tested new FileInputStream second, you are probably just seeing the effect of cache priming by the operating system. It isn't plausible that Java is causing any significant difference to an I/O-bound process. Try it the other way around, and on a much larger dataset.
I don't want to be the buzzkill, but the javadoc doesn't state any advantages, nor does any documentation I could find
Opens a file, returning an input stream to read from the file. The
stream will not be buffered, and is not required to support the mark
or reset methods. The stream will be safe for access by multiple
concurrent threads. Reading commences at the beginning of the file.
Whether the returned stream is asynchronously closeable and/or
interruptible is highly file system provider specific and therefore
not specified.
I think the method is just a utility method not necessarily meant to replace or improve on FileInputStream. Note that the concurrency point might explain some slow down.
Your FileInputStream and FileOutputstreams might introduce long GC pauses
Every time you create either a FileInputStream or a FileOutputStream, you are creating an object. Even if you close it correctly and promptly, it will be put into a special category that only gets cleaned up when the garbage collector does a full GC. Sadly, due to backwards compatibility constraints, this is not something that can be fixed in the JDK anytime soon as there could be some code out there where somebody has extended FileInputStream / FileOutputStream and is relying on those finalize() methods to ensure the call to close().
The solution (at least if you are using Java 7 or newer) is not too hard
— just switch to Files.newInputStream(...) and Files.newOutputStream(...)
https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful
The document said
"The stream will not be buffered"
It's because Files.newInputStream(Paths) support non-blocking IO.
You can try in debug mode, you can open non blocking inputstream and in the same time modify the file, but if you use FileInputStream, you cannot do such things.
FileInputStream will require "write lock" of file, so it can buffer the content of file, increase the speed of reading.
But ChannelInputStream cannot. It must guaranteed that it is reading the "current" content of file.
Above is my experience, I didn't check every point in Java doc.
In my program I have loop that scans a bunch of files and reads their content. The problem happened over the iteration of about 1500 files and can't seem to be reproduced (or understood (by me))
The problem:
java.io.FileNotFoundException: /path/to/file//myFile (Too many open files)
Exception points to this method:
private static String readFileAsRawString(File f) throws IOException {
FileInputStream stream = new FileInputStream(f); // <------------Stacktrace
try{
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
return Charset.defaultCharset().decode(bb).toString();
} finally {
stream.close();
}
}
I ran this method over 20,000 files in QA and it seems to have no problems.
Do you see anything wrong with code i pasted above that would cause this issue?
The mapping is suspect. A MappedByteBuffer can outlive its FileChannel, and is valid until it is garbage collected. You might not have enough garbage to run the GC, but perhaps on a particular platform file handles are retained by unreferenced buffers.
Unless explicit garbage collection is disabled (-XX:-DisableExplicitGC), you should be able to test for this by catching the exception, calling System.gc(), and trying again. If it works on the second try, that's your problem. However, calling System.gc() as a permanent fix is a bad idea. The solution that will perform best overall will take some profiling on the target platform.
Don't use MappedByteBuffer for this trivial task. There is no well-defined time at which they are released. Just open the file, read it, close it.
I think you open too many files to fast, try to add a wait() to test this.
Then add a static counter that keeps tracks of opens files and if many files are already open, add a wait mechanism...
I'm trying to understand if the performance i'm obtaining from the get() method of the MappedByteBuffer class is normal or not. My code is the following:
private byte[] testBuffer = new byte[4194304];
private File sdcardDir, filepath;
private FileInputStream inputStream;
private FileChannel fileChannel;
private MappedByteBuffer mappedByteBuffer;
// Obtain the root folder of the external storage
sdcardDir = Environment.getExternalStorageDirectory();
// Create the reference to the file to be read
filepath = new File(sdcardDir, "largetest.avi");
inputStream = new FileInputStream(filepath);
fileChannel = inputStream.getChannel();
mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, (4194304));
Log.d("GFXUnpack", "Starting to read");
mappedByteBuffer.position(0);
mappedByteBuffer.get(testBuffer, 0, (4194304));
Log.d("GFXUnpack", "Ended to read");
mappedByteBuffer.rewind();
Since i'm a beginner, and i needed the fastest way to read data from SD card, i looked for documentation and i found that File Mapping is considered, in many cases, the fastest approach to read from a file. But if i run the above code, although the buffer is correctly filled, the performance is so slow (or maybe not ? You decide !!) that i can read those 4194304 bytes in almost 5 seconds, that's less than 1MB per second. I'm using Eclipse directly connected to my Optimus Dual smartphone; the same time is required for the read even if i put the reading operation in a loop (maybe overhead initialization wouldn't take place if multiple reads are performed...Not the case).
This file size-time relation doesn't change if i reduce or make the file larger: 8 megs will be read in almost 9 seconds, 2 megs in 2 seconds, and so on.
I've read that even a slow SD card can be read at a speed of at least 5 MB per second...
Note that 4194304 is a power of 2 value, since i've read that this would increase performance.
Please tell me your opinion: is 1MB per second the actual performance on a modern smartphone, or is there something wrong with my code ? Thank you
Its worth nothing that in the Hotspot JVM, MappedByteBuffer.get() uses an intrinsic rather than a native call. When copying large sections blocks of data it copies multiple bytes at a time e.g. 8 bytes or longer with MMX instructions.
AFAIK, Android doesn't do this, which makes this call much more expensive.
I can't see anything wrong with your code. It is probably just the speed of the device and / or the file system implementation. As Tom Hawtin puts it "[m]emory mapped I/O will not make your disks run faster".