Files.newInputStream creates slow InputStream - java

On my Windows 7 Files.newInputStream returns sun.nio.ch.ChannelInputStream. When I tested its performance vs FileInputStream I was surprised to know that FileInputStream is faster.
This test
InputStream in = new FileInputStream("test");
long t0 = System.currentTimeMillis();
byte[] a = new byte[16 * 1024];
for (int n; (n = in.read(a)) != -1;) {
}
System.out.println(System.currentTimeMillis() - t0);
reads 100mb file in 125 ms. If I replace the first line with
InputStream in = Files.newInputStream(Paths.get("test"));
I get 320ms.
If Files.newInputStream is slower what advantages it has over FileInputStream?

If you tested new FileInputStream second, you are probably just seeing the effect of cache priming by the operating system. It isn't plausible that Java is causing any significant difference to an I/O-bound process. Try it the other way around, and on a much larger dataset.

I don't want to be the buzzkill, but the javadoc doesn't state any advantages, nor does any documentation I could find
Opens a file, returning an input stream to read from the file. The
stream will not be buffered, and is not required to support the mark
or reset methods. The stream will be safe for access by multiple
concurrent threads. Reading commences at the beginning of the file.
Whether the returned stream is asynchronously closeable and/or
interruptible is highly file system provider specific and therefore
not specified.
I think the method is just a utility method not necessarily meant to replace or improve on FileInputStream. Note that the concurrency point might explain some slow down.

Your FileInputStream and FileOutputstreams might introduce long GC pauses
Every time you create either a FileInputStream or a FileOutputStream, you are creating an object. Even if you close it correctly and promptly, it will be put into a special category that only gets cleaned up when the garbage collector does a full GC. Sadly, due to backwards compatibility constraints, this is not something that can be fixed in the JDK anytime soon as there could be some code out there where somebody has extended FileInputStream / FileOutputStream and is relying on those finalize() methods to ensure the call to close().
The solution (at least if you are using Java 7 or newer) is not too hard
— just switch to Files.newInputStream(...) and Files.newOutputStream(...)
https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful

The document said
"The stream will not be buffered"
It's because Files.newInputStream(Paths) support non-blocking IO.
You can try in debug mode, you can open non blocking inputstream and in the same time modify the file, but if you use FileInputStream, you cannot do such things.
FileInputStream will require "write lock" of file, so it can buffer the content of file, increase the speed of reading.
But ChannelInputStream cannot. It must guaranteed that it is reading the "current" content of file.
Above is my experience, I didn't check every point in Java doc.

Related

Using buffered streams for sending objects?

I'm currently using Java sockets in a client-server application with OutputStream and not BufferedOutputStream (and the same for input streams).
The client and server exchanges serialized objects (writeObject() method).
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
And when I have to flush or should I not write a flush() statement?
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
Actually, it probably doesn't make sense1.
The object stream implementation internally wraps the stream it has been given with a private class called BlockDataOutputStream that does buffering. If you wrap the stream yourself, you will have two levels of buffering ... which is likely to make performance worse2.
And when I have to flush or should I not write a flush() statement?
Yes, flushing is probably necessary. But there is no universal answer as to when to do it.
On the one hand, if you flush too often, you generate extra network traffic.
On the other hand, if you don't flush when it is needed, the server can stall waiting for an object that the client has written but not flushed.
You need to find the compromise between these two syndromes ... and that depends on your application's client/server interaction patterns; e.g. whether the message patterns are synchronous (e.g. message/response) or asynchronous (e.g. message streaming).
1 - To be certain on this, you would need to do some forensic testing to 1) measure the system performance, and 2) determine what syscalls are made and when network packets are sent. For a general answer, you would need to repeat this for a number of use-cases. I'd also recommend looking at the Java library code yourself to confirm my (brief) reading.
2 - Probably only a little bit worse, but a well designed benchmark would pick up a small performance difference.
UPDATE
After writing the above, I found this Q&A - Performance issue using Javas Object streams with Sockets - which seems to suggest that using BufferedInputStream / BufferedOutputStream helps. However, I'm not certain whether the performance improvement that was reported is 1) real (i.e. not a warmup artefact) and 2) due to the buffering. It could be just due to adding the flush() call. (Why: because the flush could cause the network stack to push the data sooner.)
I think these links might help you:
What is the purpose of flush() in Java streams?
The flush method flushes the output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
How java.io.Buffer* stream differs from normal streams?
Internally a buffer array is used and instead of reading bytes individually from the underlying input stream enough bytes are read to fill the buffer. This generally results in faster performance as less reads are required on the underlying input stream.
http://www.oracle.com/technetwork/articles/javase/perftuning-137844.html
As a means of starting the discussion, here are some basic rules on how to speed up I/O: 1.Avoid accessing the disk. 2.Avoid accessing the underlying operating system. 3.Avoid method calls. 4.Avoid processing bytes and characters individually.
So using Buffered-Streams usually speeds speeds up the IO-processe, as less read() are done in the background.

Java FileOutputStream consecutive close takes a long time

I'm facing a little weird situation.
I'm copying from FileInputStream to FileOutputStream a file that is sized around 500MB.
It goes pretty well (takes around 500ms). When I close this FileOutputStream the FIRST time, it takes about 1ms.
But here comes the catch, when I run this again, every consecutive close takes around 1500-2000ms!
The duration is dropped back to 1ms when I delete this file.
Is there some essential java.io knowledge I'm missing?
It seems to be related to OS. I'm running on ArchLinux (the same code run on Windows 7 have all the times under 20ms). Note that it doesn't matter if it runs in OpenJDK or Oracle's JDK. Hard drive is a solid state drive with ext4 file-system.
Here is my testing code:
public void copyMultipleTimes() throws IOException {
copy();
copy();
copy();
new File("/home/d1x/temp/500mb.out").delete();
copy();
copy();
// Runtime.getRuntime().exec("sync") => same results
// Thread.sleep(30000) => same results
// combination of sync & sleep => same results
copy();
}
private void copy() throws IOException {
FileInputStream fis = new FileInputStream("/home/d1x/temp/500mb.in");
FileOutputStream fos = new FileOutputStream("/home/d1x/temp/500mb.out");
IOUtils.copy(fis, fos); // copyLarge => same results
// copying takes always the same amount of time, only close "enlarges"
fis.close(); // input stream close this is always fast
// fos.flush(); // has no effect
// fos.getFD().sync(); // Solves the problem but takes ~2.5s
long start = System.currentTimeMillis();
fos.close();
System.out.println("OutputStream close took " + (System.currentTimeMillis() - start) + "ms");
}
The output is then:
OutputStream close took 0ms
OutputStream close took 1951ms
OutputStream close took 1934ms
OutputStream close took 1ms
OutputStream close took 1592ms
OutputStream close took 1727ms
#Duncan proposed the following explanation:
The first call to close() returns quickly, yet the OS is still flushing data to disk. The subsequent calls to close() can't complete until the previous flushing is complete.
I think this is close to the mark, but not exactly correct.
I think that what is actually going on here is that the first copy is filling up the operating system's file buffer cache with large numbers of dirty pages. The internal daemon that flushes the dirty pages to discs may start working on them, but it is still going when you start the second copy.
When you do the second copy, the OS tries to acquire buffer cache pages for reading and writing. But since the buffer cache is full of dirty pages the read and write calls are repeatedly blocked, waiting for free pages to become available. But before a dirty page can be recycled, the data in the page needs to be written to disc. The net result is that the copy slows down to the effective data write rate.
A 30 second pause may not be sufficient to complete flushing the dirty pages to disc.
One thing you could try is to do an fsync(fd) or fdatasync(fd) between the copies. In Java, the way to do that is to call FileDescriptor.sync().
Now, I can't say if this is going to improve total copy throughput, but I'd expect a sync operation to be better at writing out (just) one file than relying on the page eviction algorithm to do it.
You seem on to something interesting. Under Linux someone is allowed to be holding a file handle to the original file, when you open it, actually deleting the directory entry and starting afresh. This does not bother the original file (handle). On closing than, maybe some disk directory work happens.
Test it with IOUtils.copyLarge and Files.copy:
Path target = Paths.get("/home/d1x/temp/500mb.out");
Files.copy(fis, target, StandardCopyOption.REPLACE_EXISTING);
(I once saw a IOUtils.copy that just called copyLarge, but Files.copy should act nice.)
Note that this question was asked because I was curious why this is happening, it was not meant to be measurement of copy throughput.
To summarize:
As EJP noted, the whole thing is not connected to Java. The result is the same if multiple consecutive cp commands are run in bash script.
The best answer why is this happening is Stephen's one - fsync between copy calls removes the issue (but fsync itself takes ~2.5s).
The best way to solve this is to do it as Files.copy(I, o, REPLACE_EXISTING) (as in Joop's answer) => First check if target file exists and if so delete it (instead of "overwriting"). Then you can write and close stream fast.

Too many open file error, java.io.FileNotFoundException

In my program I have loop that scans a bunch of files and reads their content. The problem happened over the iteration of about 1500 files and can't seem to be reproduced (or understood (by me))
The problem:
java.io.FileNotFoundException: /path/to/file//myFile (Too many open files)
Exception points to this method:
private static String readFileAsRawString(File f) throws IOException {
FileInputStream stream = new FileInputStream(f); // <------------Stacktrace
try{
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
return Charset.defaultCharset().decode(bb).toString();
} finally {
stream.close();
}
}
I ran this method over 20,000 files in QA and it seems to have no problems.
Do you see anything wrong with code i pasted above that would cause this issue?
The mapping is suspect. A MappedByteBuffer can outlive its FileChannel, and is valid until it is garbage collected. You might not have enough garbage to run the GC, but perhaps on a particular platform file handles are retained by unreferenced buffers.
Unless explicit garbage collection is disabled (-XX:-DisableExplicitGC), you should be able to test for this by catching the exception, calling System.gc(), and trying again. If it works on the second try, that's your problem. However, calling System.gc() as a permanent fix is a bad idea. The solution that will perform best overall will take some profiling on the target platform.
Don't use MappedByteBuffer for this trivial task. There is no well-defined time at which they are released. Just open the file, read it, close it.
I think you open too many files to fast, try to add a wait() to test this.
Then add a static counter that keeps tracks of opens files and if many files are already open, add a wait mechanism...

FileInputStream and FileOutputStream to the same file: Is a read() guaranteed to see all write()s that "happened before"?

I am using a file as a cache for big data. One thread writes to it sequentially, another thread reads it sequentially.
Can I be sure that all data that has been written (by write()) in one thread can be read() from another thread, assuming a proper "happens-before" relationship in terms of the Java memory model? Is this behavior documented?
In my JDK, FileOutputStream does not override flush(), and OutputStream.flush() is empty. That's why I'm wondering...
The streams in question are owned exclusively by a class that I have full control of. Each stream is guaranteed to be accesses by one thread only. My tests show that it works as expected, but I'm still wondering if this is guaranteed and documented.
See also this related discussion.
Assuming you are using a posix file system, then yes.
FileInputStream and FileOutputStream on *nix use the read and write system calls internally. The documentation for write says that reads will see the results of past writes,
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
for that position until such byte positions are again modified.
I'm pretty sure ntfs on windows will have the same read() write() guarantees.
You can't talk about "happens-before" relationship in terms of the Java memory model between your FileInputStream and FileOutputStream objects since they don't share any memory or thread. VM is free to reorder them just honoring your synchronization requirements. When you have proper synchronization between reads and writes without application level buffering, you are safe.
However FileInputStream and FileOutputStream share a file, which leaves things up to the OS which in main stream ones you can expect to read after write in order.
If FileOutputStream does not override flush(), then I think you can be sure all data written by write() can be read by read(), unless your OS does something weird with the data (like starting a new thread that waits for the hard drive to spin at the right speed instead of blocking, etc) so that it is not written immediately.
No, you need to flush() the Streams (at least for Buffered(Input|Output)Streams), otherwise you could have data in a buffer.
Maybe you need a concurrent data structure?

How do I flush a 'RandomAccessFile' (java)?

I'm using RandomAccessFile in java:
file = new RandomAccessFile(filename, "rw");
...
file.writeBytes(...);
How can I ensure that this data is flushed to the Operating System? There is no file.flush() method. (Note that I don't actually expect it to be physically written, I'm content with it being flushed to the operating system, so that the data will survive a tomcat crash but not necessarily an unexpected server power loss).
I'm using tomcat6 on Linux.
The only classes that provide a .flush() method are those that actually maintain their own buffers. As java.io.RandomAccessFile does not itself maintain a buffer, it does not need to be flushed.
Have a look carefully at RandomAccessFile constructor javadoc:
The "rws" and "rwd" modes work much like the force(boolean) method of the FileChannel class, passing arguments of true and false, respectively, except that they always apply to every I/O operation and are therefore often more efficient. If the file resides on a local storage device then when an invocation of a method of this class returns it is guaranteed that all changes made to the file by that invocation will have been written to that device. This is useful for ensuring that critical information is not lost in the event of a system crash. If the file does not reside on a local device then no such guarantee is made.
You can use getFD().sync() method.
here's what i do in my app:
rf.close();
rf = new RandomAccessFile("mydata", "rw");
this is give 3-4times gain in performance
compared to getFd().sync() and 5-7 times compared to "rws' mode
deoes exactly what the original question proposed: passes
on unsaved data to the OS and out of JVM. Doesn't physically
write to disk, and therefore introduces no annoying delays
I reached here with the very same curiosity.
And I really can't figure what need to flush on OS and not necessarily need to flush to Disk part means.
In my opinion,
The best thing matches to the concept of a managed flushing is getFD().sync(), as #AVD said,
try(RandomAccessFile raw = new RandomAccessFile(file, "rw")) {
raw.write...
raw.write...
raw.getFD().sync();
raw.wirte...
}
which looks like, by its documentation, it works very much like what FileChannel#force(boolean) does with true.
Now "rws" and "rwd" are look like they work as if specifying StandardOpenOption#SYNC and StandardOpenOption#DSYNC respectively while a FileChannel is open.
try(RandomAccessFile raw = new RandomAccessFile(file, "rws")) {
raw.write...
raw.write...
raw.wirte...
// don't worry be happy, woo~ hoo~ hoo~
}
I learned that you can't..
Some related links here: http://www.cs.usfca.edu/~parrt/course/601/lectures/io.html
and here: http://tutorials.jenkov.com/java-io/bufferedwriter.html

Categories

Resources