I'm currently using Java sockets in a client-server application with OutputStream and not BufferedOutputStream (and the same for input streams).
The client and server exchanges serialized objects (writeObject() method).
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
And when I have to flush or should I not write a flush() statement?
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
Actually, it probably doesn't make sense1.
The object stream implementation internally wraps the stream it has been given with a private class called BlockDataOutputStream that does buffering. If you wrap the stream yourself, you will have two levels of buffering ... which is likely to make performance worse2.
And when I have to flush or should I not write a flush() statement?
Yes, flushing is probably necessary. But there is no universal answer as to when to do it.
On the one hand, if you flush too often, you generate extra network traffic.
On the other hand, if you don't flush when it is needed, the server can stall waiting for an object that the client has written but not flushed.
You need to find the compromise between these two syndromes ... and that depends on your application's client/server interaction patterns; e.g. whether the message patterns are synchronous (e.g. message/response) or asynchronous (e.g. message streaming).
1 - To be certain on this, you would need to do some forensic testing to 1) measure the system performance, and 2) determine what syscalls are made and when network packets are sent. For a general answer, you would need to repeat this for a number of use-cases. I'd also recommend looking at the Java library code yourself to confirm my (brief) reading.
2 - Probably only a little bit worse, but a well designed benchmark would pick up a small performance difference.
UPDATE
After writing the above, I found this Q&A - Performance issue using Javas Object streams with Sockets - which seems to suggest that using BufferedInputStream / BufferedOutputStream helps. However, I'm not certain whether the performance improvement that was reported is 1) real (i.e. not a warmup artefact) and 2) due to the buffering. It could be just due to adding the flush() call. (Why: because the flush could cause the network stack to push the data sooner.)
I think these links might help you:
What is the purpose of flush() in Java streams?
The flush method flushes the output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
How java.io.Buffer* stream differs from normal streams?
Internally a buffer array is used and instead of reading bytes individually from the underlying input stream enough bytes are read to fill the buffer. This generally results in faster performance as less reads are required on the underlying input stream.
http://www.oracle.com/technetwork/articles/javase/perftuning-137844.html
As a means of starting the discussion, here are some basic rules on how to speed up I/O: 1.Avoid accessing the disk. 2.Avoid accessing the underlying operating system. 3.Avoid method calls. 4.Avoid processing bytes and characters individually.
So using Buffered-Streams usually speeds speeds up the IO-processe, as less read() are done in the background.
Related
I have noticed that some I/O Classes in Java (and a lot others, like the BufferedWriter and FileWriter), require a call to flush() after writing. (With the exception of AutoFlush, I'll get to that later).
For example, this call to println() will not work. However, if I invoke writer#flush() after, the line will print.
PrintWriter writer = new PrintWriter(System.out);
writer.println("test");
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?
Why does the PrintWriter class (and other writers) require a call to flush after writing?
To the extent that flushing is wanted1, it will be needed if the "stack" of output classes beneath the print writer is doing some output buffering. If an output stream is buffered, then some event needs to trigger pushing (flushing) the buffered output to the the external file, pipe, socket or whatever. The things that will trigger flushing are:
the buffer filling up
something calling close() on the stream, or
something calling flush() on the stream.
In the case of a PrintWriter, the underlying stream can also be flushed by the classes auto-flushing mechanism.
The reason for buffering output (in general) is efficiency. Performing the low-level output operation that writes data to the (external) file, pipe, whatever involves a system call. There are significant overheads in doing this, so you want to avoid doing lots of "little" writes.
Output buffering is the standard way to solve this problem. Data to be written is collected in the buffer until the buffer fills up. The net result us lots of "little" writes can be aggregated into a "big" write. The performance improvement can be significant.
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?
It is really a convenience to avoid having to explicitly flush. It doesn't improve performance. Indeed, if you don't need the data to be flushed1, then unnecessary auto-flushing will reduce performance.
1 - You would want the data to be flushed if someone or something wants to see the data you are writing as soon as possible.
They don't require flushing, only if you want to guarantee that output has been displayed so far, which is exactly what flushing is. If you are fine writing to a file and just want to make sure it gets there before the program terminates, then no need to flush.
When data is written to an output stream, the underlying
an operating system does not guarantee that the data will make it
to the file system immediately. In many operating systems, the
data may be cached in memory, with a write occurring only
after a temporary cache is filled or after some amount of time
has passed.
If the data is cached in memory and the application terminates
unexpectedly, the data would be lost, because it was never
written to the file system. To address this, all output stream
classes provide a flush() method, which requests that all accumulated data be written immediately to disk.
The flush() method helps reduce the amount of data lost if the
application terminates unexpectedly. It is not without cost,
though. Each time it is used, it may cause a noticeable delay in
the application, especially for large files. Unless the data that
you are writing is extremely critical, the flush() method should
be used only intermittently. For example, it should not
necessarily be called after every write.
You also do not need to call the flush() method when you have
finished writing data, since the close() method will
automatically do this.
Read from the book here -> OCP Oracle Certified Professional Java SE 11
Hope this is clear to you!
How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.
Reading Thinking in Java 4th ed. I've got some doubts about I/O operations performance:
I've read that it's better to "wrap" InputStream objects in BufferedInputStream, but in my mind I can't see any difference. Isn't i.e. file operations already buffered? What's the advantages of file buffered write?
The system's IO buffering is on a different level than the Buffered*putStream.
Each call on FileOutputStream.write(...) induces a native method call (which is typically more costly than a java-internal call), and then a context switch to the OS' kernel to do the actual writing. Even if the kernel (or the file system driver or the harddisk controller or the harddisk itself) is doing more buffering, these costs will occur.
By wrapping a BufferedOutputStream around this, we will call the native write method only much less often, thus allowing much higher throughput.
(The same is valid for other types of IO, of course, I just used FileOutputStream as an example.)
Isn't i.e. file operations already buffered?
Maybe, maybe not - depending on the OS, the HD used, the way of access (e.g. reading big consecutive blocks vs randomly accessing small blocks all over the place), etc. In the worst case, adding a BufferedInputStream probably won't harm performance noticeably. In the best case, it can improve it by magnitudes (replacing many little file accesses by one big read/write).
An InputStream will only request as much data as you request, so if you request 1000 characters one character at a time, that will turn out to be 1000 seperate disk accesses, which will become pretty slow.
A BufferedInputStream however will request data from the InputStream in larger chunks, thus reducing the need for seperate disk accesses.
The same goes for output, instead of writing every character seperately, there are fewer physical disk writes with a BufferedOutputStream.
Is it more efficient to flush the OutputStream after each individual invocation of ObjectOutputStream#writeObject rather than flushing the stream after a sequence of object writes? (Example: write object and flush 4 times, or write 4 times and then just flush once?)
How does ObjectOutputStream work internally?
Is it somehow better sending four Object[5] (flushing each one) than a Object[20], for example?
It is not better. In fact it is probably worse, from a performance perspective. Each of those flushes will force the OS-level TCP/IP stack to send the data "right now". If you just do one flush at the end, you should save on system calls, and on network traffic.
If you haven't done this already, inserting a BufferedOutputStream between the Socket OutputStream and the ObjectOutputStream will make a much bigger difference to performance. This allows the serialized data to accumulate in memory before being written to the socket stream. This potentially save many system calls and could improve performance by orders of magnitude ... depending on the actual objects being sent.
(The representation of four Object[5] objects is larger than one Object[20] object, and that results in a performance hit in the first case. However, this is marginal at most, and tiny compared with the flushing and buffering issues.)
How does this stream work internally?
That is too general a question to answer sensibly. I suggest that you read up on serialization starting with the documents on this page.
No, it shouldn't matter, unless you have reason to believe the net link is likely to go down, and partial data is useful. Otherwise it just sounds like a way to make the code more complex for no reason.
If you look at the one and only public constructor of ObjectOutputStream, you note that it requires an underlying OutputStream for its instantiation.
When and how you flush your ObjectStream is entirely dependent on the type of stream you are using. (And in considering all this, do keep in mind that not all extension of OutputStream are guaranteed to respect your request to flush -- it is entirely implementation independent, as it is spelled out in the 'contract' of the javadocs.)
But certainly we can reason about it and even pull up the code and see what is actually done.
IFF the underlying OutputStream must utilize the OS services for devices (such as the disk or the network interface in case of Sockets) then the behavior of flush() is entirely OS dependent. For example, you may grab the output stream of a socket and then instantiate an ObjectOutputStream to write serialized objects to the net. TCP/IP implementation of the host OS is in charge.
What is more efficient?
Well, if your object stream is wrapping a ByteArrayOutputStream, you are potentially looking at a series of reallocs and System.arrayCopy() calls. I say potentially, since the implementation of byte array doubles the size on each (internal) resize() op and it is very unlikely that writing n (small) objects and flushing each time will result in n reallocs. (Where n is assumed to be a reasonably small number).
But if you are wrapping a network stream, you must keep in mind that network writes are very expensive. It makes much more sense, if your protocol allows it, to chunk your writes (to fill the send buffer) and just flush once.
I have a client program that sends an encrypted object to a server, and am using a bunch of streams to do it:
//httpOutput is the HTTPUrlConnection request stream
Cipher rsaCipher = Cipher.getInstance("RSA/ECB/PKCS1Padding");
rsaCipher.init(Cipher.ENCRYPT_MODE, certificate);
CipherOutputStream cipherOutput = new CipherOutputStream(httpOutput, rsaCipher);
BufferedOutputStream bufferedOutput = new BufferedOutputStream(cipherOutput);
ObjectOutputStream objectOutput = new ObjectOutputStream(bufferedOutput );
Am I putting the buffer in the right place/does it matter?
Am I putting the buffer in the right place?
Yes. In general, it is best to put the buffered stream as close as possible to the code that produces or consumes the data. This is on the assumption that the lower layers can handle large chunks of data more efficiently than small chunks.
Does it matter?
It depends on what is underneath, and specifically on what the performance gain of chunking is at each level. If the performance gain is small, then it doesn't matter much. But if the performance gain is large (e.g. at the stage where you are doing system calls), then having the buffering in place very important.
There are a few situations where inserting a buffered stream could degrade performance (a bit). For example:
if the application code (the producer / consumer) is reading / writing data in large chunks, or
if the stream is writing to / reading from an in-memory buffer and there is no filtering / conversion in the stream stack that would benefit from chucking.
In these cases, the buffering does not help, and adds a small performance cost due to the extra method calls and per-call buffer state management. However, the performance penalty is unlikely to be significant, even if the streaming is a performance critical part of the application. (But if performance is an important requirement, profile it!)
The Buffering should happen the closest to the raw stream as possible. in this case it should be wrapping the http output stream.
You might be interested in reading the Java documentation on I/O Performance. Buffered streams allow you to deal with chunks of data instead of individuals bytes. In most cases, this allows for faster reading and writing from disk.
I've never done performance testing, but if I'm using a buffered stream, it's usually very early on in the wrapping. For example, I would buffer the httpOutput stream first, then apply the CipherOutputStream to that, then finally the ObjectOutputStream. However, I'm not sure why I do it that way, it's just the way I've always done it - apply the buffering as early as possible.