Optimal block size in java streams

Optimal block size in java streams - java

I have a theoretical question. Let's imagine you have an InputStream and an OutputStream. You nead to copy content from one to the other and you don't know exactly the size of the content you need to transfer. What is the best choice in general of the block size in the write method?

The answer is: it depends. For a general solution, stop worrying about it and just use a library. Common choices:
Apache Commons IO IOUtils#copy() or copyLarge(), or
Google Guava's ByteStreams#copy()

The default buffer size for BufferedInputStream and BufferedOutputStream is 8 KB and this is typically a good size.
Note: if you are reading a Socket fast enough, you will rarely get more one packet, ~1.5 KB. If you are reading from disk, you typically get whatever size you ask for, however the performance doesn't improve much from 32 KB to 256 KB and is likely to be dependant on the hardware you use.
However I have also found, unless you are benchmarking you rarely see a noticeable difference if you use only 512 bytes as a buffer size (which Inflator/Deflator streams do) i.e. the difference might be 15% or less.
In summary, you are unlikely to notice a difference with buffer sizes between 512 bytes and 32 KB. The latter is likely to be more than enough for most situations. I tend to use 256 KB as I have a lot of memory and few pre-allocated buffers.

Related

Buffered-Input Stream [duplicate]

Let me preface this post with a single caution. I am a total beginner when it comes to Java. I have been programming PHP on and off for a while, but I was ready to make a desktop application, so I decided to go with Java for various reasons.
The application I am working on is in the beginning stages (less than 5 classes) and I need to read bytes from a local file. Typically, the files are currently less than 512kB (but may get larger in the future). Currently, I am using a FileInputStream to read the file into three byte arrays, which perfectly satisfies my requirements. However, I have seen a BufferedInputStream mentioned, and was wondering if the way I am currently doing this is best, or if I should use a BufferedInputStream as well.
I have done some research and have read a few questions here on Stack Overflow, but I am still having troubles understanding the best situation for when to use and not use the BufferedInputStream. In my situation, the first array I read bytes into is only a few bytes (less than 20). If the data I receive is good in these bytes, then I read the rest of the file into two more byte arrays of varying size.
I have also heard many people mention profiling to see which is more efficient in each specific case, however, I have no profiling experience and I'm not really sure where to start. I would love some suggestions on this as well.
I'm sorry for such a long post, but I really want to learn and understand the best way to do these things. I always have a bad habit of second guessing my decisions, so I would love some feedback. Thanks!

If you are consistently doing small reads then a BufferedInputStream will give you significantly better performance. Each read request on an unbuffered stream typically results in a system call to the operating system to read the requested number of bytes. The overhead of doing a system call is may be thousands of machine instructions per syscall. A buffered stream reduces this by doing one large read for (say) up to 8k bytes into an internal buffer, and then handing out bytes from that buffer. This can drastically reduce the number of system calls.
However, if you are consistently doing large reads (e.g. 8k or more) then a BufferedInputStream slows things a bit. You typically don't reduce the number of syscalls, and the buffering introduces an extra data copying step.
In your use-case (where you read a 20 byte chunk first then lots of large chunks) I'd say that using a BufferedInputStream is more likely to reduce performance than increase it. But ultimately, it depends on the actual read patterns.

If you are using a relatively large arrays to read the data a chunk at a time, then BufferedInputStream will just introduce a wasteful copy. (Remember, read does not necessarily read all of the array - you might want DataInputStream.readFully). Where BufferedInputStream wins is when making lots of small reads.

BufferedInputStream reads more of the file that you need in advance. As I understand it, it's doing more work in advance, like, 1 big continous disk read vs doing many in a tight loop.
As far as profiling - I like the profiler that's built into netbeans. It's really easy to get started with. :-)

I can't speak to the profiling, but from my experience developing Java applications I find that using any of the buffer classes - BufferedInputStream, StringBuffer - my applications are exceptionally faster. Because of which, I use them even for the smallest files or string operation.

import java.io.*;
class BufferedInputStream
{
public static void main(String arg[])throws IOException
{
FileInputStream fin=new FileInputStream("abc.txt");
BufferedInputStream bis=new BufferedInputStream(fin);
int size=bis.available();
while(true)
{
int x=bis.read(fin);
if(x==-1)
{
bis.mark(size);
System.out.println((char)x);
}
}
bis.reset();
while(true)
{
int x=bis.read();
if(x==-1)
{
break;
System.out.println((char)x);
}
}
}
}

Buffered vs Unbuffered. How actually buffer work?

How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?

The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.

As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.

Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.

Why is the default char buffer size of BufferedReader 8192?

When I construct a new BufferedReader it is providing me a buffer of 8192 characters. What is the logic/reason behind this?
8192 = 2 to the power of 13

Traditionally, memory managers and paging files in the operating system work on pages that are sized in powers of 2. This allows very efficient multiply/divide operations to be performed with left/right shift operations. When working with a buffer, the worst case scenario is to have a buffer with size 1 byte longer than the page size (that would result in an extra page swap with very low benefit). So the default buffer sizes will also tend to be implemented in factors of two.
I'd assume (but have not checked) that the JVM looks for buffers like this and attempts to align them on page boundaries.
Why does this matter? Page misses are quite expensive. If you are doing a ton of IO, it's better to avoid the case where the page backing the buffer gets swapped out to disk (kind of defeats the purpose of the buffer). That said, for most applications, this is a micro-optimization, and for the vast majority of cases, the default is fine.
For reference, Windows and Linux both currently use a 4KB memory page size. So the default buffer on BufferedReader will consume exactly 2 pages.

As the BufferedReader Javadoc says
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
The default was chosen as being "large enough" (which I would interpret as "good enough").

8192, as you said, is 2^13. The exact reason for this number being the default is hard to come by, but I'd venture to say it's based on the combination of normal use scenarios and data efficiency. You can specify a buffer size of whatever you want, though, using a different object constructor.
BufferedReader(Reader in, int sz)
Creates a buffering character-input stream that uses an input buffer of the specified size.
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
and BufferedReader default buffer size? will provide further insight.

There is a JDK ticket https://bugs.openjdk.org/browse/JDK-4953311 that states
Most OSes that we support uses a buffer size of 8192 (8K) bytes for their IO buffering, and this is also the buffer size used by Microsoft VM on Win32. We should change the default buffer size in these two classes to 8K.

8192 is 2^13 and also reveals much information regarding the RIGHT v WRONG encoded in all we do. If one takes away or adds to the author intent, he modifies and therefore corrupts the entire thing. Try to add or take away from something perfect... good luck!

Java compress/decompress large files (>1gb)

I have made an application in android that lets the user compress and decompress files and I used the package java.util.zip. Everything is okay. the speed, files are totally compressed and decompressed together with the directories. The only problem is that the application is not able to compress/decompress large files (greater than 1gb).
I believe the problem is the size of my buffer. Other codes that I've seen, the value of their buffer is 1024 or 2048 or 8192 but my value of my buffer is base on the size of the chosen file (just to make it flexible). But once the user chose a large file (with a size of >8 digits), that's were the error comes out. I searched over the net and also here in this site but I can't find an answer. my problem is similar to this:
To Compress a big file in a ZIP with Java
Thanks for the future help! :)
EDIT:
Thanks for the comments and answers. It really helped a lot. I thought BUFFER in compressing/decompressing in java means the size of file so in my program, I made the buffer size flexible (buffer size = file size). Will someone please explain how buffer works so I can understand why is it okay that BUFFER has a fixed value. Also for me to figure it out why others people is telling that it is much better if the buffer size is 8k or else. Thanks a lot! :)

If you size the buffer to the size of the file, then it means that you will have OutOfMemoryError whenever the file size is too big for memory available.
Use a normal buffer size and let it do it's work - buffering the data in a streaming fashion, one chunk at a time, rather than all in one go.
For explanation, see for example the documentation of BufferedOutputStream:
The class implements a buffered output stream. By setting up such an
output stream, an application can write bytes to the underlying output
stream without necessarily causing a call to the underlying system for
each byte written.
So using a buffer is more efficient than non-buffered writing.
And from the write method:
Ordinarily this method stores bytes from the given array into this
stream's buffer, flushing the buffer to the underlying output stream
as needed. If the requested length is at least as large as this
stream's buffer, however, then this method will flush the buffer and
write the bytes directly to the underlying output stream.
Each write causes the in-memory buffer to fill up, until the buffer is full. When the buffer is full, it is flushed and cleared. If you use a very large buffer, you will cause a large amount of data to be stored in memory before flushing. If your buffer is the same size as the input file, then you are saying you need to read the whole content into memory before flushing it. Using the default buffer size is usually just fine. There will be more physical writes (flushes); you avoid exploding memory.
By allowing you to specify a specific buffer size, the API is letting you choose the right balance between memory consumption and i/o to suit your application. If you tune your application for performance, you might end up tweaking buffer size. But the default size will be reasonable for many situations.

It sounds like it would help to simply set a maximum size for the buffer, something like:
//After calculating the buffer size bufSize:
bufSize = Math.min(bufSize, MAXSIZE);

Using Dynamic Buffers? Java

In Java, I have a method
public int getNextFrame( byte[] buff )
that reads from a file into the buffer and returns the number of bytes read. I am reading from .MJPEG that has a 5byte value, say "07939", followed by that many bytes for the jpeg.
The problem is that the JPEG byte size could overflow the buffer. I cannot seem to find a neat solution for the allocation. My goal is to not create a new buffer for every image. I tried a direct ByteBuffer so I could use its array() method to get direct access to the underlying buffer. The ByteBuffer does not expand dynamically.
Should I be returning a reference to the parameter? Like:
public ByteBuffer getNextFrame( ByteBuffer ref )
How do I find the bytes read? Thanks.

java.io.ByteArrayOutputStream is a wrapper around a byte-array and enlarges it as needed. Perhaps this is something you could use.
Edit:
To reuse just call reset() and start over...

Just read the required number of bytes. Do not use read(buffer), but use read(buffer,0,size). If there are more bytes, just discard them, the JPG is broken anyway.

EDIT:
Allocating a byte[] is so much faster than reading from a file or a
socket, I would be surprised it will make much difference, unless you
have a system where micro-seconds cost money.
The time it takes to read a file of 64 KB is about 10 ms (unless the
file is in memory)
The time it takes to allocate a 64 KB byte[] is about 0.001 ms,
possibly faster.
You can use apache IO's IOBuffer, however this expands very expensively.
You can also use ByteBuffer, the position() will tell you how much data was read.
If you don't know how big the buffer will be and you have a 64-bit JVM you can create a large direct buffer. This will only allocate memory (by page) when used. The upshot is that you can allocate a 1 GB but might only ever use 4 KB if that is all you need. Direct buffer doesn't support array() however, you would have to read from the ByteBuffer using its other methods.
Another solution is to use an AtomicReference<byte[]> the called method can increase the size as required, but if its large enough it would reuse the previous buffer.

The usual way of accomplishing this in a high-level API is either let the user provide an OutputStream and fill it with your data (which can be a ByteArrayOutputStream or something completely different), or have an InputStream as return value, that the user can read to get the data (which will dynamically load the correct parts from the file and stop when finished).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.