I was first curious about how the buffer size of 'BufferedInputStream' class is decided for each program. I found out in STACKOVERFLOW that the default buffer size is 8KB.
I wonder if there is profound meaning in the default buffer size of
'BufferInputStream' class, which is 8KB in size.
When I searched java docs for information about the 'BufferedInputStream' class,
I found out there are two forms of constructors for the class.
One constructor form offers users the ability to change the buffer size.
BufferedInputStream(InputStream in, int size)
Could the buffer size of this class be critical in deciding performance of some programs? I'm curious if anyone uses the above form of the constructor to change the buffer size to fit/optimize his/her program.
Is there any profound meaning to the default buffer size of 8KB?
Thank you for reading.
Could the buffer size of this class be critical in deciding performance of some programs? I'm curious if anyone uses the above form of the constructor to change the buffer size to fit/optimize his/her program.
Probably not. Changing from a buffer size of 1 to 2 will about double your performance (by reducing system calls). Changing from 2 to 4 will double it again. Changing from 4 to 8, again. You get the idea. At some point this ceases being true, as the performance ceases being dominated by system calls and starts being dominated by transfer sizes. 8k is a good place to stop. Use more if you like but you won't notice much difference.
Is there any profound meaning to the default buffer size of 8KB?
There isn't. It is 8k in size. By default. That's the meaning. You can change it via a constructor. Nothing more to it.
Related
Let’s say I’ve mapped a memory region [0, 1000] and now I have MappedByteBuffer.
Can I read and write to this buffer from multiple threads at the same time without locking, assuming that each thread accesses different part of the buffer for exp. T1 [0, 500), T2 [500, 1000]?
If the above is true, is it possible to determine whether it’s better to create one big buffer for multiple threads, or smaller buffer for each thread?
Detailed Intro:
If you wanna learn how to answer those questions yourself, check their implementation source codes:
MappedByteBuffer: https://github.com/himnay/java7-sourcecode/blob/master/java/nio/MappedByteBuffer.java (notice it's still abstract, so you cannot instantiate it directly)
extends ByteBuffer: https://github.com/himnay/java7-sourcecode/blob/master/java/nio/ByteBuffer.java
extends Buffer: https://github.com/himnay/java7-sourcecode/blob/329bbb33cbe8620aee3cee533eec346b4b56facd/java/nio/Buffer.java (which only does index checks, and does not grant an actual access to any buffer memory)
Now it gets a bit more complicated:
When you wanna allocate a MappedByteBuffer, you will get either a
HeapByteBuffer: https://github.com/himnay/java7-sourcecode/blob/329bbb33cbe8620aee3cee533eec346b4b56facd/java/nio/HeapByteBuffer.java
or a DirectByteBuffer: https://github.com/himnay/java7-sourcecode/blob/329bbb33cbe8620aee3cee533eec346b4b56facd/java/nio/DirectByteBuffer.java
Instead of having to browse internet pages, you could also simply download the source code packages for your Java version and attach them in your IDE so you can see the code in development AND debug modes. A lot easier.
Short (incomplete) answer:
Neither of them does secure against multithreading.
So if you ever needed to resize the MappedByteBuffer, you might get stale or even bad (ArrayIndexOutOfBoundsException) access
If the size is constant, you can rely on either Implementation to be "thread safe", as far as your requirements are concerned
On a side note, here also lies an implementation failure creep in the Java implementation:
MappedByteBuffer extends ByteBuffer
ByteBuffer has the heap byte[] called "hb"
DirectByteBuffer extends MappedByteBuffer extends ByteBuffer
So DirectByteBuffer still has ByteBuffer's byte[] hb buffer,
but does not use it
and instead creates and manages its own Buffer
This design flaw comes from the step-by-step development of those classes (they were no all planned and implemented at the same time), AND the topic of package visibility, resulting in inversion of dependency/hierarchy of the implementation.
Now to the true answer:
If you wanna do proper object-oriented programming, you should NOT share resource unless utterly needed.
This ESPECIALLY means that each Thread should have its very own Buffer.
Advantage of having one global buffer: the only "advantage" is to reduce the additional memory consumption for additional object references. But this impact is SO MINIMAL (not even 1:10000 change in your app RAM consumption) that you will NEVER notice it. There's so many other objects allocated for any number of weird (Java) reasons everywhere that this is the least of your concerns. Plus you would have to introduce additional data (index boundaries) which lessens the 'advantage' even more.
The big Advantages of having separate buffers:
You will never have to take care of the pointer/index arithmetics
especially when it comes to you needing more threads at any given time
You can freely allocate new threads at any time without having to rearrange any data or do more pointer arithmetics
you can freely reallocate/resize each individual buffer when needed (without worrying about all the other threads' indexing requirement)
Debugging: You can locate problems so much easier that result from "writing out of boundaries", because if they tried, the bad thread would crash, and not other threads that would have to deal with corrupted data
Java ALWAYS checks each array access (on normal heap arrays like byte[]) before it accesses it, exactly to prevent side effects
think back: once upon a time there was the big step in operating systems to introduce linear address space so programs would NOT have to care about where in the hardware RAM they're loaded.
Your one-buffer-design would be the exact step backwards.
Conclusion:
If you wanna have a really bad design choice - which WILL make life a lot harder later on - you go with one global Buffer.
If you wanna do it the proper OO way, separate those buffers. No convoluted dependencies and side effect problems.
When I construct a new BufferedReader it is providing me a buffer of 8192 characters. What is the logic/reason behind this?
8192 = 2 to the power of 13
Traditionally, memory managers and paging files in the operating system work on pages that are sized in powers of 2. This allows very efficient multiply/divide operations to be performed with left/right shift operations. When working with a buffer, the worst case scenario is to have a buffer with size 1 byte longer than the page size (that would result in an extra page swap with very low benefit). So the default buffer sizes will also tend to be implemented in factors of two.
I'd assume (but have not checked) that the JVM looks for buffers like this and attempts to align them on page boundaries.
Why does this matter? Page misses are quite expensive. If you are doing a ton of IO, it's better to avoid the case where the page backing the buffer gets swapped out to disk (kind of defeats the purpose of the buffer). That said, for most applications, this is a micro-optimization, and for the vast majority of cases, the default is fine.
For reference, Windows and Linux both currently use a 4KB memory page size. So the default buffer on BufferedReader will consume exactly 2 pages.
As the BufferedReader Javadoc says
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
The default was chosen as being "large enough" (which I would interpret as "good enough").
8192, as you said, is 2^13. The exact reason for this number being the default is hard to come by, but I'd venture to say it's based on the combination of normal use scenarios and data efficiency. You can specify a buffer size of whatever you want, though, using a different object constructor.
BufferedReader(Reader in, int sz)
Creates a buffering character-input stream that uses an input buffer of the specified size.
https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
and BufferedReader default buffer size? will provide further insight.
There is a JDK ticket https://bugs.openjdk.org/browse/JDK-4953311 that states
Most OSes that we support uses a buffer size of 8192 (8K) bytes for their IO buffering, and this is also the buffer size used by Microsoft VM on Win32. We should change the default buffer size in these two classes to 8K.
8192 is 2^13 and also reveals much information regarding the RIGHT v WRONG encoded in all we do. If one takes away or adds to the author intent, he modifies and therefore corrupts the entire thing. Try to add or take away from something perfect... good luck!
I have a case where I need to peek ahead in the stream for the existence of a certain regular expression and then read data from the stream.
mark and reset allow me to do this but I am facing an issue where mark becomes invalid if the readAheadLimit goes beyond the size of the current buffer.
For example: I have a BufferedReader with buffer size of 1k.
Lets say I am at position 1000 (mark=1000) in the buffer and I need to check for the regex in the next 100 chars (readAheadLimit=100).
So while reading, the moment I cross the current buffer size (1024), a new buffer is allocated and the mark becomes invalid (not able to reset) and the data is streamed into the new buffer in a normal way.
I think this is the intended behavior but is there a way to get around this?
Appreciate your help.
regards
the moment I cross the current buffer size (1024), a new buffer is allocated
No it isn't. The existing buffer is cleared and readied for another use.
and the mark becomes invalid (not able to reset)
No it doesn't, unless you've gone beyond the read ahead limit.
You don't seem to have read the API. You call mark() with an argument that says how far ahead you want to go before calling reset(), in this case 100 bytes, and the API is required to allow you to do exactly that. So when you get up to 100 characters ahead, call reset(), and you are back where you were when you called mark(). How that happens internally isn't your problem, but it is certainly required to happen.
And how did you get a BufferedReader with a 1k buffer? The default is 4096.
There are at least two options:
Set default cache size much more than 1k:
new BufferedReader(originalReader, 1024 * 1024) // e.g. 1Mb
Apply custom buffering to increase cache size as soon as limit was exceeded. In case if you are working with huge amount of data - custom buffering can store data it in database or file.
I have made an application in android that lets the user compress and decompress files and I used the package java.util.zip. Everything is okay. the speed, files are totally compressed and decompressed together with the directories. The only problem is that the application is not able to compress/decompress large files (greater than 1gb).
I believe the problem is the size of my buffer. Other codes that I've seen, the value of their buffer is 1024 or 2048 or 8192 but my value of my buffer is base on the size of the chosen file (just to make it flexible). But once the user chose a large file (with a size of >8 digits), that's were the error comes out. I searched over the net and also here in this site but I can't find an answer. my problem is similar to this:
To Compress a big file in a ZIP with Java
Thanks for the future help! :)
EDIT:
Thanks for the comments and answers. It really helped a lot. I thought BUFFER in compressing/decompressing in java means the size of file so in my program, I made the buffer size flexible (buffer size = file size). Will someone please explain how buffer works so I can understand why is it okay that BUFFER has a fixed value. Also for me to figure it out why others people is telling that it is much better if the buffer size is 8k or else. Thanks a lot! :)
If you size the buffer to the size of the file, then it means that you will have OutOfMemoryError whenever the file size is too big for memory available.
Use a normal buffer size and let it do it's work - buffering the data in a streaming fashion, one chunk at a time, rather than all in one go.
For explanation, see for example the documentation of BufferedOutputStream:
The class implements a buffered output stream. By setting up such an
output stream, an application can write bytes to the underlying output
stream without necessarily causing a call to the underlying system for
each byte written.
So using a buffer is more efficient than non-buffered writing.
And from the write method:
Ordinarily this method stores bytes from the given array into this
stream's buffer, flushing the buffer to the underlying output stream
as needed. If the requested length is at least as large as this
stream's buffer, however, then this method will flush the buffer and
write the bytes directly to the underlying output stream.
Each write causes the in-memory buffer to fill up, until the buffer is full. When the buffer is full, it is flushed and cleared. If you use a very large buffer, you will cause a large amount of data to be stored in memory before flushing. If your buffer is the same size as the input file, then you are saying you need to read the whole content into memory before flushing it. Using the default buffer size is usually just fine. There will be more physical writes (flushes); you avoid exploding memory.
By allowing you to specify a specific buffer size, the API is letting you choose the right balance between memory consumption and i/o to suit your application. If you tune your application for performance, you might end up tweaking buffer size. But the default size will be reasonable for many situations.
It sounds like it would help to simply set a maximum size for the buffer, something like:
//After calculating the buffer size bufSize:
bufSize = Math.min(bufSize, MAXSIZE);
I need a byte buffer class in Java for single-threaded use. The buffer should resize when it's full, rather than throw an exception or something. Very important issue for me is performance.
What would you recommend?
ADDED:
At the momement I use ByteBuffer but it cannot resize. I need one that can resize.
Any reason not to use the boring normal ByteArrayOutputStream?
As mentioned by miku above, Evan Jones gives a review of different types and shows that it is very application dependent. So without knowing further details it is hard to speculate.
I would start with ByteArrayOutputStream, and only if profiling shows it is your performance bottleneck move to something else. Often when you believe the buffer code is the bottleneck, it will actually be network or other IO - wait until profiling shows you need an optimisation before wasting time finding a replacement.
If you are moving to something else, then other factors you will need to think about:
You have said you are using single threaded use, so BAOS's synchronization is not needed
what is the buffer being filled by and fed into? If either end is already wired to use Java NIO, then using a direct ByteBuffer is very efficient.
Are you using a circular buffer or a plain linear buffer? If you are then the Ostermiller Utils are pretty efficient, and GPL'd
You can use a direct ByteBuffer. Direct memory uses virtual memory to start with is only allocated to the application when it is used. i.e. the amount of main memory it uses re-sizes automagically.
Create a direct ByteBuffer larger than you need and it will only consume what you use.
you can also write manual code for checking the buffer content continously and if its full then make a new buffer of greater size and shift all the data in that new buffer.