Was writing the javadoc for :
/**
* ...Buffers the input stream so do not pass in a BufferedInputStream ...
*/
public static void meth(InputStream is) throws IOException {
BufferedInputStream bis = new BufferedInputStream(is,
INPUT_STREAM_BUFFER_SIZE);
// rest omitted
}
But is it really a problem to pass a buffered input stream in ? So this :
InputStream is = new BufferedInputStream(new FileInputStream("C:/file"), SIZE);
meth(is);
would buffer the is into bis - or would java detect that is is already buffered and set bis = is ? If yes, would different buffer sizes make a difference ? If no, why not ?
NB : I am talking about input streams but actually the question is valid for output streams too
But is it really a problem to pass a buffered input stream in ?
Not really. There is potentially a small overhead in doing this, but it is negligible compared with the overall cost of reading input.
If you look at the code of BufferedInputStream, (e.g. the read1 method) you will see that block reads are implemented to be efficient when buffered streams are stacked.
[Re the example code:] would java detect that is is already buffered and set bis = is ?
No.
If no, why not ?
Because Java (the language, the compiler) generally doesn't understand the semantics of Java library classes. And in this case, since the benefit of such an optimization would be negligible, it i not worthwhile implementing.
Of course, you are free to write your meth method to do this kind of thing explicitly ... though I predict that it will make little difference.
I do not quite get why in read1 they "bother" to copy to the input buffer only if the requested length is less than the buf.length (or if there is a marked position in the input stream)
I assume that you are referring to this code (in read1):
if (len >= getBufIfOpen().length && markpos < 0) {
return getInIfOpen().read(b, off, len);
}
The first part is saying that if the user is asking for less than the stream's configured buffer size, we don't want to short-circuit the buffering. (Otherwise, we'd have the problem that doing a read(byte[], int, int) with a small requested length would be pessimal.)
The second part is to do with the way that mark / reset is implemented. Instead of using mark / reset on the underlying stream (which may or may not be supported), a BufferedInputStream uses the buffer to implement it. What you are seeing is part of that logic. (You can work the details for yourself ... reading the comments in the source code.)
If you buffer the stream twice then it will use more memory and be slower than if you only did so once, but it will still work.
It's certainly worth documenting that your stream does buffering so that users will know they don't need to do so themselves.
Generally it's best to discourage rather than actively prevent this sort of misuse.
The answer is no, Java would not detect the double buffering.
It is up to the user to avoid this problem. The BufferedInputStream has no way of knowing whether the InputStream you pass into the constructor is buffered or not.
Here is the source code for the BufferedInputStream constructor:
public BufferedInputStream(InputStream in, int size) {
super(in);
if (size <= 0) {
throw new IllegalArgumentException("Buffer size <= 0");
}
buf = new byte[size];
}
EDIT
From the comments is it a problem to double buffer a stream?
The short answer is yes.
The idea of buffering is to increase speed so that data is spooled into memory and written out (usually to very slow IO) in chunks. If you double buffer you spool data into memory and then flush that data back into memory somewhere else. This certainly has a cost in terms of speed...
Related
To start with, I understand the concept of buffering as a wrapper around, for instance, FileInuptStream to act as a temporary container for contents read(lets take read scenario) from an underlying stream, in this case - FileInputStream.
Say, there are 100 bytes to read from a stream(file as a source).
Without buffering, code(read method of BufferedInputStream) has to make 100 reads(one byte at a time).
With buffering, depending on buffer size, code makes <= 100 reads.
Lets assume buffer size to be 50.
So, the code reads the buffer(as a source) only twice to read the contents of a file.
Now, as the FileInuptStream is the ultimate source(though wrapped by BufferedInputStream) of data(file which contains 100 bytes), wouldn't it has to read 100 times to read 100 bytes? Though, the code calls read method of BufferedInputStream but, the call is passed to read method of FileInuptStream which needs to make 100 read calls. This is the point which I'm unable to comprehend.
IOW, though wrapped by a BufferedInputStream, the underlying streams(such as FileInputStream) still have to read one byte at a time. So, where is the benefit(not for the code which requires only two read calls to buffer but, to the application's performance) of buffering?
Thanks.
EDIT:
I'm making this as a follow-up 'edit' rather than 'comment' as I think its contextually better suits here and as a TL;DR for readers of chat between #Kayaman and me.
The read method of BufferedInputStream says(excerpt):
As an additional convenience, it
attempts to read as many bytes as possible by repeatedly invoking the
read method of the underlying stream. This iterated read continues
until one of the following conditions becomes true:
The specified number of bytes have been read,
The read method of the underlying stream returns -1, indicating end-of-file, or
The available method of the underlying stream returns zero, indicating that further input requests would block.
I digged into the code and found method call trace as under:
BufferedInputStream -> read(byte b[]) As a I want to see buffering in action.
BufferedInputStream -> read(byte b[], int off, int len)
BufferedInputStream -> read1(byte[] b, int off, int len) - private
FileInputStream -
read(byte b[], int off, int len)
FileInputStream -> readBytes(byte b[], int off, int len) - private and native. Method description from source code -
Reads a subarray as a sequence of bytes.
Call to read1(#4, above mentioned) in BufferedInputStream is in an infinite for loop. It returns on conditions mentioned in above excerpt of read method description.
As I had mentioned in OP(#6), the call does seem to handle by an underlying stream which matches API method description and method call trace.
The question still remains, if native API call - readBytes of FileInputStream reads one byte at a time and create an array of those bytes to return?
The underlying streams(such as FileInputStream) still have to read
one byte at a time
Luckily no, that would be hugely inefficient. It allows the BufferedInputStream to make read(byte[8192] buffer) calls to the FileInputStream which will return a chunk of data.
If you then want to read a single byte (or not), it will efficiently be returned from BufferedInputStream's internal buffer instead of having to go down to the file level. So the BI is there to reduce the times we do actual reads from the filesystem, and when those are done, they're done in an efficient fashion even if the end user wanted to read just a few bytes.
It's quite clear from the code that BufferedInputStream.read() does not delegate directly to UnderlyingStream.read(), as that would bypass all the buffering.
public synchronized int read() throws IOException {
if (pos >= count) {
fill();
if (pos >= count)
return -1;
}
return getBufIfOpen()[pos++] & 0xff;
}
I have read that BufferedOutputStream Class improves efficiency and must be used with FileOutputStream in this way -
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream("myfile.txt"));
and for writing to the same file below statement is also works -
FileOutputStream fout = new FileOutputStream("myfile.txt");
But the recommended way is to use Buffer for reading / writing operations and that's the reason only I too prefer to use Buffer for the same.
But my question is how to measure performance of above 2 statements. Is their any tool or kind of something, don't know exactly what? but which will be useful to analyse it's performance.
As new to JAVA language, I am very curious to know about it.
Buffering is only helpful if you are doing inefficient reading or writing. For reading, it's helpful for letting you read line by line, even when you could gobble up bytes / chars faster just using read(byte[]) or read(char[]). For writing, it allows you to buffer pieces of what you want to send through I/O with the buffer, and to send them only on flush (see PrintWriter (PrintOutputStream(?).setAutoFlush())
But if you are just trying to read or write as fast as you can, buffering doesn't improve performance
For an example of efficient reading from a file:
File f = ...;
FileInputStream in = new FileInputStream(f);
byte[] bytes = new byte[(int) f.length()]; // file.length needs to be less than 4 gigs :)
in.read(bytes); // this isn't guaranteed by the API but I've found it works in every situation I've tried
Versus inefficient reading:
File f = ...;
BufferedReader in = new BufferedReader(f);
String line = null;
while ((line = in.readLine()) != null) {
// If every readline call was reading directly from the FS / Hard drive,
// it would slow things down tremendously. That's why having a buffer
//capture the file contents and effectively reading from the buffer is
//more efficient
}
These numbers came from a MacBook Pro laptop using an SSD.
BufferedFileStreamArrayBatchRead (809716.60-911577.03 bytes/ms)
BufferedFileStreamPerByte (136072.94 bytes/ms)
FileInputStreamArrayBatchRead (121817.52-1022494.89 bytes/ms)
FileInputStreamByteBufferRead (118287.20-1094091.90 bytes/ms)
FileInputStreamDirectByteBufferRead (130701.87-956937.80 bytes/ms)
FileInputStreamReadPerByte (1155.47 bytes/ms)
RandomAccessFileArrayBatchRead (120670.93-786782.06 bytes/ms)
RandomAccessFileReadPerByte (1171.73 bytes/ms)
Where there is a range in the numbers, it varies based on the size of the buffer being used. A larger buffer results in more speed up to a point, typically somewhere around the size of the caches within the hardware and operating system.
As you can see, reading bytes individually is always slow. Batching the reads into chunks is easily the way to go. It can be the difference between 1k per ms and 136k per ms (or more).
These numbers are a little old, and they will vary wildly by setup but they will give you an idea. The code for generating the numbers can be found here, edit Main.java to select the tests that you want to run.
An excellent (and more rigorous) framework for writing benchmarks is JMH. A tutorial for learning how to use JMH can be found here.
I am really in trouble: I want to read HUGE files over several GB using FileChannels and MappedByteBuffers - all the documentation I found implies it's rather simple to map a file using the FileChannel.map() method.
Of course there is a limit at 2GB as all the Buffer methods use int for position, limit and capacity - but what about the system implied limits below that?
In reality, I get lots of problems regarding OutOfMemoryExceptions! And no documentation at all that really defines the limits!
So - how can I map a file that fits into the int-limit safely into one or several MappedByteBuffers without just getting exceptions?
Can I ask the system which portion of a file I can safely map before I try FileChannel.map()? How?
Why is there so little documentation about this feature??
I can offer some working code. Whether this solves your problem or not is difficult to say. This hunts through a file for a pattern recognised by the Hunter.
See the excellent article Java tip: How to read files quickly for the original research (not mine).
// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
long red = 0L;
do {
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
int nGet;
while (mb.hasRemaining() && p.ok()) {
nGet = Math.min(mb.remaining(), SIZE);
mb.get(buffer, 0, nGet);
for (int i = 0; i < nGet && p.ok(); i++) {
p.check(buffer[i]);
}
}
red += read;
} while (red < ch.size() && p.ok());
// Finish off.
p.close();
ch.close();
f.close();
}
What I use is a List<ByteBuffer> where each ByteBuffer maps to the file in block of 16 MB to 1 GB. I uses powers of 2 to simplify the logic. I have used this to map in files up to 8 TB.
A key limitation of memory mapped files is that you are limited by your virtual memory. If you have a 32-bit JVM you won't be able to map in very much.
I wouldn't keep creating new memory mappings for a file because these are never cleaned up. You can create lots of these but there appears to be a limit of about 32K of them on some systems (no matter how small they are)
The main reason I find MemoryMappedFiles useful is that they don't need to be flushed (if you can assume the OS won't die) This allows you to write data in a low latency way, without worrying about losing too much data if the application dies or too much performance by having to write() or flush().
You don't use the FileChannel API to write the entire file at once. Instead, you send the file in parts. See example code in Martin Thompson's post comparing performance of Java IO techniques: Java Sequential IO Performance
In addition, there is not much documentation because you are making a platform-dependent call. from the map() JavaDoc:
Many of the details of memory-mapped files are inherently dependent
upon the underlying operating system and are therefore unspecified.
The bigger the file, the less you want it all in memory at once. Devise a way to process the file a buffer at a time, a line at a time, etc.
MappedByteBuffers are especially problematic, as there is no defined release of the mapped memory, so using more than one at a time is essentially bound to fail.
I want to read a large InputStream and return it as a String.
This InputStream is a large one. So, normally it takes much time and a lot of memory while it is excuting.
The following code is the one that I've developed so far.
I need to convert this code as it does the job in a lesser time consuming lesser memory.
Can you give me any idea to do this.
BufferedReader br =
new BufferedReader(
new InputStreamReader(
connection.getInputStream(),
"UTF-8")
);
StringBuilder response = new StringBuilder(1000);
char[] buffer = new char[4096];
int n = 0;
while(n >= 0){
n = br.read(buffer, 0, buffer.length);
if(n > 0){
response.append(buffer, 0, n);
}
}
return response.toString();
Thank you!
When you are doing buffered I/O you can just read one char at a time from the buffered reader. Then build up the string, and do a toString() at the end.
You may find that for large files on some operating systems, mmaping the file via FileChannel.map will give you better performance - map the file and then create a string out of the mapped ByteBuffer. You'll have to benchmark though, as it may be that 'traditional' IO is faster in some cases.
Do you know in advance the likely maxiumum length of your string? You currently specify an intiial capacity of 1000 for your buffer. If what you read is lots bigger than thet you'll pay some cost in allocating larger internal buffers.
If you have control over the life-cycle of what you're reading, perhaps you could allocate a single re-usable byte array as the buffer. Hence avoiding garbage collection.
Increase the size of your buffer. The bigger the buffer, the faster all the data can be read. If you know (or can work out) how many bytes are available in the stream, you could even allocate a buffer of the same size up-front.
You could run the code in a separate thread... it won't run any faster but at least your program will be able to do some other work instead of waiting for data from the stream.
Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.
The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();
Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.
The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?