BufferedReader with FileReader working - java

I have a doubt regarding how BufferedReader works with FileReader. Studied most of the posts on stackoverflow and Google as well but still my doubt is not cleared. Its my third day am putting on this to understand..! :)
Here it is:
My Understanding says, when we use below code snippet
BufferedReader in
= new BufferedReader(new FileReader("foo.in"));
FileReader reads bytes wise data and put into buffer. Here buffer is created by BufferedReader and the instance of BufferedReader reads from that buffer.
This made me think, because this post says Understanding how BufferedReader works in Java, BufferedReader doesnt store anything itself, because if that's a case then I thought BufferedReader doing two things, one creates a buffer and second creates a instance of BufferedReader who reads from that buffer...! Makes Sense...?
My Second doubt is, BufferedReader can be used to avoid IO operations, which means to avoid time consuming efforts where bytes are being read from disk and then converted to Char then giving out. So to overcome this issue, BufferedReader can be used who reads big chunk of data at once. Here makes me think that, when BufferedReader is wrapped around FileReader then FileReader stream is reading first and then data is being passed to BufferedReader. Then how it takes a big chunk...?
My understanding says, BufferedReader reader is helpful because it reads data from Buffer which is a memory, so rather than doing same thing at time which is reading bytes from disk and converting at the same time, first put all bytes in buffer or memory then read it from there, because its fast to be read and can be converted to char as well. This I have concluded by reading online, but am not agree 100% because no step is skipped even after putting into buffer, then how it reduce the time frame....? :(
I'm literally confused with these, Can anyone help me to understand this more precisely ?

FileReader reads bytes wise data
No. It constructs a FileInputStream and Input Reader, and reads from the latter, as characters.
and put into buffer
Puts into the caller's buffer.
Here buffer is created by BufferedReader and the instance of BufferedReader reads from that buffer.
Correct.
This made me think, because this post says Understanding how BufferedReader works in Java, BufferedReader doesnt store anything itself
That statement in that post is complete and utter nonsense, and so is any other source that says so. Of course it stores data. It is a buffer. See the Javadoc, and specifically the following statement: 'reads text from a character-input stream, buffering characters [my emphasis] so as to provide for the efficient reading of characters, arrays, and lines.'
because if that's a case then I thought BufferedReader doing two things, one creates a buffer and second creates a instance of BufferedReader who reads from that buffer...! Makes Sense...?
No, but neither did your source. Your first intuition above was correct.
My Second doubt is, BufferedReader can be used to avoid IO operations, which means to avoid time consuming efforts where bytes are being read from disk and then converted to Char then giving out. So to overcome this issue, BufferedReader can be used who reads big chunk of data at once. Here makes me think that, when BufferedReader is wrapped around FileReader then FileReader stream is reading first and then data is being passed to BufferedReader. Then how it takes a big chunk...?
By supplying a big buffer to FileReader.read().
My understanding says, BufferedReader reader is helpful because it reads data from Buffer which is a memory, so rather than doing same thing at time which is reading bytes from disk and converting at the same time, first put all bytes in buffer or memory then read it from there, because its fast to be read and can be converted to char as well. This I have concluded by reading online, but am not agree 100% because no step is skipped even after putting into buffer, then how it reduce the time frame....? :(
The step of reading character by character from the disk is skipped. It is more or less just as efficient to read a chunk from a disk file as it is to read one byte, and system calls are themselves expensive.

Related

Buffer and File in Java

I'm new to java and I want to ask what's the difference between using FileReader-FileWriter and using BufferedReader-BufferedWriter. Except of speed is there any other reason to use Buffered?
In a code for copying a file and pasting its content into another file is it better to use BufferedReader and BufferedWriter?
The short version is: File writer/reader is fast but inefficient, but a buffered writer/reader saves up writes/reads and does them in chunks (Based on the buffer size) which is far more efficient but can be slower (waiting for the buffer to fill up).
So to answer your question, a buffered writer/reader is generally best, especially if you are not sure on which one to use.
Take a look at the JavaDoc for the BufferedWriter, it does a great job of explaining how it works:
In general, a Writer sends its output immediately to the underlying
character or byte stream. Unless prompt output is required, it is
advisable to wrap a BufferedWriter around any Writer whose write()
operations may be costly, such as FileWriters and OutputStreamWriters.
For example,
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("foo.out")));
will buffer the PrintWriter's output to the file. Without buffering,
each invocation of a print() method would cause characters to be
converted into bytes that would then be written immediately to the
file, which can be very inefficient.

How does BufferedReader read files from S3?

I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.
As per AWS documentation to read from file:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
displayTextInputStream(fullObject.getObjectContent());
private static void displayTextInputStream(InputStream input) throws IOException {
// Read the text input stream one line at a time and display each line.
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
System.out.println();
}
Here we are using a BufferedReader. It is not clear to me what is happening underneath here.
Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?
One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).
The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.
And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.
If you are basically not searching for a specific word/words, and you are aware of the bytes range, you can also use Range header in S3. This should be specifically useful as you are working with a single file of several GB size. Specifying Range not only helps to reduce the memory, but also is faster, as only the specified part of the file is read.
See Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file?
Hope this helps.
Sreram
Depends on the size of the lines in your file. readLine() will continue to build the string fetching data from the stream in blocks the size of your buffer size, until you hit a line termination character. So the memory used will be on the order of your line length + buffer length.
Only a single HTTP call is made to the AWS infrastructure, and the data is read into memory in small blocks, of which the size may vary and is not directly under your control.
This is very memory-efficient already, assuming each line in the file is a reasonably small size.
One way to optimize further (for network and compute resources), assuming that your "certain condition" is a simple string match, is to use S3 Select: https://aws.amazon.com/s3/features/#s3-select

Why it is advisable to wrap BufferReader around InputStreamReader?

BufferReader br = new BufferedReader(new InputStreamReader(System.in));
Please explain why InputStreamReader(System.in) is passed in BufferReader().
The docs answer this very question.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders.
For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
A data buffer (or just buffer) is a region of a physical memory storage used to temporarily store data while it is being moved from one place to another. Typically, the data is stored in a buffer as it is retrieved from an input device (such as a microphone, file, disk, etc) or just before it is sent to an output device (such as speakers).
Wrapping the reader in BufferedReader increases the program efficiency by reducing the time it takes to read data from external device each time directly. Instead, BufferedReader can read the data from the external device and store the data in buffer for further processing.
The java BufferedReader class provides buffering to your Reader instances.Buffering can speed up I/O quite a bit.

Where is the data queued with a BufferedReader?

I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?
yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.
+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.
BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.
BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.

Is it overkill to use BufferedWriter and BufferedOutputStream together?

I want to write to a socket. From reading about network IO, it seems to me that the optimal way to write to it is to do something like this:
OutputStream outs=null;
BufferedWriter out=null;
out =
new BufferedWriter(
new OutputStreamWriter(new BufferedOutputStream(outs),"UTF-8"));
The BufferedWriter would buffer the input to the OutputStreamWriter which is recommended, because it prevents the writer from starting up the encoder for each character.
The BufferedOutputStream would then buffer the bytes from the Writer to avoid putting one byte at a time potentially onto the network.
It looks a bit like overkill, but it all seems like it helps?
Grateful for any help..
EDIT: From the javadoc on OutputStreamWriter:
Each invocation of a write() method causes the encoding converter to be invoked on the given character(s). The resulting bytes are accumulated in a buffer before being written to the underlying output stream. The size of this buffer may be specified, but by default it is large enough for most purposes. Note that the characters passed to the write() methods are not buffered.
For top efficiency, consider wrapping an OutputStreamWriter within a BufferedWriter so as to avoid frequent converter invocations. For example:
Writer out = new BufferedWriter(new OutputStreamWriter(System.out));
The purpose of the Buffered* classes is to coalesce small write operations into a larger one, thereby reducing the number of system calls, and increasing throughput.
Since a BufferedWriter already collects writes in a buffer, then converts the characters in the buffer into another buffer, and writes that buffer to the underlying OutputStream in a single operation, the OutputStream is already invoked with large write operations. Therefore, a BufferedOutputStream finds nothing to combine, and is simply redundant.
As an aside, the same can apply to the BufferedWriter: buffering will only help if the writer is only passed few characters at a time. If you know the caller only writes huge strings, the BufferedWriter will find nothing to combine and is redundant, too.
The BufferedWriter would buffer the input to the outputStreamWriter, which is recommended because it prevents the writer from starting up the encoder for each character.
Recommended by who, and in what context? What do you mean by "starting up the encoder"? Are you writing a single character at a time to the writer anyway? (We don't know much about how you're using the writer... that could be important.)
The BufferedOutputStream would then buffer the bytes from the Writer to avoid putting one byte at a time potentially onto the network.
What makes you think it would write one byte at a time? I think it very unlikely that OutputStreamWriter will write a byte at a time to the underlying writer, unless you really write a character at a time to it.
Additionally, I'd expect the network output stream to use something like Nagle's algorithm to avoid sending single-byte packets.
As ever with optimization, you should do it based on evidence... have you done any testing with and without these layers of buffering?
EDIT: Just to clarify, I'm not saying the buffering classes are useless. In some cases they're absolutely the right way to go. I'm just saying that as with all optimization, they shouldn't be used blindly. You should consider what you're trying to optimize for (processor usage, memory usage, network usage etc) and measure. There are many factors which matter here - not least of which is the write pattern. If you're already writing "chunkily" - writing large blocks of character data - then the buffers will have relatively little impact. If you're actually writing a single character at a time to the writer, then they would be more significant.
Yes it is overkill. From the Javadoc for OutputStreamWriter: "Each invocation of a write() method causes the encoding converter to be invoked on the given character(s). The resulting bytes are accumulated in a buffer before being written to the underlying output stream.".

Categories

Resources