I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.
As per AWS documentation to read from file:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
displayTextInputStream(fullObject.getObjectContent());
private static void displayTextInputStream(InputStream input) throws IOException {
// Read the text input stream one line at a time and display each line.
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
System.out.println();
}
Here we are using a BufferedReader. It is not clear to me what is happening underneath here.
Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?
One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).
The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.
And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.
If you are basically not searching for a specific word/words, and you are aware of the bytes range, you can also use Range header in S3. This should be specifically useful as you are working with a single file of several GB size. Specifying Range not only helps to reduce the memory, but also is faster, as only the specified part of the file is read.
See Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file?
Hope this helps.
Sreram
Depends on the size of the lines in your file. readLine() will continue to build the string fetching data from the stream in blocks the size of your buffer size, until you hit a line termination character. So the memory used will be on the order of your line length + buffer length.
Only a single HTTP call is made to the AWS infrastructure, and the data is read into memory in small blocks, of which the size may vary and is not directly under your control.
This is very memory-efficient already, assuming each line in the file is a reasonably small size.
One way to optimize further (for network and compute resources), assuming that your "certain condition" is a simple string match, is to use S3 Select: https://aws.amazon.com/s3/features/#s3-select
Related
I'm new to java and I want to ask what's the difference between using FileReader-FileWriter and using BufferedReader-BufferedWriter. Except of speed is there any other reason to use Buffered?
In a code for copying a file and pasting its content into another file is it better to use BufferedReader and BufferedWriter?
The short version is: File writer/reader is fast but inefficient, but a buffered writer/reader saves up writes/reads and does them in chunks (Based on the buffer size) which is far more efficient but can be slower (waiting for the buffer to fill up).
So to answer your question, a buffered writer/reader is generally best, especially if you are not sure on which one to use.
Take a look at the JavaDoc for the BufferedWriter, it does a great job of explaining how it works:
In general, a Writer sends its output immediately to the underlying
character or byte stream. Unless prompt output is required, it is
advisable to wrap a BufferedWriter around any Writer whose write()
operations may be costly, such as FileWriters and OutputStreamWriters.
For example,
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("foo.out")));
will buffer the PrintWriter's output to the file. Without buffering,
each invocation of a print() method would cause characters to be
converted into bytes that would then be written immediately to the
file, which can be very inefficient.
I have a doubt regarding how BufferedReader works with FileReader. Studied most of the posts on stackoverflow and Google as well but still my doubt is not cleared. Its my third day am putting on this to understand..! :)
Here it is:
My Understanding says, when we use below code snippet
BufferedReader in
= new BufferedReader(new FileReader("foo.in"));
FileReader reads bytes wise data and put into buffer. Here buffer is created by BufferedReader and the instance of BufferedReader reads from that buffer.
This made me think, because this post says Understanding how BufferedReader works in Java, BufferedReader doesnt store anything itself, because if that's a case then I thought BufferedReader doing two things, one creates a buffer and second creates a instance of BufferedReader who reads from that buffer...! Makes Sense...?
My Second doubt is, BufferedReader can be used to avoid IO operations, which means to avoid time consuming efforts where bytes are being read from disk and then converted to Char then giving out. So to overcome this issue, BufferedReader can be used who reads big chunk of data at once. Here makes me think that, when BufferedReader is wrapped around FileReader then FileReader stream is reading first and then data is being passed to BufferedReader. Then how it takes a big chunk...?
My understanding says, BufferedReader reader is helpful because it reads data from Buffer which is a memory, so rather than doing same thing at time which is reading bytes from disk and converting at the same time, first put all bytes in buffer or memory then read it from there, because its fast to be read and can be converted to char as well. This I have concluded by reading online, but am not agree 100% because no step is skipped even after putting into buffer, then how it reduce the time frame....? :(
I'm literally confused with these, Can anyone help me to understand this more precisely ?
FileReader reads bytes wise data
No. It constructs a FileInputStream and Input Reader, and reads from the latter, as characters.
and put into buffer
Puts into the caller's buffer.
Here buffer is created by BufferedReader and the instance of BufferedReader reads from that buffer.
Correct.
This made me think, because this post says Understanding how BufferedReader works in Java, BufferedReader doesnt store anything itself
That statement in that post is complete and utter nonsense, and so is any other source that says so. Of course it stores data. It is a buffer. See the Javadoc, and specifically the following statement: 'reads text from a character-input stream, buffering characters [my emphasis] so as to provide for the efficient reading of characters, arrays, and lines.'
because if that's a case then I thought BufferedReader doing two things, one creates a buffer and second creates a instance of BufferedReader who reads from that buffer...! Makes Sense...?
No, but neither did your source. Your first intuition above was correct.
My Second doubt is, BufferedReader can be used to avoid IO operations, which means to avoid time consuming efforts where bytes are being read from disk and then converted to Char then giving out. So to overcome this issue, BufferedReader can be used who reads big chunk of data at once. Here makes me think that, when BufferedReader is wrapped around FileReader then FileReader stream is reading first and then data is being passed to BufferedReader. Then how it takes a big chunk...?
By supplying a big buffer to FileReader.read().
My understanding says, BufferedReader reader is helpful because it reads data from Buffer which is a memory, so rather than doing same thing at time which is reading bytes from disk and converting at the same time, first put all bytes in buffer or memory then read it from there, because its fast to be read and can be converted to char as well. This I have concluded by reading online, but am not agree 100% because no step is skipped even after putting into buffer, then how it reduce the time frame....? :(
The step of reading character by character from the disk is skipped. It is more or less just as efficient to read a chunk from a disk file as it is to read one byte, and system calls are themselves expensive.
For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.
Is there a way to receive a Stream<String> stream out of a BufferedReader reader such that each string in stream represents one line of reader with the additional condition that stream is provided directly (before readerread everything)? I want to process the data of stream parallel to getting them from reader to save time.
Edit: I want to process the data parallel to reading. I don't want to process different lines parallel. They should be processed in order.
Let's make an example on how I want to save time. Let's say our reader will present 100 lines to us. It takes 2 ms to read one line and 1 ms to process it. If I first read all the lines and then process them, it will take me 300 ms. What I want to do is: As soon as a line is read I want to process it and parallel read the next line. The total time will then be 201 ms.
What I don't like about BufferedReader.lines(): As far as I understood reading starts when I want to process the lines. Let's assume I have already my reader but have to do precomputations before being able to process the first line. Let's say they cost 30 ms. In the above example the total time would then be 231 ms or 301 ms using reader.lines() (can you tell me which of those times is correct?). But it would be possible to get the job done in 201 ms, since the precomputations can be done parallel to reading the first 15 lines.
You can use reader.lines().parallel(). This way your input will be split into chunks and further stream operations will be performed on chunks in parallel. If further operations take significant time, then you might get performance improvement.
In your case default heuristic will not work as you want and I guess there's no ready solution which will allow you to use single line batches. You can write a custom spliterator which will split after each line. Look into java.util.Spliterators.AbstractSpliterator implementation. Probably the easiest solution would be to write something similar, but limit batch sizes to one element in trySplit and read single line in tryAdvance method.
To do what you want, you would typically have one thread that reads lines and add them to a blocking queue, and a second thread that would get lines from this blocking queue and process them.
You are looking at the wrong place. You are thinking that a stream of lines will read lines from the file but that’s not how it works. You can’t tell the underlying system to read a line as no-one knows what a line is before reading.
A BufferedReader has it’s name because of it’s character buffer. This buffer has a default capacity of 8192. Whenever a new line is requested, the buffer will be parsed for a newline sequence and the part will be returned. When the buffer does not hold enough characters for finding a current line, the entire buffer will be filled.
Now, filling the buffer may lead to requests to read bytes from the underlying InputStream to fill the buffer of the character decoder. How many bytes will be requested and how many bytes will be actually read depends on the buffer size of the character decoder, on how much bytes of the actual encoding map to one character and whether the underlying InputStream has its own buffer and how big it is.
The actual expensive operation is the reading of bytes from the underlying stream and there is no trivial mapping from line read requests to these read operations. Requesting the first line may cause reading, let’s say one 16 KiB chunk from the underlying file, and the subsequent one hundred requests might be served from the filled buffer and cause no I/O at all. And nothing you do with the Stream API can change anything about that. The only thing you would parallelize is the search for new line characters in the buffer which is too trivial to benefit from parallel execution.
You could reduce the buffer sizes of all involved parties to roughly get your intended parallel reading of one line while processing the previous line, however, that parallel execution will never compensate the performance degradation caused by the small buffer sizes.
I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?
yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.
+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.
BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.
BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.