Does Stream skip loads everything in memory - java

I have Stream<String> s = bufferedReader.lines();
The bufferedReader returns a lazy stream.
While using s.skip(100).limit(100) does it call to load all s in memory or it would be evaluated upto skip and limit value.
What would be memory footprint for using the skip/limit function?

In order to skip 100 lines, it’s unavoidable to read the data into memory in order to find the locations of the line terminators to know to which character position to skip.
Still, the memory footprint does not depend on the number of the skipped lines, as the size of the BufferedReader’s internal buffer will be fixed at the BufferedReader’s construction, see BufferedReader(Reader) and BufferedReader(Reader, int).
As a corner case, if a line doesn’t fit into the buffer (note that the default buffer size is 8192), a temporary StringBuffer will be used for that line.
Generally, there is no dedicated skip operation in the backends which the Stream API could exploit, which you can verify by looking at the Spliterator interface, so skipping lines is not different to repeatedly invoking readLine() on the BufferedReader, but dropping the resulting strings immediately.
As explained at the beginning, reading the data is unavoidable, so the cost of creating temporary string instances from the read data is negligible compared to the reading itself.

Related

Buffered vs Unbuffered. How actually buffer work?

How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.

Are there any performance benefits to leaving BufferedReader stream open?

Before I ask my question, I am fully aware that leaving an input stream open can cause a memory leak, and therefore doing so is bad practice.
Consider the following preconditions:
Only a single file is needed to be read
The file in question is a text file which contains rows of data
This file is quite large: 50MB or more
The file is read many, many times during a test run
The reason I am asking is that in my test automation suite, the same file is required to be called over and over again to validate certain data fields.
In its current state, the data reader function opens a BufferedReader stream, reads/returns data, and then closes stream.
However, due to the file size and the number of times the file is read, I don't know if leaving the stream open would be beneficial. If I'm being honest, I don't know if the file size affects the opening of the stream at all.
So in summary, given the above listed preconditions, will leaving open a BufferedReader input stream improve overall performance? And is a memory leak still possible?
If you have enough memory to do this, then you will probably get best performance by reading the entire file into a StringBuilder, turning it into a String, and then repeatedly reading from the String via a StringReader.
However, you may need 6 or more times as many bytes of (free) heap space as the size of the file.
2 x to allow for byte -> char expansion
3 x because of the way that a StringBuilder buffer expands as it grows.
You can save space by holding the file in memory as as bytes (not chars), and by reading into a byte[] of exactly the right size. But then you need to repeat the bytes -> chars decoding each time you read from the byte[].
You should benchmark the alternatives if you need ultimate performance.
And look at using Buffer to reduce copying.
Re your idea. Keeping the BufferedReader open and using mark and reset would give you a small speedup compared with closing and reopening. But the larger your file is, the smaller the speedup is in relative terms. For a 50GB file, I suspect that the speedup would be insignificant.
Yes, not closing a stream could improve performance in theory as the object will not trigger garbage collection
assuming you're not de-referencing the BufferedReader. Also, the undelying resources won't need to be sync'd. See similar answer: Performance hit opening and closing filehandler?
However, not closing you BufferedReader will result in memory leak and you'll see heap increase.
I suggest as other's have in comments and answers to just read the file into a memory and use that. A 50MB file that isn't that much, plus the performance reading from a String once in memory will be much higher than re-reading a file.

Convert `BufferedReader` to `Stream<String>` in a parallel way

Is there a way to receive a Stream<String> stream out of a BufferedReader reader such that each string in stream represents one line of reader with the additional condition that stream is provided directly (before readerread everything)? I want to process the data of stream parallel to getting them from reader to save time.
Edit: I want to process the data parallel to reading. I don't want to process different lines parallel. They should be processed in order.
Let's make an example on how I want to save time. Let's say our reader will present 100 lines to us. It takes 2 ms to read one line and 1 ms to process it. If I first read all the lines and then process them, it will take me 300 ms. What I want to do is: As soon as a line is read I want to process it and parallel read the next line. The total time will then be 201 ms.
What I don't like about BufferedReader.lines(): As far as I understood reading starts when I want to process the lines. Let's assume I have already my reader but have to do precomputations before being able to process the first line. Let's say they cost 30 ms. In the above example the total time would then be 231 ms or 301 ms using reader.lines() (can you tell me which of those times is correct?). But it would be possible to get the job done in 201 ms, since the precomputations can be done parallel to reading the first 15 lines.
You can use reader.lines().parallel(). This way your input will be split into chunks and further stream operations will be performed on chunks in parallel. If further operations take significant time, then you might get performance improvement.
In your case default heuristic will not work as you want and I guess there's no ready solution which will allow you to use single line batches. You can write a custom spliterator which will split after each line. Look into java.util.Spliterators.AbstractSpliterator implementation. Probably the easiest solution would be to write something similar, but limit batch sizes to one element in trySplit and read single line in tryAdvance method.
To do what you want, you would typically have one thread that reads lines and add them to a blocking queue, and a second thread that would get lines from this blocking queue and process them.
You are looking at the wrong place. You are thinking that a stream of lines will read lines from the file but that’s not how it works. You can’t tell the underlying system to read a line as no-one knows what a line is before reading.
A BufferedReader has it’s name because of it’s character buffer. This buffer has a default capacity of 8192. Whenever a new line is requested, the buffer will be parsed for a newline sequence and the part will be returned. When the buffer does not hold enough characters for finding a current line, the entire buffer will be filled.
Now, filling the buffer may lead to requests to read bytes from the underlying InputStream to fill the buffer of the character decoder. How many bytes will be requested and how many bytes will be actually read depends on the buffer size of the character decoder, on how much bytes of the actual encoding map to one character and whether the underlying InputStream has its own buffer and how big it is.
The actual expensive operation is the reading of bytes from the underlying stream and there is no trivial mapping from line read requests to these read operations. Requesting the first line may cause reading, let’s say one 16 KiB chunk from the underlying file, and the subsequent one hundred requests might be served from the filled buffer and cause no I/O at all. And nothing you do with the Stream API can change anything about that. The only thing you would parallelize is the search for new line characters in the buffer which is too trivial to benefit from parallel execution.
You could reduce the buffer sizes of all involved parties to roughly get your intended parallel reading of one line while processing the previous line, however, that parallel execution will never compensate the performance degradation caused by the small buffer sizes.

mark and readAheadLimit

I have a case where I need to peek ahead in the stream for the existence of a certain regular expression and then read data from the stream.
mark and reset allow me to do this but I am facing an issue where mark becomes invalid if the readAheadLimit goes beyond the size of the current buffer.
For example: I have a BufferedReader with buffer size of 1k.
Lets say I am at position 1000 (mark=1000) in the buffer and I need to check for the regex in the next 100 chars (readAheadLimit=100).
So while reading, the moment I cross the current buffer size (1024), a new buffer is allocated and the mark becomes invalid (not able to reset) and the data is streamed into the new buffer in a normal way.
I think this is the intended behavior but is there a way to get around this?
Appreciate your help.
regards
the moment I cross the current buffer size (1024), a new buffer is allocated
No it isn't. The existing buffer is cleared and readied for another use.
and the mark becomes invalid (not able to reset)
No it doesn't, unless you've gone beyond the read ahead limit.
You don't seem to have read the API. You call mark() with an argument that says how far ahead you want to go before calling reset(), in this case 100 bytes, and the API is required to allow you to do exactly that. So when you get up to 100 characters ahead, call reset(), and you are back where you were when you called mark(). How that happens internally isn't your problem, but it is certainly required to happen.
And how did you get a BufferedReader with a 1k buffer? The default is 4096.
There are at least two options:
Set default cache size much more than 1k:
new BufferedReader(originalReader, 1024 * 1024) // e.g. 1Mb
Apply custom buffering to increase cache size as soon as limit was exceeded. In case if you are working with huge amount of data - custom buffering can store data it in database or file.

Would FileChannel.read read less bytes than specified if there's enough data?

For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

Categories

Resources