Convert `BufferedReader` to `Stream<String>` in a parallel way - java

Is there a way to receive a Stream<String> stream out of a BufferedReader reader such that each string in stream represents one line of reader with the additional condition that stream is provided directly (before readerread everything)? I want to process the data of stream parallel to getting them from reader to save time.
Edit: I want to process the data parallel to reading. I don't want to process different lines parallel. They should be processed in order.
Let's make an example on how I want to save time. Let's say our reader will present 100 lines to us. It takes 2 ms to read one line and 1 ms to process it. If I first read all the lines and then process them, it will take me 300 ms. What I want to do is: As soon as a line is read I want to process it and parallel read the next line. The total time will then be 201 ms.
What I don't like about BufferedReader.lines(): As far as I understood reading starts when I want to process the lines. Let's assume I have already my reader but have to do precomputations before being able to process the first line. Let's say they cost 30 ms. In the above example the total time would then be 231 ms or 301 ms using reader.lines() (can you tell me which of those times is correct?). But it would be possible to get the job done in 201 ms, since the precomputations can be done parallel to reading the first 15 lines.

You can use reader.lines().parallel(). This way your input will be split into chunks and further stream operations will be performed on chunks in parallel. If further operations take significant time, then you might get performance improvement.
In your case default heuristic will not work as you want and I guess there's no ready solution which will allow you to use single line batches. You can write a custom spliterator which will split after each line. Look into java.util.Spliterators.AbstractSpliterator implementation. Probably the easiest solution would be to write something similar, but limit batch sizes to one element in trySplit and read single line in tryAdvance method.

To do what you want, you would typically have one thread that reads lines and add them to a blocking queue, and a second thread that would get lines from this blocking queue and process them.

You are looking at the wrong place. You are thinking that a stream of lines will read lines from the file but that’s not how it works. You can’t tell the underlying system to read a line as no-one knows what a line is before reading.
A BufferedReader has it’s name because of it’s character buffer. This buffer has a default capacity of 8192. Whenever a new line is requested, the buffer will be parsed for a newline sequence and the part will be returned. When the buffer does not hold enough characters for finding a current line, the entire buffer will be filled.
Now, filling the buffer may lead to requests to read bytes from the underlying InputStream to fill the buffer of the character decoder. How many bytes will be requested and how many bytes will be actually read depends on the buffer size of the character decoder, on how much bytes of the actual encoding map to one character and whether the underlying InputStream has its own buffer and how big it is.
The actual expensive operation is the reading of bytes from the underlying stream and there is no trivial mapping from line read requests to these read operations. Requesting the first line may cause reading, let’s say one 16 KiB chunk from the underlying file, and the subsequent one hundred requests might be served from the filled buffer and cause no I/O at all. And nothing you do with the Stream API can change anything about that. The only thing you would parallelize is the search for new line characters in the buffer which is too trivial to benefit from parallel execution.
You could reduce the buffer sizes of all involved parties to roughly get your intended parallel reading of one line while processing the previous line, however, that parallel execution will never compensate the performance degradation caused by the small buffer sizes.

Related

FileChannel read behaviour [duplicate]

For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

Does Stream skip loads everything in memory

I have Stream<String> s = bufferedReader.lines();
The bufferedReader returns a lazy stream.
While using s.skip(100).limit(100) does it call to load all s in memory or it would be evaluated upto skip and limit value.
What would be memory footprint for using the skip/limit function?
In order to skip 100 lines, it’s unavoidable to read the data into memory in order to find the locations of the line terminators to know to which character position to skip.
Still, the memory footprint does not depend on the number of the skipped lines, as the size of the BufferedReader’s internal buffer will be fixed at the BufferedReader’s construction, see BufferedReader(Reader) and BufferedReader(Reader, int).
As a corner case, if a line doesn’t fit into the buffer (note that the default buffer size is 8192), a temporary StringBuffer will be used for that line.
Generally, there is no dedicated skip operation in the backends which the Stream API could exploit, which you can verify by looking at the Spliterator interface, so skipping lines is not different to repeatedly invoking readLine() on the BufferedReader, but dropping the resulting strings immediately.
As explained at the beginning, reading the data is unavoidable, so the cost of creating temporary string instances from the read data is negligible compared to the reading itself.

Where is the data queued with a BufferedReader?

I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?
yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.
+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.
BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.
BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.

Java How to improve reading of 50 Gigabit file

I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file
String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}
Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.
it is taking 2 Hrs to process the whole file.
50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("FileName")), 100000);
Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.
Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.
Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.
I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.
Without NIO you won't be able to break the throughput barrier. For example, try using new Scanner(File) instead of directly creating readers. Recently I took a look at that source code, it uses NIO's file channels.
But the first thing I would suggest is to run an empty loop with BufferedReader that does nothing but reading. Note the throughput -- and also keep an eye on the CPU. If the loop floors the CPU, then there's definitely an issue with the IO code.
Disable the antivirus and any other program which adds to disk contention while reading the file.
Defragment the disk.
Create a raw disk partition and read the file from there.
Read the file from an SSD.
Create a 50GB Ramdisk and read the file from there.
I think you may get the best results by re-considering the problem you're trying to solve. There's clearly a reason you're loading this 50Gig file. Consider if there isn't a better way to break the stored data down and only use the data you really need.
The way you read the file is fine. There might be ways to get it faster, but it usually requires understanding where your bottleneck is. Because the IO throughput is actually on the lower end, I assume the computation is having a performance side effect. If its not too lengthy you could show you whole program.
Alternatively, you could run your program without the contents of the loop and see how long it takes to read through the file :)

Would FileChannel.read read less bytes than specified if there's enough data?

For example I have a file whose content is:
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.
No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

Categories

Resources