Java How to improve reading of 50 Gigabit file

Java How to improve reading of 50 Gigabit file - java

I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file
String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}
Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.

it is taking 2 Hrs to process the whole file.
50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("FileName")), 100000);
Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.

Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.
Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.
I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.

Without NIO you won't be able to break the throughput barrier. For example, try using new Scanner(File) instead of directly creating readers. Recently I took a look at that source code, it uses NIO's file channels.
But the first thing I would suggest is to run an empty loop with BufferedReader that does nothing but reading. Note the throughput -- and also keep an eye on the CPU. If the loop floors the CPU, then there's definitely an issue with the IO code.

Disable the antivirus and any other program which adds to disk contention while reading the file.
Defragment the disk.
Create a raw disk partition and read the file from there.
Read the file from an SSD.
Create a 50GB Ramdisk and read the file from there.

I think you may get the best results by re-considering the problem you're trying to solve. There's clearly a reason you're loading this 50Gig file. Consider if there isn't a better way to break the stored data down and only use the data you really need.

The way you read the file is fine. There might be ways to get it faster, but it usually requires understanding where your bottleneck is. Because the IO throughput is actually on the lower end, I assume the computation is having a performance side effect. If its not too lengthy you could show you whole program.
Alternatively, you could run your program without the contents of the loop and see how long it takes to read through the file :)

Related

Buffered-Input Stream [duplicate]

Let me preface this post with a single caution. I am a total beginner when it comes to Java. I have been programming PHP on and off for a while, but I was ready to make a desktop application, so I decided to go with Java for various reasons.
The application I am working on is in the beginning stages (less than 5 classes) and I need to read bytes from a local file. Typically, the files are currently less than 512kB (but may get larger in the future). Currently, I am using a FileInputStream to read the file into three byte arrays, which perfectly satisfies my requirements. However, I have seen a BufferedInputStream mentioned, and was wondering if the way I am currently doing this is best, or if I should use a BufferedInputStream as well.
I have done some research and have read a few questions here on Stack Overflow, but I am still having troubles understanding the best situation for when to use and not use the BufferedInputStream. In my situation, the first array I read bytes into is only a few bytes (less than 20). If the data I receive is good in these bytes, then I read the rest of the file into two more byte arrays of varying size.
I have also heard many people mention profiling to see which is more efficient in each specific case, however, I have no profiling experience and I'm not really sure where to start. I would love some suggestions on this as well.
I'm sorry for such a long post, but I really want to learn and understand the best way to do these things. I always have a bad habit of second guessing my decisions, so I would love some feedback. Thanks!

If you are consistently doing small reads then a BufferedInputStream will give you significantly better performance. Each read request on an unbuffered stream typically results in a system call to the operating system to read the requested number of bytes. The overhead of doing a system call is may be thousands of machine instructions per syscall. A buffered stream reduces this by doing one large read for (say) up to 8k bytes into an internal buffer, and then handing out bytes from that buffer. This can drastically reduce the number of system calls.
However, if you are consistently doing large reads (e.g. 8k or more) then a BufferedInputStream slows things a bit. You typically don't reduce the number of syscalls, and the buffering introduces an extra data copying step.
In your use-case (where you read a 20 byte chunk first then lots of large chunks) I'd say that using a BufferedInputStream is more likely to reduce performance than increase it. But ultimately, it depends on the actual read patterns.

If you are using a relatively large arrays to read the data a chunk at a time, then BufferedInputStream will just introduce a wasteful copy. (Remember, read does not necessarily read all of the array - you might want DataInputStream.readFully). Where BufferedInputStream wins is when making lots of small reads.

BufferedInputStream reads more of the file that you need in advance. As I understand it, it's doing more work in advance, like, 1 big continous disk read vs doing many in a tight loop.
As far as profiling - I like the profiler that's built into netbeans. It's really easy to get started with. :-)

I can't speak to the profiling, but from my experience developing Java applications I find that using any of the buffer classes - BufferedInputStream, StringBuffer - my applications are exceptionally faster. Because of which, I use them even for the smallest files or string operation.

import java.io.*;
class BufferedInputStream
{
public static void main(String arg[])throws IOException
{
FileInputStream fin=new FileInputStream("abc.txt");
BufferedInputStream bis=new BufferedInputStream(fin);
int size=bis.available();
while(true)
{
int x=bis.read(fin);
if(x==-1)
{
bis.mark(size);
System.out.println((char)x);
}
}
bis.reset();
while(true)
{
int x=bis.read();
if(x==-1)
{
break;
System.out.println((char)x);
}
}
}
}

Are there any performance benefits to leaving BufferedReader stream open?

Before I ask my question, I am fully aware that leaving an input stream open can cause a memory leak, and therefore doing so is bad practice.
Consider the following preconditions:
Only a single file is needed to be read
The file in question is a text file which contains rows of data
This file is quite large: 50MB or more
The file is read many, many times during a test run
The reason I am asking is that in my test automation suite, the same file is required to be called over and over again to validate certain data fields.
In its current state, the data reader function opens a BufferedReader stream, reads/returns data, and then closes stream.
However, due to the file size and the number of times the file is read, I don't know if leaving the stream open would be beneficial. If I'm being honest, I don't know if the file size affects the opening of the stream at all.
So in summary, given the above listed preconditions, will leaving open a BufferedReader input stream improve overall performance? And is a memory leak still possible?

If you have enough memory to do this, then you will probably get best performance by reading the entire file into a StringBuilder, turning it into a String, and then repeatedly reading from the String via a StringReader.
However, you may need 6 or more times as many bytes of (free) heap space as the size of the file.
2 x to allow for byte -> char expansion
3 x because of the way that a StringBuilder buffer expands as it grows.
You can save space by holding the file in memory as as bytes (not chars), and by reading into a byte[] of exactly the right size. But then you need to repeat the bytes -> chars decoding each time you read from the byte[].
You should benchmark the alternatives if you need ultimate performance.
And look at using Buffer to reduce copying.
Re your idea. Keeping the BufferedReader open and using mark and reset would give you a small speedup compared with closing and reopening. But the larger your file is, the smaller the speedup is in relative terms. For a 50GB file, I suspect that the speedup would be insignificant.

Yes, not closing a stream could improve performance in theory as the object will not trigger garbage collection
assuming you're not de-referencing the BufferedReader. Also, the undelying resources won't need to be sync'd. See similar answer: Performance hit opening and closing filehandler?
However, not closing you BufferedReader will result in memory leak and you'll see heap increase.
I suggest as other's have in comments and answers to just read the file into a memory and use that. A 50MB file that isn't that much, plus the performance reading from a String once in memory will be much higher than re-reading a file.

Where is the data queued with a BufferedReader?

I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?

yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.

+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.

BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.

BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.

What is the fastest way to write a large amount of data from memory to a file?

I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)

Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.

Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.

I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?

writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps

In Java, what is the difference between using a BufferedWriter or writing straight to file? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
In Java, what is the advantage of using BufferedWriter to append to a file?
The site that I am looking at says
"The BufferWriter class is used to write text to a character-output stream, buffering characters so as to provide for the efficient writing of single characters, arrays, and strings."
What make's it more efficient and why?

BufferedWriter is more efficient because it uses buffers rather than writing character by character. So it reduces I/O operations of the disk. Data is collected in a buffer and write to the file when the buffer is full.
This is why sometimes no data is written in the file if you didn't call flush method. That is data is collected in the buffer but program exits before writing them to the file. Calling flush method will cause the data to be written in the file even the buffer is not filled completely.

The cost of writing becomes expensive when you write character by character to the file. For reducing that cost, buffers are provided. If you are writing to Buffer, it waits for some limit and then writes the whole to the disk.

A BufferedWriter waits until the buffer (8192 bytes) is full and writes the whole buffer in one disk operation. Unbuffered each single write would result in a disk I/O which is obviously more expensive.

Hard disk hava a minimum unit of information storage so for example if you are writing a single byte the operating system asks for the disk to store a unit of storage (I think that the minimum is 512 bytes). So you ask for writing one byte and the operating system writes much more. If you ask to store 512 bytes with 512 calls you end up doing a lot more I/O (512 disk operations) that buffering 512 bytes and issuing only one call (1 disk operation).

As the name suggests, BufferedWriter uses a buffer to reduce the costs of writes. If you are writing to file, you might know that writing 1byte or writing 4kbytes roughly costs the same. The time required to perform such write is dominated by the access time (~8ms) which is the time required by the disk to rotate and to seek the right sector.
Additionally, aggregating small writes in a bigger one allows you to reduce the overhead on the operating system, achieving better performances.
Most of the operating systems do have an internal buffer to cache writes. However, these caches tries to figure out what the application is doing, by analyzing the write patterns. If the application itself is able to perform that caching, and perform a write only when the data is ready, the result (in terms of performance) is better.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.