Let me preface this post with a single caution. I am a total beginner when it comes to Java. I have been programming PHP on and off for a while, but I was ready to make a desktop application, so I decided to go with Java for various reasons.
The application I am working on is in the beginning stages (less than 5 classes) and I need to read bytes from a local file. Typically, the files are currently less than 512kB (but may get larger in the future). Currently, I am using a FileInputStream to read the file into three byte arrays, which perfectly satisfies my requirements. However, I have seen a BufferedInputStream mentioned, and was wondering if the way I am currently doing this is best, or if I should use a BufferedInputStream as well.
I have done some research and have read a few questions here on Stack Overflow, but I am still having troubles understanding the best situation for when to use and not use the BufferedInputStream. In my situation, the first array I read bytes into is only a few bytes (less than 20). If the data I receive is good in these bytes, then I read the rest of the file into two more byte arrays of varying size.
I have also heard many people mention profiling to see which is more efficient in each specific case, however, I have no profiling experience and I'm not really sure where to start. I would love some suggestions on this as well.
I'm sorry for such a long post, but I really want to learn and understand the best way to do these things. I always have a bad habit of second guessing my decisions, so I would love some feedback. Thanks!
If you are consistently doing small reads then a BufferedInputStream will give you significantly better performance. Each read request on an unbuffered stream typically results in a system call to the operating system to read the requested number of bytes. The overhead of doing a system call is may be thousands of machine instructions per syscall. A buffered stream reduces this by doing one large read for (say) up to 8k bytes into an internal buffer, and then handing out bytes from that buffer. This can drastically reduce the number of system calls.
However, if you are consistently doing large reads (e.g. 8k or more) then a BufferedInputStream slows things a bit. You typically don't reduce the number of syscalls, and the buffering introduces an extra data copying step.
In your use-case (where you read a 20 byte chunk first then lots of large chunks) I'd say that using a BufferedInputStream is more likely to reduce performance than increase it. But ultimately, it depends on the actual read patterns.
If you are using a relatively large arrays to read the data a chunk at a time, then BufferedInputStream will just introduce a wasteful copy. (Remember, read does not necessarily read all of the array - you might want DataInputStream.readFully). Where BufferedInputStream wins is when making lots of small reads.
BufferedInputStream reads more of the file that you need in advance. As I understand it, it's doing more work in advance, like, 1 big continous disk read vs doing many in a tight loop.
As far as profiling - I like the profiler that's built into netbeans. It's really easy to get started with. :-)
I can't speak to the profiling, but from my experience developing Java applications I find that using any of the buffer classes - BufferedInputStream, StringBuffer - my applications are exceptionally faster. Because of which, I use them even for the smallest files or string operation.
import java.io.*;
class BufferedInputStream
{
public static void main(String arg[])throws IOException
{
FileInputStream fin=new FileInputStream("abc.txt");
BufferedInputStream bis=new BufferedInputStream(fin);
int size=bis.available();
while(true)
{
int x=bis.read(fin);
if(x==-1)
{
bis.mark(size);
System.out.println((char)x);
}
}
bis.reset();
while(true)
{
int x=bis.read();
if(x==-1)
{
break;
System.out.println((char)x);
}
}
}
}
Related
Before I ask my question, I am fully aware that leaving an input stream open can cause a memory leak, and therefore doing so is bad practice.
Consider the following preconditions:
Only a single file is needed to be read
The file in question is a text file which contains rows of data
This file is quite large: 50MB or more
The file is read many, many times during a test run
The reason I am asking is that in my test automation suite, the same file is required to be called over and over again to validate certain data fields.
In its current state, the data reader function opens a BufferedReader stream, reads/returns data, and then closes stream.
However, due to the file size and the number of times the file is read, I don't know if leaving the stream open would be beneficial. If I'm being honest, I don't know if the file size affects the opening of the stream at all.
So in summary, given the above listed preconditions, will leaving open a BufferedReader input stream improve overall performance? And is a memory leak still possible?
If you have enough memory to do this, then you will probably get best performance by reading the entire file into a StringBuilder, turning it into a String, and then repeatedly reading from the String via a StringReader.
However, you may need 6 or more times as many bytes of (free) heap space as the size of the file.
2 x to allow for byte -> char expansion
3 x because of the way that a StringBuilder buffer expands as it grows.
You can save space by holding the file in memory as as bytes (not chars), and by reading into a byte[] of exactly the right size. But then you need to repeat the bytes -> chars decoding each time you read from the byte[].
You should benchmark the alternatives if you need ultimate performance.
And look at using Buffer to reduce copying.
Re your idea. Keeping the BufferedReader open and using mark and reset would give you a small speedup compared with closing and reopening. But the larger your file is, the smaller the speedup is in relative terms. For a 50GB file, I suspect that the speedup would be insignificant.
Yes, not closing a stream could improve performance in theory as the object will not trigger garbage collection
assuming you're not de-referencing the BufferedReader. Also, the undelying resources won't need to be sync'd. See similar answer: Performance hit opening and closing filehandler?
However, not closing you BufferedReader will result in memory leak and you'll see heap increase.
I suggest as other's have in comments and answers to just read the file into a memory and use that. A 50MB file that isn't that much, plus the performance reading from a String once in memory will be much higher than re-reading a file.
I am reading a 50G file containing millions of rows separated by newline character. Presently I am using following syntax to read the file
String line = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("FileName")));
while ((line = br.readLine()) != null)
{
// Processing each line here
// All processing is done in memory. No IO required here.
}
Since the file is too big, it is taking 2 Hrs to process the whole file. Can I improve the reading of file from the harddisk so that the IO(Reading) operation takes minimal time. The restriction with my code is that I have to process each line sequential order.
it is taking 2 Hrs to process the whole file.
50 GB / 2 hours equals approximately 7 MB/s. It's not a bad rate at all. A good (modern) hard disk should be capable of sustaining higher rate continuously, so maybe your bottleneck is not the I/O? You're already using BufferedReader, which, like the name says, is buffering (in memory) what it reads. You could experiment creating the reader with a bit bigger buffer than the default size (8192 bytes), like so:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("FileName")), 100000);
Note that with the default 8192 bytes buffer and 7 MB/s throughput the BufferedReader is going to re-fill its buffer almost 1000 times per second, so lowering that number could really help cutting down some overhead. But if the processing that you're doing, instead of the I/O, is the bottleneck, then no I/O trick is going to help you much. You should maybe consider making it multi-threaded, but whether it's doable, and how, depends on what "processing" means here.
Your only hope is to parallelize the reading and processing of what's inside. Your strategy should be to never require the entire file contents to be in memory at once.
Start by profiling the code you have to see where the time is being spent. Rewrite the part that takes the most time and re-profile to see if it improved. Keep repeating until you get an acceptable result.
I'd think about Hadoop and a distributed solution. Data sets that are larger than yours are processed routinely now. You might need to be a bit more creative in your thinking.
Without NIO you won't be able to break the throughput barrier. For example, try using new Scanner(File) instead of directly creating readers. Recently I took a look at that source code, it uses NIO's file channels.
But the first thing I would suggest is to run an empty loop with BufferedReader that does nothing but reading. Note the throughput -- and also keep an eye on the CPU. If the loop floors the CPU, then there's definitely an issue with the IO code.
Disable the antivirus and any other program which adds to disk contention while reading the file.
Defragment the disk.
Create a raw disk partition and read the file from there.
Read the file from an SSD.
Create a 50GB Ramdisk and read the file from there.
I think you may get the best results by re-considering the problem you're trying to solve. There's clearly a reason you're loading this 50Gig file. Consider if there isn't a better way to break the stored data down and only use the data you really need.
The way you read the file is fine. There might be ways to get it faster, but it usually requires understanding where your bottleneck is. Because the IO throughput is actually on the lower end, I assume the computation is having a performance side effect. If its not too lengthy you could show you whole program.
Alternatively, you could run your program without the contents of the loop and see how long it takes to read through the file :)
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps
Reading Thinking in Java 4th ed. I've got some doubts about I/O operations performance:
I've read that it's better to "wrap" InputStream objects in BufferedInputStream, but in my mind I can't see any difference. Isn't i.e. file operations already buffered? What's the advantages of file buffered write?
The system's IO buffering is on a different level than the Buffered*putStream.
Each call on FileOutputStream.write(...) induces a native method call (which is typically more costly than a java-internal call), and then a context switch to the OS' kernel to do the actual writing. Even if the kernel (or the file system driver or the harddisk controller or the harddisk itself) is doing more buffering, these costs will occur.
By wrapping a BufferedOutputStream around this, we will call the native write method only much less often, thus allowing much higher throughput.
(The same is valid for other types of IO, of course, I just used FileOutputStream as an example.)
Isn't i.e. file operations already buffered?
Maybe, maybe not - depending on the OS, the HD used, the way of access (e.g. reading big consecutive blocks vs randomly accessing small blocks all over the place), etc. In the worst case, adding a BufferedInputStream probably won't harm performance noticeably. In the best case, it can improve it by magnitudes (replacing many little file accesses by one big read/write).
An InputStream will only request as much data as you request, so if you request 1000 characters one character at a time, that will turn out to be 1000 seperate disk accesses, which will become pretty slow.
A BufferedInputStream however will request data from the InputStream in larger chunks, thus reducing the need for seperate disk accesses.
The same goes for output, instead of writing every character seperately, there are fewer physical disk writes with a BufferedOutputStream.
I want to read a big text file, what i decided to create four threads and read 25% of file by each one.
and then join them.
but its not more impressive.
can any one tell me can i use concurrent programming for the same.
as my file structure have some data as
name contact compnay policyname policynumber uniqueno
and I want to put all data in hashmap at last.
thanks
Reading a large file is typically limited by I/O performance, not by CPU time. You can't speed up the reading by dividing into multiple threads (it will rather decrease performance, since it's still the same file, on the same drive). You can use concurrent programming to process the data, but that can only improve performance after reading the file.
You may, however, have some luck by dedicating one single thread to reading the file, and delegate the actual processing from this thread to worker threads, whenever a data unit has been read.
If it is a big file chances are that it is written to disk as a contiguous part and "streaming" the data would be faster than parallel reads as this would start moving the heads back and forth. To know what is fastest you need intimate knowledge of your target production environment, because on high end storage the data will likely be distributed over multiple disks and parallel reads might be faster.
Best approach is i think is to read it with large chunks into memory. Making it available as a ByteArrayInputStream to do the parsing.
Quite likely you will peg the CPU during parsing and handling of the data. Maybe parallel map-reduce could help here spread the load over all cores.
You might want to use Memory-mapped file buffers (NIO) instead of plain java.io.
Well, you might flush the disk cache and put a high contention on the synchronization of the hashmap if you do it like that. I would suggest that you simply make sure that you have buffered the stream properly (possibly with a large buffer size). Use the BufferedReader(Reader in, int sz) constructor to specify buffer size.
If the bottle neck is not parsing the lines (that is, the bottle neck is not the CPU usage) you should not parallelize the task in the way described.
You could also look into memory mapped files (available through the nio package), but thats probably only useful if you want to read and write files efficiently. A tutorial with source code is available here: http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm
well you can take help from below link
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
OR
by using large buffer
or using this
import java.io.*;
public class line1 {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("missing filename");
System.exit(1);
}
try {
FileInputStream fis =
new FileInputStream(args[0]);
BufferedInputStream bis =
new BufferedInputStream(fis);
DataInputStream dis =
new DataInputStream(bis);
int cnt = 0;
while (dis.readLine() != null)
cnt++;
dis.close();
System.out.println(cnt);
}
catch (IOException e) {
System.err.println(e);
}
}
}