Reading Thinking in Java 4th ed. I've got some doubts about I/O operations performance:
I've read that it's better to "wrap" InputStream objects in BufferedInputStream, but in my mind I can't see any difference. Isn't i.e. file operations already buffered? What's the advantages of file buffered write?
The system's IO buffering is on a different level than the Buffered*putStream.
Each call on FileOutputStream.write(...) induces a native method call (which is typically more costly than a java-internal call), and then a context switch to the OS' kernel to do the actual writing. Even if the kernel (or the file system driver or the harddisk controller or the harddisk itself) is doing more buffering, these costs will occur.
By wrapping a BufferedOutputStream around this, we will call the native write method only much less often, thus allowing much higher throughput.
(The same is valid for other types of IO, of course, I just used FileOutputStream as an example.)
Isn't i.e. file operations already buffered?
Maybe, maybe not - depending on the OS, the HD used, the way of access (e.g. reading big consecutive blocks vs randomly accessing small blocks all over the place), etc. In the worst case, adding a BufferedInputStream probably won't harm performance noticeably. In the best case, it can improve it by magnitudes (replacing many little file accesses by one big read/write).
An InputStream will only request as much data as you request, so if you request 1000 characters one character at a time, that will turn out to be 1000 seperate disk accesses, which will become pretty slow.
A BufferedInputStream however will request data from the InputStream in larger chunks, thus reducing the need for seperate disk accesses.
The same goes for output, instead of writing every character seperately, there are fewer physical disk writes with a BufferedOutputStream.
Related
How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.
When I want to write Java code for writing text to a file, it usually looks something like this:
File logFile = new File("/home/someUser/app.log");
FileWriter writer;
try {
writer = new FileWriter(logFile, true);
writer.write("Some text.");
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
I am now writing a Logger that will be used extensively by an in-house reporting tool. For reasons outside the context of this question, I can't use one of the traditional logging frameworks (SLF4J, Log4j, Logback, JUL, JCL, etc.). So I have to make something homegrown.
This logging system will be simple, non-configurable, but has to be capable of handling high-volume (possibly hundreds of log operations per second, or more).
So I ask: how can I optimize my normal file I/O template above, to handle high-throughput logging? What kid of "hidden gems of Java File I/O" can I capitalize on here? Pretty much anything goes, except, like I said, use of other logging frameworks. Basic Logger API needs to be something like:
public class Logger {
private File logFile;
public Logger(File logFile) {
super();
setFile(logFile);
}
public void log(String message) {
???
}
}
Thanks in advance!
Update: If my Logger used a ByteOutputStream instead of a FileWriter, then how can I properly synchronize my log(String) : void method?
public class Logger {
private File logFile;
// Constructor, getters/setters, etc.
public void synchronized log(String message) {
FileOutputStream foutStream = new FileOutputStream(logFile);
ByteOutputStream boutStream = new BytesOutputStream(foutStream);
boutStream.write(message.getBytes(Charset.forName("utf-8")));
// etc.
}
}
If you want to achieve maximum throughput for the logging operation you should decouple the logging of messages from writing them to the file system by using a queue and a separate log-writing thread.
The purpose of a logging system isn't just to achieve maximum throughput. It is required as an audit trail. Business decisions must be made as to how much data loss, if any, is tolerable in case of a crash. You need to investigate that first, before committing yourself to any specific technical solution.
I'm speaking only about throughput here and not about engineering or reliability concerns, since the question asked only about performance.
You will want to buffer writes to disk. Writing lots of little tiny pieces with unboffered I/O incurs a whole bunch of overhead:
The cost of a native method call; the JVM has to do a good amount of bookkeeping, including knowing which threads are running native methods and which aren't, in order to work. This is on the order of tens or hundreds of nanoseconds on modern platforms.
Copying the data from the Java heap to native memory through a magic JNI call. There's a memory copy taking time proportional to the length of your data, but there's also a bunch of JVM bookkeeping. Ballpark the bookkeeping overhead around a few hundred nanoseconds.
The cost of an OS call to write() or similar. Ballpark the overhead around 2 microseconds. (There are other costs, too; your caches and TLB have probably been flushed upon return.) write() also needs to copy the data from user space to kernel space.
The OS may internally buffer your writes. It may not. This depends on the OS and the characteristics of the underlying filesystem. You can typically force it not to buffer your writes. You can typically also flush the OS's buffer. Doing so will incur the cost of a disk seek and a write. Ballpark the disk seek around 8ms and the write between 100MB/s and 1GB/s. Throw the disk seek overhead out the window if you're using a RAM disk or flash storage or something like that --- latencies there are typically much lower.
The really big cost you want to avoid if possible is the disk seek cost. 8 milliseconds is a hell of a long time to wait when writing a 100-odd-byte log message. You will want some kind of buffering between the user and the backing storage, whether it's provided by the OS or hidden by the logging interface.
The overhead of a system call from the JVM is also significant, though it's about 1000 times less than the cost of a disk seek. You're spending two or three microseconds to tell the kernel to buffer a write of 100-odd bytes. Almost all of those two or three microseconds are spent handling various bookkeeping tasks that have nothing at all to do with writing a log message to a file. This is why you want the buffering to happen in userspace, and preferably in Java code instead of native code. (However, engineering concerns may render this impossible.)
Java already comes with drop-in buffering solutions --- BufferedWriter and BufferedOutputStream. It turns out that these are internally synchronised. You'll want to use BufferedOutputStream so that the String-to-bytes conversion happens outside of the lock rather than inside.
You could do one better than the Buffered classes if you kept a queue of Strings that you flush once it reaches a certain size. This saves a memory copy, but I rather doubt this is worth doing.
On buffer sizes, I suggested something around 4MB or 8MB. Buffer sizes in this range cover up the latency of a disk seek fairly well on most typical modern hardware. Your southbridge can push about 1GB/s and a typical disk can push about 100MB/s. Maxing out your southbridge, then, an 8MB write will take about 8 milliseconds --- roughly as long as a disk seek. With a single "typical modern disk", 90% of the time spent doing an 8MB random write is spent doing the write.
Again, you can't do buffering inside Java if log messages need to be reliably written to the backing store. You need to trust the kernel in that case, and you pay a speed hit for doing so.
I use MappedByteBuffers to achieve thread safety between readers and writers of a file via volatile variables (writer updates position and readers read the writer's position) (this is a file upload system, the the incoming file is a stream, if that matters). There are more tricks, obviously (sparse files, power of two mapping growth), but it all boils down to that.
I can't find a faster way to write to a file while concurrently reading the same without caching the same completely in memory (which I cannot do due to shear size).
Is there any other method of IO that guarantees visibility within the same process for readers to written bytes? MappedByteBuffer makes its guarantees, indirectly, via the Java Memory Model, and I'd expect any other solution to do the same (read: non platform specific and more).
Is this the fastest way? Am I missing something in the docs?
I did some tests quite a few years ago on what was then decent hardware, and MappedByteBuffer was about 20% faster than any other I/O technique. It does have the disadvantage for writing that you need to know the file size in advance.
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
In Java, what is the advantage of using BufferedWriter to append to a file?
The site that I am looking at says
"The BufferWriter class is used to write text to a character-output stream, buffering characters so as to provide for the efficient writing of single characters, arrays, and strings."
What make's it more efficient and why?
BufferedWriter is more efficient because it uses buffers rather than writing character by character. So it reduces I/O operations of the disk. Data is collected in a buffer and write to the file when the buffer is full.
This is why sometimes no data is written in the file if you didn't call flush method. That is data is collected in the buffer but program exits before writing them to the file. Calling flush method will cause the data to be written in the file even the buffer is not filled completely.
The cost of writing becomes expensive when you write character by character to the file. For reducing that cost, buffers are provided. If you are writing to Buffer, it waits for some limit and then writes the whole to the disk.
A BufferedWriter waits until the buffer (8192 bytes) is full and writes the whole buffer in one disk operation. Unbuffered each single write would result in a disk I/O which is obviously more expensive.
Hard disk hava a minimum unit of information storage so for example if you are writing a single byte the operating system asks for the disk to store a unit of storage (I think that the minimum is 512 bytes). So you ask for writing one byte and the operating system writes much more. If you ask to store 512 bytes with 512 calls you end up doing a lot more I/O (512 disk operations) that buffering 512 bytes and issuing only one call (1 disk operation).
As the name suggests, BufferedWriter uses a buffer to reduce the costs of writes. If you are writing to file, you might know that writing 1byte or writing 4kbytes roughly costs the same. The time required to perform such write is dominated by the access time (~8ms) which is the time required by the disk to rotate and to seek the right sector.
Additionally, aggregating small writes in a bigger one allows you to reduce the overhead on the operating system, achieving better performances.
Most of the operating systems do have an internal buffer to cache writes. However, these caches tries to figure out what the application is doing, by analyzing the write patterns. If the application itself is able to perform that caching, and perform a write only when the data is ready, the result (in terms of performance) is better.