Optimizing Java File I/O for High-Volume Log Files

Optimizing Java File I/O for High-Volume Log Files - java

When I want to write Java code for writing text to a file, it usually looks something like this:
File logFile = new File("/home/someUser/app.log");
FileWriter writer;
try {
writer = new FileWriter(logFile, true);
writer.write("Some text.");
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
I am now writing a Logger that will be used extensively by an in-house reporting tool. For reasons outside the context of this question, I can't use one of the traditional logging frameworks (SLF4J, Log4j, Logback, JUL, JCL, etc.). So I have to make something homegrown.
This logging system will be simple, non-configurable, but has to be capable of handling high-volume (possibly hundreds of log operations per second, or more).
So I ask: how can I optimize my normal file I/O template above, to handle high-throughput logging? What kid of "hidden gems of Java File I/O" can I capitalize on here? Pretty much anything goes, except, like I said, use of other logging frameworks. Basic Logger API needs to be something like:
public class Logger {
private File logFile;
public Logger(File logFile) {
super();
setFile(logFile);
}
public void log(String message) {
???
}
}
Thanks in advance!
Update: If my Logger used a ByteOutputStream instead of a FileWriter, then how can I properly synchronize my log(String) : void method?
public class Logger {
private File logFile;
// Constructor, getters/setters, etc.
public void synchronized log(String message) {
FileOutputStream foutStream = new FileOutputStream(logFile);
ByteOutputStream boutStream = new BytesOutputStream(foutStream);
boutStream.write(message.getBytes(Charset.forName("utf-8")));
// etc.
}
}

If you want to achieve maximum throughput for the logging operation you should decouple the logging of messages from writing them to the file system by using a queue and a separate log-writing thread.

The purpose of a logging system isn't just to achieve maximum throughput. It is required as an audit trail. Business decisions must be made as to how much data loss, if any, is tolerable in case of a crash. You need to investigate that first, before committing yourself to any specific technical solution.

I'm speaking only about throughput here and not about engineering or reliability concerns, since the question asked only about performance.
You will want to buffer writes to disk. Writing lots of little tiny pieces with unboffered I/O incurs a whole bunch of overhead:
The cost of a native method call; the JVM has to do a good amount of bookkeeping, including knowing which threads are running native methods and which aren't, in order to work. This is on the order of tens or hundreds of nanoseconds on modern platforms.
Copying the data from the Java heap to native memory through a magic JNI call. There's a memory copy taking time proportional to the length of your data, but there's also a bunch of JVM bookkeeping. Ballpark the bookkeeping overhead around a few hundred nanoseconds.
The cost of an OS call to write() or similar. Ballpark the overhead around 2 microseconds. (There are other costs, too; your caches and TLB have probably been flushed upon return.) write() also needs to copy the data from user space to kernel space.
The OS may internally buffer your writes. It may not. This depends on the OS and the characteristics of the underlying filesystem. You can typically force it not to buffer your writes. You can typically also flush the OS's buffer. Doing so will incur the cost of a disk seek and a write. Ballpark the disk seek around 8ms and the write between 100MB/s and 1GB/s. Throw the disk seek overhead out the window if you're using a RAM disk or flash storage or something like that --- latencies there are typically much lower.
The really big cost you want to avoid if possible is the disk seek cost. 8 milliseconds is a hell of a long time to wait when writing a 100-odd-byte log message. You will want some kind of buffering between the user and the backing storage, whether it's provided by the OS or hidden by the logging interface.
The overhead of a system call from the JVM is also significant, though it's about 1000 times less than the cost of a disk seek. You're spending two or three microseconds to tell the kernel to buffer a write of 100-odd bytes. Almost all of those two or three microseconds are spent handling various bookkeeping tasks that have nothing at all to do with writing a log message to a file. This is why you want the buffering to happen in userspace, and preferably in Java code instead of native code. (However, engineering concerns may render this impossible.)
Java already comes with drop-in buffering solutions --- BufferedWriter and BufferedOutputStream. It turns out that these are internally synchronised. You'll want to use BufferedOutputStream so that the String-to-bytes conversion happens outside of the lock rather than inside.
You could do one better than the Buffered classes if you kept a queue of Strings that you flush once it reaches a certain size. This saves a memory copy, but I rather doubt this is worth doing.
On buffer sizes, I suggested something around 4MB or 8MB. Buffer sizes in this range cover up the latency of a disk seek fairly well on most typical modern hardware. Your southbridge can push about 1GB/s and a typical disk can push about 100MB/s. Maxing out your southbridge, then, an 8MB write will take about 8 milliseconds --- roughly as long as a disk seek. With a single "typical modern disk", 90% of the time spent doing an 8MB random write is spent doing the write.
Again, you can't do buffering inside Java if log messages need to be reliably written to the backing store. You need to trust the kernel in that case, and you pay a speed hit for doing so.

Related

Buffered-Input Stream [duplicate]

Let me preface this post with a single caution. I am a total beginner when it comes to Java. I have been programming PHP on and off for a while, but I was ready to make a desktop application, so I decided to go with Java for various reasons.
The application I am working on is in the beginning stages (less than 5 classes) and I need to read bytes from a local file. Typically, the files are currently less than 512kB (but may get larger in the future). Currently, I am using a FileInputStream to read the file into three byte arrays, which perfectly satisfies my requirements. However, I have seen a BufferedInputStream mentioned, and was wondering if the way I am currently doing this is best, or if I should use a BufferedInputStream as well.
I have done some research and have read a few questions here on Stack Overflow, but I am still having troubles understanding the best situation for when to use and not use the BufferedInputStream. In my situation, the first array I read bytes into is only a few bytes (less than 20). If the data I receive is good in these bytes, then I read the rest of the file into two more byte arrays of varying size.
I have also heard many people mention profiling to see which is more efficient in each specific case, however, I have no profiling experience and I'm not really sure where to start. I would love some suggestions on this as well.
I'm sorry for such a long post, but I really want to learn and understand the best way to do these things. I always have a bad habit of second guessing my decisions, so I would love some feedback. Thanks!

If you are consistently doing small reads then a BufferedInputStream will give you significantly better performance. Each read request on an unbuffered stream typically results in a system call to the operating system to read the requested number of bytes. The overhead of doing a system call is may be thousands of machine instructions per syscall. A buffered stream reduces this by doing one large read for (say) up to 8k bytes into an internal buffer, and then handing out bytes from that buffer. This can drastically reduce the number of system calls.
However, if you are consistently doing large reads (e.g. 8k or more) then a BufferedInputStream slows things a bit. You typically don't reduce the number of syscalls, and the buffering introduces an extra data copying step.
In your use-case (where you read a 20 byte chunk first then lots of large chunks) I'd say that using a BufferedInputStream is more likely to reduce performance than increase it. But ultimately, it depends on the actual read patterns.

If you are using a relatively large arrays to read the data a chunk at a time, then BufferedInputStream will just introduce a wasteful copy. (Remember, read does not necessarily read all of the array - you might want DataInputStream.readFully). Where BufferedInputStream wins is when making lots of small reads.

BufferedInputStream reads more of the file that you need in advance. As I understand it, it's doing more work in advance, like, 1 big continous disk read vs doing many in a tight loop.
As far as profiling - I like the profiler that's built into netbeans. It's really easy to get started with. :-)

I can't speak to the profiling, but from my experience developing Java applications I find that using any of the buffer classes - BufferedInputStream, StringBuffer - my applications are exceptionally faster. Because of which, I use them even for the smallest files or string operation.

import java.io.*;
class BufferedInputStream
{
public static void main(String arg[])throws IOException
{
FileInputStream fin=new FileInputStream("abc.txt");
BufferedInputStream bis=new BufferedInputStream(fin);
int size=bis.available();
while(true)
{
int x=bis.read(fin);
if(x==-1)
{
bis.mark(size);
System.out.println((char)x);
}
}
bis.reset();
while(true)
{
int x=bis.read();
if(x==-1)
{
break;
System.out.println((char)x);
}
}
}
}

Why does the PrintWriter class (and other writers) require a call to flush after writing?

I have noticed that some I/O Classes in Java (and a lot others, like the BufferedWriter and FileWriter), require a call to flush() after writing. (With the exception of AutoFlush, I'll get to that later).
For example, this call to println() will not work. However, if I invoke writer#flush() after, the line will print.
PrintWriter writer = new PrintWriter(System.out);
writer.println("test");
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?

Why does the PrintWriter class (and other writers) require a call to flush after writing?
To the extent that flushing is wanted1, it will be needed if the "stack" of output classes beneath the print writer is doing some output buffering. If an output stream is buffered, then some event needs to trigger pushing (flushing) the buffered output to the the external file, pipe, socket or whatever. The things that will trigger flushing are:
the buffer filling up
something calling close() on the stream, or
something calling flush() on the stream.
In the case of a PrintWriter, the underlying stream can also be flushed by the classes auto-flushing mechanism.
The reason for buffering output (in general) is efficiency. Performing the low-level output operation that writes data to the (external) file, pipe, whatever involves a system call. There are significant overheads in doing this, so you want to avoid doing lots of "little" writes.
Output buffering is the standard way to solve this problem. Data to be written is collected in the buffer until the buffer fills up. The net result us lots of "little" writes can be aggregated into a "big" write. The performance improvement can be significant.
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?
It is really a convenience to avoid having to explicitly flush. It doesn't improve performance. Indeed, if you don't need the data to be flushed1, then unnecessary auto-flushing will reduce performance.
1 - You would want the data to be flushed if someone or something wants to see the data you are writing as soon as possible.

They don't require flushing, only if you want to guarantee that output has been displayed so far, which is exactly what flushing is. If you are fine writing to a file and just want to make sure it gets there before the program terminates, then no need to flush.

When data is written to an output stream, the underlying
an operating system does not guarantee that the data will make it
to the file system immediately. In many operating systems, the
data may be cached in memory, with a write occurring only
after a temporary cache is filled or after some amount of time
has passed.
If the data is cached in memory and the application terminates
unexpectedly, the data would be lost, because it was never
written to the file system. To address this, all output stream
classes provide a flush() method, which requests that all accumulated data be written immediately to disk.
The flush() method helps reduce the amount of data lost if the
application terminates unexpectedly. It is not without cost,
though. Each time it is used, it may cause a noticeable delay in
the application, especially for large files. Unless the data that
you are writing is extremely critical, the flush() method should
be used only intermittently. For example, it should not
necessarily be called after every write.
You also do not need to call the flush() method when you have
finished writing data, since the close() method will
automatically do this.
Read from the book here -> OCP Oracle Certified Professional Java SE 11
Hope this is clear to you!

Java IO Thread Saftey

I use MappedByteBuffers to achieve thread safety between readers and writers of a file via volatile variables (writer updates position and readers read the writer's position) (this is a file upload system, the the incoming file is a stream, if that matters). There are more tricks, obviously (sparse files, power of two mapping growth), but it all boils down to that.
I can't find a faster way to write to a file while concurrently reading the same without caching the same completely in memory (which I cannot do due to shear size).
Is there any other method of IO that guarantees visibility within the same process for readers to written bytes? MappedByteBuffer makes its guarantees, indirectly, via the Java Memory Model, and I'd expect any other solution to do the same (read: non platform specific and more).
Is this the fastest way? Am I missing something in the docs?

I did some tests quite a few years ago on what was then decent hardware, and MappedByteBuffer was about 20% faster than any other I/O technique. It does have the disadvantage for writing that you need to know the file size in advance.

What is the fastest way to write a large amount of data from memory to a file?

I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)

Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.

Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.

I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?

writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps

Java I/O classes and performance

Reading Thinking in Java 4th ed. I've got some doubts about I/O operations performance:
I've read that it's better to "wrap" InputStream objects in BufferedInputStream, but in my mind I can't see any difference. Isn't i.e. file operations already buffered? What's the advantages of file buffered write?

The system's IO buffering is on a different level than the Buffered*putStream.
Each call on FileOutputStream.write(...) induces a native method call (which is typically more costly than a java-internal call), and then a context switch to the OS' kernel to do the actual writing. Even if the kernel (or the file system driver or the harddisk controller or the harddisk itself) is doing more buffering, these costs will occur.
By wrapping a BufferedOutputStream around this, we will call the native write method only much less often, thus allowing much higher throughput.
(The same is valid for other types of IO, of course, I just used FileOutputStream as an example.)

Isn't i.e. file operations already buffered?
Maybe, maybe not - depending on the OS, the HD used, the way of access (e.g. reading big consecutive blocks vs randomly accessing small blocks all over the place), etc. In the worst case, adding a BufferedInputStream probably won't harm performance noticeably. In the best case, it can improve it by magnitudes (replacing many little file accesses by one big read/write).

An InputStream will only request as much data as you request, so if you request 1000 characters one character at a time, that will turn out to be 1000 seperate disk accesses, which will become pretty slow.
A BufferedInputStream however will request data from the InputStream in larger chunks, thus reducing the need for seperate disk accesses.
The same goes for output, instead of writing every character seperately, there are fewer physical disk writes with a BufferedOutputStream.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.