Fast and scalable IO passthrough

Fast and scalable IO passthrough - java

Original
I have a batch job that passes through the contents it downloads from various URLs to an S3 storage. I am currently using blocking IO and have reached a point where my job is IO bound blocking most of the time because of IO. So in order to speed up the whole process I was thinking about using non-blocking IO.
Unfortunately I wasn't able to find utility code for passing through content from a set of channels to another set of channels. Since I read that writing correct non-blocking code is not exactly easy, I would prefer to use an existing utility/framework than to write that code myself.
The TransferManager seems to be the only possible option for higher throughput when using the AWS SDK, but it only offers the usage of streams and seems to use IO threads in the background. Apparently there is no out of the box option for non-blocking uploads to S3.
What would you recommend? Right now I can only imagine the following solutions.
Stay with blocking IO and use my own IO thread pool
Use non-blocking IO to download the files to the local filesystem, then upload with TransferManager
Use non-blocking IO for pass through
Option 1 will obviously not scale and 2 will probably work for a while, but I would really like to keep my IOs on EBS low so I'd rather use 3.
To successfully implement option 3 I guess I would have to implement a lot myself so my final question is, whether you think it's worth it and if so, which tools I could use to make this work.
Edit 1
Clarifying that by IO bound I actually meant that the job is mostly waiting for IO. Here you can see that my bandwidth is not really saturated, but I would like it to be if that is possible.

If your job is I/O bound you've finished. You're bound by the speed of the network, not of your code. Using NIO won't make it any faster.
Clarifying that by IO bound I actually meant that the job is mostly waiting for IO.
Yes, that's what 'I/O bound' means. Nothing has changed.
Here you can see that my bandwidth is not really saturated, but I would like it to be if that is possible.
You could try using larger buffers, but as I said it seems to me that you've already finished.

Related

File synchronizer architecture

I have to make a file synchronizer: an application that essentially synchronizes H24 a large amount of data files from many systems outside to my local system using essentially FTP, SFTP and NFS.
The streams are more than twenty, for each of them the logic is slightly different and it must be configurable.
One of the requirements is that if one of the streams for some reason falls down it must be possible to retrieve it on without restarting the entire system.
Another requirement is that the transfer rate is balanced. In other words, there must not be a stream or a part of them synchronized and another stream 10 hours late
I have some perplexity about architecture to be realized: if I realize a single multithread system I would have a very high thread count (more than 100 I would say) and make it complicated by fulfilling the two requirements outlined above.
I was thinking of realizing several processes or different instances of the same process even if It seems a little "ugly" .. so in this way some load balancing would be done by the operating system and it would be simpler to kill or to start a flow ..Perhaps even performance might be better as several processes could use much more ram Someone has any tips/advice? Thanks a lot and sorry for my poor english. Gian

As #kayaman said, 100 threads is not a lot. If that means 100 threads per unit of work and you will have many units of work which would imply many magnitudes increase in threads, I would suggest having a look at Fibers
As long as you don't block the fibers, you can have 100000+ fibers running over a couple (typically number of CPU cores) of threads. Each fiber would then just wait for a callback from the process before continuing.
To access your endpoints and handle them in similar ways, have a look at Apache Camel - it will allow you to stream the FTP, SFTP, etc and handle each as just another endpoint (in theory you should be able to plug email in as well and stream packets that are emailed to the endpoint)
Regarding balancing the streams, this is business logic you need to implement. If one stream is receiving packets faster than another stream, you should be able to limit the rate by not requesting more packets under certain conditions. Need some more information on how you retrieve the packages and which libraries you are using in order to be of better assistance here.

What is the primary use case for asynchronous I/O

My application is not web based, just need to use sockets to service around 1000 clients. Throughput and latency are of utmost importance to me. Currently using select() from NIO but thinking of moving on to asynchronous IO in NIO.2.
When should asynchronous I/O be used?
What is the primary use case for asynchronous I/O?

If you are using Infiniband networks I would suggest looking at the Asynchronous IO.
Comparing Java 7 Async NIO with NIO.
However if you are using regular ethernet, it is just as likely to slower as faster. The coding is more complicated than using non-blocking IO which is more complicated than using blocking IO.
If latency is of utmost importance I suggest you look at using kernel by-pass network adapters like Solarflare. However if a 100 micro-second latency is acceptable to you, it is unlikely you need this.

Asynchronous IO is very good in situations where you need to scale to handle many concurrent connections. Historically one way of doing this was to dedicate a thread per connection, but this approach has limitations you can read about here. With Asynchronous IO you can easier handle many things with less threads and thus scale better.
Depending on the problem it might or might not be the right approach as nothing can beat a single thread when it comes to latency. However, this is a very extreme end and means you care about microseconds. In many cases is a trade-off between latency and throughput/scalability.
In any case it comes down to what you want to solve and what your expectations (numbers) need to be. Asynchronous IO is great and many cases and so is synchronous. There might also be other things you might want to consider such as protocol. Multiple clients interested in the same data (streaming) could indicate you want to look at multicast. If traffic is dedicated per client, then that might not be the approach.
Not knowing your latency requirements, but assuming they are not in a few microseconds I would definitely look into asychronous IO just reading you have 1000 clients. Asynchronous IO is by no means slow and synchronous/single thread connections might not scale well for you.

Synchronizing several threads writing to the same file in java

Of course there is the obvious way of using "synchronized".
But I'm creating a system designed for running on several cores
and writing to that file various times at the same milisecond.
So I believe that using synchronize will hurt performance badly.
I was thinking of using the Pipe class of java (but not sure if it will help)
or having each thread write to a different file and an additional thread collecting
those writings, creating the final result.
I should mention that the order of the writings isn't important and it is timestamped
in nanotime anyway.
What is the better idea of those two? have any other suggestions?
thanks.

Using some sort of synchronization (eg. single mutexes) is quite easy to implement.
If I had enough RAM I would create a queue of some sort for each log-producer thread, and a log-consumer thread to read from all the queues in a round-robin fashion (or something like that).

Not a direct answer to your question, but the logback project has synchronization facilities built into it for writing to the same file from different threads, so you might try to use it if it suits your needs, or at least take a look at it's source code. Since it's built for speed, I'm pretty sure the implementation isn't to be taken for granted.

You are right to be concerned, you are not going to be able to have all the threads write to the same file without a performance problem.
When I had this problem (writing my own logging, way back before Log4j) I created two fixed-size buffers in memory and had all the producer threads write to one buffer while a dedicated consumer thread read from the other buffer and wrote to a file. That way the writer threads had to synchronize only on getting and incrementing the index to the buffer and when the buffers were being swapped, and it only blocked when the current buffer was full. It was memory-intensive but fast.
For other ideas you could check out how loggers like Log4j and Logback work, they would have had to solve this problem.

Try to use JMS. All your processes running on different machines may send JMS message instead of writing to file. Create only one queue receiver that receives messages and writes them to file. Log4J already has this functionality: see JMSAppender.

Does log4j uses NIO for writing data to file?

It seems to be pretty fast already and i was just wondering if anyone knows if its using NIO. I tried searching the whole source code for NIO (well, its kind of way to search :) lol); but did not hit anything. Also, If its not using NIO; do you think its worthwhile to modify log4j to use NIO also to make it more faster? Any pointers advises and links to some resources would be lot helpful.

Also, If its not using NIO; do you
think its worthwhile to modify of
log4j to use NIO also to make it more
faster?
No, unless logging is a significant part of your application's activities, in which case there is usually something wrong.
You seem to be under the impression that NIO 'is faster', but this is not true in general. Just try creating two files, one with standard IO and one with NIO, writing a bunch of data to them and closing them. You'll see the performance hardly differs. NIO will only perform better in certain use cases; most usually the case of many connections.

Check out the FileAppender source. Pretty much just standard java.io.

I don't see any reason why FileChannel could be any faster than FileOutputStream in this case.
maybe by using MappedByteBuffer? but in append mode, the behavior is OS dependent.
ultimately, the performance depends on your hard drive, your code matters very little.

Elaborating on Confusion's answer, File NIO blocks as well. That's why it is not faster than traditional File IO in some scenarios. Quoting O'Reilly's Java NIO book:
File channels are always blocking and cannot be placed into
nonblocking mode. Modern operating systems have sophisticated caching
and prefetch algorithms that usually give local disk I/O very low
latency. Network filesystems generally have higher latencies but often
benefit from the same optimizations. The nonblocking paradigm of
stream-oriented I/O doesn't make as much sense for file-oriented
operations because of the fundamentally different nature of file I/O.
For file I/O, the true winner is asynchronous I/O, which lets a
process request one or more I/O operations from the operating system
but does not wait for them to complete. The process is notified at a
later time that the requested I/O has completed. Asynchronous I/O is
an advanced capability not available on many operating systems. It is
under consideration as a future NIO enhancement.
Edit: With that said, you can get better read/write efficiency if you use File NIO with a MappedByteBuffer. Note that using MappedByteBuffer in Log4j 2 is under consideration.

How do I get Java to use my multi-core processor with GZIPInputStream?

I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));

AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).

PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.

Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
OutputStream out = new BufferedOutputStream(
new GZIPOutputStream(
new FileOutputStream(myFile)
)
)
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.

I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.
If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.
Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.

Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.

compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.

Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.

I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
The question is too general to assume we are reading from a single hard drive.
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.

You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.