Our application reads and writes lots of encrypted files and we have created classes that implement InputStream and OutputStream to handle all these details. These implementation work fine (and are very convenient to use, by the way).
To be a bit more specific, when we write (in our OutputStream implementation), we utilize the Bouncy Castle (version 1.53) CMS classes and we compress, sign, encrypt, and sign again.
The problem we have is that we occasionally write large (>1GB after compression, 10+GB before compression) files and these can take 1+ hours to write. During this process, one CPU is pegged; the bottleneck is the compression step (based on profiling). It is not an option to not compress - we get good compression (better than 10x), these files are transported across a network (so much smaller is much faster), and we pay for storing these files essentially forever, so much smaller is cheaper.
The servers that we typically run this code on are AWS EC2 c5.4xlarge instances, which have lots of memory and 16 CPUs.
Here is the code for where we set up the compression:
import org.bouncycastle.cms.CMSCompressedDataStreamGenerator;
CMSCompressedDataStreamGenerator compressedGenerator = new CMSCompressedDataStreamGenerator();
compressedStream = compressedGenerator.open(signedStream, new ZlibCompressor());
We have run into similar situation where we GZIP (rather than encrypt) large files. We have successfully worked around this performance bottleneck by using a parallel GZIP writer. With the parallel GZIP writer, we can effectively utilize all available CPUs and complete the compression in a fraction of the time compared to using the JDK default implementation.
The question:
Is there a way to configure/invoke Bouncy Castle to parallelize the ZlibCompressor?
Is there a different Bouncy Castle compressor that we can/should use that will provide equivalent compression and significantly improve throughput?
Related
Is there a way to do parallel zipping in Java?
I'm currently using ParallelScatterZipCreator but unfortunately it does parallel zipping per file. So if there's a single file that is much larger than other files, the parallel zipping only happens for smaller files. Then it has to wait until the large file is zipped serialy.
Is there a better library out there that utilizes all CPU cores even if we're zipping a single file?
TL;DR: You may not need compression at all. If you do, then you probably don't want to use the zip format, it's outdated tech with sizable downsides, and clearly you have some fairly specific needs. You probably want ZStandard (zstd).
How does compression work
Compression works by looking at a blob of bytes and finding repetitions of some form within it. Thus, it is not possible to compress a single byte.
This makes the job fundamentally at odds with parallelising: If you take a blob of 1 million bytes and compress them by lopping that up into 10 chunks of 100k bytes each, compressing each miniblob individually, then any repetition such that one of the reptitions was in one miniblob and another was in another, means that you have missed an opportunity to compress data, that you would not have missed if you had compressed this data in one blob instead.
The only reason ZIP does let you parallellize a little bit, is because it is an old format - sensible at the time, but in this age, just about every part of the ZIP format is crap.
Why is ZIP bad?
ZIP is a mixed bag, conflating two unrelated jobs.
A bundler. A bundling tool is some software that takes a bunch of files and turns that into a single stream (a single sack of bytes). To do so, the bundling tool will take the metadata about a file (its name, its owner/group or other access info, its last-modified time, etcetera), and the data within it, and serializes this into a single stream. zip does this, as does e.g. the posix tar tool.
A compressor. A compressor takes a stream of data and compresses it by finding repeated patterns.
zip in essence is only #1, but as part of the bundler, the part with 'the data within this file' has a flag to indicate that a compression algorithm has been applied to the bytes representing the data. In theory you can use just about any algorithm, but in practice, every zip file has all entries compressed with the DEFLATE algorithm, which is vastly inferior to more modern algorithms.
.tar.gz is the exact same technology (first bundle it: tar file, then gzip that tar file. gzip is the DEFLATE algorithm), but vastly more efficient in certain cases (it applies compression to the entire stream vs. restarting from scratch for every file. If you take a sack of 1000 small, similar files, then that in .tar.gz form is orders of magnitude smaller than that in .zip form).
Also, zip is old, and it made choices defensible at the time but silly in modern systems: You can't 'stream' zips (you can't meaningfully start unpacking one until the whole file has been received), because the bundler's info is at the end of the file.
So why can I parallellize zips?
Because zips 'restart' their compression window on every file. This is inefficient and hurts the compression ratio of zip files.
You can apply the exact same principle to any block of data, if you want. Trade compression efficiency for paralellizability. ZIP is the format that doesn't do it in a useful way; as you said, if you have one much larger file, the point is moot.
'restart window at' is a principle that can be generalized, and various compression formats support it in a much more useful fashion (restart at every X bytes, vs. ZIPs unreliable 'restart at every file').
What is the bottleneck?
Multiple aspects are involved when sending data: The speed at which the source provides the bytes you want to send, the speed at which the bytes are processed into a package that can then be sent (e.g. a zip tool, but can be anything, including just sending it verbatim, uncompressed), the speed at which the packaged-up bytes are transited to the target system, the speed at which the target can unpack it, and the speed at which the target can then process the unpacked result.
Are you sure that the compression aspect is the bottleneck?
In the base case where you read the bytes off of a harddisk, zip them up, send them across a residential internet pipe to another system, that system unzips, and saves them on a HDD, it is rather likely that the bottleneck is the network. Parallellizing the compression step is a complete waste and in fact only slows things down by reducing compression ratios.
If you're reading files off of a spinning platter, then the speed of the source is likely the bottleneck, and parallel processing considerably slows things down: You're now asking the read head to bounce around, and this is much slower than reading the data sequentially in one go.
If you have a fast source, and a fast pipe, then the bottleneck is doubtlessly the compression and uncompression, but the solution is then not to compress at all: If you are transferring data off of SSDs or from a USB3-connected byte-spewing sensor and transfer it across a 10M CAT6 cable from one gigabit ethernet port to another, then why compress at all? Just send those bytes. Compression isn't going to make it any faster, and as long as you don't saturate that 1Gb connection, you gain absolutely nothing whatsoever by attempting to compress it.
If you have a slow pipe, then the only way to make things faster is to compress as much as you can. Which most definitely involves not using the DEFLATE algorithm (e.g. don't use zip). Use another algorithm and configure it to get better compression rates, at the cost of CPU performance. Parallelising is irrelevant; it's not the bottleneck, so there is no point whatsoever in doing it.
Conclusions
Most likely you want to either send your files over uncompressed, or ZStandard your files over, tweaking the compression v. speed ratio as needed. I'm not aware of any ZStandard (zstd) impl for java itself, but the zstd-jni project gives you a java-based API for invoking the C zstd library.
If you insist on sticking with ZIP, the answer is a fairly basic 'nope, you cannot really do that', though you could in theory write a parallel ZIP compressor that has worse compression power but parallelizes better (by restarting the window within a single file for larger files in addition to the forced-upon-you-by-the-format restart at every file), and produces ZIP files that are still compatible with just about every unzip tool on the planet. I'm not aware of one, I don't think one exists, and writing one yourself would be a decidedly non-trivial exercise.
I am trying to optimize the logging system of an Android application which causes some unwanted latency. There are multiple files opened which log different parts and should be kept separate.
I am not very familiar with low-level filesystem design and even less with current flash and/or SSD memory used in mobile phones (opposed to traditional HDD). I assume that memory is organized in disk blocks (512B or 4096B more recently) and some form of continuous, linked or indexed allocations is used.
I am using BufferedOutputStreams with buffer size of 256B, but this values is chosen at random (this provides a good answer for buffer size).
Writing in append mode to multiple opened files creates additional overhead that can significantly decrease performance (from allocation strategy for ex.)? Is it influenced greatly by the buffered output buffer size (this particular case of multiple files)?
I am using Android which tends to have a variety of FSs and makes it hard to understand how each influences the append to multiple opened files. Probably the I/O functions of Java or any other are very similar.
My search for this particular issue turned empty or maybe I need some domain specific terms in my search that I am not familiar with.
I have a SCALA(+ JAVA) code which reads and writes at a certain rate. Profiling tells me how much time each method in the code has taken to execute. How do I measure if my program is reaching its maximum efficiency ? To make my code optimized so that it is reading at the maximum speed that is possible with the give configuration.I know this is hard-ware specific and varies from machine to machine. If there is a short-way to measure the process. If my program is reading and writing at the fastest rate possible by the hardware. (I'm using FileWriter along with BufferWriter.)
With a description given your best possible choice might be experimenting. What you could measure:
Changing the buffer size (this didn't help me when I tried this)
Switching to NIO (might help for large files)
Caching the data you read (might help for small files), caching dir content if there are many files in it. Opening a file speed degrades when number of files in a folder grows.
One possible technique to make sure there is no problem with the code with your code profiling is to get the CPU time distribution tree for your method and to expand execution paths taking most of the time. If all these paths head to Java standard libraries, you're probably reaching your best performance.
UPDATE
Some other things & techniques from the hrof you provided.
With a profiler or some other tecnhinique (I prefer stopwatches as they give more stable and realistic results) you need to find what's your bottleneck.
Most of the IO can be optimized to use a single buffer - this is less painful with Guava or Apache Commons IO.
However, there is not much you can do if you're using Jackson in your serialization chain if it's your bottleneck. Changing an algorithm?
There are slow guys (compared to native filesystem IO) - i.e. Formatters, String.format is very slow, Jackson, etc.
There are typical slow operations with IO - i.e. buffer allocations, string concatenation, too many allocated char[] buffers is a smell for IO optimization.
I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).
PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.
Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
OutputStream out = new BufferedOutputStream(
new GZIPOutputStream(
new FileOutputStream(myFile)
)
)
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.
I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.
If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.
Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.
Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.
compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.
Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.
I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
The question is too general to assume we are reading from a single hard drive.
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.
You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.
I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.
Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.
You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.
From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.
As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.
If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?
As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh
I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.