Parallel zipping of a single large file

Parallel zipping of a single large file - java

Is there a way to do parallel zipping in Java?
I'm currently using ParallelScatterZipCreator but unfortunately it does parallel zipping per file. So if there's a single file that is much larger than other files, the parallel zipping only happens for smaller files. Then it has to wait until the large file is zipped serialy.
Is there a better library out there that utilizes all CPU cores even if we're zipping a single file?

TL;DR: You may not need compression at all. If you do, then you probably don't want to use the zip format, it's outdated tech with sizable downsides, and clearly you have some fairly specific needs. You probably want ZStandard (zstd).
How does compression work
Compression works by looking at a blob of bytes and finding repetitions of some form within it. Thus, it is not possible to compress a single byte.
This makes the job fundamentally at odds with parallelising: If you take a blob of 1 million bytes and compress them by lopping that up into 10 chunks of 100k bytes each, compressing each miniblob individually, then any repetition such that one of the reptitions was in one miniblob and another was in another, means that you have missed an opportunity to compress data, that you would not have missed if you had compressed this data in one blob instead.
The only reason ZIP does let you parallellize a little bit, is because it is an old format - sensible at the time, but in this age, just about every part of the ZIP format is crap.
Why is ZIP bad?
ZIP is a mixed bag, conflating two unrelated jobs.
A bundler. A bundling tool is some software that takes a bunch of files and turns that into a single stream (a single sack of bytes). To do so, the bundling tool will take the metadata about a file (its name, its owner/group or other access info, its last-modified time, etcetera), and the data within it, and serializes this into a single stream. zip does this, as does e.g. the posix tar tool.
A compressor. A compressor takes a stream of data and compresses it by finding repeated patterns.
zip in essence is only #1, but as part of the bundler, the part with 'the data within this file' has a flag to indicate that a compression algorithm has been applied to the bytes representing the data. In theory you can use just about any algorithm, but in practice, every zip file has all entries compressed with the DEFLATE algorithm, which is vastly inferior to more modern algorithms.
.tar.gz is the exact same technology (first bundle it: tar file, then gzip that tar file. gzip is the DEFLATE algorithm), but vastly more efficient in certain cases (it applies compression to the entire stream vs. restarting from scratch for every file. If you take a sack of 1000 small, similar files, then that in .tar.gz form is orders of magnitude smaller than that in .zip form).
Also, zip is old, and it made choices defensible at the time but silly in modern systems: You can't 'stream' zips (you can't meaningfully start unpacking one until the whole file has been received), because the bundler's info is at the end of the file.
So why can I parallellize zips?
Because zips 'restart' their compression window on every file. This is inefficient and hurts the compression ratio of zip files.
You can apply the exact same principle to any block of data, if you want. Trade compression efficiency for paralellizability. ZIP is the format that doesn't do it in a useful way; as you said, if you have one much larger file, the point is moot.
'restart window at' is a principle that can be generalized, and various compression formats support it in a much more useful fashion (restart at every X bytes, vs. ZIPs unreliable 'restart at every file').
What is the bottleneck?
Multiple aspects are involved when sending data: The speed at which the source provides the bytes you want to send, the speed at which the bytes are processed into a package that can then be sent (e.g. a zip tool, but can be anything, including just sending it verbatim, uncompressed), the speed at which the packaged-up bytes are transited to the target system, the speed at which the target can unpack it, and the speed at which the target can then process the unpacked result.
Are you sure that the compression aspect is the bottleneck?
In the base case where you read the bytes off of a harddisk, zip them up, send them across a residential internet pipe to another system, that system unzips, and saves them on a HDD, it is rather likely that the bottleneck is the network. Parallellizing the compression step is a complete waste and in fact only slows things down by reducing compression ratios.
If you're reading files off of a spinning platter, then the speed of the source is likely the bottleneck, and parallel processing considerably slows things down: You're now asking the read head to bounce around, and this is much slower than reading the data sequentially in one go.
If you have a fast source, and a fast pipe, then the bottleneck is doubtlessly the compression and uncompression, but the solution is then not to compress at all: If you are transferring data off of SSDs or from a USB3-connected byte-spewing sensor and transfer it across a 10M CAT6 cable from one gigabit ethernet port to another, then why compress at all? Just send those bytes. Compression isn't going to make it any faster, and as long as you don't saturate that 1Gb connection, you gain absolutely nothing whatsoever by attempting to compress it.
If you have a slow pipe, then the only way to make things faster is to compress as much as you can. Which most definitely involves not using the DEFLATE algorithm (e.g. don't use zip). Use another algorithm and configure it to get better compression rates, at the cost of CPU performance. Parallelising is irrelevant; it's not the bottleneck, so there is no point whatsoever in doing it.
Conclusions
Most likely you want to either send your files over uncompressed, or ZStandard your files over, tweaking the compression v. speed ratio as needed. I'm not aware of any ZStandard (zstd) impl for java itself, but the zstd-jni project gives you a java-based API for invoking the C zstd library.
If you insist on sticking with ZIP, the answer is a fairly basic 'nope, you cannot really do that', though you could in theory write a parallel ZIP compressor that has worse compression power but parallelizes better (by restarting the window within a single file for larger files in addition to the forced-upon-you-by-the-format restart at every file), and produces ZIP files that are still compatible with just about every unzip tool on the planet. I'm not aware of one, I don't think one exists, and writing one yourself would be a decidedly non-trivial exercise.

Related

Compressing in Python and decompressing in Java for data transfer

I need to compress PNG images on PC for data transmission in Python, transfer it to mobile and read it there in Java. I need lossless compression.
The hard part is compression and decompression in different environments and programming languages. I need to use something available for both languages. I've tried zlib, which technically should work, but it only decreases size by about 0.001% (with "best" compression setting 9).
What am I doing wrong with zlib, if anything?
What are the possible alternatives?
Is there any other way to go about this problem? I just need to transfer data as byte stream and the compression was my first thought to optimizing it.

compressing a a compressed file (like png / jpg) will normally not yield a lot and can even occasionally increase the file size.
It's not worth the effort.

Compressing large files using blocks in Java

I am compressing files of over 2GB in Java using a consecutive application of two compression algorithms; one LZ based and one Huffman-based. (This is similar to DEFLATE).
Since 2GB is too large to be held in any buffer, I have to pass the file through one algorithm outputting a temporary file, then pass that temporary file through the second algorithm outputting the final file.
An alternative is to compress the file in 8MB blocks (the size where I don't get an Out-Of-Memory error) but then I have an inability to take full advantage of the redundancy within the entire file.
Any ideas how to perform these operations neater. No temporary files, and no compressing in blocks? Do any other compression tools compress in blocks? How do they deal with this issue? Regards

Java comes with “java.util.zip” library to perform data compression in ZIp format.
The overall concept is quite straightforward.
Library reads file with “FileInputStream”.
And add the file name to “ZipEntry” and output it to “ZipOutputStream“
import java.util.zip.ZipEntry and import java.util.zip.ZipOutputStream are used for importing Zip folder to a program.
But how can decompress a file
?

What's wrong with piping of streams? You can read from InputStream, compress the bytes and write them to output stream that is connected to input stream of the next algorithm. Take a look on PipeInputStream and PipeOutputStream.
I hope that these algorithms can work incrementally.

You could use two levels of java.util.zip. First, just concatenate all files (without compression). If possible, sort the entries by file type so that similar files are next to each other (this will increase compression ratio). Second, compress this stream. You don't need to run two separate phases; instead, you can wrap the first within the second stage, like CompressStream(ConcatenateFiles(directory)). That way you have a zip file within another zip file: the outer zip file is compressed, the inner is not and contains all the actual files.
It is true that java.util.zip used to have problems with files larger than 2 GB (I did run into those problems). However, I believe that was only the case for ZipFile and not for ZipIn/OutputStream. Also, I think those problems are fixed with recent Java versions.
Buffer size: regular compression algorithms such as Deflate will not benefit from chunk sizes larger than about 64 KB. More advanced algorithms can benefit from using larger chunk sizes, for example bzip2 up to 900 KB, or LZMA2 up to 2 MB. Anything beyond that is more likely the domain of data deduplication, which might or might not make sense for what you want to do.

Writing multiple files of same data amount vs writing a single large file of same data amount

I want to write a big file to the local disk.
I split the big file into many small files and I tried to write it to the disk. But I observed that when I split the files and tried to write, there was a big increase in disk write time.
Also, I copy the files from a disk and write it another computer's disk(reducer). I observed that there was a big increase in read time as well. Can anybody explain me the reason? I am working with hadoop.
Thanks!

That's due to the underlying file system and hardware.
There's overhead for each file in addition to its contents, for example MFT for NTFS(on Windows). So for a single large file the file system could do less bookkeeping.Thus it's faster.
As arranged by your OS, single big file tends to be written on consecutive sectors of the hard drive where possible, but multiple small files may or may not be written as such. So the resulting increased seek time may account for the increased reading time for many small files.
The efficiency of your OS may also play a big part. For example whether it prefetches file contents, how it makes use of buffer, etc. For many small files it's more difficult for the OS to use the buffer(and deal with other issues) efficiently.(Under different scenarios it can behave differently.)
EDIT: As for the copy process you mentioned, generally your OS do it in the following steps:
read data from disk->writing data to buffer->read from buffer->write to (possibly another) disk
This is usually done in multiple threads. When dealing with many small files, the OS may fail to coordinate these threads efficiently(Some threads are very busy, while others must wait). For a single large file the OS doesn't have to deal with these issues.

Every file system has a smallest unit(non sharable) defined to store the data named page. Say for example, in the file system, you have a page size of 4KB. Now if you save a big file of 8 KB, it will consume 2 pages on the disk. But if you break the file in 4 files, each of size 2KB, then it will consume 4 half filled pages on the disk consuming size 16KB disk space.
Similarly, if you break the file in 8 small files, each of size 1KB, then it will consume 8 pages in the disk though partially filled and your 32KB of the disk space is consumed.
Same is true in the reading overhead. If your file as several pages, then might be scattered. This will lead into high overhead in seektime/access time.

Fastest compression for serialzable files in Java

I have a bunch of files (around 4000), each weighting 1-5K more or less,
all created using the serialization mechanism of Java.
I'd like to compress them and send them over a network as a single file.
(They total for around 200-300MB).
I'm looking for a way to increase the compression / decompression speed, without hurting the file size too much (as it should still be sent over the network and get stored in the server).
Currently using the zip package that comes with Apache Ant.
I read that zip files store meta data for each file, so I guess zip files won't be the best choice here.
So what's preferable?
Gzip / Tar?
Or not compressing at all?
Which java library would you recommend for this case?
Thanks in advance.

Not compressing at all would be fastest, but the resulting file size is the downside.
One reason why tar.gz produces smaller file sizes than zip alone is that gzip gets to work with a bigger buffer of data (the whole tar file), while in your case, zip only gets to work with the data from one file at a time (usually a lot less than the size of the tar file, if there are a lot of files).
So gzip gets to compress an entire book with chapters of pages at a time, while zip compresses each chapter of a book and then wraps the compressed chapters up in a book - i.e. compressed collection of objects is usually smaller than a collection of compressed objects.
To produce a similar result to tar.gz, you can zip up the files in the first pass using the 'store'algorithm, and then zip up the resulting zip file using the default deflate algorithm.

A lot depends on the network that you are using.
If its over the internet - you might be better off sending as (say) 50 zipped up files rather than one file. If you transfer the data in one file and the file copy fails - you will have to send it again.
Copying as separate files will allow you to transfer some in parallel and to minimise the risk of a large upload failing.

Another possibility might be switching to another Serialization mechanism. JBoss Serialization is API and functionality compatible, but produces 30% less data.

How do I get Java to use my multi-core processor with GZIPInputStream?

I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));

AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).

PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.

Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
OutputStream out = new BufferedOutputStream(
new GZIPOutputStream(
new FileOutputStream(myFile)
)
)
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.

I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.
If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.
Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.

Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.

compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.

Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.

I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
The question is too general to assume we are reading from a single hard drive.
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.

You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.