Is there a way to do parallel zipping in Java?
I'm currently using ParallelScatterZipCreator but unfortunately it does parallel zipping per file. So if there's a single file that is much larger than other files, the parallel zipping only happens for smaller files. Then it has to wait until the large file is zipped serialy.
Is there a better library out there that utilizes all CPU cores even if we're zipping a single file?
TL;DR: You may not need compression at all. If you do, then you probably don't want to use the zip format, it's outdated tech with sizable downsides, and clearly you have some fairly specific needs. You probably want ZStandard (zstd).
How does compression work
Compression works by looking at a blob of bytes and finding repetitions of some form within it. Thus, it is not possible to compress a single byte.
This makes the job fundamentally at odds with parallelising: If you take a blob of 1 million bytes and compress them by lopping that up into 10 chunks of 100k bytes each, compressing each miniblob individually, then any repetition such that one of the reptitions was in one miniblob and another was in another, means that you have missed an opportunity to compress data, that you would not have missed if you had compressed this data in one blob instead.
The only reason ZIP does let you parallellize a little bit, is because it is an old format - sensible at the time, but in this age, just about every part of the ZIP format is crap.
Why is ZIP bad?
ZIP is a mixed bag, conflating two unrelated jobs.
A bundler. A bundling tool is some software that takes a bunch of files and turns that into a single stream (a single sack of bytes). To do so, the bundling tool will take the metadata about a file (its name, its owner/group or other access info, its last-modified time, etcetera), and the data within it, and serializes this into a single stream. zip does this, as does e.g. the posix tar tool.
A compressor. A compressor takes a stream of data and compresses it by finding repeated patterns.
zip in essence is only #1, but as part of the bundler, the part with 'the data within this file' has a flag to indicate that a compression algorithm has been applied to the bytes representing the data. In theory you can use just about any algorithm, but in practice, every zip file has all entries compressed with the DEFLATE algorithm, which is vastly inferior to more modern algorithms.
.tar.gz is the exact same technology (first bundle it: tar file, then gzip that tar file. gzip is the DEFLATE algorithm), but vastly more efficient in certain cases (it applies compression to the entire stream vs. restarting from scratch for every file. If you take a sack of 1000 small, similar files, then that in .tar.gz form is orders of magnitude smaller than that in .zip form).
Also, zip is old, and it made choices defensible at the time but silly in modern systems: You can't 'stream' zips (you can't meaningfully start unpacking one until the whole file has been received), because the bundler's info is at the end of the file.
So why can I parallellize zips?
Because zips 'restart' their compression window on every file. This is inefficient and hurts the compression ratio of zip files.
You can apply the exact same principle to any block of data, if you want. Trade compression efficiency for paralellizability. ZIP is the format that doesn't do it in a useful way; as you said, if you have one much larger file, the point is moot.
'restart window at' is a principle that can be generalized, and various compression formats support it in a much more useful fashion (restart at every X bytes, vs. ZIPs unreliable 'restart at every file').
What is the bottleneck?
Multiple aspects are involved when sending data: The speed at which the source provides the bytes you want to send, the speed at which the bytes are processed into a package that can then be sent (e.g. a zip tool, but can be anything, including just sending it verbatim, uncompressed), the speed at which the packaged-up bytes are transited to the target system, the speed at which the target can unpack it, and the speed at which the target can then process the unpacked result.
Are you sure that the compression aspect is the bottleneck?
In the base case where you read the bytes off of a harddisk, zip them up, send them across a residential internet pipe to another system, that system unzips, and saves them on a HDD, it is rather likely that the bottleneck is the network. Parallellizing the compression step is a complete waste and in fact only slows things down by reducing compression ratios.
If you're reading files off of a spinning platter, then the speed of the source is likely the bottleneck, and parallel processing considerably slows things down: You're now asking the read head to bounce around, and this is much slower than reading the data sequentially in one go.
If you have a fast source, and a fast pipe, then the bottleneck is doubtlessly the compression and uncompression, but the solution is then not to compress at all: If you are transferring data off of SSDs or from a USB3-connected byte-spewing sensor and transfer it across a 10M CAT6 cable from one gigabit ethernet port to another, then why compress at all? Just send those bytes. Compression isn't going to make it any faster, and as long as you don't saturate that 1Gb connection, you gain absolutely nothing whatsoever by attempting to compress it.
If you have a slow pipe, then the only way to make things faster is to compress as much as you can. Which most definitely involves not using the DEFLATE algorithm (e.g. don't use zip). Use another algorithm and configure it to get better compression rates, at the cost of CPU performance. Parallelising is irrelevant; it's not the bottleneck, so there is no point whatsoever in doing it.
Conclusions
Most likely you want to either send your files over uncompressed, or ZStandard your files over, tweaking the compression v. speed ratio as needed. I'm not aware of any ZStandard (zstd) impl for java itself, but the zstd-jni project gives you a java-based API for invoking the C zstd library.
If you insist on sticking with ZIP, the answer is a fairly basic 'nope, you cannot really do that', though you could in theory write a parallel ZIP compressor that has worse compression power but parallelizes better (by restarting the window within a single file for larger files in addition to the forced-upon-you-by-the-format restart at every file), and produces ZIP files that are still compatible with just about every unzip tool on the planet. I'm not aware of one, I don't think one exists, and writing one yourself would be a decidedly non-trivial exercise.
I need to compress PNG images on PC for data transmission in Python, transfer it to mobile and read it there in Java. I need lossless compression.
The hard part is compression and decompression in different environments and programming languages. I need to use something available for both languages. I've tried zlib, which technically should work, but it only decreases size by about 0.001% (with "best" compression setting 9).
What am I doing wrong with zlib, if anything?
What are the possible alternatives?
Is there any other way to go about this problem? I just need to transfer data as byte stream and the compression was my first thought to optimizing it.
compressing a a compressed file (like png / jpg) will normally not yield a lot and can even occasionally increase the file size.
It's not worth the effort.
I wrote a backup program using Deflater and SHA-1 to store files and a hash value. I see that Java's Deflater uses zlib. If I explicit set the Deflater's level, could I expect to always get the same series of bytes regardless of platform and JRE version?
If not then what do I use? Are there any stable and fast pure-Java implementations?
Do the SHA-1 before compression. Then you verify the correctness of the compression and decompression as well.
There is no assurance that what a compressor produces today will be the same as what a later version of the compressor produces tomorrow for the same input. And there should be no such assurance, since that would preclude improvements in compression.
The only assurance is that the compression-decompression process is lossless, so that what you get from the decompressor is exactly what you fed the compressor. For that reason, you need to compute signatures on the input of the compressor and the output of the decompressor. Ignore the intermediate compressed stream.
I have gone through the url in which it is preferred that for writing huge and bulk of data , use buffer writer but I just want to know its advantages over memory mapped io, since main focus was to make this process as faster as possible but in jdk 1.5 memmory mapped io was also faster so why it is not preferred
I use memory mapped files in Chronicle however I would say
Memory mapped files are much harder to work with esp for text as you need random access and text has varying length characters. Plain IO is the simplest usually much faster than your hardware so unless you have a PCI SSD card, you won't notice much difference for larger files.
In short, if your write speed is slow, check which hard drive you are writing to as there is not much you can do in software to make it faster. (Except use compression)
Memory mapped I/O is:
20% faster than java.io when reading
Unusable when writing files of unknown length, as you have specify the length when mapping.
I am new in hadoop and i am working with a program that its map output is very large versus the size of input file.
I installed lzo library and changed the config files, but it didn't have any effect on my program. how do i compress map output? is lzo the best case?
If yes, how do i implement that in my program?
To compress the intermediate output (your map output), you need to set the following properties in your mapred-site.xml:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>
If you want to do it on a job per job basis, you could also directly implement that in your code in 1 of the following ways:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.LzoCodec");
or
jobConf.setMapOutputCompressorClass(LzoCodec.class);
Also it's worth mentioning that the property mapred.output.compression.type should be left to the default of RECORD, because BLOCK compression for intermediate output causes bad performance.
When choosing what type of compression to use, I think you need to consider 2 aspects:
Compression ratio: how much compression actually occurs. The higher the %, the better the compression.
IO performance: since compression is an IO intensive operation, different methods of compression have different performance implication.
The goal is to balance compression ratio and IO performance, you can have a compression codec with very high compression ratio but poor IO performance.
It's really hard to tell you which one you should use and which one you should not, it also depends on your data, so you should try a few ones and see what makes more sense. In my experience, Snappy and LZO are the most efficient ones. Recently I heard about LZF which sounds like a good candidate too. I found a post proposing a benchmark of compressions here, but I would definitely advise to not take that for ground truth and do your own benchmark.
If you are using Hadoop 0.21 or later, you have to set these properties in your mapred-site.xml:
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
And do not forget to restart hadoop after the change. Also make sure that you have both 32-bit and 64-bit liblzo2 installed. For detailed help on how to set this you can refer the following links :
https://github.com/toddlipcon/hadoop-lzo
https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1
And along with the points made by Charles sir, you should keep 1 more aspect in mind :
CPU cycles : The compression algorithm which you are going to use should consume lesser number of CPU cycles. Otherwise, the cost of compression can void or reverse the speed advantage.
Snappy is another option, but it is primarily optimized for 64-bit machines. If you are on 32-bit machines better be careful.
Based on the recent progress LZ4 also seems good and has been recently integrated into Hadoop. It's fast but has higher memory requirements. You can go here to find more on LZ4.
But as Charles sir has said a fair decision can be made only after some experiments.