Deflate algorithm different result with different software

Deflate algorithm different result with different software - java

I am currently reading about the deflate algorithm and as part of learning I picked one file that I zipped using different methods. What I found and what confuses me very much is that the different methods produced different bytes representing the compressed file.
I tried zipping the file using WinRar, 7-Zip, using the Java zlib library(ZipOutputStream class) and also manually by just doing the deflate upon the source data(Deflater class). All of the four methods produced completely different bytes.
My goal was just to see that all of the methods produced the same byte array as a result, but this was not the case and my question is why could that be? I made sure by checking the file headers that all of this software actually used the deflate algorithm.
Can anyone help with this? Is it possible that deflate algorithm can produce different compressed result for exactly the same source file?

There are many, many deflate representations of the same data. Surely you have already noticed that you can set a compression level. That could only have an effect if there were different ways to compress the same data. What you get depends on the compression level, any other compression settings, the software you are using, and the version of that software.
The only guarantee is that when you compress and then decompress, you get exactly what you started with. There is no guarantee, nor does there need to be or should be such a guarantee, that you get the same thing when you decompress and then compress.
Why do you have that goal?

The reason is that Deflate is a format, not an algorithm. The compression happens in two steps: LZ77 (here you have a large choice of algorithms among a quasi infinity of possible algorithms). Then, the LZ77 messages are encoded with Huffman trees (again a very large amount of choices about how to define those trees). Additionally, from time to time in the stream of LZ77 messages, it is good to redefine the trees and start a new block - or not. Here there is again an enormous amount of choices about how to split those blocks.

Related

Is file compression a legit way to improve the performance of network communication for a large file

i have a large file and i want to send it over the network to multiple consumers via pub sub method. For that purpose, i choose to user jeromq. The code is working but i am not happy because i want to optimize it. The file is too large over than 2gb my question now is if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved? What do you think?
In addition except for the compression is there any other technique to use?
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
(Maybe my second idea is stupid or i dont have understand something correct please give me your directions.)

if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved?
Gzip has a relatively good compression ratio (though not the best) but it is slow. In practice, it is so slow that the network interconnect can be faster than compressing+decompressing a data stream. Gzip is only fine for relatively slow network. There are faster compression algorithms to do that, but with generally lower compression ratio. For example LZ4, is very fast for both compressing and decompressing a data stream in real time. There is a catch though: the compression ratio is strongly dependent of the kind of file being sent. Indeed, binary files or already compressed data will barely be compressed, especially with fast algorithm like LZ4 so compression will not worth it. For text-based data or ones with repeated pattern (or a reduced range of byte values), the compression can be very useful.
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
This is a common broadcast algorithm in distributed computing. This methods is used in distributed hash-table algorithms and also in MPI implementations (massively used on supercomputers). For example, MPI use a tree-based broadcast method when the number of machines is relatively big or/and the amount of data is also big. Note that this method introduce additional latency overheads that are generally small in your case unless the network has a very high latency. Distributed hash-table use more complex algorithm since they generally consider a not can fail (so they use dynamic adaptation) at any time and cannot be fully trusted (so they use checks like hashes and sometime even get data from multiple sources so to avoid malicious injection from specific machines). They also do not make assumption about the network structure/speed (resulting in an imbalanced download-speed).

Java - Calculate File Compression

Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.
e.g.
Compression Ratio: 50% -> I can save 50% of my file's space if I compress it
Compression Ratio: 99% -> I can save only 1% of my file's space if I compress it

First, this will depend largely on the compression method you choose. And second, I seriously doubt it's possible without computation of time and space complexity comparable to actually doing the compression. I'd say your best bet is to compress the file, keeping track of the size of what you've already produced and dropping/freeing it (once you're done with it, obviously) instead of writing it out.
To actually do this, unless you really want to implement it yourself, it'll probably be easiest to use the java.util.zip class, in particular the Deflater class and its deflate method.

Firstly, you need to work on information theory. There are two theory about information theory field:
According to Shannon, one can compute entropy (i.e. compressed size) of a source by using it's symbol probabilities. So, smallest compression size defined by an statistical model which produces symbol probabilities at each step. All algorithms use that approach implicitly or explicitly to compress data. Look that Wikipedia article for more details.
According to Kolmogorov, smallest compression size can be found by finding smallest possible program which produces the source. In that sense, it cannot be compute-able. Some program partially use that approach to compress data (e.g. you can write a small console application which can produce 1 million digits of PI instead of zipping that 1 million digits of PI).
So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:
Compute order-1 statistics for each symbol in the source file.
Calculate entropy by using those statistics.
Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.

Not possible without examining the file. The only thing you can do is have an approximate ratio by file extension based on statistics gathered from a relative large sample by doing actual compression and measuring. For example a statistical analysis will likely show that .zip, .jpg are not heavily compressible, but files like .txt and .doc might be heavily compressible.
The results of this will be for rough guidance only and will probably be way off in some cases as there's absolutely no guarantee of compressible-ness by file extension. The file could contain anything no matter what the extension say it may or may not be.
UPDATE: Assuming you can examine the file then you can use the java.util.zip APIs to read the raw file and compress it and see what the before/after difference is.

Java - Parallelizing Gzip

I was assigned to parallelize GZip in Java 7, and I am not sure which is possible.
The assignment is:
Parallelize gzip using a given number of threads
Each thread takes a 1024 KiB block, using the last 32 KiB block from
the previous block as a dictionary. There is an option to use no
dicitionary
Read from Stdin and stdout
What I have tried:
I have tried using GZIPOutputStream, but there doesn't seem to be a
way to isolate and parallelize the deflate(), nor can I access the
deflater to alter the dictionary. I tried extending off of GZIPOutputStream, but it didn't seem to act as I wanted to, since I still couldn't isolate the compress/deflate.
I tried using Deflater with wrap enabled and a FilterOutputStream to
output the compressed bytes, but I wasn't able to get it to compress
properly in GZip format. I made it so each thread had a compressor that will write to a byte array, then it will write to the OutputStream.
I am not sure if I am did my approaches wrong or took the wrong approaches completely. Can anyone point me the right direction for which classes to use for this project?

Yep, zipping a file with dictionary can't be parallelized, as everything depends on everything. Maybe your teacher asked you to parallelize the individual gzipping of multiple files in a folder? That would be a great example of parallelized work.

To make a process concurrent, you need to have portions of code which can run concurrently and independently. Most compression algorithms are designed to be run sequentially, where every byte depends on every byte has comes before.
The only way to do compression concurrently is to change the algorythm (making it incompatible with existing approaches)

I think you can do it by inserting appropriate resets in the compression stream. The idea is that the underlying compression engine used in gzip allows the deflater to be reset, with an aim that it makes it easier to recover from stream corruption, though at a cost of making the compression ratio worse. After reset, the deflater will be in a known state and thus you could in fact start from that state (which is independent of the content being compressed) in multiple threads (and from many locations in the input data, of course) produce a compressed chunk and include the data produced when doing the following reset so that it takes the deflater back to the known state. Then you've just to reassemble the compressed pieces into the overall compressed stream. “Simple!” (Hah!)
I don't know if this will work, and I suspect that the complexity of the whole thing will make it not a viable choice except when you're compressing single very large files. (If you had many files, it would be much easier to just compress each of those in parallel.) Still, that's what I'd try first.
(Also note that the gzip format is just a deflated stream with extra metadata.)

Effectively compress strings of 10-1000 characters in Java?

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?

"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'

Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

I'm trying to find a balance between performance and degree of compression when gzipping a Java webapp response.
In looking at the Deflater class, I can set a level and a strategy. The levels are self explanatory - BEST_SPEED to BEST_COMPRESSION.
I'm not sure regarding the strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY
I can make some sense from the Javadoc but I was wondering if someone had used a specific strategy in their apps and if you saw any difference in terms of performance / degree of compression.

The strategy options mentioned in the Java Deflater originated in the zlib (C) implementation of ZLIB and (RFC 1950) and DEFLATE (1951), I believe. They are present in virtually all compression libraries that implement DEFLATE.
To understand what they mean, you need to know a little about DEFLATE. The compression algorithm combines LZ77 and Huffman coding. The basics are:
LZ77 compression works by finding sequences of data that are repeated. Implementations typically use a "sliding window" of between 1k and 32k, to keep track of data that went before. For any repeats in the original data, instead of inserting the repeated data in the output, the LZ77 compression inserts a "back-reference". Imagine the back reference saying "here, insert the same data you saw 8293 bytes ago, for 17 bytes". The back-ref is encoded as this pair of numbers: a length - in this case 17 - and a distance (or offset) - in this case, 8293.
Huffman coding substitutes codes for the actual data. When the data says X, the Huffman code says Y. This obviously helps compression only if the substitute is shorter than the original. (a counter-example is in the Jim Carrey movie Yes Man, when Norm uses "Car" for a shortname for Carl. Carl points out that Carl is already pretty short.) The Huffman encoding algorithm does a frequency analysis, and uses the shortest substitutes for the data sequences that occur most often.
Deflate combines these, so that one can use Huffman codes on LZ77 back-refs. The strategy options on various DEFLATE/ZLIB compressors just tells the library how much to weight Huffman versus LZ77.
FILTERED usually means the LZ77 matches are stopped at a length of 5. So when the documentation says
Use (Filtered) for data produced by a filter (or predictor), ... Filtered data consists mostly of small values with a somewhat random distribution.
(from the zlib man page)
...my reading of the code says that it does LZ77 matching, but only up to sequences of 5 or fewer bytes. That's what the doc means by "small values" I guess. But the number 5 isn't mentioned in the doc, so there's no guarantee that number won't be changed from rev to rev, or from one implementation of ZLIB/DEFLATE to another (like the C version and the Java version).
HUFFMAN_ONLY says, only do the substitution codes based on frequency analysis. HUFFMAN_ONLY is very very fast, but not very effective in compression for most data. Unless you have a very small range of byte values (for example, if bytes in your actual datastream take one of only 20 of the possible 255 values), or have extreme requirements for speed in compression at the expense of size, HUFFMAN_ONLY will not be what you want.
DEFAULT combines the two in the way the authors expected it would be most effective for most applications.
As far as I know, within DEFLATE there is no way to do only LZ77. There is no LZ77_ONLY strategy. But of course you could build or acquire your own LZ77 encoder and that would be "LZ77 only".
Keep in mind that the strategy never affects correctness of compression; it affects only and operation of it, and the performance of it, either in speed or size.
There are other ways to tweak the compressor. One is to set the size of the LZ77 sliding window. In the C library, this is specified with a "Window bits" option. If you understand LZ77, then you know that a smaller window means less searching back, which means faster compression, at the expense of missing some matches. This is often the more effective knob to turn when compressing.
The bottom line is that, for the 80% case, you don't care to tweak strategy. You might be interested in fiddling with window bits, just to see what happens. But only do that when you've done everything else you need to do in your app.
reference:
How DEFLATE works, by Antaeus Feldspar

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.