Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.
e.g.
Compression Ratio: 50% -> I can save 50% of my file's space if I compress it
Compression Ratio: 99% -> I can save only 1% of my file's space if I compress it
First, this will depend largely on the compression method you choose. And second, I seriously doubt it's possible without computation of time and space complexity comparable to actually doing the compression. I'd say your best bet is to compress the file, keeping track of the size of what you've already produced and dropping/freeing it (once you're done with it, obviously) instead of writing it out.
To actually do this, unless you really want to implement it yourself, it'll probably be easiest to use the java.util.zip class, in particular the Deflater class and its deflate method.
Firstly, you need to work on information theory. There are two theory about information theory field:
According to Shannon, one can compute entropy (i.e. compressed size) of a source by using it's symbol probabilities. So, smallest compression size defined by an statistical model which produces symbol probabilities at each step. All algorithms use that approach implicitly or explicitly to compress data. Look that Wikipedia article for more details.
According to Kolmogorov, smallest compression size can be found by finding smallest possible program which produces the source. In that sense, it cannot be compute-able. Some program partially use that approach to compress data (e.g. you can write a small console application which can produce 1 million digits of PI instead of zipping that 1 million digits of PI).
So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:
Compute order-1 statistics for each symbol in the source file.
Calculate entropy by using those statistics.
Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.
Not possible without examining the file. The only thing you can do is have an approximate ratio by file extension based on statistics gathered from a relative large sample by doing actual compression and measuring. For example a statistical analysis will likely show that .zip, .jpg are not heavily compressible, but files like .txt and .doc might be heavily compressible.
The results of this will be for rough guidance only and will probably be way off in some cases as there's absolutely no guarantee of compressible-ness by file extension. The file could contain anything no matter what the extension say it may or may not be.
UPDATE: Assuming you can examine the file then you can use the java.util.zip APIs to read the raw file and compress it and see what the before/after difference is.
Related
I am currently reading about the deflate algorithm and as part of learning I picked one file that I zipped using different methods. What I found and what confuses me very much is that the different methods produced different bytes representing the compressed file.
I tried zipping the file using WinRar, 7-Zip, using the Java zlib library(ZipOutputStream class) and also manually by just doing the deflate upon the source data(Deflater class). All of the four methods produced completely different bytes.
My goal was just to see that all of the methods produced the same byte array as a result, but this was not the case and my question is why could that be? I made sure by checking the file headers that all of this software actually used the deflate algorithm.
Can anyone help with this? Is it possible that deflate algorithm can produce different compressed result for exactly the same source file?
There are many, many deflate representations of the same data. Surely you have already noticed that you can set a compression level. That could only have an effect if there were different ways to compress the same data. What you get depends on the compression level, any other compression settings, the software you are using, and the version of that software.
The only guarantee is that when you compress and then decompress, you get exactly what you started with. There is no guarantee, nor does there need to be or should be such a guarantee, that you get the same thing when you decompress and then compress.
Why do you have that goal?
The reason is that Deflate is a format, not an algorithm. The compression happens in two steps: LZ77 (here you have a large choice of algorithms among a quasi infinity of possible algorithms). Then, the LZ77 messages are encoded with Huffman trees (again a very large amount of choices about how to define those trees). Additionally, from time to time in the stream of LZ77 messages, it is good to redefine the trees and start a new block - or not. Here there is again an enormous amount of choices about how to split those blocks.
I am testing Huffman coding now, and I wanted to know which type of files(like .txt, .jpg, .mp3 etc) experience a good compression when they undergo Huffman based compression. I implemented Huffman coding in java and I found out that I was getting about 40% size reduction for .txt files(the ones with ordinary English text) and about almost 0% - 1% reduction on .jpg, .mp3, and .mp4 files (of course I haven't tested it on huge files above 1 MB, because my program is super slow). I understand that Huffman coding works best for those files which have more frequently occurring symbols, however I do not know what kind of symbols are there in a video, audio or an image file, hence the question. Since that I have designed this program(I did it for my project at school, I will not deny it, I did it on my own and I am only asking for a few pointers for my research), I wanted to know where it would work well.
Thanks.
Note: I initially created this project only for .txt files and to my wonder, it was working on all other types of files as well, hence I wanted to test it and thereby I had to ask this question. I found out that for image files, you don't encode the symbols themselves, but rather some RGB values? Correct me if I am wrong.
It's all about the amount of redundancy in the file.
In any file each byte occupies 8 bits, allowing 256 distinct symbols per byte. In a text file, a relatively small number of those symbols are actually used, and the distribution of the symbols is not flat (there are more es than qs). Thus the information "density" is more like 5 bits per byte.
JPEGs, MP3 and MP4 are already compressed and have almost no redundancy. All 256 symbols are used, with about equal frequency, so the information "density" is very close to 8 bits per byte. You cannot compress it further.
So, lets say I want to recode some PNG to JPEG in Java. The image has extreme resolution, lets say for example 10 000 x 10 000px. Using "standard" Java image API Writers and Reader, you need at some point to have entire image decoded in RAM, which takes extreme amount of RAM space (hundreds of MB). I have been looking how other tools do this, and I found that ImageMagick uses disk pixel storage, but this seems to by way too slower for my needs. So what I need is tru streaming recoder. And by true streaming I mean read and process data by chuncks or bins, not just give stream as input but decode it whole beforehand.
Now, first the theory behind - is it even possible, given JPEG and PNG algorithms, to do this using streams, or lets say in bins of data? So there is no need to have entire image encoded in memory(or other storage)? In JPEG compression, first few stages could be done in streams, but I believe Huffman encoding needs to build entire tree of value probabilities after quantization, therefore it needs to analyze whole image - so whole image needs to be decoded beforehand, or somehow on demand by regions.
And the golden question, if above could be achieved, is there any Java library that can actually work in this way? And save large amount of RAM?
If I create a 10,000 x 10,000 PNG file, full of incompressible noise, with ImageMagick like this:
convert -size 10000x10000 xc:gray +noise random image.png
I see ImageMagick uses 675M of RAM to create the resulting 572MB file.
I can convert it to a JPEG with vips like this:
vips im_copy image.png output.jpg
and vips uses no more than 100MB of RAM while converting, and takes 7 seconds on a reasonable spec iMac around 4 years old - albeit with SSD.
I have thought about this for a while, and I would really like to implement such a library. Unfortunately, it's not that easy. Different image formats store pixels in different ways. PNG or GIFs may be interlaced. JPEGs may be progressive (multiple scans). TIFFs are often striped or tiled. BMPs are usually stored bottom up. PSDs are channeled. Etc.
Because of this, the minimum amount of data you have to read to recode to a different format, may in worst case be the entire image (or maybe not, if the format supports random access and you can live with a lot of seeking back and forth)... Resampling (scaling) the image to a new file using the same format would probably work in most cases though (probably not so good for progressive JPEGs, unless you can resample each scan separately).
If you can live with disk buffer though, as the second best option, I have created some classes that allows for BufferedImages to be backed by nio MappedByteBuffers (memory-mapped file Buffers, kind of like virtual memory). While performance isn't really like in-memory images, it's also not entirely useless. Have a look at MappedImageFactory and MappedFileBuffer.
I've written a PNG encoder/decoder that does that (read and write progressively, which only requires to store a row in memory) for PNG format: PNGJ
I don't know if there is something similar with JPEG
I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?
"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.
The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.
Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.
good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'
I'm trying to find a balance between performance and degree of compression when gzipping a Java webapp response.
In looking at the Deflater class, I can set a level and a strategy. The levels are self explanatory - BEST_SPEED to BEST_COMPRESSION.
I'm not sure regarding the strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY
I can make some sense from the Javadoc but I was wondering if someone had used a specific strategy in their apps and if you saw any difference in terms of performance / degree of compression.
The strategy options mentioned in the Java Deflater originated in the zlib (C) implementation of ZLIB and (RFC 1950) and DEFLATE (1951), I believe. They are present in virtually all compression libraries that implement DEFLATE.
To understand what they mean, you need to know a little about DEFLATE. The compression algorithm combines LZ77 and Huffman coding. The basics are:
LZ77 compression works by finding sequences of data that are repeated. Implementations typically use a "sliding window" of between 1k and 32k, to keep track of data that went before. For any repeats in the original data, instead of inserting the repeated data in the output, the LZ77 compression inserts a "back-reference". Imagine the back reference saying "here, insert the same data you saw 8293 bytes ago, for 17 bytes". The back-ref is encoded as this pair of numbers: a length - in this case 17 - and a distance (or offset) - in this case, 8293.
Huffman coding substitutes codes for the actual data. When the data says X, the Huffman code says Y. This obviously helps compression only if the substitute is shorter than the original. (a counter-example is in the Jim Carrey movie Yes Man, when Norm uses "Car" for a shortname for Carl. Carl points out that Carl is already pretty short.) The Huffman encoding algorithm does a frequency analysis, and uses the shortest substitutes for the data sequences that occur most often.
Deflate combines these, so that one can use Huffman codes on LZ77 back-refs. The strategy options on various DEFLATE/ZLIB compressors just tells the library how much to weight Huffman versus LZ77.
FILTERED usually means the LZ77 matches are stopped at a length of 5. So when the documentation says
Use (Filtered) for data produced by a filter (or predictor), ... Filtered data consists mostly of small values with a somewhat random distribution.
(from the zlib man page)
...my reading of the code says that it does LZ77 matching, but only up to sequences of 5 or fewer bytes. That's what the doc means by "small values" I guess. But the number 5 isn't mentioned in the doc, so there's no guarantee that number won't be changed from rev to rev, or from one implementation of ZLIB/DEFLATE to another (like the C version and the Java version).
HUFFMAN_ONLY says, only do the substitution codes based on frequency analysis. HUFFMAN_ONLY is very very fast, but not very effective in compression for most data. Unless you have a very small range of byte values (for example, if bytes in your actual datastream take one of only 20 of the possible 255 values), or have extreme requirements for speed in compression at the expense of size, HUFFMAN_ONLY will not be what you want.
DEFAULT combines the two in the way the authors expected it would be most effective for most applications.
As far as I know, within DEFLATE there is no way to do only LZ77. There is no LZ77_ONLY strategy. But of course you could build or acquire your own LZ77 encoder and that would be "LZ77 only".
Keep in mind that the strategy never affects correctness of compression; it affects only and operation of it, and the performance of it, either in speed or size.
There are other ways to tweak the compressor. One is to set the size of the LZ77 sliding window. In the C library, this is specified with a "Window bits" option. If you understand LZ77, then you know that a smaller window means less searching back, which means faster compression, at the expense of missing some matches. This is often the more effective knob to turn when compressing.
The bottom line is that, for the 80% case, you don't care to tweak strategy. You might be interested in fiddling with window bits, just to see what happens. But only do that when you've done everything else you need to do in your app.
reference:
How DEFLATE works, by Antaeus Feldspar