Java - Parallelizing Gzip

Java - Parallelizing Gzip - java

I was assigned to parallelize GZip in Java 7, and I am not sure which is possible.
The assignment is:
Parallelize gzip using a given number of threads
Each thread takes a 1024 KiB block, using the last 32 KiB block from
the previous block as a dictionary. There is an option to use no
dicitionary
Read from Stdin and stdout
What I have tried:
I have tried using GZIPOutputStream, but there doesn't seem to be a
way to isolate and parallelize the deflate(), nor can I access the
deflater to alter the dictionary. I tried extending off of GZIPOutputStream, but it didn't seem to act as I wanted to, since I still couldn't isolate the compress/deflate.
I tried using Deflater with wrap enabled and a FilterOutputStream to
output the compressed bytes, but I wasn't able to get it to compress
properly in GZip format. I made it so each thread had a compressor that will write to a byte array, then it will write to the OutputStream.
I am not sure if I am did my approaches wrong or took the wrong approaches completely. Can anyone point me the right direction for which classes to use for this project?

Yep, zipping a file with dictionary can't be parallelized, as everything depends on everything. Maybe your teacher asked you to parallelize the individual gzipping of multiple files in a folder? That would be a great example of parallelized work.

To make a process concurrent, you need to have portions of code which can run concurrently and independently. Most compression algorithms are designed to be run sequentially, where every byte depends on every byte has comes before.
The only way to do compression concurrently is to change the algorythm (making it incompatible with existing approaches)

I think you can do it by inserting appropriate resets in the compression stream. The idea is that the underlying compression engine used in gzip allows the deflater to be reset, with an aim that it makes it easier to recover from stream corruption, though at a cost of making the compression ratio worse. After reset, the deflater will be in a known state and thus you could in fact start from that state (which is independent of the content being compressed) in multiple threads (and from many locations in the input data, of course) produce a compressed chunk and include the data produced when doing the following reset so that it takes the deflater back to the known state. Then you've just to reassemble the compressed pieces into the overall compressed stream. “Simple!” (Hah!)
I don't know if this will work, and I suspect that the complexity of the whole thing will make it not a viable choice except when you're compressing single very large files. (If you had many files, it would be much easier to just compress each of those in parallel.) Still, that's what I'd try first.
(Also note that the gzip format is just a deflated stream with extra metadata.)

Related

Deflate algorithm different result with different software

I am currently reading about the deflate algorithm and as part of learning I picked one file that I zipped using different methods. What I found and what confuses me very much is that the different methods produced different bytes representing the compressed file.
I tried zipping the file using WinRar, 7-Zip, using the Java zlib library(ZipOutputStream class) and also manually by just doing the deflate upon the source data(Deflater class). All of the four methods produced completely different bytes.
My goal was just to see that all of the methods produced the same byte array as a result, but this was not the case and my question is why could that be? I made sure by checking the file headers that all of this software actually used the deflate algorithm.
Can anyone help with this? Is it possible that deflate algorithm can produce different compressed result for exactly the same source file?

There are many, many deflate representations of the same data. Surely you have already noticed that you can set a compression level. That could only have an effect if there were different ways to compress the same data. What you get depends on the compression level, any other compression settings, the software you are using, and the version of that software.
The only guarantee is that when you compress and then decompress, you get exactly what you started with. There is no guarantee, nor does there need to be or should be such a guarantee, that you get the same thing when you decompress and then compress.
Why do you have that goal?

The reason is that Deflate is a format, not an algorithm. The compression happens in two steps: LZ77 (here you have a large choice of algorithms among a quasi infinity of possible algorithms). Then, the LZ77 messages are encoded with Huffman trees (again a very large amount of choices about how to define those trees). Additionally, from time to time in the stream of LZ77 messages, it is good to redefine the trees and start a new block - or not. Here there is again an enormous amount of choices about how to split those blocks.

Writing a random access file transparently to a zip file

I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.
I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.
Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?
It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.
Any ideas or suggestions..?

Maybe you need to change the file format so it can be written sequentially.
In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.
It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.

The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.
To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.

The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:
You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.
So the only reasonable solution is to manipulate the data and compress at once it at the end.

Advice on replacing a block of bytes in a file at run time, when the file is read

Folks. I trust that the community will see this as a relevant question. My apologies if not and mods, please close.
I am developing a video playback app with static content for a customer. My customer wants me to implement some basic security to stop someone unpacking the deployed app (it's for Android) and simply copying the MPEGs. My customer has made basic protection a critical requirement and, he's paying the bills :)
The files are too big to decrpyt on the fly so I'm considering the following approach. I'd welcome thoughts and suggestions as to alternatives. I am aware of the arguments for and against copy protection schemes and security through obscurity, which my proposed approach uses and my question is not "should I?".
Take a block of bytes, say 256, from somewhere in the header of the MPG. Replace those bytes with random values such that the MPEG won't play without a lot of effort to repair it. Store the original 256 bytes in one of the apps bitmaps such that the bitmap still displays properly. When playing the video, read it in through a byte stream and replace the bytes with their original values before passing them to the output stream.
In summary:
Extract 256 bytes from the header of the MPEG
Store these bytes in a bitmap
Randomise values in the original bytes
At run time, read the 256 bytes back out of the bitmap
Read MPEG through an inputstream using a byte array buffer
Replace randomised bytes with the original values
Stream the input to an outputstream which is the input to the video player.
I do recognise at least 2 ways to defeat this, reverse engineering and screen grabbing but the point is to prevent the average thief simply copying my customers content with no effort.
Thoughts folks?
Thanks

I would suggest using an encryption/decryption scheme for the entire stream:
Real time video stream decryption is the standard way to deal with this issue. Its processing overhead is negligible when compared to the actual video decoding. For example, each and every single DVD player out there supports the CSS encryption scheme.
While using Java does impose some restrictions, such as the inability to make effective use of various CPU-specific instructions, you should be able to find a decryption algorithm that is not very expensive. I would suggest profiling your application before rejecting stream encryption algorithms out of hand.
Mangling the header does make some video files hard to read, but far from impossible. Some files have redundant information, others are actually the result of straight-out concatenation which would leave any following segments readable. Some streaming video codecs actually insert enough metadata to rebuild the stream every few seconds. And there are a lot of video formats out there.
In other words there is no way to guarantee that removing any number of bytes from the start of a file would make it unreadable. I also think that imposing on your client a bunch of restrictions w.r.t. the video formats that they can use is not reasonable and limits the future usefulness of your application.

Effectively compress strings of 10-1000 characters in Java?

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 characters into individual UDP packets.
What compression algorithms available in Java are well suited to this task?
Are there maybe open source Java libraries available to do this?

"It depends".
I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...
...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.
If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.
Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.
Happy coding.
Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.
See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.
Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:
...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.
My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!
Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);
zos.write(someText.getBytes());
zos.finish();
zos.flush();
byte[] udpBuffer = baos.toByteArray();
There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.
For this reason you can find that
small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.
I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt:
https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java
some remarks
use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.
I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.
upd: base64 implementation is here: http://www.source-code.biz/base64coder/java
you have to change it to make url-safe, i.e. change following characters:
'+' -> '-'
'/' -> '~'
'=' -> '_'

How would you change a single byte in a file?

What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!

I'd use RandomAccessFile, seek to the position I wanted to change and write the change.

If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.