Decompressing gzipped data with Inflater in Java

Decompressing gzipped data with Inflater in Java - java

I'm trying to decompress gzipped data with Inflater. According to the docs,
If the parameter 'nowrap' is true then the ZLIB header and checksum
fields will not be used. This provides compatibility with the
compression format used by both GZIP and PKZIP.
Note: When using the 'nowrap' option it is also necessary to provide
an extra "dummy" byte as input. This is required by the ZLIB native
library in order to support certain optimizations.
Passing true to the constructor, then attempting to decompress data results in DataFormatException: invalid block type being thrown. Following the instructions in this answer, I've added a dummy byte to the end of setInput()'s parameter, to no avail.
Will I have to use GZIPInputStream instead? What am I doing wrong?

The Java documentation is incorrect or at least misleading:
nowrap - if true then support GZIP compatible compression
What nowrap means is that raw deflate data will be decompressed. A gzip stream is raw deflate data wrapped with a gzip header and trailer. To fully decode the gzip format with this class, you will need to process the gzip header as described in RFC 1952, use inflater to decompress the raw deflate data, compute the crc32 of the uncompressed data using that class, and then verify the crc and length (modulo 2^32) in the gzip trailer, again as specified in the RFC.

I think that to read a GZIP stream it's not enough to set nowrap=true, you must also consume the gzip header, which is not part of the compressed stream. See eg. readHeader() in this implementation

Related

Decompressing/inflating zlib-compressed data without adler32 checksum

Update 2 (newest)
Here's the situation:
A foreign application is storing zlib deflated (compressed) data in this format:
78 9C BC (...data...) 00 00 FF FF - let's call it DATA1
If I take original XML file and deflate it in Java or Tcl, I get:
78 9C BD (...data...) D8 9F 29 BB - let's call it DATA2
Definitely the last 4 bytes in DATA2 is the Adler-32 checksum, which in DATA1 is replaced with the zlib FULL-SYNC marker (why? I have no idea).
3rd byte is different by value of 1.
The (...data...) is equal between DATA1 and DATA2.
Now the most interesting part: if I update the DATA1 changing the 3rd byte from BC to BD, leave last 8 bytes untouched (so 0000FFFF) and inflating this data with new Inflater(true) (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/zip/Inflater.html#%3Cinit%3E(boolean)), I am able to decode it correctly! (because the Inflater in this mode does not require zlib Adler-32 checksum and zlib header)
Questions:
Why does changing BC to BD work? Is it safe to do in all cases? I checked with few cases and worked each time.
Why would any application output an incorrect (?) deflate value of BC at all?
Why would the application start with a zlib header (78 9C), but not produce compliant zlib structure (FLUSH-SYNC instead of Adler-32)? It's not a small hobby application, but a widely used business app (I would say dozens of thousands of business users).
### Update 1 (old)
After further analysis it seems that I have a zlib-compressed byte array that misses the final checksum (adler32).
According to RFC 1950, the correct zlib format must end with the adler32 checksum, but for some reason a dataset that I work with has zlib bytes, that miss that checksum. It always ends with 00 00 FF FF, which in zlib format is a marker of SYNC FLUSH. For a complete zlib object, there should be adler32 afterwards, but there is none.
Still it should be possible to inflate such data, right?
As mentioned earlier (in original question below), I've tried to pass this byte array to Java inflater (I've tried with one from Tcl too), with no luck. Somehow the application that produces these bytes is able to read it correctly (as also mentioned below).
How can I decompress it?
Original question, before update:
Context
There is an application (closed source code), that connects to MS SQL Server and stores compressed XML document there in a column of image type. This application - when requested - can export the document into a regular XML file on the local disk, so I have access to both plain text XML data, as well as the compressed one, directly in the database.
The problem
I'd like to be able to decompress any value from this column using my own code connecting to the SQL Server.
The problem is that it is some kind of weird zlib format. It does start with typical for zlib header bytes (78 9C), but I'm unable to decompress it (I used method described at Java Decompress a string compressed with zlib deflate).
The whole data looks like 789CBC58DB72E238...7E526E7EFEA5E3D5FF0CFE030000FFFF (of course dots mean more bytes inside - total of 1195).
What I've tried already
What caught my attention was the ending 0000FFFF, but even if I truncate it, the decompression still fails. I actually tried to decompress it truncating all amounts of bytes from the end (in the loop, chopping last byte per iteration) - none of iterations worked either.
I also compressed the original XML file into zlib bytes to see how it looks like then and apart from the 2 zlib header bytes and then maybe 5-6 more bytes afterwards, the rest of data was different. Number of output bytes was also different (smaller), but not much (it was like ~1180 vs 1195 bytes).

The difference on the deflate side is that the foreign application is using Z_SYNC_FLUSH or Z_FULL_FLUSH to flush the provided data so far to the compressed stream. You are (correctly) using Z_FINISH to end the stream. In the first case you end up with a partial deflate stream that is not terminated and has no check value. Instead it just ends with an empty stored block, which results in the 00 00 ff ff bytes at the end. In the second case you end up with a complete deflate stream and a zlib trailer with the check value. In that case, there happens to be a single deflate block (the data must be relatively small), so the first block is the last block, and is marked as such with a 1 as the low bit of the first byte.
What you are doing is setting the last block bit on the first block. This will in general not always work, since the stream may have more than one block. In that case, some other bit in the middle of the stream would need to be set.
I'm guessing that what you are getting is part, but not all of the compressed data. There is a flush to permit transmission of the data so far, but that would normally be followed by continued compression and more such flushed packets.
(Same question as #2, with the same answer.)

Java decompress HTTP GZIP content from json attribute

We are working with packetbeat, a network packet analyzer tool to capture http requests and http responses. Packebeat persists this packet events in json format. The problem comes when the server supports gzip compression, packetbeat could not unzip content and save it directly the gzip content as json attribute. As you can see (Note: json has been simplified);
{
{
... ,
"content-type":"application/json;charset=UTF-8",
"transfer-encoding":"chunked",
"content-length":6347,
"x-application-context":"proxy-service:pre,native:8080",
"content-encoding":"gzip",
"connection":"keep-alive",
"date":"Mon, 18 Dec 2017 07:18:23 GMT"
},
"body": "\u001f\ufffd\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003\ufffd]k\ufffd\u0014Ǳ\ufffd/\ufffdYI\ufffd#\ufffd*\ufffdo\ufffd\ufffd\ufffd\u0002\t\u0010^\ufffd\u001c\u000eE=\ufffd{\ufffdb\ufffd\ufffdE\ufffd\ufffdC\ufffd\ufffdf\ufffd,\ufffd\u003e\ufffd\ufffd\ufffd\u001ef\u001a\u0008\u0005\ufffd\ufffdg\ufffd\ufffd\ufffdYYU\ufffd\ufffd;\ufffdoN\ufffd\ufffd\ufffdg\ufffd\u0011UdK\ufffd\u0015\u0015\ufffdo\u000eH\ufffd\u000c\u0015Iq\ndC\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd ... "
}
We are thinking in preprocess packet json files to unzip content. Could someone tell me what i need to decompress zipped "body" json attribute using java?

Your data is irrecoverably broken. Generally I would suggest using Base64 encoding for transferring binary data packed into JSON, but you can read about possible alternatives in Binary Data in JSON String. Something better than Base64 if you like experimenting.
Otherwise, in theory you could just use a variant of String.getBytes() to get an array of bytes, and wrap the result into the mentioned (in the other answer) streams:
byte bodyBytes[]=body.getBytes();
ByteArrayInputStream bais=new ByteArrayInputStream(bodyBytes);
GZipInputStream gis=new GZipInputStream(bais);
<do something with gis here, perhaps use an additional DataInputStream>
Apart from the String-thing (which is usually not a good idea), this is how you unpack a gzip-compressed array of bytes.
However valid gzip data starts with a magic number 0x1F,0x8B (see Wikipedia, or you can also dig up the actual specification). Your data starts with 0x1F (the \u001F part), but continues with a \ufffd Unicode character, which a replacement character (see Wikipedia again).
Some tool was encoding the binary data and did not like the 0x8B, most probably because it was >=0x80. If you further read in your JSON, there are many \ufffd-s in it, all values above (or equal to) 0x80 have been replaced with this. So the data at the moment is irrecoverably broken even if JSON would support raw binary data inside (but it does not).

In Java you can use the GZIPInputStream class to decode the GZIP data, I think you would need to turn the value into a ByteArrayInputStream first.

Is it possible to get a raw deflate out of java.util.zip.Deflater?

The zlib docs specify that one can pass a negative windowBits argument to the deflateInit2() function:
windowBits can also be –8..–15 for raw deflate. In this case, -windowBits
determines the window size. deflate() will then generate raw deflate data
with no zlib header or trailer, and will not compute an adler32 check value.
I've used this in my C code, and in Java am able to inflate bytes thus compressed by passing true for the nowrap parameter to the Inflater constructor.
However, passing true for the nowrap parameter to the Deflater does not yield a raw deflate. It returns 20 more bytes than I get with my C implementation, which sure sounds like a header and checksum. (Passing false for nowrap yields an even longer byte array.)
I've scanned through the
Deflater docs but have not found a way to specify window bits or that I want a raw deflate. Is it possible with this library? Or would I need to use some other implementation to get a raw deflate in Java?

Deflater produces zlib-wrapped data (RFC 1950). Deflater(level, true) produces raw deflate data (RFC 1951).

Compressing string to specified max chars in Java

I want to compress any string, if it has more than 50chars in it. If char size if less than 50, then let it be untouched, else it has to be compressed to required limit (50).
I will be inserting this compressed/uncompressed output in DB. So, while fetching back the data from DB, i want the compressed string to be easily differentiated from uncompressed (some common pattern for compressed string).
Please suggest some best compressing lib/algorithm ?

Please suggest some best compressing lib/algorithm ?
there is no such algorithm or library so that any string can be compressed to a given max characters (e.g. 50) without any loss of information.

If your compression is allowed to be lossy, you can use the abbreviate function from Apache Commons lang

I would use Inflater / Deflater from java.util.zip package. They use native ZLIB compression library and are very efficient. It's easy to differentiate compressed text from uncompressed - try to decompress and you will get an exception if it's a plain text

java AudioInputStream AudioSystem.write to pipe length error

Trying to pipe an AudioInputStream I got from a libary to this process:
oggenc - -o wtf.ogg
via
AudioSystem.write(audio, AudioFileFormat::Type::WAVE, process.getOutputStream());
gives me the error:
IOException
stream length not specified
file: com.sun.media.sound.WaveFileWriter.write(WaveFileWriter.java)
line: 129
So it seems audio stream length needs to be specified. This is a 44.8 kHz stereo PCM format audio with no length specified. I got this from a library so I can't modify their code. I tried creating another AudioInputStream from this stream, but I couldn't find a proper length. So how can I specify the length ?

So, I found that you are reading audio from somewhere unknown, and then trying to write it into WAV format. But WAV format requires to write header, which should contain file length. Anyway WaveFileWriter throws exception if feeded by stream without knowing length.
I guess your direction is Java -> oggenc.
So, you are to learn oggenc and know if it accepts headless WAV stream. If so, then just pass your audio to output stream, without processing with AudioSystem.write() which is for headed WAVs
According to here http://www.linuxcommand.org/man_pages/oggenc1.html you are able to accept RAW with oggenc.

You will be able to solve the issue with length if you reimplement AudioSystem.write method. It doesn't work with stream and WAVE format because of data length in the header. You should write audio data to some array and then prepare correct WAVE Header. Example: How to record audio to byte array

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.