Update 2 (newest)
Here's the situation:
A foreign application is storing zlib deflated (compressed) data in this format:
78 9C BC (...data...) 00 00 FF FF - let's call it DATA1
If I take original XML file and deflate it in Java or Tcl, I get:
78 9C BD (...data...) D8 9F 29 BB - let's call it DATA2
Definitely the last 4 bytes in DATA2 is the Adler-32 checksum, which in DATA1 is replaced with the zlib FULL-SYNC marker (why? I have no idea).
3rd byte is different by value of 1.
The (...data...) is equal between DATA1 and DATA2.
Now the most interesting part: if I update the DATA1 changing the 3rd byte from BC to BD, leave last 8 bytes untouched (so 0000FFFF) and inflating this data with new Inflater(true) (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/zip/Inflater.html#%3Cinit%3E(boolean)), I am able to decode it correctly! (because the Inflater in this mode does not require zlib Adler-32 checksum and zlib header)
Questions:
Why does changing BC to BD work? Is it safe to do in all cases? I checked with few cases and worked each time.
Why would any application output an incorrect (?) deflate value of BC at all?
Why would the application start with a zlib header (78 9C), but not produce compliant zlib structure (FLUSH-SYNC instead of Adler-32)? It's not a small hobby application, but a widely used business app (I would say dozens of thousands of business users).
### Update 1 (old)
After further analysis it seems that I have a zlib-compressed byte array that misses the final checksum (adler32).
According to RFC 1950, the correct zlib format must end with the adler32 checksum, but for some reason a dataset that I work with has zlib bytes, that miss that checksum. It always ends with 00 00 FF FF, which in zlib format is a marker of SYNC FLUSH. For a complete zlib object, there should be adler32 afterwards, but there is none.
Still it should be possible to inflate such data, right?
As mentioned earlier (in original question below), I've tried to pass this byte array to Java inflater (I've tried with one from Tcl too), with no luck. Somehow the application that produces these bytes is able to read it correctly (as also mentioned below).
How can I decompress it?
Original question, before update:
Context
There is an application (closed source code), that connects to MS SQL Server and stores compressed XML document there in a column of image type. This application - when requested - can export the document into a regular XML file on the local disk, so I have access to both plain text XML data, as well as the compressed one, directly in the database.
The problem
I'd like to be able to decompress any value from this column using my own code connecting to the SQL Server.
The problem is that it is some kind of weird zlib format. It does start with typical for zlib header bytes (78 9C), but I'm unable to decompress it (I used method described at Java Decompress a string compressed with zlib deflate).
The whole data looks like 789CBC58DB72E238...7E526E7EFEA5E3D5FF0CFE030000FFFF (of course dots mean more bytes inside - total of 1195).
What I've tried already
What caught my attention was the ending 0000FFFF, but even if I truncate it, the decompression still fails. I actually tried to decompress it truncating all amounts of bytes from the end (in the loop, chopping last byte per iteration) - none of iterations worked either.
I also compressed the original XML file into zlib bytes to see how it looks like then and apart from the 2 zlib header bytes and then maybe 5-6 more bytes afterwards, the rest of data was different. Number of output bytes was also different (smaller), but not much (it was like ~1180 vs 1195 bytes).
The difference on the deflate side is that the foreign application is using Z_SYNC_FLUSH or Z_FULL_FLUSH to flush the provided data so far to the compressed stream. You are (correctly) using Z_FINISH to end the stream. In the first case you end up with a partial deflate stream that is not terminated and has no check value. Instead it just ends with an empty stored block, which results in the 00 00 ff ff bytes at the end. In the second case you end up with a complete deflate stream and a zlib trailer with the check value. In that case, there happens to be a single deflate block (the data must be relatively small), so the first block is the last block, and is marked as such with a 1 as the low bit of the first byte.
What you are doing is setting the last block bit on the first block. This will in general not always work, since the stream may have more than one block. In that case, some other bit in the middle of the stream would need to be set.
I'm guessing that what you are getting is part, but not all of the compressed data. There is a flush to permit transmission of the data so far, but that would normally be followed by continued compression and more such flushed packets.
(Same question as #2, with the same answer.)
Related
We are working with packetbeat, a network packet analyzer tool to capture http requests and http responses. Packebeat persists this packet events in json format. The problem comes when the server supports gzip compression, packetbeat could not unzip content and save it directly the gzip content as json attribute. As you can see (Note: json has been simplified);
{
{
... ,
"content-type":"application/json;charset=UTF-8",
"transfer-encoding":"chunked",
"content-length":6347,
"x-application-context":"proxy-service:pre,native:8080",
"content-encoding":"gzip",
"connection":"keep-alive",
"date":"Mon, 18 Dec 2017 07:18:23 GMT"
},
"body": "\u001f\ufffd\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003\ufffd]k\ufffd\u0014DZ\ufffd/\ufffdYI\ufffd#\ufffd*\ufffdo\ufffd\ufffd\ufffd\u0002\t\u0010^\ufffd\u001c\u000eE=\ufffd{\ufffdb\ufffd\ufffdE\ufffd\ufffdC\ufffd\ufffdf\ufffd,\ufffd\u003e\ufffd\ufffd\ufffd\u001ef\u001a\u0008\u0005\ufffd\ufffdg\ufffd\ufffd\ufffdYYU\ufffd\ufffd;\ufffdoN\ufffd\ufffd\ufffdg\ufffd\u0011UdK\ufffd\u0015\u0015\ufffdo\u000eH\ufffd\u000c\u0015Iq\ndC\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd ... "
}
We are thinking in preprocess packet json files to unzip content. Could someone tell me what i need to decompress zipped "body" json attribute using java?
Your data is irrecoverably broken. Generally I would suggest using Base64 encoding for transferring binary data packed into JSON, but you can read about possible alternatives in Binary Data in JSON String. Something better than Base64 if you like experimenting.
Otherwise, in theory you could just use a variant of String.getBytes() to get an array of bytes, and wrap the result into the mentioned (in the other answer) streams:
byte bodyBytes[]=body.getBytes();
ByteArrayInputStream bais=new ByteArrayInputStream(bodyBytes);
GZipInputStream gis=new GZipInputStream(bais);
<do something with gis here, perhaps use an additional DataInputStream>
Apart from the String-thing (which is usually not a good idea), this is how you unpack a gzip-compressed array of bytes.
However valid gzip data starts with a magic number 0x1F,0x8B (see Wikipedia, or you can also dig up the actual specification). Your data starts with 0x1F (the \u001F part), but continues with a \ufffd Unicode character, which a replacement character (see Wikipedia again).
Some tool was encoding the binary data and did not like the 0x8B, most probably because it was >=0x80. If you further read in your JSON, there are many \ufffd-s in it, all values above (or equal to) 0x80 have been replaced with this. So the data at the moment is irrecoverably broken even if JSON would support raw binary data inside (but it does not).
In Java you can use the GZIPInputStream class to decode the GZIP data, I think you would need to turn the value into a ByteArrayInputStream first.
I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}
The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().
I have a multi-threaded client-server application that uses Vector<String> as a queue of messages to send.
I need, however, to send a file using this application. In C++ I would not really worry, but in Java I'm a little confused when converting anything to string.
Java has 2 byte characters. When you see Java string in HEX, it's usually like:
00XX 00XX 00XX 00XX
Unless some Unicode characters are present.
Java also uses Big endian.
These facts make me unsure, whether - and eventually how - to add the file into the queue. Preferred format of the file would be:
-- Headers --
2 bytes Size of the block (excluding header, which means first four bytes)
2 bytes Data type (text message/file)
-- End of headers --
2 bytes Internal file ID (to avoid referring by filenames)
2 bytes Length of filename
X bytes Filename
X bytes Data
You can see I'm already using 2 bytes for all numbers to avoid some horrible operations required when getting 2 numbers out of one char.
But I have really no idea how to add the file data correctly. For numbers, I assume this would do:
StringBuilder packetData = new StringBuilder();
packetData.append((char) packetSize);
packetData.append((char) PacketType.BINARY.ordinal()); //Just convert enum constant to number
But file is really a problem. If I have also described anything wrongly regarding the Java data types please correct me - I'm a beginner.
Does it have to send only Strings? I think if it does then you really need to encode it using base64 or similar. The best approach overall would probably be to send it as raw bytes. Depending on how difficult it would be to refactor your code to support byte arrays instead of just Strings, that may be worth doing.
To answer your String question I just saw pop up in the comments, there's a getBytes method on a String.
For the socket question, see:
Java sending and receiving file (byte[]) over sockets
I am using netty, and have to parse binary data in a ChannelBufferInputStream. Here's the code I am using:
ins.skipBytes(14); // skip 14 bytes header
byte[] b = new byte[195]; // note that 195 is the length of data after inflation
(new InflaterInputStream(ins)).read(b, 0, 195);
This works as expected, but it sets the mark on the ChannelBufferInputStream after 195 bytes.
Needless to say, the mark should have been set after less that 195 bytes.
Is it possible to get the no. of 'actual' bytes read from the inputstream so that I can set the mark myself? Or is there some other way to inflate a ChannelBuffer's data in netty?
Without knowing what the larger code flow looks like, it's hard to recommend a best practice, but assuming you're reading an incoming network stream, a better pattern might be to use a sequence of pipeline handlers like:
HeaderHandler --> decoder, returns null until 14 bytes are read
InflaterDecoder --> Inflates the remainder (will ZLibDecoder work ?)
AppHandler --> Receives the inflated buffer
But to answer you first question directly, ChannelBufferInputStream.readBytes() will, to quote the javadoc:
Returns the number of read bytes by this stream so far.
e.g. I save the tuple T = {k1, v1, k2, v2} to the redis by jedis:
eredis:q(Conn, ["SET", <<"mykey">>, term_to_binary(T)]).
I am trying to use the code below to read this erlang term:
Jedis j = Redis.pool.getResource();
byte[] t = j.get("mykey").getBytes();
OtpInputStream ois = new OtpInputStream(t);
System.out.println(OtpErlangObject.decode(ois));
The error is: com.ericsson.otp.erlang.OtpErlangDecodeException: Uknown data type: 239.
So how can I get the erlang term correctly?
Erlang side:
term_to_binary({k1, v1, k2, v2}).
<<131,104,4,100,0,2,107,49,100,0,2,118,49,100,0,2,107,50,
100,0,2,118,50>>
Java side:
j.get("mykey").getBytes():
-17 -65 -67 104 4 100 0 2 107 49 100 0 2 118 49 100 0 2 107 50 100 0 2 118 50.
It seems that only the first 3 byte are different. So I change them to byte(131),
and then it can be printed correctly with System.out.println(OtpErlangObject.decode(ois)).
But when the term is more complicated, such as for a record with list inside, it wont work. cuz some other characters will appear not only at the head of the data but also the end and the middle.
Why the data I saved is different from what I got?
The negative numbers at the beginning of the byte array are not valid values for erlang external term syntax.
I would assume that since you have been storing the erlang terms in redis this way for some time, you are inserting them correctly.
That really only leaves one thing: When you call getBytes() your encoding is off, it is most likely using whatever encoding is set as the default on your system (probably UTF-8, but I'm not sure). Really what you want to do is pass a different encoding to getBytes(), probably like this: getBytes("US-ASCII").
check out the documentation for encodings are available.
Heres a link on SO that shows how to convert a string to an ASCII byte array.