Java decompress HTTP GZIP content from json attribute

Java decompress HTTP GZIP content from json attribute - java

We are working with packetbeat, a network packet analyzer tool to capture http requests and http responses. Packebeat persists this packet events in json format. The problem comes when the server supports gzip compression, packetbeat could not unzip content and save it directly the gzip content as json attribute. As you can see (Note: json has been simplified);
{
{
... ,
"content-type":"application/json;charset=UTF-8",
"transfer-encoding":"chunked",
"content-length":6347,
"x-application-context":"proxy-service:pre,native:8080",
"content-encoding":"gzip",
"connection":"keep-alive",
"date":"Mon, 18 Dec 2017 07:18:23 GMT"
},
"body": "\u001f\ufffd\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003\ufffd]k\ufffd\u0014Ǳ\ufffd/\ufffdYI\ufffd#\ufffd*\ufffdo\ufffd\ufffd\ufffd\u0002\t\u0010^\ufffd\u001c\u000eE=\ufffd{\ufffdb\ufffd\ufffdE\ufffd\ufffdC\ufffd\ufffdf\ufffd,\ufffd\u003e\ufffd\ufffd\ufffd\u001ef\u001a\u0008\u0005\ufffd\ufffdg\ufffd\ufffd\ufffdYYU\ufffd\ufffd;\ufffdoN\ufffd\ufffd\ufffdg\ufffd\u0011UdK\ufffd\u0015\u0015\ufffdo\u000eH\ufffd\u000c\u0015Iq\ndC\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd ... "
}
We are thinking in preprocess packet json files to unzip content. Could someone tell me what i need to decompress zipped "body" json attribute using java?

Your data is irrecoverably broken. Generally I would suggest using Base64 encoding for transferring binary data packed into JSON, but you can read about possible alternatives in Binary Data in JSON String. Something better than Base64 if you like experimenting.
Otherwise, in theory you could just use a variant of String.getBytes() to get an array of bytes, and wrap the result into the mentioned (in the other answer) streams:
byte bodyBytes[]=body.getBytes();
ByteArrayInputStream bais=new ByteArrayInputStream(bodyBytes);
GZipInputStream gis=new GZipInputStream(bais);
<do something with gis here, perhaps use an additional DataInputStream>
Apart from the String-thing (which is usually not a good idea), this is how you unpack a gzip-compressed array of bytes.
However valid gzip data starts with a magic number 0x1F,0x8B (see Wikipedia, or you can also dig up the actual specification). Your data starts with 0x1F (the \u001F part), but continues with a \ufffd Unicode character, which a replacement character (see Wikipedia again).
Some tool was encoding the binary data and did not like the 0x8B, most probably because it was >=0x80. If you further read in your JSON, there are many \ufffd-s in it, all values above (or equal to) 0x80 have been replaced with this. So the data at the moment is irrecoverably broken even if JSON would support raw binary data inside (but it does not).

In Java you can use the GZIPInputStream class to decode the GZIP data, I think you would need to turn the value into a ByteArrayInputStream first.

Related

Converting Java byte array to Buffer in Node.js

In an Android app I have a byte array containing data in the following format:
In another Node.js server, the same data is stored in a Buffer which looks like this:
I am looking for a way to convert both data to the same format so I can compare the two and check if they are equal. What would be the best way to approach this?

[B#cbf1911 is not a format. That is the result of invoking the .toString() method on a java object which doesn't have a custom toString implementation (thus, you get the default implementation written in java.lang.Object itself. The format of that string is:
binary-style-class-name#system-identity-hashcode.
[B is the binary style class name. That's JVM-ese for byte[].
cbf1911 is the system identity hashcode, which is (highly oversimplified and not truly something you can use to look stuff up) basically the memory address.
It is not the content of that byte array.
Lots of java APIs allow you to pass in any object and will just invoke the toString for you. Where-ever you're doing this, you wrote a bug; you need to write some explicit code to turn that byte array into data.
Note that converting bytes into characters, which you'll have to do whenever you need to put that byte array onto a character-based comms channel (such as JSON or email), is tricky.
<Buffer 6a 61 ...>
This is listing each byte as an unsigned hex nibble. This is an incredibly inefficient format, but it gets the job done.
A better option is base64. That is merely highly inefficient (but not incredibly inefficient); it spends 4 characters to encode 3 bytes (vs the node.js thing which spends 3 characters to encode 1 byte). Base64 is a widely supported standard.
When encoding, you need to explicitly write that. When decoding, same story.
In java, to encode:
import android.util.Base64;
class Foo {
void example() {
byte[] array = ....;
String base64 = Base64.encodeToString(array, Base64.DEFAULT);
System.out.println(base64);
}
}
That string is generally 'safe' - it has no characters in it that could end up being interpreted as control flow (so no <, no ", etc), and is 100% ASCII which tends to survive broken charset encoding transitions, which are common when tossing strings about the interwebs.
How do you decode base64 in node? I don't know, but I'm sure a web search for 'node base64 decode' will provide hundreds of tutorials.
Good luck!

Byte length difference when retrieving from WebSphere MQ Message

In Java, I am polling a WebSphere MQ message queue, expecting a message of `STRING format, that is composed entirely of XML. Part of this XML will contain bytes to a file attachment (any format: pdf, image, etc) which will then be converted to a blob for storage in an Oracle Db and later retrieval.
The issue I am having is that the known size of example files being sent over end up in my Db with a different size. I am not adding anything to the bytes (as far as I know), and the size appears to be larger directly after I get the message. I cannot determine if I am somehow adding information at retrieve, conversion from bytes -> String, or if this is happening on the front end when the sender populates the message.
My code at retrieve of the message:
inboundmsg = new MQMessage();
inboundmsg = getMQMsg(FrontIncomingQueue, gmo);
strLen = inboundmsg.getMessageLength();
strData = new byte[strLen];
ibm_id = inboundmsg.messageId;
inboundmsg.readFully(strData);
inboundmsgContents = new String(strData);
I see a file known to have size 21K go to 28K. A coworker has suggested that charset/encoding may be the issue. I do not specify a charset in the constructor call to String above, nor in any of the calls to getBytes when converting back from a string (for other unrelated uses). My default charset is ISO-8859-1. When speaking with the vendor who is initiating the message transfer, I asked her what charset she is using. Her reply:
"I am using the File.WriteAllBytes method in C# - I pass it the path of my file and it writes it to a byte[]. I could not find any documentation on the MSDN about what encoding the function uses. The method creates a byte array and from what I have read online this morning there is no encoding, its just a sequence of 8bit unsigned binary data with no encoding."
Another coworker has suggested that perhaps the MQ charset is the culprit, but my reading of the documentation suggests that MQ charset only affects the behavior of readString, readLine, & writeString.
If I circumvent MQ totally, and populate a byte array using a File Input Stream and a local file, the file size is preserved all the way to Db store, so this definitely appears to be happening at or during message transfer.

The problem is evident in the wording of the question. You describe a payload that contains arbitrary binary data and also trying to process it as a string. These two things are mutually exclusive.
This appears to be complicated by the vendor not providing valid XML. For example, consider the attachment:
<PdfBytes>iVBORw0KGgoAAAANS … AAAAASUVORK5CYII=</PdfBytes>
If the attachment legitimately contains any XML special character such as < or > then the result is invalid XML. If it contains null bytes, some parsers assume they have reached the end of the text and stop parsing there. That is why you normally see any attachment in XML either converted to Base64 for transport or else converted to hexadecimal.
The vendor describes writing raw binary data which suggests that what you are receiving contains non-string characters and therefore should not be sent as string data. If she had described some sort of conversion that would make the attachment XML compliant then string would be appropriate.
Interestingly, a Base64 encoding results in a payload that is 1.33 times larger than the original. Coincidence that 21k * 1.3 = 28k? One would think that what is received is actually the binary payload in Base64 format. That actually would be parseable as a string and accounts for the difference in file sizes. But it isn't at all what the vendor described doing. she said she's writing "8bit unsigned binary data with no encoding" and not Base64.
So we expect it to fail but not necessarily to result in a larger payload. Consider that WebSphere MQ receiving a message in String format will attempt to convert it. If the CCSID of the message differs from that requested on the GET then MQ will attempt a conversion. If the inbound CCSID is UTF-16 or any double-byte character set, certain characters will be expanded from one to two bytes - assuming the conversion doesn't hit invalid binary characters that cause it to fail.
If the two CCSIDs are the same then no conversion is attempted in the MQ classes but there is still an issue in that something has to parse an XML payload that is by definition not valid and therefore subject to unexpected results. If it happens that the binary payload does not contain any XML special characters and the parser doesn't choke on any embedded null bytes, then the parser is going to rather heroic lengths to forgive the non-compliant payload. If it gets to the </PdfBytes> tag without choking, it may assume that the payload is valid and convert everything between the <PdfBytes>...</PdfBytes> tags itself. Presumably to Base64.
All of this is conjecture, of course. But in a situation where the payload is unambiguously not string data any attempt to parse it as string data will either fail outright or produce unexpected and potentially bizarre results. You are actually unfortunate that it doesn't fail outright because now there's an expectation that the problem is on your end when it clearly appears to be the vendor's fault.
Assuming that the content of the payload remains unchanged, the vendor should be sending bytes messages and you should be receiving them as bytes. That would at least fix the problems MQ is having reconciling the expected format with the actual received format, but it would still be invalid XML. If it works that the vendor sends binary data in a message set to type String with you processing it as bytes then count your blessings and use it that way but don't count on it being reliable. Eventually you'll get a payload with an embedded XML special character and then you will have a very bad day.
Ideally, the vendor should know better than to send binary data in an XML payload without converting it first to string and it is up to them to fix it so that it is compliant with the XML spec and reliable.
Please see this MSDN page: XML, SOAP, and Binary Data

Can I add a binary file to a String based server message queue?

I have a multi-threaded client-server application that uses Vector<String> as a queue of messages to send.
I need, however, to send a file using this application. In C++ I would not really worry, but in Java I'm a little confused when converting anything to string.
Java has 2 byte characters. When you see Java string in HEX, it's usually like:
00XX 00XX 00XX 00XX
Unless some Unicode characters are present.
Java also uses Big endian.
These facts make me unsure, whether - and eventually how - to add the file into the queue. Preferred format of the file would be:
-- Headers --
2 bytes Size of the block (excluding header, which means first four bytes)
2 bytes Data type (text message/file)
-- End of headers --
2 bytes Internal file ID (to avoid referring by filenames)
2 bytes Length of filename
X bytes Filename
X bytes Data
You can see I'm already using 2 bytes for all numbers to avoid some horrible operations required when getting 2 numbers out of one char.
But I have really no idea how to add the file data correctly. For numbers, I assume this would do:
StringBuilder packetData = new StringBuilder();
packetData.append((char) packetSize);
packetData.append((char) PacketType.BINARY.ordinal()); //Just convert enum constant to number
But file is really a problem. If I have also described anything wrongly regarding the Java data types please correct me - I'm a beginner.

Does it have to send only Strings? I think if it does then you really need to encode it using base64 or similar. The best approach overall would probably be to send it as raw bytes. Depending on how difficult it would be to refactor your code to support byte arrays instead of just Strings, that may be worth doing.
To answer your String question I just saw pop up in the comments, there's a getBytes method on a String.
For the socket question, see:
Java sending and receiving file (byte[]) over sockets

Decompressing gzipped data with Inflater in Java

I'm trying to decompress gzipped data with Inflater. According to the docs,
If the parameter 'nowrap' is true then the ZLIB header and checksum
fields will not be used. This provides compatibility with the
compression format used by both GZIP and PKZIP.
Note: When using the 'nowrap' option it is also necessary to provide
an extra "dummy" byte as input. This is required by the ZLIB native
library in order to support certain optimizations.
Passing true to the constructor, then attempting to decompress data results in DataFormatException: invalid block type being thrown. Following the instructions in this answer, I've added a dummy byte to the end of setInput()'s parameter, to no avail.
Will I have to use GZIPInputStream instead? What am I doing wrong?

The Java documentation is incorrect or at least misleading:
nowrap - if true then support GZIP compatible compression
What nowrap means is that raw deflate data will be decompressed. A gzip stream is raw deflate data wrapped with a gzip header and trailer. To fully decode the gzip format with this class, you will need to process the gzip header as described in RFC 1952, use inflater to decompress the raw deflate data, compute the crc32 of the uncompressed data using that class, and then verify the crc and length (modulo 2^32) in the gzip trailer, again as specified in the RFC.

I think that to read a GZIP stream it's not enough to set nowrap=true, you must also consume the gzip header, which is not part of the compressed stream. See eg. readHeader() in this implementation

cpp:char(-1) in Java-char

I have client-server app. Client on C++, server on Java.
I am sending byte-stream form client to server, and from server to client.
Tell me please, when I sent char(-1) from C++, what value equals to it in Java?
And what value I must sent from Java to C++, to get char(-1) in Cpp code?

As you are writing through a byte stream, your char(-1) arrives as 255, as byte streams normally transmit unsigned bytes.
The -1 which is read when you read the end of a stream can not be send explicitely but only through closing the stream.

There's no single answer; it depends on how C++ encodes the data and how Java interprets it. The most common encoding of char(-1) is the number 255. Note that this isn't defined by C++; a one's-complement system might encode it as 254. But also note that there are innumerable ways to encode data across the wire: Elias coding, various ASN.1 encodings, decimal digits, hex, etc.
At the Java end, even assuming a simple char-to-byte encoding, it depends on how you de-serialise the byte and into what type.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.