Unzip http response - java

Beginner in java, I try to decompress an HTTP response in Gzip format. Roughly, I have a bufferReader which allows me to read lines of http response from a socket. Thanks to that, I parse the http header and if it specifies that the body is in gzip format then I have to decompress it. Here is the code which I use:
DataInputStream response = new DataInputStream(clientSideSocket.getInputStream());
BufferedReader buffer = new BufferedReader(new InputStreamReader(response))
header = parseHTTPHeader(buffer); // return a map<String,String> with header options
StringBuilder SBresponseBody = new StringBuilder();
String responseBody = new String();
String line;
while((line = buffer.readLine())!= null) // extract the body as if was a string...
SBresponseBody.append(line);
responseBody = SBresponseBody.toString();
if (header.get("Content-Encoding").contains("gzip"))
responseBody = unzip(responseBody); // function I try to construct
My attempt for the unzip function is as follows:
private String unzip(String body) throws IOException {
String responseBody = "";
byte[] readBuffer = new byte[5000];
GZIPInputStream gzip = new GZIPInputStream (new ByteArrayInputStream(body.getBytes());
int read = gzip.read(readBuffer,0,readBuffer.length);
gzip.close();
byte[] result = Arrays.copyOf(readBuffer, read);
responseBody = new String(result, "UTF-8");
return responseBody;
}
I get an error in the GZIPInputStream: not GZIP format (because gzip header is not found in body).
Here are my thoughts:
• Is body.toByte() wrong since it has been read by a bufferReader as a character string and therefore converting it back to byte[] makes no sense since it has already been interpreted in the wrong way? Or do I reconvert Sting body to byte[] in the wrong way?
• Do I have to build a GZIP header myself using the information provided in the HTTP header and adding it to the String body ?
• Do I need to create another InputStream from my socket.getInputStream() to read the information byte by byte, or is it tricky since there is already a buffer "connected" to this socket?

Roughly, I have a bufferReader which allows me to read lines of http response from a socket.
You've handrolled a HTTP client.
This is not a good thing; HTTP is considerably more complicated than you think it is. gzip is just one of about 10,000 things you need to think about. There's HTTP/2.0, Spdy, http3, chunked transfer encoding, TLS, redirects, mime packing, and so much more to think about.
So, if you want to write an actual HTTP client, you need about 100x this code and a ton of domain knowledge, because the actual specs of the HTTP protocol, while handy, don't really tell the story. The de-facto protocol you're implementing is 'whatever servers connected to the internet tend to send' and what they tend to send is tightly wound up with 'whatever commonly used browsers tend to get right', which is almost, but not quite, what that spec document says. This is one of those cases where pragmatics and implementations are the 'real spec', and the actual spec is merely attempting to document reality.
That's a long way around to say: Your mistake is trying to handroll a HTTP client. Don't do that. Use OkHttp or the http client introduced in jdk11 in the core libraries.
But, I know what I want!
Your code is loaded up with bugs, though.
DataInputStream response = new DataInputStream(clientSideSocket.getInputStream());
DataInputStream is useless here. Remove that wrapper.
BufferedReader buffer = new BufferedReader(new InputStreamReader(response))
Missing semi-colon. Also, this is broken - this will convert the bytes flowing over the wire to characters using 'platform default encoding' which is wrong, you need to look at the Content-Type header.
responseBody = unzip(responseBody)
You cannot do this. Your major misunderstanding is that you appear to think that there is no difference between a bunch of bytes, and a sequence of characters.
That's wrong. Once you stored bytes into chars, you cannot unzip it anymore.
The fix is to check for the gzip header FIRST, then wrap your inputstream through GZipStream.

Related

Get Content type of HttpURLConnection.getInputStream()

I'm having trouble using HttpUrlConnection. I'm working on multiple servers. Some servers send response in gzip encoding and some don't. For gzip encoding, I'm using
inputStream = new GZIPInputStream(connection.getInputStream());
inputStreamReader = new InputStreamReader(inputStream);
And for normal encoding, I'm using
inputStreamReader = new InputStreamReader(connection.getInputStream());
Is it possible to know the encoding of getInputStream so that I know beforehand whether or not to use GZIPInputStream. Or is there a generic input stream reader for both compressed & uncompressed. Thanks.
Get the content encoding from the HttpURLConnection using getContentEncoding().
It it's gzip-encoded, the result of that call should be gzip, and then you know what type of input stream you need to create.

HttpsUrlConnection response returned from servlet contains extra 'b''0''\r\n' characters when read through python library

I am using HttpsURLConnection to call a server and return the response returned from the HttpsURLConnection from my servlet. I am copying the response from HttpssURLConnection to HttpServletresponse using streams, copying bytes from the httpconnection response input stream to the response's output stream, checking the end by seeing if read returns < 0.
Following is the code for copying the response. The variable response is of type HttpServletResponse and the variable httpCon is of type HttpsURLConnection.
InputStream responseStream = httpCon.getInputStream();
if (responseStream != null)
{
OutputStream os = response.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while ((len = responseStream.read(buffer)) >= 0)
{
os.write(buffer, 0, len);
}
os.flush();
os.close();
}
On the client side, I am using python requests library to read the response.
What I am seeing that if I use the curl to test my servlet, I am getting the proper response json, response = u'{"key":"value"}'.
If i read it from the requests python, it is putting some extra characters in the response , the response looks like the following
response = u'b0\r\n{"key":"value"}\r\n0\r\n\r\n'
Both the strings are unicode. But the second one has extra characters.
Same resonse if I try from curl/Postman restclient, I am able to get it properly. But from python requests, it is not working. I tried another livetest library in python, with that also, it is not working and the response has same characters. I also tried to change the accept-encoding header but it did not have any effect.
Because of this, I am not able to parse the json.
I don't want to change the client to parse this kind of string.
Can I change something on the server so that it will work correctly?
Did the response contain the below header "Transfer-Encoding: chunked"?
The response should be in Chunked transfer encoding
https://en.wikipedia.org/wiki/Chunked_transfer_encoding.
In this case, you get \r\n0\r\n\r\n at the end of the response is as expected since it is terminating symbol of this encoding. I guest curl/Postman just help us to handle Chunked transfer encoding, so you can't find these chunked symbols.

Reading from response body multiple times in Apache HttpClient 4.x

I'm using the Apache HttpClient 4.2.3 in my application. We store the response of an HTTP call like so:
HttpResponse httpResponse = (DefaultHttpClient)httpClient.execute(httpRequest);
The response body is an InputStream in the 4.x API:
InputStream responseStream = httpResponse.getEntity().getContent();
My problem is I need to read the response body as a string and as a byte[] at various points in the application. But the InputStream used by Apache is an EofSensorInputStream, which means once I reach the stream EOF, it gets closed. Is there anyway I can get the string and byte[] representations multiple times and not close the stream?
I've already tried wrapping the byte array in a new ByteArrayInputStream and setting that as the request body, but it doesn't work since my response body can reach a few gigs. I've also tried this, but I noticed the original response stream still gets closed.
Any pointers would be welcome.
EDIT: On a related note, it would be also be great if I could find the length of the InputStream either without consuming the stream or by reversing the consumption.
1 . I think you have somewhat conflicting requirements:
a)
it doesn't work since my response body can reach a few gigs
b)
Is there anyway I can get the string and byte[] representations multiple times and not close the stream
If you do not have enough memory this is not possible.
Btw, another way to get the response as bytes is EntityUtils.byte[] toByteArray(final HttpEntity entity).
Do you really need N-gigs String? What are you going to do with it?
2 .
it would be also be great if I could find the length of the InputStream
httpResponse.getEntity().getContentLength()
3 . Since the response does not fit into the memory I would suggest to save it into a file (or temp file). Then set up InputStream on that file, and then read it as many times as you need.

Read complete HTTP request-header

I'm currently creating a little webserver (for testing purposes) and I have a problem reading the HTTP request-header (coming from the browser, chromium in my case).
First, I simply tried something like this:
BufferedReader in = new BufferedReader(
new InputStreamReader(client_socket.getInputStream(), "UTF-8")
);
StringBuilder builder = new StringBuilder();
while (in.ready()){
builder.append(in.readLine());
}
return builder.toString();
This worked fine for the first request. However, after the first request was done, the ready()-method only returned false (i closed the client_socket as well as all readers/writers).
After a little searching I stumbled across this older question: Read/convert an InputStream to a String
I tried the first four solutions (two with Apache Commons, the one with the Scanner and the one with the do-while loop). All of them blocked for ever and the browser gave me a "Website not reachable"-error.
I'd like to do this on my own (without using any library's or embedded servers), that's why I even try.
I'm a little lost right now, how would you go about this?
You are reading from the socket until there is no more data to read. That is wrong. You need to keep reading until you encounter a 0-length line, then process the received headers to determine if there is more data to read (look for Content-Length: ... and Transfer-Encoding: chunked headers), eg:
StringBuilder builder = new StringBuilder();
String line;
do
{
line = in.readLine();
if (line == "") break;
builder.append(line);
}
while (true);
// use builder as needed...
// read message body data if headers say there
// is a body to read, per RFC 2616 Section 4.4...
Read RFC 2616 Section 4 for more details. Not only does this facilitate proper request reading, but doing so also allows you to support HTTP keep-alives correctly (for sending multiple requests on a single connection).
The solution suggested above by Remy Lebeau is wrong, as it was shown in a test of mine. This alternative is fail-safe:
StringBuilder builder = new StringBuilder();
String line;
do
{
line = in.readLine();
if (line.equals("")) break;
builder.append(line);
}
while (true);
Refer to: How do I compare strings in Java?

Java: HttpComponents gets rubbish Response from input Stream from a specific URL

I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response.
On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.
For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong.
The Page is working correctly in known Browsers, so it is not a Problem on their side.
Here is the code I use to send and receive.
URI uri1 = new URI("http://www.forum.animenokami.com");
HttpGet get = new HttpGet(uri1);
get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
HttpClient httpClient = new DefaultHttpClient();
HttpResponse response = httpClient.execute(get);
HttpEntity ent = response.getEntity();
InputStream is = ent.getContent();
BufferedInputStream bis = new BufferedInputStream(is);
byte[] tmp = new byte[2048];
int l;
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
I hope you can help me.
If you need anymore Information I will try to provide it as soon as possible.
This code is completely broken:
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
Three things:
This is converting the whole buffer into a string on each iteration, regardless of how much data has been read. (I suspect this is what's actually going wrong in your case.)
It's using the default platform encoding, which is almost never a good idea.
It's using string concatenation in a loop, which leads to poor performance.
Fortunately you can avoid all of this very easily using EntityUtils:
String text = EntityUtils.toString(ent);
That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)
It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.
It works fine but what I don't understand is why I see the same text multiple times only on this URL.
It will be because your client is seeing more incomplete buffers when it reads the socket. Than could be:
because there is a network bandwidth bottleneck on the route from the remote site to your client,
because the remote site is doing some unnecessary flushes, or
some other reason.
The point is that your client must pay close attention to the number of bytes read into the buffer by the read call, otherwise it will end up inserting junk. Network streams in particular are prone not filling the buffer.

Categories

Resources