HTTP gzip encoding of html

HTTP gzip encoding of html - java

For a project of mine i'm having to code my own lite webserver.
At the moment it's doing what i want it to do, but kinda ... slow. at least to slow for me.
Therefore i was thinking about implementing gzip compression to speed things up.
Here's how.
public static String encodeToGZip(String data) {
ByteArrayOutputStream bout = null;
try {
bout = new ByteArrayOutputStream();
GZIPOutputStream output = new GZIPOutputStream(bout);
output.write(data.getBytes());
output.flush();
output.close();
bout.close();
} catch (IOException ex) {
ex.printStackTrace();
}
try {
return new String(bout.toByteArray(), "UTF-8");
} catch (UnsupportedEncodingException ex) {
return null;
}
}
the problem is that the webserver can't decode the data i've sent. eventhough it states that it accepts gzip encoding so i must be sending some corrupt data.
this is the result.
wireshark sniff==>
GET /login.html HTTP/1.1
Host: localhost:9090
Connection: keep-alive
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
HTTP/1.1 200 OK
Connection: close
Server: My Lite Server v0
Content-Encoding: gzip
Content-Type: text/html
...............T...N...0....#.......O...?...$...........BB...g...6...[.....u...........6......................g6e...............S......c..$..........`I
Gw............AOAhU...XO...d...].... IU...h...+......[.....Y.........b...|x.........rm1.........1.....L...uI.........S...n............F......T2.[$X.......M.....M.#*...........d....58HL:....Wx......Z...........m...t...Z.)'XQdg
......X.........~......(......<.......p/.......
..........."...6|7........3
...r.Sv.../...rT...."..........SrJ..........M.vR^...4$...
.q...x.................../...8...........M...y#...j......7........d..le....;..................~......o....F......

return new String(bout.toByteArray(), "UTF-8");
This line in your method will produce garbage strings.
The above constructor performs a transcoding operation from the given encoding to UTF-16. You take a bunch of arbitrary bytes and try to decode them as UTF-8. You can only decode UTF-8 encoded character data as UTF-8. Java does not have binary-safe strings (all strings are UTF-16); you must use byte arrays instead.
Just write the compressed bytes to your OutputStream.
Avoid using data.getBytes() as it uses the default system encoding. This will produce non-portable code as the default system encoding is system and configuration dependent. Prefer always setting an encoding explicitly.

Related

Change header-body-separator with HttpURLConnection (from java.net)

I am sending a simple POST request with the built-in HttpURLConnection class, but I want to change the way Java separates the headers from the body (look at the outputs of tcpflow -a port 80 below). Here is the code:
// Create HttpURLConnection object
URL url = new URL("http://httpbin.org/post");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// Set request method
connection.setRequestMethod("POST");
// Write body and "Content-Length" header
String body = "This+is+the+body+of+the+post+request.";
connection.setRequestProperty("Content-Length",
String.valueOf(body.length()));
connection.setDoOutput(true);
connection.getOutputStream().write(body.getBytes("US-ASCII"));
// Send request
int responseCode = connection.getResponseCode();
When I execute that code and look at what Java actually sends using tcpflow -a port 80 (prints all requests/responses on port 80), I see the following (I cut away the response):
192.168.178.113.54654-054.225.177.165.00080: POST /post HTTP/1.1
User-Agent: Java/9
Host: httpbin.org
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Content-type: application/x-www-form-urlencoded
Content-Length: 37
192.168.178.113.54654-054.225.177.165.00080: This is the body of the post request.
The headers are correct, the body is correct. But I can see that the body is transferred in a separate connection. I know that this is a problem with java.net.HttpURLConnection because when I try the same with Apache's HttpClient, tcpflow -a port 80 gives me:
192.168.178.113.39708-054.243.202.193.00080: POST /post HTTP/1.1
Content-Length: 37
Host: httpbin.org
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.3 (Java/9)
Accept-Encoding: gzip,deflate
This+is+the+body+of+the+post+request.
Here the body is sent with the headers, separated with just /r/n/r/n. I would like the Java library (java.net.HttpURLConnection) to do the same. Is that possible?
EDIT: I found out that the reason the server rejected my requests was not that the body was in a different packet (packet, not connection, as Julian Reschke pointed out) than the headers but just that I sent the wrong data facepalm.

Encoded Http Request/Response body

I've built an Android proxy server passing http request and responses using Java Sockets.
The proxy is working, all content in browser is passing through it. However I would be able to read requests/responses but their body seems to be encoded:
GET http://m.onet.pl/ HTTP/1.1
Host: m.onet.pl
Proxy-Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Linux; Android 4.4.4; XT1039 Build/KXB21.14-L1.56) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36
DNT: 1
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-GB,en;q=0.8,en-US;q=0.6,pl;q=0.4
Cookie: onet_ubi=201509221839473724130028; onetzuo_ticket=9AEDF08D278EC7965FF6A20BABD36EF0010012ED90FDD127C16068426F8B65A5D81A000000000000000050521881000000; onet_cid=dd6df83b3a8c33cd497d1ec3fcdea91b; __gfp_64b=2Mp2U1jvfJ3L9f.y6CbKfJ0oVfA7pVdBYfT58G1nf7T.p7; ea_uuid=201509221839478728300022; onet_cinf=1; __utma=86187972.1288403231.1442939988.1444999380.1445243557.40; __utmb=86187972.13.10.1445243557; __utmc=86187972; __utmz=86187972.1442939988.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
So both in request and response a lot of "���" occurs. I didn't find any info about http encoding. What is it ? How can I properly read body ?
Assuming it might be GZIPed message I tried:
while ((count = externalServerInputReader.read(buf, 0, buf.length)) != -1)
{
String stream = new String(buf, 0 , count);
proxyOutputStream.write(buf, 0, count);
if (stream.contains("content-encoding: gzip")) {
ByteArrayInputStream bais = new ByteArrayInputStream(buf);
GZIPInputStream gzis = new GZIPInputStream(bais);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader in = new BufferedReader(reader);
String readed;
while ((readed = in.readLine()) != null) {
Log.d("Hello", "UnGzip: " + readed);
}
}
}
proxyOutputStream.flush();
However I get error on ungzipping attempt.
unknown format (magic number 5448)

I tried your sample request by saving it to "/tmp/req" and replaying it using cat /tmp/req | nc m.onet.pl 80. The server sent back a gzip encoded response, which I could tell from the response header content-encoding: gzip. In the case where the response is gzip encoded, you could decompress it in Java using java.util.zip.GZIPInputStream. Note that the user agent in your example is also advertising support for "deflate" and "sdch" too, so you may also get responses with those encodings. The "deflate" encoding can be decompressed using java.util.zip.InflaterInputStream. I'm not aware of any built in support for sdch, so you would need to find or write a library to decompress that - see this other Stack Overflow question for a possible starting point: "Java SDCH compressor/decompressor".
To address the updated part of your question where you added a stab at using GZIPInputStream, the most immediate issue is that you should only gunzip the stream after the HTTP response headers have ended. The simplest thing to do would be to wait for "\r\n\r\n" to come across the underlying InputStream (not a Reader) and then run the data starting with the next byte on through a single GZIPInputStream. That should probably work for the example you gave - I successfully decoded the replayed response I got using gunzip -c. For thoroughness, there are some other issues that will keep this from working as a general solution for arbitrary websites, but I think it will be enough to get you started. (Some examples: 1) you might miss a "content-encoding" header because you are splitting the response into chunks of length buf.length. 2) Responses which use chunked encoding would need to be de-chunked. 3) Keep-alive responses would necessitate that you track when the response ends rather than waiting for end of stream.)

jquery ajax and java server, lost data

i have this ajax function that looks like so
$.ajax({
type: "POST",
url: "http://localhost:55556",
data: "lots and lots of pie",
cache: false,
success: function(result)
{
alert("sent");
},
failure: function()
{
alert('An Error has occured, please try again.');
}
});
and a server that looks like so
clientSocket = AcceptConnection();
inp = new BufferedReader(new InputStreamReader (clientSocket.getInputStream()));
String requestString = inp.readLine();
BufferedReader ed = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
while(true){
String tmp = inp.readLine();
System.out.println(tmp);
}
now the odd thing is when i send my ajax my server gets by using system.out
Host: localhost:55556
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Content-Length: 20
Origin: null
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
the question is where is the data that i sent through, where is lots of pie?

The data should come after a blank line after the header lines, but I think the problem is that the data does not end with a newline character, and therefore, you cannot read it with the .readLine() method.
While looping through the header lines, you could look for the "Content-Length" line and get the length of the data. When you have reached the blank line, stop using .readLine(). Instead switch to reading one character at a time, reading the number of characters specified by the "Content-Length" header. I think you can find example code for this in this answer.
If you can, I suggest you use a library to help with this. I think the Apache HTTP Core library can help with this. See this answer.

Neglecting HTTP Post request headers when read using BufferedReader

I am currently listening on a port using BufferedReader like:
ServerSocket ss = new ServerSocket(2346);
Socket s = ss.accept();
BufferedReader in = new BufferedReader(new InputStreamReader(s.getInputStream()));
while(true){
inputLine = in.readLine();
if(inputLine==null)
break;
}
Now I am getting all the headers and everything like:
POST /record HTTP/1.1
Accept-Encoding: gzip,deflate
Content-Type: text/xml;charset=UTF-8
SOAPAction: ""
Content-Length: 1969
Host: localhost:2346
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1.1 (java 1.5)
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope">...
The problem is that I need just the content of the POST request(the last line above), so is there a Java parser that could do it. And in my request to the socket I need to give an extra empty line to allow it to be read properly. Is there a solution for this?
Thanks

The response body is always separated by one blank line from the response header. You can either write your own parser or use a library like HttpCore http://hc.apache.org/httpcomponents-core-ga/

Urlencoding data for post request body. Am I using wrong charset?

I want to replicate a working POST request in Java. For testing purpose, lets take message like: 'äöõüäöõüäöõüäöõü'
Working POST request (with encoded message of 'äöõüäöõüäöõüäöõü'):
Header
POST http://www.mysite.com/newreply.php?do=postreply&t=477352 HTTP/1.1
Host: www.warriorforum.com
Connection: keep-alive
Content-Length: 403
Origin: http://www.mysite.com
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko)Chrome/14.0.835.202 Safari/535.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: */*
Referer: http://www.mysite.com/test-forum/477352-test.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: bblastvisit=1319205053; bblastactivity=0; bbuserid=265374; bbpassword=1125e9ec1ab41f532ab8ec6f77ddaf94; bbsessionhash=91444317c100996990a04d6c5bbd8375;
Body
securitytoken=1319806096-618e5f9012901e2d818bf2c74c2121baa064be57&ajax=1&ajax_lastpost=1319806096&**message=%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC**&wysiwyg=0&styleid=1&signature=1&fromquickreply=1&s=&do=postreply&t=477352&p=who%20cares&specifiedpost=0&parseurl=1&loggedinuser=265374
As we can see in the request body 'äöõüäöõüäöõüäöõü is encoded as: %u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC
Now i want to replicate it.
Lets Url encode the text with charset utf-8 in Java:
String userText = "äöõüäöõüäöõüäöõü";
String encoded = URLEncoder.encode(userText, "utf-8");
Result: %C3%A4%C3%B6%C3%B5%C3%BC%C3%A4%C3%B6%C3%B5%C3%BC%C3%A4%C3%B6%C3%B5%C3%BC%C3%A4%C3%B6%C3%B5%C3%BC%0A%0A%0A%5BSIZE%3D%221%22%5D%5BI%5D << NOT THE SAME
Lets try ISO-8859-1:
String userText = "äöõüäöõüäöõüäöõü";
String encoded = URLEncoder.encode(userText, "ISO-8859-1");
Result: %E4%F6%F5%FC%E4%F6%F5%FC%E4%F6%F5%FC%E4%F6%F5%FC%0A%0A%0A%5BSIZE%3D%221%22%5D%5BI%5D << NOT THE SAME
Neither of them produce the same encoded string as in the working example, but all of them have the same input. What am I missing here?

%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC%u00E4%u00F6%u00F5%u00FC
I don't know what the above data is encoded as, but it isn't application/x-www-form-urlencoded; charset=UTF-8 as the request claims. This is not legal data for this MIME type.
It looks like some UTF-16BE-encoded form.
URLEncoder.encode(userText, "utf-8"); would be the correct way to encode the application/x-www-form-urlencoded; charset=UTF-8 values if this was actually what the server was expecting. (ref)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.