Java: HttpComponents gets rubbish Response from input Stream from a specific URL

Java: HttpComponents gets rubbish Response from input Stream from a specific URL - java

I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response.
On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.
For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong.
The Page is working correctly in known Browsers, so it is not a Problem on their side.
Here is the code I use to send and receive.
URI uri1 = new URI("http://www.forum.animenokami.com");
HttpGet get = new HttpGet(uri1);
get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
HttpClient httpClient = new DefaultHttpClient();
HttpResponse response = httpClient.execute(get);
HttpEntity ent = response.getEntity();
InputStream is = ent.getContent();
BufferedInputStream bis = new BufferedInputStream(is);
byte[] tmp = new byte[2048];
int l;
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
I hope you can help me.
If you need anymore Information I will try to provide it as soon as possible.

This code is completely broken:
String ret = "";
while ((l = bis.read(tmp)) != -1){
ret += new String(tmp);
}
Three things:
This is converting the whole buffer into a string on each iteration, regardless of how much data has been read. (I suspect this is what's actually going wrong in your case.)
It's using the default platform encoding, which is almost never a good idea.
It's using string concatenation in a loop, which leads to poor performance.
Fortunately you can avoid all of this very easily using EntityUtils:
String text = EntityUtils.toString(ent);
That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)
It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.

It works fine but what I don't understand is why I see the same text multiple times only on this URL.
It will be because your client is seeing more incomplete buffers when it reads the socket. Than could be:
because there is a network bandwidth bottleneck on the route from the remote site to your client,
because the remote site is doing some unnecessary flushes, or
some other reason.
The point is that your client must pay close attention to the number of bytes read into the buffer by the read call, otherwise it will end up inserting junk. Network streams in particular are prone not filling the buffer.

Related

Unzip http response

Beginner in java, I try to decompress an HTTP response in Gzip format. Roughly, I have a bufferReader which allows me to read lines of http response from a socket. Thanks to that, I parse the http header and if it specifies that the body is in gzip format then I have to decompress it. Here is the code which I use:
DataInputStream response = new DataInputStream(clientSideSocket.getInputStream());
BufferedReader buffer = new BufferedReader(new InputStreamReader(response))
header = parseHTTPHeader(buffer); // return a map<String,String> with header options
StringBuilder SBresponseBody = new StringBuilder();
String responseBody = new String();
String line;
while((line = buffer.readLine())!= null) // extract the body as if was a string...
SBresponseBody.append(line);
responseBody = SBresponseBody.toString();
if (header.get("Content-Encoding").contains("gzip"))
responseBody = unzip(responseBody); // function I try to construct
My attempt for the unzip function is as follows:
private String unzip(String body) throws IOException {
String responseBody = "";
byte[] readBuffer = new byte[5000];
GZIPInputStream gzip = new GZIPInputStream (new ByteArrayInputStream(body.getBytes());
int read = gzip.read(readBuffer,0,readBuffer.length);
gzip.close();
byte[] result = Arrays.copyOf(readBuffer, read);
responseBody = new String(result, "UTF-8");
return responseBody;
}
I get an error in the GZIPInputStream: not GZIP format (because gzip header is not found in body).
Here are my thoughts:
• Is body.toByte() wrong since it has been read by a bufferReader as a character string and therefore converting it back to byte[] makes no sense since it has already been interpreted in the wrong way? Or do I reconvert Sting body to byte[] in the wrong way?
• Do I have to build a GZIP header myself using the information provided in the HTTP header and adding it to the String body ?
• Do I need to create another InputStream from my socket.getInputStream() to read the information byte by byte, or is it tricky since there is already a buffer "connected" to this socket?

Roughly, I have a bufferReader which allows me to read lines of http response from a socket.
You've handrolled a HTTP client.
This is not a good thing; HTTP is considerably more complicated than you think it is. gzip is just one of about 10,000 things you need to think about. There's HTTP/2.0, Spdy, http3, chunked transfer encoding, TLS, redirects, mime packing, and so much more to think about.
So, if you want to write an actual HTTP client, you need about 100x this code and a ton of domain knowledge, because the actual specs of the HTTP protocol, while handy, don't really tell the story. The de-facto protocol you're implementing is 'whatever servers connected to the internet tend to send' and what they tend to send is tightly wound up with 'whatever commonly used browsers tend to get right', which is almost, but not quite, what that spec document says. This is one of those cases where pragmatics and implementations are the 'real spec', and the actual spec is merely attempting to document reality.
That's a long way around to say: Your mistake is trying to handroll a HTTP client. Don't do that. Use OkHttp or the http client introduced in jdk11 in the core libraries.
But, I know what I want!
Your code is loaded up with bugs, though.
DataInputStream response = new DataInputStream(clientSideSocket.getInputStream());
DataInputStream is useless here. Remove that wrapper.
BufferedReader buffer = new BufferedReader(new InputStreamReader(response))
Missing semi-colon. Also, this is broken - this will convert the bytes flowing over the wire to characters using 'platform default encoding' which is wrong, you need to look at the Content-Type header.
responseBody = unzip(responseBody)
You cannot do this. Your major misunderstanding is that you appear to think that there is no difference between a bunch of bytes, and a sequence of characters.
That's wrong. Once you stored bytes into chars, you cannot unzip it anymore.
The fix is to check for the gzip header FIRST, then wrap your inputstream through GZipStream.

Reading from response body multiple times in Apache HttpClient 4.x

I'm using the Apache HttpClient 4.2.3 in my application. We store the response of an HTTP call like so:
HttpResponse httpResponse = (DefaultHttpClient)httpClient.execute(httpRequest);
The response body is an InputStream in the 4.x API:
InputStream responseStream = httpResponse.getEntity().getContent();
My problem is I need to read the response body as a string and as a byte[] at various points in the application. But the InputStream used by Apache is an EofSensorInputStream, which means once I reach the stream EOF, it gets closed. Is there anyway I can get the string and byte[] representations multiple times and not close the stream?
I've already tried wrapping the byte array in a new ByteArrayInputStream and setting that as the request body, but it doesn't work since my response body can reach a few gigs. I've also tried this, but I noticed the original response stream still gets closed.
Any pointers would be welcome.
EDIT: On a related note, it would be also be great if I could find the length of the InputStream either without consuming the stream or by reversing the consumption.

1 . I think you have somewhat conflicting requirements:
a)
it doesn't work since my response body can reach a few gigs
b)
Is there anyway I can get the string and byte[] representations multiple times and not close the stream
If you do not have enough memory this is not possible.
Btw, another way to get the response as bytes is EntityUtils.byte[] toByteArray(final HttpEntity entity).
Do you really need N-gigs String? What are you going to do with it?
2 .
it would be also be great if I could find the length of the InputStream
httpResponse.getEntity().getContentLength()
3 . Since the response does not fit into the memory I would suggest to save it into a file (or temp file). Then set up InputStream on that file, and then read it as many times as you need.

reading bytes from web site

I am trying to create a proxy server.
I want to read the websites byte by byte so that I can display images and all other stuff. I tried readLine but I can't display images. Do you have any suggestions how I can change my code and send all data with DataOutputStream object to browser ?
try{
Socket s = new Socket(InetAddress.getByName(req.hostname), 80);
String file = parcala(req.url);
DataOutputStream out = new DataOutputStream(clientSocket.getOutputStream());
BufferedReader dis = new BufferedReader(new InputStreamReader(s.getInputStream()));
PrintWriter socketOut = new PrintWriter(s.getOutputStream());
socketOut.print("GET "+ req.url + "\n\n");
//socketOut.print("Host: "+req.hostname);
socketOut.flush();
String line;
while ((line = dis.readLine()) != null){
System.out.println(line);
}
}
catch (Exception e){}
}
Edited Part
This is what I should have to do. I can block banned web sites but can't allow other web sites in my program.
In the filter program, you will open a TCP socket at the specified port and wait for connections. If a
request comes (i.e. the client types a URL to access a web site), the application will process it to
decide whether access is allowed or not and then, using the same socket, it will send the reply back
to the client. After the client opened her connection to WebPolice (and her request has been checked
and is allowed), the real web page needs to be shown to the client. Therefore, since the user already gave her request, now it is WebPolice’s turn to forward the request so that the user can get the web page. Thus, WebPolice acts as a client and requests the web page. This means you need to open a connection to the web server (without closing the connection to the user), forward the request over this connection, get the reply and forward it back to the client. You will use threads to handle multiple connections (at the same time and/or at different times).

I don't know what exactly you're trying to do, but crafting an HTTP request and reading its response incorporates somewhat more than you have done here. Readline won't work on binary data anyway.
You can take a look at the URLConnection class (stolen here):
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
Then you can read textual or binary data from the in object.

Read line will treat the line read as a String, so unless you want to mess around with conversions over to bytes, I wouldn't recommend that.
I would just read bytes until you can't read anymore, then write them out to a file, this should allow you to grab the images, keeping file headers intact which can be important when dealing with files other than text.
Hope this helps.

Instead of using BufferedReader you can try to use InputStream.
It has several methods for reading bytes.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html

Read complete HTTP request-header

I'm currently creating a little webserver (for testing purposes) and I have a problem reading the HTTP request-header (coming from the browser, chromium in my case).
First, I simply tried something like this:
BufferedReader in = new BufferedReader(
new InputStreamReader(client_socket.getInputStream(), "UTF-8")
);
StringBuilder builder = new StringBuilder();
while (in.ready()){
builder.append(in.readLine());
}
return builder.toString();
This worked fine for the first request. However, after the first request was done, the ready()-method only returned false (i closed the client_socket as well as all readers/writers).
After a little searching I stumbled across this older question: Read/convert an InputStream to a String
I tried the first four solutions (two with Apache Commons, the one with the Scanner and the one with the do-while loop). All of them blocked for ever and the browser gave me a "Website not reachable"-error.
I'd like to do this on my own (without using any library's or embedded servers), that's why I even try.
I'm a little lost right now, how would you go about this?

You are reading from the socket until there is no more data to read. That is wrong. You need to keep reading until you encounter a 0-length line, then process the received headers to determine if there is more data to read (look for Content-Length: ... and Transfer-Encoding: chunked headers), eg:
StringBuilder builder = new StringBuilder();
String line;
do
{
line = in.readLine();
if (line == "") break;
builder.append(line);
}
while (true);
// use builder as needed...
// read message body data if headers say there
// is a body to read, per RFC 2616 Section 4.4...
Read RFC 2616 Section 4 for more details. Not only does this facilitate proper request reading, but doing so also allows you to support HTTP keep-alives correctly (for sending multiple requests on a single connection).

The solution suggested above by Remy Lebeau is wrong, as it was shown in a test of mine. This alternative is fail-safe:
StringBuilder builder = new StringBuilder();
String line;
do
{
line = in.readLine();
if (line.equals("")) break;
builder.append(line);
}
while (true);
Refer to: How do I compare strings in Java?

400 error with HttpClient for a link with an anchor

Here is my code:
DefaultHttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
HttpResponse response = client.execute(request);
This works for every url I have tried so far, except for some urls that contain an anchor. Some of these anchored urls return a 400. The weird thing is that it isn't all links that contain an anchor, a lot of them work just fine.
Unfortunately, I have to be really general as I can't provide the specific urls here.
The links are completely valid and work just fine in any browser, but the HttpClient returns a 400 when trying the link. If I remove the anchor it will work.
Any ideas what to look for?
For example: http://www.somedomain.com/somedirectory/somepage#someanchor
Sorry again for the generics
EDIT: I should mention this is for Android.

Your usage of the anchor in the url is incorrect.
When we perform a "Get", we need to get the entire resource (page). The anchor is just a tag marking a location, normally your browser will scroll to the position of the anchor once the page is loaded. It does not make sense to "Get" the page at a specific anchor - the entire page must be fetched.
It is possible your inconsistent results are because some web servers are ignoring the anchor component, and others are correcting your error.
The solution is just to remove the #anchor portion of the url before running your code.

As #Greg Sansom says, the URL should not be sent with an anchor / fragment. The fragment part of the URL is not relevant to the server.
Here's the expected URL syntax from relevant part of the HTTP 1.1 specification:
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
Note: there is no fragment part in the syntax.
What happens if you do send a fragment clearly is server implementation specific. I expect that you will see a variety of responses:
Some servers will silently strip / ignore the fragment part. (This is what you are expecting to happen).
Some servers might treat this as a request error and respond with a 400.
Some servers might mistakenly treat the fragment as part of the path or query, and give you a 404 or some other response, depending on how "confused" the fragment makes the server.
Some servers might actually imbue the fragment with a specific meaning. (This strikes me as a stupid thing to do, but you never know ...)
IMO, the most sensible solution is to strip it from the URL before instantiating the HttpGet object.
FOLLOWUP
The recommended way to remove a fragment from a URL string is to turn it into a java.net.URL or java.net.URI instance, extract the relevant components, use these to create a new java.net.URL or java.net.URI instance (leaving out the fragment of course), and finally turn it back into a String.
But I think that the following should also work, if you can safely assume that your URLs are all valid absolute HTTP or HTTPS URLs.
int pos = url.indexOf("#");
String strippedUrl = (pos >= 0) ? url.substring(0, pos) : url;

String user_url2="uhttp://www.somedomain.com/somedirectory/somepage#someanchor";
HttpClient client = new DefaultHttpClient();
HttpGet siteRequest = new HttpGet(user_url2);
StringBuilder sb = new StringBuilder();
HttpResponse httpResponse;
try {
httpResponse = client.execute(siteRequest);
HttpEntity entity = httpResponse.getEntity();
InputStream in = entity.getContent();
String line = null;
BufferedReader reader = new BufferedReader(
new InputStreamReader(in));
while ((line = reader.readLine()) != null)
{
sb.append(line);
}
result = sb.toString();
result string will display url value

There is a bug in Android HttpClient that was fixed in HttpClient 1.2 but not backported to Android
https://issues.apache.org/jira/browse/HTTPCLIENT-1177
https://github.com/apache/httpclient/commit/be6347aef0f7450133017b775113a8f3fadd2f1c
I have opened a bug report at:
https://code.google.com/p/android/issues/detail?id=65909

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.