How to open many HttpURLConnection? - java

I'm trying to implement a simple URL availability checker which basically checks if the link is available (No HTTP 403, 404 etc. returned).
I have more than 20,000 links(to different servers/websites) in my database for testing purposes but it doesn't seems to work when I try to create more than 10 Threads.
Here is the code I'm using for opening the connection and reading response code within each WorkerThread.
URL url = new URL(dto.getUrl());
httpUrlConnection = (HttpURLConnection) url.openConnection();
httpUrlConnection.setUseCaches(false);
// httpUrlConnection.setConnectTimeout(6000);
httpUrlConnection.setDoInput(true);
httpUrlConnection.setDoOutput(false);
httpUrlConnection.setRequestMethod("GET");
httpUrlConnection.setRequestProperty("Host", dto.getUrl().replace("http://", ""));
// httpUrlConnection.setRequestProperty("Connection",
// "Keep-Alive");
httpUrlConnection.setRequestProperty("User-Agent", USER_AGENT);
httpUrlConnection.setRequestProperty("Cache-Control", "no-cache");
httpUrlConnection.connect();
int code = httpUrlConnection.getResponseCode();
Few issues I have noticed when having multiple threads opening the connections:
1) Only first 100-200 connections seems to open with no problem, after that, I start getting "Read timeout", "Connection timeout", "Connection reset" etc. Although, if you try to run the code again the links which have thrown above exceptions will return proper response code (if they get processed in first 100).
2) The response code sometimes is not valid (especially if the link was processed after first 100 links). I have noticed that sometimes 404 is returned when in fact it should return 200 (I checked it by putting link in first 100).
I did try using Apache's Http client but it also fails to process links correctly with many threads.
So does anyone know a solution to this problem ? What is the maximum amount of connections you could open using HttpURLConnection using multiple threads ? Is there any other way to open many HTTP connections and check response codes ?
Thank you all in advance !

Related

PORT Change in error case: HttpURLConnection

I am consuming API using HttpURLConnection in my android application and its running fine but if I get response code except then 200 ok (like 404, 500) my port is changing when I hit next request after error response code:
my code for android request is below and wireshark log as well:
try {
url = new URL(path_url + apiMsg); //in the real code, there is an ip and a port
conn = (HttpURLConnection) url.openConnection();
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setConnectTimeout(5000);
conn.setRequestProperty("Content-Type", "application/json; charset=UTF-8");
conn.setRequestMethod("POST");
conn.setRequestProperty("Accept","*/*");
}
Please refer wireshark log:
https://files.fm/u/w7umrwwk
So how to avoid PORT change in error scenario as well like in success 200 case, so that we continue to run on the same PORT.
read about sun.net.http.errorstream.enableBuffering in HttpURLConnection source code.
By default when response code is >= 400 then the connection is closed.
It is a clean though not so efficient way of handling error streams.
Instead of setting obscure system properties to handle this, it would be better to move to a proper http client like apache.

Parallelize HttpsURLConnection in Java

I am working on an application which executes a lot of HTTP-Post Request on a specific URL. The data gets included into the request via a JSON File. Performing a single POST request takes about 200ms until I get the response code back. In order to parallelize these Request I started using different Worker Threads, opening a HTTPURLConnection each and posting their requests over this connection.
Unfortunately I do not see much improvement, so I wanted to clarify the underlying situation of the TCP sockets and get some thoughts from you guys.
Each of my workers executes the request like the following example could.
public void submitRequest() {
URL url = new URL("https://example.com");
HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();
conn.setRequestProperty("Content-Type", "application/json");
conn.setRequestProperty("Accept", "application/json");
conn.setDoOutput(true);
conn.setRequestMethod("POST");
DataOutputStream output = new DataOutputStream(conn.getOutputStream());
// Construct the POST data.
String content = generatedJSONString();
// Send the request data
output.writeBytes(content);
output.flush();
output.close();
// Check response code here
int response = conn.getResponseCode();
}
Because each worker uses its own HttpsURLConnection, I assume that the necessary TLS handshake is only done once per worker and the requests can then be executed without the need of an additional handshake? I want each worker to use its own TCP connection to the server. I tried to check with Wireshark, but it does not show me the TLS handshake, I don't really know why.
Even with 50 workers in parallel, my code does not execute fast enough to post the relevant data to the url. Therefore I ask myself the question if its because of the TCP connections not getting saturated by the code, because the JSON data (usually 10 key-value pairs) is to little data compared to the TCP/HTTP overhead. If not I might ask the service provider to think about a better way to deliver data in a batch way, to submit more JSON data per individual POST request. I also changed to code to use the Apache HTTPClient, but I got similar performance, therefore I think there is a mistake in the way I think the TLS connections get established and reused.

How i reset an URL connection

I use URL connection to download stream in the Internet. But after i reset the modem, i can't continue download this stream caz it error: Connection reset. How i solve it?
Here is my code:
URL url = new URL(_URL);
HttpURLConnection hUC = (HttpURLConnection) url.openConnection();
hUC.connect();
while (true) {
if ((_data.num = is.read(_data.b)) == -1) {
break;
}
//write to file
fos.write(_data.b, 0, _data.num);
}
You can't - at least, not how you may be expecting.
Instead, you need to handle your exception, and determine how much data you've already read. Once your Internet connection is re-established - assuming that the HTTP server you're downloading from supports requestable byte ranges - you can then set custom HTTP Headers on the request and re-download the remaining portions. (This will require a new HttpURLConnection.)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 shows the related HTTP specifications involved to make this work.
This is a bit more complicated if you're looking for a "resume" type feature.
You would need to reissue the request once the internet comes back after a disconnect, and add a header to the request in order to resume the download at the byte number where you left off.
You need to set the Range property in the request header in order to specify how far in you're resuming. Then you would just continue to write to the "fos" object from there.
Check out this url: Java: resume Download in URLConnection

Infinite redirect loop in HTTP request

I'm getting a too many redirects redirect error from URLConnection when trying to fetch www.palringo.com
URL url = new URL("http://www.palringo.com/");
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
outputs the dreaded:
Exception in thread "main" java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
According to wget there is just one redirect, from www.palringo.com to www.palringo.com/en/gb/
Any ideas why my request using URLConnection for /en/gb results in another 302 response for the same resource ?
The problem is exemplified by:
URL url = new URL("http://www.palringo.com/en/gb/");
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// Just for testing, use Chrome header, to eliminate "anti-crawler" response!
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30");
System.out.println("Response code = " + connection.getResponseCode());
This outputs:
Response code = 302
Redirected to /en/gb/
hence an infinite redirect loop.
Interestingly although browsers and wget handle it, curl does not:
joel#bohr:/tmp$ curl http://www.palringo.com/en/gb/
curl: (7) couldn't connect to host
A request for /en/gb/ is redirected to /en/gb/ precisely once.
The problem is that your HttpURLConnection (or whatever code you use -- sorry, I'm NOT familiar with Java) does not use cookies.
Disable cookies in browser and observe exactly the same behaviour -- infinite redirect.
The reason: Server checks if cookie is set. If not set -- it sets it and redirects. Because cookies are not supported/disabled, script on server side redirects over and over again.
Solution: Enable/add cookie support to your code and try again.
I think that redirect is defined with pattern like /* -> /en/gb
So, when you arrive to /en/gb the redirect rule works again.
Check your redirect rules. Where are they defined? In apache web server or in other place? Check all. Verify that this is (or is not) a case and fix the rules accordingly.
The problem is on the server side. It might be a broken Apache httpd rewrite rule that is sending redirects that loop back to the same place. It might be something else. Whatever it is, you are unlikely to be able to fix it on the client side.
I'm basically running a crawler and just noticed this issue.
Ah.
It is possible that it is an anti-crawler defence measure. "Hmmm ... looks like one of those pesky crawlers who ignore my robots.txt file, waste all of my bandwidth and steal my precious content. Lets cause him some pain with a redirect loop!!".
Check that your crawler is obeying the "robots.txt" protocol. Check the ToS for the site you are crawling to see if what you are doing is allowed.
You could be right, but if so how come wget and browsers handle this with just the one redirect?
Maybe because the server is looking at the request headers, or at your pattern of requests.
The Terms of Service (that I see) say this:
"You agree to not use the Service to: ... xiii - Run any automated systems, processes, scripts or bots for any purpose without the express written permission of Palringo."
Arguably, crawling their site is in violation of that.
You will also get this error if you're trying to connect to a service that requires authentication and you provide wrong username and password.

Trying to GET a Google Spreadsheet in Java is returning HTTP error 405

Been working on this all day and have gotten no where with it.
My Java code looks like this:
final URL url = new URL(String.format("https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=%s&exportFormat=tsv&gid=0", spreadsheetId));
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("Authorization", "GoogleLogin auth=" + wiseAuth.getAuthToken());
conn.setRequestProperty("GData-Version", "3.0");
conn.setRequestMethod("GET");
conn.setDoOutput(true); // trouble here, see below
conn.setInstanceFollowRedirects(true);
conn.connect();
I always get a FileNotFound error when attempting to do conn.getInputStream(). I narrowed it down to being that the response code is 405 Method Not Allowed. The exception is returning me my URL and I can access the page just fine in my browser.
It was then that I discovered that setDoOutput(true) executes a POST internally. But if I remove that line, conn.getInputStream() is null, and conn.getOutputStream() appears to return nothing--though maybe I am setting it up wrong?
I don't recommend you to do it like this, even if you get it working now you cannot ensure you will get it working in the future if Google started changing it.
Instead, consider using Google Spreadsheet API. The provided Java examples are pretty straightforward and you should able to accomplish what you want.
I would recommend using a web debugger like Fiddler to see what exactly your application is sending in the GET request and compare it to your browser. You might be missing an important header or something, and Fiddler makes it really easy to slowly strip down your browser's request to the essential elements (just drag a request to clone it, then take out headers).

Categories

Resources