I'm getting a too many redirects redirect error from URLConnection when trying to fetch www.palringo.com
URL url = new URL("http://www.palringo.com/");
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
outputs the dreaded:
Exception in thread "main" java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
According to wget there is just one redirect, from www.palringo.com to www.palringo.com/en/gb/
Any ideas why my request using URLConnection for /en/gb results in another 302 response for the same resource ?
The problem is exemplified by:
URL url = new URL("http://www.palringo.com/en/gb/");
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// Just for testing, use Chrome header, to eliminate "anti-crawler" response!
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30");
System.out.println("Response code = " + connection.getResponseCode());
This outputs:
Response code = 302
Redirected to /en/gb/
hence an infinite redirect loop.
Interestingly although browsers and wget handle it, curl does not:
joel#bohr:/tmp$ curl http://www.palringo.com/en/gb/
curl: (7) couldn't connect to host
A request for /en/gb/ is redirected to /en/gb/ precisely once.
The problem is that your HttpURLConnection (or whatever code you use -- sorry, I'm NOT familiar with Java) does not use cookies.
Disable cookies in browser and observe exactly the same behaviour -- infinite redirect.
The reason: Server checks if cookie is set. If not set -- it sets it and redirects. Because cookies are not supported/disabled, script on server side redirects over and over again.
Solution: Enable/add cookie support to your code and try again.
I think that redirect is defined with pattern like /* -> /en/gb
So, when you arrive to /en/gb the redirect rule works again.
Check your redirect rules. Where are they defined? In apache web server or in other place? Check all. Verify that this is (or is not) a case and fix the rules accordingly.
The problem is on the server side. It might be a broken Apache httpd rewrite rule that is sending redirects that loop back to the same place. It might be something else. Whatever it is, you are unlikely to be able to fix it on the client side.
I'm basically running a crawler and just noticed this issue.
Ah.
It is possible that it is an anti-crawler defence measure. "Hmmm ... looks like one of those pesky crawlers who ignore my robots.txt file, waste all of my bandwidth and steal my precious content. Lets cause him some pain with a redirect loop!!".
Check that your crawler is obeying the "robots.txt" protocol. Check the ToS for the site you are crawling to see if what you are doing is allowed.
You could be right, but if so how come wget and browsers handle this with just the one redirect?
Maybe because the server is looking at the request headers, or at your pattern of requests.
The Terms of Service (that I see) say this:
"You agree to not use the Service to: ... xiii - Run any automated systems, processes, scripts or bots for any purpose without the express written permission of Palringo."
Arguably, crawling their site is in violation of that.
You will also get this error if you're trying to connect to a service that requires authentication and you provide wrong username and password.
Related
I am tasked with checking whether some URLs are working correctly, I'm using Java to make HTTP get request to get the response code.
So what I did was this.
URL u = new URL("some URL");
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.connect();
int code = huc.getResponseCode();
System.out.println(code + " " + huc.getURL());
The Problem: Some sites require you to login to access the page, but the page doesn't return a 401 code, but 200. Note that the web page doesn't show up until a username and password are provided. It asks for authentication in a pop up window.
So how do I catch these kind of links?
Also, how can I identify if a webpage shows a login page like http://www.example.com/login/? Is it sufficient to just check the URL for the word “login”?
There's no universal way to deal with this. You have to know how the site you're using does authentication - 401? separate login page? multi-factor auth (ie: using RSA token)? Checking for the substring "login" in the URL is a possible way of handling some, but not enough for a general way.
For example, a 401 will only happen when using basic authentication (or when trying to access protected resources directly). There's a lot of other ways to do auth
John sums up the issue quite well in his comment:
If you have to deal with pages that roll their own custom authentication, then it follows that you probably have to write your own custom code to accommodate them. Depending on how the relevant sites work, you might be able to bypass authentication by sending an appropriate cookie in your request, as if you had already authenticated, or by some similar means
I admit there is a possibility that I am not well informed about the subject, but I've done a LOADS of reading and I still can't get answer to my question.
From what I have learnt, to make communication secure with HTTPS I need to be using some sort of public key (reminds me of pgp-encryption).
My goal is to make a secured POST request from my java application (which I, in the moment it starts working, will rewrite to Android app, if it matters) to a php application accessible via https address.
Naturally I did some Google research on the topic and I got a lot of results how to make ssl connection. Non of those results used any sort of certificate/hash prints. They just use HttpsURLConnection instead of HttpURLConnection, everything else is almost identical.
Right now, almost copy paste of something I found here is this:
String httpsURL = "https://xx.yyyy.zzz/requestHandler.php?getParam1=value1&getParam2=value2";
String query = "email=" + URLEncoder.encode("abc#xyz.com", "UTF-8");
query+="&";
query+="password="+URLEncoder.encode("tramtarie","UTF-8");
URL myurl = new URL(httpsURL);
HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
con.setRequestMethod("POST");
con.setRequestProperty("Content-length",String.valueOf(query.length()));
con.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
con.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 5.0;Windows98;DigExt)");
con.setDoOutput(true);
con.setDoInput(true);
DataOutputStream output = new DataOutputStream(con.getOutputStream());
output.writeBytes(query);
output.close();
DataInputStream input = new DataInputStream(con.getInputStream());
for(
int c = input.read();
c!=-1;c=input.read())
System.out.print((char)c);
input.close();
System.out.println("Resp Code:"+con.getResponseCode());
System.out.println("Resp Message:"+con.getResponseMessage());
Which sadly does not work and ends up with this exception:
Exception in thread "main" javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching app.elessy.cz found
This probably means that it checks the certificate and finds out that the certificate I am using does not match domain name for which is registered (it is webhosting certificate, registered for webhosting domain, not the domain I own, the only reason I am using https is to secure data for internal purposes, I do not want this site to be visited by users from outside, so this certificate should be ok).
There are two things that I just don't get about the code and everything.
No code I have been able to find use MD5/SHA-1 (supposedly the public keys for message encryption?) prints or
certificate, they just somehow automatically connect to https
website and should work. Doesn't work for me though.
Do I really need those md5/sha-1 prints that are provided to me? Or at least, what in the given context do those prints mean?
Edit:
Following the given answer and duplicate mark, I managed to get it working - in the meaning that I can communicate with application behind https.
But I didnt have to use any sort of md5/sha1 print. How do I know now that it is safe? Does this protocol on his own? Like that communication is secured either way, when I use built-in java classes to connect to app behind https?
I probably do not seek for precise technical explanation, but more for an assurance that yes - the communication is safe even though I do not use (knowingly) certificate/servers public key to encrypt my messages. That it does the ssl connection for me.
I'm trying to use jsoup to login to a website (intranet page with some subsystems), enter in a subsystem, search something and parse the page.
I can login, but when I try to access the subsystem I receive an HTTP error 502. However in the browser opens normally.
I think that is some problem with a proxy (which is already set in java). After some few tries my login is blocked and I get the HTTP error 407 (page blocked or something like that)
I already tried to put .useragent("mozilla..."), .timeout(...), .ignorehttperrors(true), ignorecontenttype(true) and using .cookie too.
Is there some way to solve this?
Response x = Jsoup.connect("page").data("...").method(method.GET).execute();
I used the given suggestion (apache httpclient and I don't get the HTTP errors anymore.
But I still want to know if jsoup can bypass this problem, because I could use just one .jar instead 6 (5 from apache plus jsoup to parse the responses.). Thanks to those that who edited my post (rs) and to ollo for the suggestion.
Here's an example using Java's UrlConnection:
URLConnection connection = new URL("your url").openConnection();
connection.addRequestProperty("http.proxyHost", "proxy server");
connection.addRequestProperty("http.proxyPort", "proxy port");
// Alternative:
System.setProperty("http.proxyHost", "yourproxyserver");
System.setProperty("http.proxyPort", "portnumber");
InputStream responseStream = connection.getInputStream();
// Read response into buffer and parse it with jsoup
See also my answer here: JSoup over VPN/proxy
(i guess thats a better one)
But i realy reccommend you HttpClient (or a similar one) for such connection things. As i said before, jsoup as only a limited connection support.
This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
My code goes like this:
URL url;
URLConnection uc;
StringBuilder parsedContentFromUrl = new StringBuilder();
String urlString="http://www.example.com/content/w2e4dhy3kxya1v0d/";
System.out.println("Getting content for URl : " + urlString);
url = new URL(urlString);
uc = url.openConnection();
uc.connect();
uc.getInputStream();
BufferedInputStream in = new BufferedInputStream(uc.getInputStream());
int ch;
while ((ch = in.read()) != -1) {
parsedContentFromUrl.append((char) ch);
}
System.out.println(parsedContentFromUrl);
However when I am trying to access the URL through browser there is no problem , but when I try to access it through a java program, it throws expection:
java.io.IOException: Server returned HTTP response code: 403 for URL
What is the solution?
Add the code below in between uc.connect(); and uc.getInputStream();:
uc = url.openConnection();
uc.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
However, it a nice idea to just allow certain types of user agents. This will keep your website safe and bandwidth usage low.
Some possible bad 'User Agents' you might want to block from your server depending if you don't want people leeching your content and bandwidth. But, user agent can be spoofed as you can see in my example above.
403 means forbidden. From here:-
10.4.4 403 Forbidden
The server understood the request, but
is refusing to fulfill it.
Authorization will not help and the
request SHOULD NOT be repeated. If the
request method was not HEAD and the
server wishes to make public why the
request has not been fulfilled, it
SHOULD describe the reason for the
refusal in the entity. If the server
does not wish to make this information
available to the client, the status
code 404 (Not Found) can be used
instead.
You need to contact the owner of the site to make sure the permissions are set properly.
EDIT I see your problem. I ran the URL through Fiddler. I noticed that I am getting a 407 which means below. This should help you go in the right direction.
10.4.8 407 Proxy Authentication Required
This code is similar to 401
(Unauthorized), but indicates that the
client must first authenticate itself
with the proxy. The proxy MUST return
a Proxy-Authenticate header field
(section 14.33) containing a challenge
applicable to the proxy for the
requested resource. The client MAY
repeat the request with a suitable
Proxy-Authorization header field
(section 14.34). HTTP access
authentication is explained in "HTTP
Authentication: Basic and Digest
Access Authentication"
Also see this relevant question.
java.io.IOException: Server returned HTTP response code: 403 for URL
IF the browser can access the page, and your code cannot, then there's something different between the browser request and your request. You can look at the browser request, using, say, Firebug, to see what the differences are. Some things I can think of are:
The site sets a
cookie (maybe during login). You may be able to handle
this in code, you will have to
explicitly add support for passing
the cookie. This is most likely.
The site filters based on user agents. You can set the user agent. This is not as likely.
I have HTML based queries in my code and one specific kind seems to give rise to IOExceptions upon receiving 505 response from the server. I have looked up the 505 response along with other people who seemed to have similar problems. Apparently 505 stands for HTTP version mismatch, but when I copy the same query URL to any browser (tried firefox, seamonkey and Opera) there seems to be no problem. One of the posts I read suggested that the browsers might automatically handle the version mismatch problem..
I have tried to dig in deeper by using the nice developer tool that comes with Opera, and it looks like there is no mismatch in versions (I believe Java uses HTTP 1.1) and a nice 200 OK response is received. Why do I experience problems when the same query goes through my Java code?
private InputStream openURL(String urlName) throws IOException{
URL url = new URL(urlName);
URLConnection urlConnection = url.openConnection();
return urlConnection.getInputStream();
}
sample link: http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length
There has been some issues in Tomcat with URLs containing space in it. To fix the problem, you need to encode your url with URLEncoder.
Example (notice the space):
String url="http://example.org/test test2/index.html";
String encodedURL=java.net.URLEncoder.encode(url,"UTF-8");
System.out.println(encodedURL); //outputs http%3A%2F%2Fexample.org%2Ftest+test2%2Findex.html
AS a developer at www.uniprot.org I have the advantage of being able to look in the request logs. In the last year according to the logs we have not send a 505 response code. In any case our servers do understand http 1 requests as well as the default http1.1 (though you might not get the results that you expect).
That makes me suspect there was either some kind of data corruption on the way. Or you where affected by a hardware failure (lately we have had some trouble with a switch and a whole datacentre ;). In any case if you ever have questions or problems with uniprot.org please contact help#uniprot.org then we can see if we can help/fix the problem.
Your code snippet seems normal and should work.
Regards,
Jerven Bolleman
Are you behind a proxy? This code works for me and prints out the same text I see through a browser.
final URL url = new URL("http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length");
final URLConnection conn = url.openConnection();
final InputStream is = conn.getInputStream();
System.out.println(IOUtils.toString(is));
conn is an instance of HttpURLConnection
from the API documentation for the URL class:
The URL class does not itself encode or decode any URL components
[...]. It is the responsibility of the caller to encode any fields,
which need to be escaped prior to calling URL, and also to decode any
escaped fields, that are returned from URL.
so if you have any spaces in your url-str encode it before calling new URL(url-str)
#posdef I was having same HTTP error code 505 problem. When I pasted URL that I was using in Java code in Firefox, Chrome it worked. But through code was giving IOException. But at last I came to know that in url string there were brackets '(' and ')', by removing them it worked so it seems I needed URLEncodeing same like browsers.