I'm trying to use jsoup to login to a website (intranet page with some subsystems), enter in a subsystem, search something and parse the page.
I can login, but when I try to access the subsystem I receive an HTTP error 502. However in the browser opens normally.
I think that is some problem with a proxy (which is already set in java). After some few tries my login is blocked and I get the HTTP error 407 (page blocked or something like that)
I already tried to put .useragent("mozilla..."), .timeout(...), .ignorehttperrors(true), ignorecontenttype(true) and using .cookie too.
Is there some way to solve this?
Response x = Jsoup.connect("page").data("...").method(method.GET).execute();
I used the given suggestion (apache httpclient and I don't get the HTTP errors anymore.
But I still want to know if jsoup can bypass this problem, because I could use just one .jar instead 6 (5 from apache plus jsoup to parse the responses.). Thanks to those that who edited my post (rs) and to ollo for the suggestion.
Here's an example using Java's UrlConnection:
URLConnection connection = new URL("your url").openConnection();
connection.addRequestProperty("http.proxyHost", "proxy server");
connection.addRequestProperty("http.proxyPort", "proxy port");
// Alternative:
System.setProperty("http.proxyHost", "yourproxyserver");
System.setProperty("http.proxyPort", "portnumber");
InputStream responseStream = connection.getInputStream();
// Read response into buffer and parse it with jsoup
See also my answer here: JSoup over VPN/proxy
(i guess thats a better one)
But i realy reccommend you HttpClient (or a similar one) for such connection things. As i said before, jsoup as only a limited connection support.
Related
I need to find the HTTP response code of URLs in java. I know this can be done using URL & HTTPURLConnection API and have gone through previous questions like this
and this.
I need to do this on around 2000 links so speed is the most required attribute and among those I already have crawled 150-250 pages using crawler4j and don't know a way to get code from this library (due to which I will have to make connection on those links again with another library to find the response code).
In Crawler4J, the class WebCrawler has a method handlePageStatusCode, which is exactly what you are looking for and what you would also have found if you had looked for it. Override it and be happy.
The answer behind your first link contains everything you need:
How to get HTTP response code for a URL in Java?
URL url = new URL("http://google.com");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
int code = connection.getResponseCode();
The response code is the HTTP code returned by the server.
I am trying to read a website using the java.net package classes. The site has content, and i see it manually in html source utilities in the browser. When I get its response code and try to view the site using java, it connects successfully but interprets the site as one without content(204 code). What is going on and is it possible to get around this to view the html automatically.
thanks for your responses:
Do you need the URL?
here is the code:
URL hef=new URL(the website);
BufferedReader kj=null;
int kjkj=((HttpURLConnection)hef.openConnection()).getResponseCode();
System.out.println(kjkj);
String j=((HttpURLConnection)hef.openConnection()).getResponseMessage();
System.out.println(j);
URLConnection g=hef.openConnection();
g.connect();
try{
kj=new BufferedReader(new InputStreamReader(g.getInputStream()));
while(kj.readLine()!=null)
{
String y=kj.readLine();
System.out.println(y);
}
}
finally
{
if(kj!=null)
{
kj.close();
}
}
}
Suggestions:
Assert than when manually accessing the site (with a web browser client) you are effectively getting a 200 return code
Make sure that the HTTP request issued from the automated (java-based) logic is similar/identical to that of what is sent by an interactive web browser client. In particular, make sure the User-Agent is identical (some sites purposely alter their responses depending on the agent).
You can use a packet sniffer, maybe something like Fiddler2 to see exactly what is being sent and received to/from the server
I'm not sure that the java.net package is robot-aware, but that could be a factor as well (can you check if the underlying site has robot.txt files).
Edit:
assuming you are using the java.net package's HttpURLConnection class, the "robot" hypothesis doesn't apply.
On the other hand you'll probably want to use the connection's setRequestProperty() method to prepare the desired HTTP header for the request (so they match these from the web browser client)
Maybe you can post the relevant portions of your code.
I'm getting a too many redirects redirect error from URLConnection when trying to fetch www.palringo.com
URL url = new URL("http://www.palringo.com/");
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
outputs the dreaded:
Exception in thread "main" java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
According to wget there is just one redirect, from www.palringo.com to www.palringo.com/en/gb/
Any ideas why my request using URLConnection for /en/gb results in another 302 response for the same resource ?
The problem is exemplified by:
URL url = new URL("http://www.palringo.com/en/gb/");
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
// Just for testing, use Chrome header, to eliminate "anti-crawler" response!
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30");
System.out.println("Response code = " + connection.getResponseCode());
This outputs:
Response code = 302
Redirected to /en/gb/
hence an infinite redirect loop.
Interestingly although browsers and wget handle it, curl does not:
joel#bohr:/tmp$ curl http://www.palringo.com/en/gb/
curl: (7) couldn't connect to host
A request for /en/gb/ is redirected to /en/gb/ precisely once.
The problem is that your HttpURLConnection (or whatever code you use -- sorry, I'm NOT familiar with Java) does not use cookies.
Disable cookies in browser and observe exactly the same behaviour -- infinite redirect.
The reason: Server checks if cookie is set. If not set -- it sets it and redirects. Because cookies are not supported/disabled, script on server side redirects over and over again.
Solution: Enable/add cookie support to your code and try again.
I think that redirect is defined with pattern like /* -> /en/gb
So, when you arrive to /en/gb the redirect rule works again.
Check your redirect rules. Where are they defined? In apache web server or in other place? Check all. Verify that this is (or is not) a case and fix the rules accordingly.
The problem is on the server side. It might be a broken Apache httpd rewrite rule that is sending redirects that loop back to the same place. It might be something else. Whatever it is, you are unlikely to be able to fix it on the client side.
I'm basically running a crawler and just noticed this issue.
Ah.
It is possible that it is an anti-crawler defence measure. "Hmmm ... looks like one of those pesky crawlers who ignore my robots.txt file, waste all of my bandwidth and steal my precious content. Lets cause him some pain with a redirect loop!!".
Check that your crawler is obeying the "robots.txt" protocol. Check the ToS for the site you are crawling to see if what you are doing is allowed.
You could be right, but if so how come wget and browsers handle this with just the one redirect?
Maybe because the server is looking at the request headers, or at your pattern of requests.
The Terms of Service (that I see) say this:
"You agree to not use the Service to: ... xiii - Run any automated systems, processes, scripts or bots for any purpose without the express written permission of Palringo."
Arguably, crawling their site is in violation of that.
You will also get this error if you're trying to connect to a service that requires authentication and you provide wrong username and password.
Been working on this all day and have gotten no where with it.
My Java code looks like this:
final URL url = new URL(String.format("https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=%s&exportFormat=tsv&gid=0", spreadsheetId));
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("Authorization", "GoogleLogin auth=" + wiseAuth.getAuthToken());
conn.setRequestProperty("GData-Version", "3.0");
conn.setRequestMethod("GET");
conn.setDoOutput(true); // trouble here, see below
conn.setInstanceFollowRedirects(true);
conn.connect();
I always get a FileNotFound error when attempting to do conn.getInputStream(). I narrowed it down to being that the response code is 405 Method Not Allowed. The exception is returning me my URL and I can access the page just fine in my browser.
It was then that I discovered that setDoOutput(true) executes a POST internally. But if I remove that line, conn.getInputStream() is null, and conn.getOutputStream() appears to return nothing--though maybe I am setting it up wrong?
I don't recommend you to do it like this, even if you get it working now you cannot ensure you will get it working in the future if Google started changing it.
Instead, consider using Google Spreadsheet API. The provided Java examples are pretty straightforward and you should able to accomplish what you want.
I would recommend using a web debugger like Fiddler to see what exactly your application is sending in the GET request and compare it to your browser. You might be missing an important header or something, and Fiddler makes it really easy to slowly strip down your browser's request to the essential elements (just drag a request to clone it, then take out headers).
I have HTML based queries in my code and one specific kind seems to give rise to IOExceptions upon receiving 505 response from the server. I have looked up the 505 response along with other people who seemed to have similar problems. Apparently 505 stands for HTTP version mismatch, but when I copy the same query URL to any browser (tried firefox, seamonkey and Opera) there seems to be no problem. One of the posts I read suggested that the browsers might automatically handle the version mismatch problem..
I have tried to dig in deeper by using the nice developer tool that comes with Opera, and it looks like there is no mismatch in versions (I believe Java uses HTTP 1.1) and a nice 200 OK response is received. Why do I experience problems when the same query goes through my Java code?
private InputStream openURL(String urlName) throws IOException{
URL url = new URL(urlName);
URLConnection urlConnection = url.openConnection();
return urlConnection.getInputStream();
}
sample link: http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length
There has been some issues in Tomcat with URLs containing space in it. To fix the problem, you need to encode your url with URLEncoder.
Example (notice the space):
String url="http://example.org/test test2/index.html";
String encodedURL=java.net.URLEncoder.encode(url,"UTF-8");
System.out.println(encodedURL); //outputs http%3A%2F%2Fexample.org%2Ftest+test2%2Findex.html
AS a developer at www.uniprot.org I have the advantage of being able to look in the request logs. In the last year according to the logs we have not send a 505 response code. In any case our servers do understand http 1 requests as well as the default http1.1 (though you might not get the results that you expect).
That makes me suspect there was either some kind of data corruption on the way. Or you where affected by a hardware failure (lately we have had some trouble with a switch and a whole datacentre ;). In any case if you ever have questions or problems with uniprot.org please contact help#uniprot.org then we can see if we can help/fix the problem.
Your code snippet seems normal and should work.
Regards,
Jerven Bolleman
Are you behind a proxy? This code works for me and prints out the same text I see through a browser.
final URL url = new URL("http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length");
final URLConnection conn = url.openConnection();
final InputStream is = conn.getInputStream();
System.out.println(IOUtils.toString(is));
conn is an instance of HttpURLConnection
from the API documentation for the URL class:
The URL class does not itself encode or decode any URL components
[...]. It is the responsibility of the caller to encode any fields,
which need to be escaped prior to calling URL, and also to decode any
escaped fields, that are returned from URL.
so if you have any spaces in your url-str encode it before calling new URL(url-str)
#posdef I was having same HTTP error code 505 problem. When I pasted URL that I was using in Java code in Firefox, Chrome it worked. But through code was giving IOException. But at last I came to know that in url string there were brackets '(' and ')', by removing them it worked so it seems I needed URLEncodeing same like browsers.