i try to parse a document with jsoup (java). This is my java-code:
package test;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class crawler{
private static final int TIMEOUT_IN_MS = 5000;
public static void main(String[] args) throws MalformedURLException, IOException
{
Document doc = Jsoup.parse(new URL("http://www.internet.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
}
}
Ok, this works. But when i want to parse a https site, i get this error message:
Document doc = Jsoup.parse(new URL("https://www.somesite.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.somesite.com/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)
at org.jsoup.Jsoup.parse(Jsoup.java:183)
at test.crawler.main(crawler.java:14)
I only get this error messages, when i try to parse https. http is working.
Jsoup supports https fine - it just uses Java's URLConnection under the hood.
A 403 server response indicates that the server has 'forbidden' the request, normally due to authorization issues. If you're getting a HTTP response status code, the TLS (https) negotiation has worked.
The issue here is probably not related to HTTPS, it just that the URL you're having troubles fetching happens to be HTTPS. You need to understand why the server is giving you a 403 - my guess is either you need to send some authorization tokens (cookies or URL params), or it is blocking the request because of the user agent (which defaults to "Java" unless you specify it). Lots of services block requests that way. Look to set the useragent to a common browser string. Use the Jsoup.Connect methods to do that.
(People won't be able to help you more without real example URLs, because we can't tell what the server is doing just with this info.)
You would need to provide authentication when hitting the URL. Also try the solution in 403 Forbidden with Java but not web browser? if the request works in a browser and not using JAVA code.
You could also just ignore SSL certificate if it's required
Jsoup.connect("https://example.com").validateTLSCertificates(false).get()
Related
I tried to Parse the web pages ending with .tv and .mobi extension but every time I tried I end up with the same error. Jsoup can easily parse the websites ending with .com, .org , .in etc but not .tv or .mobi.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
public class sample {
public static void main(String[] args) throws IOException{
Document doc =Jsoup.connect("http://www.xmovies8.tv").get();
String title = doc.title();
System.out.println(title);
}
}
Stack trace:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.xmovies8.tv
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
at eric.sample.main(sample.java:30)
/home/azeem/.cache/netbeans/8.1/executor-snippets/run.xml:53: Java returned: 1
BUILD FAILED (total time: 3 seconds)
And Also it failed to parse:
http://www.xmovies8.tv
www.fztvseries.mobi
is there any solution in Jsoup so that i can connect to different websites ending with .mobi , .tv , .xyz etc?
Your problem isn't anything to do with the TLD of the domain you're attempting to scrape, infact, it's nothing to do with the name at all, or even Jsoup.
If you read your stack trace, you will see you're getting a response code of:
HTTP 403 Forbidden, which according to HTTP Specification, means your request was seen by the web server, and deliberately refused.
Now, this could be for a number of reasons that all depend on the website you're trying to scrape.
It could be that the website sees you are trying to scrape, and they have explicitly gone out of the way to prevent being scraped
It could also be that that page requires a permission you don't have, or you need to be logged in.
I also noticed that particular domain uses CloudFlare, so it could be that CloudFlare is intercepting your request before it's even reaching the website itself.
I would make sure it's not against the website's policy to scrape them, and if it isn't, try maybe changing the User-Agent Header of your scraper to a normal browser User agent instead of java and see if it works.
I have a 404 status error (page not found). I only want to send a request from my Android app to Mean.io web app through
the following url:
http://192.168.0.103:3000/auth/register
I have also tried:
http://10.0.2.2:3000/auth/register
I had already googled but both of the solutions above didn't worked for me. However the url: http://192.168.0.103:3000/auth/register does work
on my Chrome browser on my pc.
Here is the code:
public class AppConfig {
// Server user register url
//public static String URL_REGISTER = "http://10.0.2.2:3000/auth/register";
public static String URL_REGISTER = "http://192.168.0.103:3000/auth/register";
}
If you want to know where the variable URL_REGISTER gets used. It's getting used in the registerUser() method.
I'm posting the method through a link, because the method is too big to post it here. In the link below you can see that the URL_REGISTER gets used on line 10.
Link: http://pastebin.com/ttH6upnb
1 be sure you connect to the server
192.168 and 10.0 are local addresses (not going to internet)
beware, if you get 404, perhaps another server like proxy responds to you
2 read this: Using java.net.URLConnection to fire and handle HTTP requests
3 begin by getting page "/" and check the headers (good server, etc.)
4 then verify your code, step by step
5 check if GET or POST, and authentication is not easy (check the headers)
I want to get an HTML page from a meta refresh redirect very similar as in question can jsoup handle meta refresh redirect.
But I can't get it to work. I want to do a search on http://synchronkartei.de.
I have the following code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SynchronkarteiScraper {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://www.synchronkartei.de/search.php")
.data("cat", "2")
.data("search", "Thomas Danneberg")
.data("action", "search")
.followRedirects(true)
.get();
Elements meta = doc.select("html head meta");
for (final Element m : meta){
if (m.attr("http-equiv").contains("refresh")){
doc = Jsoup.connect(m.baseUri()+m.attr("content").split("=")[1]).get();
}
}
System.out.println(doc.body().toString());
}
}
This does the search, which leads to a temporary site that gets refreshed opens the real result page.
It is the same as going to http://synchronkartei.de, selecting "Sprecher" from the dropdownbox, entering "Thomas Danneberg" to the textfield and hitting enter.
But even after extracting the refresh URL and do a second connect, I still get the content of the temporary landing page, which can be seen in the prinln of the body.
So what is going wrong here?
As a note, the site synchronkartei.de always redirects to HTTPS. And since it is using a certificate from StartCom, java complains about the certificate path. To let the above code snippet work, it is necessary to use the VM parameter -Djavax.net.ssl.trustStore=<path-to-keystore> with the correct certificate.
I have to admit, that I am no expert in Jsoup, but I know some details about the Synchronkartei, though.
Deutsche Synchronkartei supports OpenSearchDescriptions, which is linked at /search.xml. That said, you could also use https://www.synchronkartei.de/search.php?search={searchTerms} to get your search term into the session.
All you need is a cookie "sid" with the session ID, the Synchronkartei provides you. After that, a direct request to https://www.synchronkartei.de/index.php?action=search will provide you the results, regardless of your referrer.
What I mean is, first send a request to https://www.synchronkartei.de/search.php?search={searchTerms} or https://www.synchronkartei.de/search.php?cat={Category}&search={searchTerms}&action=search (as you did above) and ignore the result completely if it has an HTTP result of 200, but safe the session cookie. After that, you place a request to https://www.synchronkartei.de/index.php?action=search which should provide you the whole list of results then.
Funzi
I use a HttpURLConnection to connect to a website and receive an ResponseCode=404 (HTTP_NOT_FOUND). However I have no problem opening the website in my browser (IE).
Why the difference, and what can I do about it?
Regards, Pavan
This is my Program
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;
public class TestGet {
private static URL source;
public static void main(String[] args) {
doGet();
}
public static void doGet() {
try {
source = new URL("http://localhost:8080/");
System.out.println("Url is" + source.toString());
URLConnection connection = source.openConnection();
connection.setRequestProperty("User-Agent","Mozilla/5.0 ( compatible ) ");
connection.setRequestProperty("Accept","*/*");
connection.setDoInput(true);
connection.setDoOutput(true);
System.out.println(((HttpURLConnection) connection).getResponseCode());
BufferedReader rdr = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
StringBuffer b = new StringBuffer();
String line = null;
while (true) {
line = rdr.readLine();
if (line == null)
break;
b.append(line);
}
} catch (Exception e) {
e.printStackTrace();
System.err.println(e.toString());
}
}
}
Stack Trace
Url ishttp://localhost:8080/
404
java.io.FileNotFoundException: http://localhost:8080/
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at TestGet.doGet(TestGet.java:28)
at TestGet.main(TestGet.java:11)
Caused by: java.io.FileNotFoundException: http://localhost:8080/
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at TestGet.doGet(TestGet.java:26)
... 1 more
java.io.FileNotFoundException: http://localhost:8080/
You are getting 404 error that means the response for the request is not found. First you need to make sure that there is a server serving at http://localhost:8080/ and it must return some content with code 200. If not, then there is nothing we can help you.
The easiest way to test whether there is anything at the url is to paste the url on the web browser address bar and click go. However, this does not guarantee that the Java code will be able to access it. For example, if the server is designed to response 404 if it cannot find the web browser User-Agent header.
Since the server returns a status code, either 200 or 404, it means this is not a firewall problem.
According to your latest edition of the question, you can view it with the web browser but cannot download it with your java code and the header seems to be set correctly. There are only two problem I can see:
You should not set connection.setDoOutput(true); to true. This will enforce the connection to do HTTP POST instead of GET and the server may not support POST.
Your server may be always returning 404 even if it should have been 200. Since the web browser doesn't care about the error status and tries to render all the content so it seems to be working from the web browser. If so, you should fix the server to reponse correctly first, otherwise try getting error stream instead HttpURLConnection#getErrorStream()
I had a similar issue. For me it helped to inspect the packets using RawCap. RawCap is one of the few Windows packet sniffers that lets you sniff localhost.
In my cases the server was returning a 404 due to an authentication issue.
If the url http://localhost:8080/ can be accessed well in the web browser, the code should work well. I run the program in my machine, it works well. So you must check whether the webserver service is ok.
I know this is very late in the game, but I was just recently having the same issue and none of the solutions here worked for me. In my case, I actually had another process running on the same port that was stealing the requests from the java app. Using yair's answer here you can check for a process running on the same port like this: In the command prompt, do netstat -nao | find "8080" on Windows or netstat -nap | grep 8080 on Linux. It should show a line with LISTENING and 127.0.0.1:8080 and next would be the process ID. Just terminate the process and you should be good to go.
I had the problem too. In my case i had a invisible unicode character in the url string. So connection couldnt open it (FileNotFound indicates that). I removed it and it worked.
I had a similar scenario where the web service processed POST requests from the browser (in my case Postman, an API testing Chrome extension) correctly, but HttpURLConnection kept failing with a 404 for large payloads. I mistakenly assumed that the problem must be in my HttpURLConnection client code.
When I later tried to replicate the request from cUrl with a large payload, I got the same 404 error. Even though I used the cUrl code generated by Postman, which therefore should be identical to Postman's request, there was a difference in how the web service reacted to both requests. Some client middleware on Postman may have intercepted and modified the requests.
TL;DR
Check the web service. It may be the culprit. Try another non-browser barebones Http client like cUrl to see how the web service reacts to it.
I have page like localhost:7001/MyServlet. I am making a http connection request from like below
String url = "http://localhost:7001/MyServlet"
PostMethod method = new PostMethod(url);
HttpClient client = new HttpClient();
However "MyServlet" is protected by j_security_check. So when I am making my connection , getting redirected to login page.
How to get authenticated and access my url , in one HttpConnection
Note: I use apache common httpclient
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.PostMethod;
I doubt you can log in and call the server in a single request, unless HTTP BASIC authentication is enabled. While I do not know the details of HTTPClient's API yet, basically you will need to track a session using cookies; POST your login to /j_security_check; then access the servlet. (The same basic process works for /j_acegi_security_check if using ACEGI Security.)
A nasty wrinkle in Tomcat is that just posting right away to /j_security_check gives a 400 "bad request"; its authenticator is rather finicky about state transitions and was clearly not designed with programmatic clients in mind. You need to first access /loginEntry (you can throw away the response other than session cookies); then post your login information to /j_security_check; then follow the resulting redirect (back to /loginEntry I think) which will actually store your new login information; finally post to the desired servlet! NetBeans #5c3cb7fb60fe shows this in action logging in to a Hudson server using Tomcat's container authentication.