can't connect to urls ending with .tv using Jsoup - java

I tried to Parse the web pages ending with .tv and .mobi extension but every time I tried I end up with the same error. Jsoup can easily parse the websites ending with .com, .org , .in etc but not .tv or .mobi.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
public class sample {
public static void main(String[] args) throws IOException{
Document doc =Jsoup.connect("http://www.xmovies8.tv").get();
String title = doc.title();
System.out.println(title);
}
}
Stack trace:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.xmovies8.tv
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
at eric.sample.main(sample.java:30)
/home/azeem/.cache/netbeans/8.1/executor-snippets/run.xml:53: Java returned: 1
BUILD FAILED (total time: 3 seconds)
And Also it failed to parse:
http://www.xmovies8.tv
www.fztvseries.mobi
is there any solution in Jsoup so that i can connect to different websites ending with .mobi , .tv , .xyz etc?

Your problem isn't anything to do with the TLD of the domain you're attempting to scrape, infact, it's nothing to do with the name at all, or even Jsoup.
If you read your stack trace, you will see you're getting a response code of:
HTTP 403 Forbidden, which according to HTTP Specification, means your request was seen by the web server, and deliberately refused.
Now, this could be for a number of reasons that all depend on the website you're trying to scrape.
It could be that the website sees you are trying to scrape, and they have explicitly gone out of the way to prevent being scraped
It could also be that that page requires a permission you don't have, or you need to be logged in.
I also noticed that particular domain uses CloudFlare, so it could be that CloudFlare is intercepting your request before it's even reaching the website itself.
I would make sure it's not against the website's policy to scrape them, and if it isn't, try maybe changing the User-Agent Header of your scraper to a normal browser User agent instead of java and see if it works.

Related

Parse https with jsoup (java)

i try to parse a document with jsoup (java). This is my java-code:
package test;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class crawler{
private static final int TIMEOUT_IN_MS = 5000;
public static void main(String[] args) throws MalformedURLException, IOException
{
Document doc = Jsoup.parse(new URL("http://www.internet.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
}
}
Ok, this works. But when i want to parse a https site, i get this error message:
Document doc = Jsoup.parse(new URL("https://www.somesite.com/"), TIMEOUT_IN_MS);
System.out.println(doc.html());
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://www.somesite.com/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)
at org.jsoup.Jsoup.parse(Jsoup.java:183)
at test.crawler.main(crawler.java:14)
I only get this error messages, when i try to parse https. http is working.
Jsoup supports https fine - it just uses Java's URLConnection under the hood.
A 403 server response indicates that the server has 'forbidden' the request, normally due to authorization issues. If you're getting a HTTP response status code, the TLS (https) negotiation has worked.
The issue here is probably not related to HTTPS, it just that the URL you're having troubles fetching happens to be HTTPS. You need to understand why the server is giving you a 403 - my guess is either you need to send some authorization tokens (cookies or URL params), or it is blocking the request because of the user agent (which defaults to "Java" unless you specify it). Lots of services block requests that way. Look to set the useragent to a common browser string. Use the Jsoup.Connect methods to do that.
(People won't be able to help you more without real example URLs, because we can't tell what the server is doing just with this info.)
You would need to provide authentication when hitting the URL. Also try the solution in 403 Forbidden with Java but not web browser? if the request works in a browser and not using JAVA code.
You could also just ignore SSL certificate if it's required
Jsoup.connect("https://example.com").validateTLSCertificates(false).get()

Can't send request from app to web app

I have a 404 status error (page not found). I only want to send a request from my Android app to Mean.io web app through
the following url:
http://192.168.0.103:3000/auth/register
I have also tried:
http://10.0.2.2:3000/auth/register
I had already googled but both of the solutions above didn't worked for me. However the url: http://192.168.0.103:3000/auth/register does work
on my Chrome browser on my pc.
Here is the code:
public class AppConfig {
// Server user register url
//public static String URL_REGISTER = "http://10.0.2.2:3000/auth/register";
public static String URL_REGISTER = "http://192.168.0.103:3000/auth/register";
}
If you want to know where the variable URL_REGISTER gets used. It's getting used in the registerUser() method.
I'm posting the method through a link, because the method is too big to post it here. In the link below you can see that the URL_REGISTER gets used on line 10.
Link: http://pastebin.com/ttH6upnb
1 be sure you connect to the server
192.168 and 10.0 are local addresses (not going to internet)
beware, if you get 404, perhaps another server like proxy responds to you
2 read this: Using java.net.URLConnection to fire and handle HTTP requests
3 begin by getting page "/" and check the headers (good server, etc.)
4 then verify your code, step by step
5 check if GET or POST, and authentication is not easy (check the headers)

Jsoup meta refresh redirect

I want to get an HTML page from a meta refresh redirect very similar as in question can jsoup handle meta refresh redirect.
But I can't get it to work. I want to do a search on http://synchronkartei.de.
I have the following code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SynchronkarteiScraper {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://www.synchronkartei.de/search.php")
.data("cat", "2")
.data("search", "Thomas Danneberg")
.data("action", "search")
.followRedirects(true)
.get();
Elements meta = doc.select("html head meta");
for (final Element m : meta){
if (m.attr("http-equiv").contains("refresh")){
doc = Jsoup.connect(m.baseUri()+m.attr("content").split("=")[1]).get();
}
}
System.out.println(doc.body().toString());
}
}
This does the search, which leads to a temporary site that gets refreshed opens the real result page.
It is the same as going to http://synchronkartei.de, selecting "Sprecher" from the dropdownbox, entering "Thomas Danneberg" to the textfield and hitting enter.
But even after extracting the refresh URL and do a second connect, I still get the content of the temporary landing page, which can be seen in the prinln of the body.
So what is going wrong here?
As a note, the site synchronkartei.de always redirects to HTTPS. And since it is using a certificate from StartCom, java complains about the certificate path. To let the above code snippet work, it is necessary to use the VM parameter -Djavax.net.ssl.trustStore=<path-to-keystore> with the correct certificate.
I have to admit, that I am no expert in Jsoup, but I know some details about the Synchronkartei, though.
Deutsche Synchronkartei supports OpenSearchDescriptions, which is linked at /search.xml. That said, you could also use https://www.synchronkartei.de/search.php?search={searchTerms} to get your search term into the session.
All you need is a cookie "sid" with the session ID, the Synchronkartei provides you. After that, a direct request to https://www.synchronkartei.de/index.php?action=search will provide you the results, regardless of your referrer.
What I mean is, first send a request to https://www.synchronkartei.de/search.php?search={searchTerms} or https://www.synchronkartei.de/search.php?cat={Category}&search={searchTerms}&action=search (as you did above) and ignore the result completely if it has an HTTP result of 200, but safe the session cookie. After that, you place a request to https://www.synchronkartei.de/index.php?action=search which should provide you the whole list of results then.
Funzi

How to parse a custom XML-style error code response from a website

I'm developing a program that queries and prints out open data from the local transit authority, which is returned in the form of an XML response.
Normally, when there are buses scheduled to run in the next few hours (and in other typical situations), the XML response generated by the page is handled correctly by the java.net.URLConnection.getInputStream() function, and I am able to print the individual results afterwards.
The problem is when the buses are NOT running, or when some other problem with my queries develops after it is sent to the transit authority's web server. When the authority developed their service, they came up with their own unique error response codes, which are also sent as XMLs. For example, one of these error messages might look like this:
<Error xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Code>3005</Code>
<Message>Sorry, no stop estimates found for given values.</Message>
</Error>
(This code and similar is all that I receive from the transit authority in such situations.)
However, it appears that URLConnection.getInputStream() and some of its siblings are unable to interpret this custom code as a "valid" response that I can handle and print out as an error message. Instead, they give me a more generic HTTP/1.1 404 Not Found error. This problem cascades into my program which then prints out a java.io.FileNotFoundException error pointing to the offending input stream.
My question is therefore two-fold:
1. Is there a way to retrieve, parse, and print a custom XML-formatted error code sent by a web service using the plugins that are available in Java?
2. If the above is not possible, what other tools should I use or develop to handle such custom codes as described?
URLConnection isn't up to the job of REST, in my opinion, and if you're using getInputStream, I'm almost certain you're not handling character encoding correctly.
Check out Spring's RestTemplate - it's really easy to use (just as easy as URLConnection), powerful and flexible. You will need to change the ResponseErrorHandler, because the default one will throw an exception on 404, but it looks like you want it to carry on and parse the XML in the response.

Is it possible to access the html of a site with a 204 response code via java.net?

I am trying to read a website using the java.net package classes. The site has content, and i see it manually in html source utilities in the browser. When I get its response code and try to view the site using java, it connects successfully but interprets the site as one without content(204 code). What is going on and is it possible to get around this to view the html automatically.
thanks for your responses:
Do you need the URL?
here is the code:
URL hef=new URL(the website);
BufferedReader kj=null;
int kjkj=((HttpURLConnection)hef.openConnection()).getResponseCode();
System.out.println(kjkj);
String j=((HttpURLConnection)hef.openConnection()).getResponseMessage();
System.out.println(j);
URLConnection g=hef.openConnection();
g.connect();
try{
kj=new BufferedReader(new InputStreamReader(g.getInputStream()));
while(kj.readLine()!=null)
{
String y=kj.readLine();
System.out.println(y);
}
}
finally
{
if(kj!=null)
{
kj.close();
}
}
}
Suggestions:
Assert than when manually accessing the site (with a web browser client) you are effectively getting a 200 return code
Make sure that the HTTP request issued from the automated (java-based) logic is similar/identical to that of what is sent by an interactive web browser client. In particular, make sure the User-Agent is identical (some sites purposely alter their responses depending on the agent).
You can use a packet sniffer, maybe something like Fiddler2 to see exactly what is being sent and received to/from the server
I'm not sure that the java.net package is robot-aware, but that could be a factor as well (can you check if the underlying site has robot.txt files).
Edit:
assuming you are using the java.net package's HttpURLConnection class, the "robot" hypothesis doesn't apply.
On the other hand you'll probably want to use the connection's setRequestProperty() method to prepare the desired HTTP header for the request (so they match these from the web browser client)
Maybe you can post the relevant portions of your code.

Categories

Resources