I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.
I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/
However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.
Here is my code for the part where I get URL:
void search() {
List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
}
and then I do:
try {
HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
...
}catch (Exception e) {
System.out.println(e);
}
in the scraper file.
It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.
if the server returns a failing status code AND the property
WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set
to true.
You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.
The tutorial you added contains the following code:
.....
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
e.printStackTrace();
}
.....
With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.
As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)
This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.
Related
When running the following code:
try {
Document doc = Jsoup.connect("https://pomofocus.io/").get();
Elements text = doc.select("div.sc-kEYyzF");
System.out.println(text.text());
}
catch (IOException e) {
e.printStackTrace();
}
No output occurs. When changing the println to:
System.out.println(text.first().text());
I get a NullPointerException but nothing else.
jsoup doesn't execute javascript - it parses the HTML that the server returns. You can check View Source (vs Inspect) to see the response from the server, and what is selectable.
Hey all trying to write a web scraper that downloads all the songs from this website
https://billwurtz.com/songs.html
but some of his older URLs contain spaces ("%20") such as https://billwurtz.com/can%20i.mp3. Which on his newer links he has changed to '-'.
Anyway, when I try to download these tracks I get a 400 Error (Bad Request Error) which makes me think that java is sending the request with and not %20
URL website = new URL("https://www.billwurtz.com/" + song);
Path out = Path.of("songs/" + song);
System.out.println(website.toString());
try (InputStream in = website.openStream()) {
Files.copy(in, out, StandardCopyOption.REPLACE_EXISTING);
} catch (Exception e) {
// missed.add(song + ":" + e.getMessage());
throw new Exception(e.getMessage());
}
returns
https://www.billwurtz.com/can%20i.mp3
Exception in thread "main" java.lang.Exception: Server returned HTTP response code: 400 for URL: https://billwurtz.com/can i.mp3
I tried looking about but only came across issues created from the programmer trying to create a URL that contains (spaces) but here it is clear that it is using %20
Thank you all for your time :)
I'm writing a small program and I want to fetch an element from a website. I've followed many tutorials to learn how to write this code with jSoup. An example of what I'm trying to print is "Monday, November 19, 2018 - 3:00pm to 7:00pm". I'm running into the error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://my.cs.ubc.ca/course/cpsc-210
Here is my code:
public class WebPageReader {
private String url = "https://my.cs.ubc.ca/course/cpsc-210";
private Document doc;
public void readPage(){
try {
doc = Jsoup.connect(url).
userAgent("Mozilla/5.0")
.referrer("https://www.google.com").timeout(1000).followRedirects(true).get();
Elements temp=doc.select("span.date-display-single");
int i=0;
for (Element officeHours:temp){
i++;
System.out.println(officeHours);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thanks for the help.
Status 403 means your access is forbidden.
Please make sure you have an access to https://my.cs.ubc.ca/course/cpsc-210
I have tried to access https://my.cs.ubc.ca/course/cpsc-210 from browser. It returns Error Page. I think you need to use credential to access it.
In HtmlUnit, how to disable throw exception when the requested page returns fail status code (like 4xx)? I need to get the status code, so if it throws an exception, I can't get the status code.
Page page = null;
try {
page = webClient.getPage(requestSettings);
System.out.println(page.getWebResponse().getStatusCode()); // it doesn't go to this line because exception is already thrown
} catch (Exception e) {
System.out.println(page.getWebResponse().getStatusCode()); // it will fail because of NullPointerException
System.out.println(e);
}
The following method seems to work only on older versions of HtmlUnit. I'm using v2.25 and the method doesn't exist.
webClient.setThrowExceptionOnFailingStatusCode(false);
The new API now has WebClientOptions,
you should use:
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
I want to use google-oauth-java-client to get authorization code from Sina Weibo.
This is the GET method that get code from Sina
https://api.weibo.com/oauth2/authorize?client_id=70090552&response_type=code&redirect_uri=http://127.0.0.1/weibo
Please solve this without web page, only client!
Can anybody give me some advise?
Get method use browser and return the code
Post method use HttpRequest and we can get parameter from HtppResponse
So if you want to get code, just use browser and redirect to the url to get code
Here is how I get access_token
If you want, you can use google-oauth-java-client to authorization twitter facebook
I solve this by javadoc which show me some examples. This is the root of JavaDoc and this is the package I use to solve
Here is the example I write
// https://server.example.com/token server url example
try {
TokenResponse response =
new AuthorizationCodeTokenRequest(new NetHttpTransport(), new JacksonFactory(),
new GenericUrl("here is the server url "), "here write your code")
.setRedirectUri("here write the redirectUrl")
.set("client_id","here write your client_id")
.set("client_secret","here write your client_secret")
.set("Other else need","Other else need")
.execute();
System.out.println("Access token: " + response.getAccessToken());
} catch (TokenResponseException e) {
if (e.getDetails() != null) {
System.err.println("Error: " + e.getDetails().getError());
if (e.getDetails().getErrorDescription() != null) {
System.err.println(e.getDetails().getErrorDescription());
}
if (e.getDetails().getErrorUri() != null) {
System.err.println(e.getDetails().getErrorUri());
}
} else {
System.err.println(e.getMessage());
}
}
Here you can get find the solution.
http://code.google.com/p/google-oauth-java-client/source/browse/dailymotion-cmdline-sample/src/main/java/com/google/api/services/samples/dailymotion/cmdline/DailyMotionSample.java?repo=samples
This and this will help you. First understand the mechanism and implement it according to your scenario.