In HtmlUnit, how to disable throw exception when the requested page returns fail status code (like 4xx)? I need to get the status code, so if it throws an exception, I can't get the status code.
Page page = null;
try {
page = webClient.getPage(requestSettings);
System.out.println(page.getWebResponse().getStatusCode()); // it doesn't go to this line because exception is already thrown
} catch (Exception e) {
System.out.println(page.getWebResponse().getStatusCode()); // it will fail because of NullPointerException
System.out.println(e);
}
The following method seems to work only on older versions of HtmlUnit. I'm using v2.25 and the method doesn't exist.
webClient.setThrowExceptionOnFailingStatusCode(false);
The new API now has WebClientOptions,
you should use:
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
Related
When running the following code:
try {
Document doc = Jsoup.connect("https://pomofocus.io/").get();
Elements text = doc.select("div.sc-kEYyzF");
System.out.println(text.text());
}
catch (IOException e) {
e.printStackTrace();
}
No output occurs. When changing the println to:
System.out.println(text.first().text());
I get a NullPointerException but nothing else.
jsoup doesn't execute javascript - it parses the HTML that the server returns. You can check View Source (vs Inspect) to see the response from the server, and what is selectable.
I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.
I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/
However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.
Here is my code for the part where I get URL:
void search() {
List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
}
and then I do:
try {
HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
...
}catch (Exception e) {
System.out.println(e);
}
in the scraper file.
It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.
if the server returns a failing status code AND the property
WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set
to true.
You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.
The tutorial you added contains the following code:
.....
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
e.printStackTrace();
}
.....
With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.
As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)
This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.
I'm writing a small program and I want to fetch an element from a website. I've followed many tutorials to learn how to write this code with jSoup. An example of what I'm trying to print is "Monday, November 19, 2018 - 3:00pm to 7:00pm". I'm running into the error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://my.cs.ubc.ca/course/cpsc-210
Here is my code:
public class WebPageReader {
private String url = "https://my.cs.ubc.ca/course/cpsc-210";
private Document doc;
public void readPage(){
try {
doc = Jsoup.connect(url).
userAgent("Mozilla/5.0")
.referrer("https://www.google.com").timeout(1000).followRedirects(true).get();
Elements temp=doc.select("span.date-display-single");
int i=0;
for (Element officeHours:temp){
i++;
System.out.println(officeHours);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thanks for the help.
Status 403 means your access is forbidden.
Please make sure you have an access to https://my.cs.ubc.ca/course/cpsc-210
I have tried to access https://my.cs.ubc.ca/course/cpsc-210 from browser. It returns Error Page. I think you need to use credential to access it.
I have a situation where before I process an input file I want to check if certain information is setup in the database. In this particular case it is a client's name and parameters used for processing. If this information is not setup, the file import shall fail.
In many StackOverflow pages, the users resolve handling EmptyResultDataAccessException exceptions generated by queryForObject returning no rows by catching them in the Java code.
The issue is that Spring Integration is catching the exception well before my code is catching it and in theory, I would not be able to tell this error from any number of EmptyResultDataAccessException exceptions which may be thrown with other queries in the code.
Example code segment showing try...catch with queryForObject:
MapSqlParameterSource mapParameters = new MapSqlParameterSource();
// Step 1 check if client exists at all
mapParameters.addValue("clientname", clientName);
try {
clientID = this.namedParameterJdbcTemplate.queryForObject(FIND_BY_NAME, mapParameters, Long.class);
} catch (EmptyResultDataAccessException e) {
SQLException sqle = (SQLException) e.getCause();
logger.debug("No client was found");
logger.debug(sqle.getMessage());
return null;
}
return clientID;
In the above code, no row was returned and I want to properly handle it (I have not coded that portion yet). Instead, the catch block is never triggered and instead, my generic error handler and associated error channel is triggered instead.
Segment from file BatchIntegrationConfig.java:
#Bean
#ServiceActivator(inputChannel="errorChannel")
public DefaultErrorHandlingServiceActivator errorLauncher(JobLauncher jobLauncher){
logger.debug("====> Default Error Handler <====");
return new DefaultErrorHandlingServiceActivator();
}
Segment from file DefaultErrorHandlingServiceActivator.java:
public class DefaultErrorHandlingServiceActivator {
#ServiceActivator
public void handleThrowable(Message<Throwable> errorMessage) throws Throwable {
// error handling code should go here
}
}
Tested Facts:
queryForObject expects a row to be returned and will thrown an
exception if otherwise, therefore you have to handle the exception
or use a different query which returns a row.
Spring Integration is monitoring exceptions and catching them before
my own code can hand them.
What I want to be able to do:
Catch the very specific condition and log it or let the end user know what they need to do to fix the problem.
Edit on 10/26/2016 per recommendation from #Artem:
Changed my existing input channel to Spring provided Handler Advice:
#Transformer(inputChannel = "memberInputChannel", outputChannel = "commonJobGateway", adviceChain="handleAdvice")
Added support Bean and method for the advice:
#Bean
ExpressionEvaluatingRequestHandlerAdvice handleAdvice() {
ExpressionEvaluatingRequestHandlerAdvice advice = new ExpressionEvaluatingRequestHandlerAdvice();
advice.setOnFailureExpression("payload");
advice.setFailureChannel(customErrorChannel());
advice.setReturnFailureExpressionResult(true);
advice.setTrapException(true);
return advice;
}
private QueueChannel customErrorChannel() {
return new DirectChannel();
}
I initially had some issues with wiring up this feature, but in the end, I realized that it is creating yet another channel which will need to be monitored for errors and handled appropriately. For simplicity, I have chosen to not use another channel at this time.
Although potentially not the best solution, I switched to checking for row counts instead of returning actual data. In this situation, the data exception is avoided.
The main code above moved to:
MapSqlParameterSource mapParameters = new MapSqlParameterSource();
mapParameters.addValue("clientname", clientName);
// Step 1 check if client exists at all; if exists, continue
// Step 2 check if client enrollment rules are available
if (this.namedParameterJdbcTemplate.queryForObject(COUNT_BY_NAME, mapParameters, Integer.class) == 1) {
if (this.namedParameterJdbcTemplate.queryForObject(CHECK_RULES_BY_NAME, mapParameters, Integer.class) != 1) return null;
} else return null;
return findClientByName(clientName);
I then check the data upon return to the calling method in Spring Batch:
if (clientID != null) {
logger.info("Found client ID ====> " + clientID);
}
else {
throw new ClientSetupJobExecutionException("Client " +
fileNameParts[1] + " does not exist or is improperly setup in the database.");
}
Although not needed, I created a custom Java Exception which could be useful at a later point in time.
Spring Integration Service Activator can be supplied with the ExpressionEvaluatingRequestHandlerAdvice, which works like a try...catch and let you to perform some logic onFailureExpression: http://docs.spring.io/spring-integration/reference/html/messaging-endpoints-chapter.html#expression-advice
Your problem might be that you catch (EmptyResultDataAccessException e), but it is a cause, not root on the this.namedParameterJdbcTemplate.queryForObject() invocation.
I have a swing application that read HTML pages using the following command
String urlzip = null;
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if (link.attr("abs:href").contains("BcfiHtm.zip")) {
urlzip = link.attr("abs:href").toString();
}
}
} catch (IOException e) {
textAreaStatus.append("Failed to get new file from internet:"+e.getMessage()+"\n");
e.printStackTrace();
}
return urlzip;
then my swing application will return a string, It works fine and it reads any HTML page that I give to it. However, some times the application gave me the following error type Exception report. How can i increase timeOut?
There's an example on this page.
Jsoup.connect("http://example.com").timeout(3000)
This error occurs while you are trying to read data and because of large data or connection problem it can not complete the task. I would suggest you to increase your Timeout using above code atleast for 1 minute. so it will be like below code,
Jsoup.connect("http://example.com").timeout(60000);