HTTP error fetching URL. Status=403 in Java

HTTP error fetching URL. Status=403 in Java - java

I'm writing a small program and I want to fetch an element from a website. I've followed many tutorials to learn how to write this code with jSoup. An example of what I'm trying to print is "Monday, November 19, 2018 - 3:00pm to 7:00pm". I'm running into the error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://my.cs.ubc.ca/course/cpsc-210
Here is my code:
public class WebPageReader {
private String url = "https://my.cs.ubc.ca/course/cpsc-210";
private Document doc;
public void readPage(){
try {
doc = Jsoup.connect(url).
userAgent("Mozilla/5.0")
.referrer("https://www.google.com").timeout(1000).followRedirects(true).get();
Elements temp=doc.select("span.date-display-single");
int i=0;
for (Element officeHours:temp){
i++;
System.out.println(officeHours);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thanks for the help.

Status 403 means your access is forbidden.
Please make sure you have an access to https://my.cs.ubc.ca/course/cpsc-210
I have tried to access https://my.cs.ubc.ca/course/cpsc-210 from browser. It returns Error Page. I think you need to use credential to access it.

Related

No output for JSoup file

When running the following code:
try {
Document doc = Jsoup.connect("https://pomofocus.io/").get();
Elements text = doc.select("div.sc-kEYyzF");
System.out.println(text.text());
}
catch (IOException e) {
e.printStackTrace();
}
No output occurs. When changing the println to:
System.out.println(text.first().text());
I get a NullPointerException but nothing else.

jsoup doesn't execute javascript - it parses the HTML that the server returns. You can check View Source (vs Inspect) to see the response from the server, and what is selectable.

unable to capture 'g-recaptcha-response' for Recaptchav2 with Selenium

So I've been trying to build a webscraper but some of the data I need to scrape is locked behind a reCaptcha. From what I've gathered scouring around on the internet is every captcha has a TextArea element with the 'g-recaptcha-response' that gets filled in as the captcha is completed. The current solution for testing is to simply get around the captcha with me manually doing it and trying to capture the response and feed it back into the headless browser however I'm unable to get the response since as soon as the answer is submitted it can no longer find the response element.
org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":"*[name='g-recaptcha-response']"}
public static String captchaSolver(String captchaUrl) {
setUp();
driver.get(captchaUrl);
new WebDriverWait(driver,2);
try {
while (true) {
String response = driver.findElement(By.name("g-recaptcha-response")).getText();
if (response.length()!=0) {
System.out.println(response);
break;
}
}
}catch (Exception e){
e.printStackTrace();
}
return "";
}

Try to find the element by CSS like this:
*[name*='g-recaptcha-response']

How to handle 404 page not found error properly?

I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.
I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/
However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.
Here is my code for the part where I get URL:
void search() {
List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
}
and then I do:
try {
HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
...
}catch (Exception e) {
System.out.println(e);
}
in the scraper file.

It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.
if the server returns a failing status code AND the property
WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set
to true.
You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.

The tutorial you added contains the following code:
.....
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
e.printStackTrace();
}
.....
With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.
As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)
This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.

com.google.gwt.user.client.Window How to get Http status code from answer?

I use com.google.gwt.user.client.Window.open(String url, String name, String features) to download file from server.
Every think is ok when the result from server is 200 OK- I am getting file.
Problem comes when the result from server is different than 200 OK, for example: 500 Internal Server Error, or 401 Unauthorised. Then I am getting ugly Tomcat Error page that contain information about the problem.
I would like to catch every status code different than 200 OK and display my own information or redirect to f.e. Login Page.
How can I achive this?

To achieve required functionality, first we need to check for existence of file on server. We can do that by simple head request. Here is a sample code for same
XMLHttpRequest req = XMLHttpRequest.create();
req.open("HEAD", fileURL);
req.setOnReadyStateChange(new ReadyStateChangeHandler() {
#Override
public void onReadyStateChange(XMLHttpRequest xhr) {
if (xhr.getReadyState() == XMLHttpRequest.DONE) {
if (xhr.getStatus() == 200) {
Window.open(fileURL, winTitle, "");
} else {
// TODO handle other status codes
}
}
}
});
req.send();

SocketTimeoutException: Read timed out, how to fix it?

I have a swing application that read HTML pages using the following command
String urlzip = null;
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if (link.attr("abs:href").contains("BcfiHtm.zip")) {
urlzip = link.attr("abs:href").toString();
}
}
} catch (IOException e) {
textAreaStatus.append("Failed to get new file from internet:"+e.getMessage()+"\n");
e.printStackTrace();
}
return urlzip;
then my swing application will return a string, It works fine and it reads any HTML page that I give to it. However, some times the application gave me the following error type Exception report. How can i increase timeOut?

There's an example on this page.
Jsoup.connect("http://example.com").timeout(3000)

This error occurs while you are trying to read data and because of large data or connection problem it can not complete the task. I would suggest you to increase your Timeout using above code atleast for 1 minute. so it will be like below code,
Jsoup.connect("http://example.com").timeout(60000);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTTP error fetching URL. Status=403 in Java - java

Status 403 means your access is forbidden. Please make sure you have an access to https://my.cs.ubc.ca/course/cpsc-210 I have tried to access https://my.cs.ubc.ca/course/cpsc-210 from browser. It returns Error Page. I think you need to use credential to access it.

Related

No output for JSoup file

unable to capture 'g-recaptcha-response' for Recaptchav2 with Selenium

How to handle 404 page not found error properly?

com.google.gwt.user.client.Window How to get Http status code from answer?

SocketTimeoutException: Read timed out, how to fix it?

Categories

Resources