So I've been trying to build a webscraper but some of the data I need to scrape is locked behind a reCaptcha. From what I've gathered scouring around on the internet is every captcha has a TextArea element with the 'g-recaptcha-response' that gets filled in as the captcha is completed. The current solution for testing is to simply get around the captcha with me manually doing it and trying to capture the response and feed it back into the headless browser however I'm unable to get the response since as soon as the answer is submitted it can no longer find the response element.
org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":"*[name='g-recaptcha-response']"}
public static String captchaSolver(String captchaUrl) {
setUp();
driver.get(captchaUrl);
new WebDriverWait(driver,2);
try {
while (true) {
String response = driver.findElement(By.name("g-recaptcha-response")).getText();
if (response.length()!=0) {
System.out.println(response);
break;
}
}
}catch (Exception e){
e.printStackTrace();
}
return "";
}
Try to find the element by CSS like this:
*[name*='g-recaptcha-response']
Related
I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.
I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/
However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.
Here is my code for the part where I get URL:
void search() {
List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
}
and then I do:
try {
HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
...
}catch (Exception e) {
System.out.println(e);
}
in the scraper file.
It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.
if the server returns a failing status code AND the property
WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set
to true.
You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.
The tutorial you added contains the following code:
.....
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
e.printStackTrace();
}
.....
With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.
As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)
This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.
I'm facing a problem with finding an element that does not exist. When I try to login into the application, if it failed login it will show the element -
dr.findElement(By.className("message-error ")).getText();
And after a successful login it will show this:
dr.findElement(By.className("message-success")).getText();
When I run the code and it doesn't find the element, then execution stops with the exception: element is not found
String mes=null;
mes=dr.findElement(By.className("message-success")).getText();
if(mes!=null) {
File out= new File("success.txt");
FileWriter fr =new FileWriter(out,true);
PrintWriter pw=new PrintWriter(fr);
pw.println(mes+"|"+user.get(i)+"|"+pass.get(i));
pw.close();
}
mes=dr.findElement(By.className("message-error")).getText();
if(mes!=null) {
File out= new File("error.txt");
FileWriter fr =new FileWriter(out,true);
PrintWriter pw=new PrintWriter(fr);
pw.println(mes+"|"+user.get(i)+"|"+pass.get(i));
pw.close();
}
The element does not appear.
For example, the success element will not shown until it is successful and the error element will not appear in the CSS until it gets an error.
So how can I tell it if element should exit or come to live or appear do an action?
What is the right thing to do in an if statement if the login is successful? Do this and login fail do this?
Use WebdriverWait to wait for the visibility of success message,
// After valid login
WebDriverWait wait = new WebDriverWait(driver, 60);
WebElement successmessage wait.until(ExpectedConditions.visibilityOfElementLocated(By.className("message-success")));
sucessmessage.getText();
Similarly for error message,
// After invalid login
WebElement errormessage wait.until(ExpectedConditions.visibilityOfElementLocated(By.className("message-error")));
errormessage.getText();
What i think is you have same element but the class getting change as per application behavior. suppose if you are able to login in the application then it show the message element with class having attribute message-success and if it won't allow then error message with class having attribute 'message-error' in the same element.
I've handled the same in below code -
// first get the message element with some other locator or other attribute (don't use class name)
WebElement message = driver.findElement(By.locator);
if(message.getAttribute("class").contains("message-success")){
System.out.println("Success message is " + message.getText())
// write the code to perform further action you want on login success
}else if (message.getAttribute("class").contains("message-error")){
System.out.println("Error message is " + message.getText())
// write the code to perform further action you want on login Fail
}else{
System.out.println("Message is empty" + message.getText())
}
Let me know if you have further queries.
My Choice is to use Webdriver wait. it is the perfect way to find an element.
public WebDriverWait(WebDriver driver,15)
WebElement successmessage = wait.until(ExpectedConditions.visibilityOfallElementLocatedby(By.className("message-success")));
sucessmessage.getText();
visibilityOfallElementLocatedby :
An expectation for checking that all elements present on the web page that match the locator are visible. Visibility means that the elements are not only displayed but also have a height and width that is greater than 0.
The above i wrote is for success Message, similar way try for invalid login.
To use different types of wait, check this doc - https://seleniumhq.github.io/selenium/docs/api/java/allclasses-noframe.html
search Wait in that document.
this is the answer for the question it work with me
try {
WebElement sucmes=dr.findElement(By.xpath("//div[#class='message']"));
String suclogin="login success:";
if(sucmes.getText().contains(suclogin)) {
File suclogin= new File("suclog.txt");
FileWriter suclogr =new FileWriter(suclogin,true);
PrintWriter suclogrw=new PrintWriter(suclogr);
suclogrw.println(sucmes.getText());
suclogrw.close();
}else{
//the other action here
}
}
I'm writing a small program and I want to fetch an element from a website. I've followed many tutorials to learn how to write this code with jSoup. An example of what I'm trying to print is "Monday, November 19, 2018 - 3:00pm to 7:00pm". I'm running into the error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://my.cs.ubc.ca/course/cpsc-210
Here is my code:
public class WebPageReader {
private String url = "https://my.cs.ubc.ca/course/cpsc-210";
private Document doc;
public void readPage(){
try {
doc = Jsoup.connect(url).
userAgent("Mozilla/5.0")
.referrer("https://www.google.com").timeout(1000).followRedirects(true).get();
Elements temp=doc.select("span.date-display-single");
int i=0;
for (Element officeHours:temp){
i++;
System.out.println(officeHours);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Thanks for the help.
Status 403 means your access is forbidden.
Please make sure you have an access to https://my.cs.ubc.ca/course/cpsc-210
I have tried to access https://my.cs.ubc.ca/course/cpsc-210 from browser. It returns Error Page. I think you need to use credential to access it.
I need to download a source code of this webpage: https://app.zonky.cz/#/marketplace/ so I could have the code checking if there is a new loan available. Unfortunate for me, the web page uses a loading spinner for the time the page is being loaded in the background. When I try to download the page's source using:
String url = "https://app.zonky.cz/#/marketplace/";
StringBuilder text = new StringBuilder();
try
{
URL pageURL = new URL(url);
Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
try {
while (scanner.hasNextLine()){
text.append(scanner.nextLine() + "\n");
}
}
finally{
scanner.close();
}
}
catch(Exception ex)
{
//
}
System.out.println(text.toString());
I get the page's source from the moment the spinner is being shown. Do you know of a better approach?
Solution:
public static String getSource() {
WebDriver driver = new FirefoxDriver();
driver.get("https://app.zonky.cz/#/marketplace/");
String output = driver.getPageSource();
driver.close();
return output;
}
You could always wait until the page has finished loading by checking if an element exists(one that is only loaded after the spinner disappears)
Also have you looked into using selenium ? it can be really useful for interacting with websites and handling tricky procedures such as waiting for elements :P
Edit: a pretty simple tutorial for Selenium waiting can be found here - http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-and-implicit-waits
I have a swing application that read HTML pages using the following command
String urlzip = null;
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if (link.attr("abs:href").contains("BcfiHtm.zip")) {
urlzip = link.attr("abs:href").toString();
}
}
} catch (IOException e) {
textAreaStatus.append("Failed to get new file from internet:"+e.getMessage()+"\n");
e.printStackTrace();
}
return urlzip;
then my swing application will return a string, It works fine and it reads any HTML page that I give to it. However, some times the application gave me the following error type Exception report. How can i increase timeOut?
There's an example on this page.
Jsoup.connect("http://example.com").timeout(3000)
This error occurs while you are trying to read data and because of large data or connection problem it can not complete the task. I would suggest you to increase your Timeout using above code atleast for 1 minute. so it will be like below code,
Jsoup.connect("http://example.com").timeout(60000);