I try to get a dynamic page from URL. I am working in Java. I have done this using Selenium, but it takes lots of time. As it takes time to invoke driver of Selenium. That's why I shifted to HtmlUnit, as it is GUILess Browser. But my HtmlUnit implementation shows some exception.
Question :-
How can I correct my HtmlUnit implementation.
Is the page produced by Selenium is simiar to the page produced by HtmlUnit? [ Both are dynamic or not? ]
My selenium code is :-
public static void main(String[] args) throws IOException {
// Selenium
WebDriver driver = new FirefoxDriver();
driver.get("ANY URL HERE");
String html_content = driver.getPageSource();
driver.close();
// Jsoup makes DOM here by parsing HTML content
Document doc = Jsoup.parse(html_content);
// OPERATIONS USING DOM TREE
}
HtmlUnit code:-
package XXX.YYY.ZZZ.Template_Matching;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Assert;
import org.junit.Test;
public class HtmlUnit {
public static void main(String[] args) throws Exception {
//HtmlUnit htmlUnit = new HtmlUnit();
//htmlUnit.homePage();
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("http://www.jabong.com/women/clothing/womens-tops/?source=women-leftnav");
String textSource = currentPage.asText();
System.out.println(textSource);
}
}
It shows exception :-
1: How can I correct my HtmlUnit implaementation.
Looking at the stack trace, it seems to be saying that the javascript engine executed some javascript that tried to access an attribute on a Javascript "undefined" value. If it is correct, that would be a bug in the javascript you are testing, not in the HtmlUnit code.
2: Is the page produced by Selenium is simiar to the page produced by HtmlUnit?
That does not make sense. Neither Selenium or HtmlUnit "produces" a page. The page is produced by the serve code you are testing.
If you are asking if HtmlUnit is capable of dealing with code that has embedded Javascript ... there is clear evidence in the stacktrace that it is trying to execute the Javascript.
Related
This is one the page that i am going to scrape: https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads
I want to scrape by the comment text under "ulasan terbaru" which I theorize it is a result of a javascript (I might be wrong though, I am not entirely sure how to check it through inspect element), other than that I also am not sure on several things in HTMLUnit
I have read that to scrape the javascript content I need to use HTMLUnit than Jsoup. I have read http://htmlunit.10904.n7.nabble.com/Selecting-a-div-by-class-name-td25787.html to try scrape the comment the div by class but i got zero output.
public static void comment(String url) throws IOException{
WebClient client = new WebClient();
client.setCssEnabled(true);
client.setJavaScriptEnabled(true);
try {
HtmlPage page = client.getPage(url);
List<?> date = page.getByXPath("//div/#class='list-box-comment'");
System.out.println(date.size());
for(int i =0 ; i<date.size();i++){
System.out.println(date.get(i).asText());
}
}
catch(Exception e){
e.printStackTrace();
}
}
This is the part of my code that will handle the comment scraping, do I do it right?. But I have two problems:
at "asText()" it said that "can't resolve method asText()"
Even if i run without "asText()", i get this as an error:
com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at ReviewScraping.comment(ReviewScraping.java:86)
at ReviewScraping.main(ReviewScraping.java:108)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
... 11 more
I hope that I can show all of the comment
/edit I use Intellij as my IDE when I am making this and the dependecies for HTMLUnit is in my Intellij project structure by using Maven
Regarding you code:
public static void main(String[] args) throws IOException {
final String url = "https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(40_000);
System.out.println(page.asXml());
List<DomNode> date = page.getByXPath("//div[#class='list-box-comment']");
System.out.println(date.size());
for(int i = 0 ; i < date.size();i++){
System.out.println(date.get(i).asText());
}
}
}
Now the problems with the page itself:
Have done some test and it looks like the page produces errors with real browsers also (check the browser console). But with HtmlUnit you get more problems (maybe because of the missing support of some javascript features). Usually this kind of pages are using many, many lines of js code - it will be really time consuming for me to figure out what is going wrong. If you like to get this fixed, try to find the real reason of the problem (see http://htmlunit.sourceforge.net/submittingJSBugs.html for some hints) and file a bug report.
I'm trying to get the html source of Zhaopin Login Page with HttpResponse (and HttpClient), Jsoup and HtmlUnit (first tried it worked) but i haven't successed. The three methods return me obfuscated html source (and with the three of them i tried sending all the headers).
So i tried with PhantomJS, because i red that it waits for the page's javascript to execute, but i'm also having no success.
Has someone used it?
Here is the method i use:
public static Document renderPage(String url) {
System.setProperty("phantomjs.binary.path", "/usr/local/share/phantomjs-1.9.8-linux-x86_64/bin/phantomjs");
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.manage().timeouts().setScriptTimeout(-1, TimeUnit.DAYS);
ghostDriver.manage().timeouts().pageLoadTimeout(-1, TimeUnit.DAYS);
ghostDriver.get(url);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
Thanks!
This produces the source of the page (at least here with the lastest SNAPSHOT from HtmlUnit). The page code still contains a lot of javascript stuff but it should be easy to move this out.
try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
final HtmlPage page = webClient.getPage("https://passport.zhaopin.com/org/login");
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
As a novice to selenium, I am trying to automate a shopping site on selenium webdriver with java, My scenario is that when i search with a keyword and get results, i should be able to pick any one of the results randomly, but I am unable to pick the random search result, either I am getting a "No such element" or when i try to click the same result everytime,search results seem to vary from time to time. please help on how to proceed further.
here is the code :
package newPackage;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.*;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.FluentWait;
import org.openqa.selenium.support.ui.Wait;
public class flipKart {
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.chrome.driver","C:\\chromedriver.exe");
WebDriver dr = new ChromeDriver();
dr.get("http://m.barnesandnoble.com/");
dr.manage().window().maximize();
dr.findElement(By.xpath(".//*[#id='search_icon']")).click();
dr.findElement(By.xpath(".//*
[#id='sk_mobContentSearchInput']")).sendKeys("Golden Book");
dr.findElement(By.xpath(".//*
[#id='sk_mobContentSearchInput']")).sendKeys(Keys.ENTER);
dr.findElement(By.xpath(".//[#id='skMob_productDetails_prd9780735217034']/div/div")).click();
dr.findElement(By.xpath(".//*[#id='pdpAddtoBagBtn']")).click();
}
}
You should write any method that would try to wait for the visibility of the element that needs to be clicked.
You could use driver.sleep() to check.
Hard to answer with your info, but these tips may help:
If you're getting no such element, try to verify the css selector or xpath you are using is correct. Firefox's Firebug Firefinder is an excellent tool for this. It will highlight the element your selector points to.
If your selector is correct, make sure you are using findElementsBy... and not findElementBy...
the plural version will return a list of webelements, from which you can then pull random elements to click on.
Use an intelligent wait to make sure the elements have loaded on the page. Sometimes selenium will try to interact with elements on the page before they appear. The selenium api has plenty of methods to help here, but if you're just debugging a quick Thread.sleep(5) when you load the page will work.
I am performing some test on a website, which is referring to a javascript array _gaq and it is not defined anywhere in the page. I can see the similar exception in Browser but there it is ignoring it. I set the method setThrowExceptionOnScriptError(false) but still it is throwing
com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: "_gaq" is not defined.
Below is my code
WebClient wb = new WebClient(BrowserVersion.CHROME);
wb.getOptions().setThrowExceptionOnScriptError(false);
page = wb.getPage("http://www.axisbank.com/");
HtmlElement el = ((HtmlElement)(page.getByXPath("//*[#id=\"form1\"]/div[5]/div[2]/div[3]/div/div[5]/img").get(0)));
page = el.click();
el = ((HtmlElement)(page.getByXPath("//*[#id=\"ContentPlaceHolder1_btnLogin\"]").get(0)));
System.out.println(el.asText());
page = el.click();
Any suggestion how to solve this problem. I tried adding page.executeScript("var _gaq = []"), but still failing
Don't use HtmlUnit for pages with serious JavaScript. Its JavaScript engine simply not good enough.
Use Selenium instead.
I need to access the value of div element inside an iframe from java code. The iframe is on a web browser and java code is on local server. I need this to test the values in iframe. I am new in coding/automation, any suggestion on how this can be done will be helpful. I found on net how to access through JS but i need to get the values in my java code. I am not using selenium web driver but original browser.
Any suggestion/pointer will be helpful, Thanks SOF!!
If you have solution in JS then you can use the Java Scripting API
Below the basic example:
import javax.script.*;
public class EvalScript {
public static void main(String[] args) throws Exception {
// create a script engine manager
ScriptEngineManager factory = new ScriptEngineManager();
// create a JavaScript engine
ScriptEngine engine = factory.getEngineByName("JavaScript");
// evaluate JavaScript code from String
engine.eval("print('Hello, World')");
}
}