I am performing some test on a website, which is referring to a javascript array _gaq and it is not defined anywhere in the page. I can see the similar exception in Browser but there it is ignoring it. I set the method setThrowExceptionOnScriptError(false) but still it is throwing
com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: "_gaq" is not defined.
Below is my code
WebClient wb = new WebClient(BrowserVersion.CHROME);
wb.getOptions().setThrowExceptionOnScriptError(false);
page = wb.getPage("http://www.axisbank.com/");
HtmlElement el = ((HtmlElement)(page.getByXPath("//*[#id=\"form1\"]/div[5]/div[2]/div[3]/div/div[5]/img").get(0)));
page = el.click();
el = ((HtmlElement)(page.getByXPath("//*[#id=\"ContentPlaceHolder1_btnLogin\"]").get(0)));
System.out.println(el.asText());
page = el.click();
Any suggestion how to solve this problem. I tried adding page.executeScript("var _gaq = []"), but still failing
Don't use HtmlUnit for pages with serious JavaScript. Its JavaScript engine simply not good enough.
Use Selenium instead.
Related
This is one the page that i am going to scrape: https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads
I want to scrape by the comment text under "ulasan terbaru" which I theorize it is a result of a javascript (I might be wrong though, I am not entirely sure how to check it through inspect element), other than that I also am not sure on several things in HTMLUnit
I have read that to scrape the javascript content I need to use HTMLUnit than Jsoup. I have read http://htmlunit.10904.n7.nabble.com/Selecting-a-div-by-class-name-td25787.html to try scrape the comment the div by class but i got zero output.
public static void comment(String url) throws IOException{
WebClient client = new WebClient();
client.setCssEnabled(true);
client.setJavaScriptEnabled(true);
try {
HtmlPage page = client.getPage(url);
List<?> date = page.getByXPath("//div/#class='list-box-comment'");
System.out.println(date.size());
for(int i =0 ; i<date.size();i++){
System.out.println(date.get(i).asText());
}
}
catch(Exception e){
e.printStackTrace();
}
}
This is the part of my code that will handle the comment scraping, do I do it right?. But I have two problems:
at "asText()" it said that "can't resolve method asText()"
Even if i run without "asText()", i get this as an error:
com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at ReviewScraping.comment(ReviewScraping.java:86)
at ReviewScraping.main(ReviewScraping.java:108)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
... 11 more
I hope that I can show all of the comment
/edit I use Intellij as my IDE when I am making this and the dependecies for HTMLUnit is in my Intellij project structure by using Maven
Regarding you code:
public static void main(String[] args) throws IOException {
final String url = "https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(40_000);
System.out.println(page.asXml());
List<DomNode> date = page.getByXPath("//div[#class='list-box-comment']");
System.out.println(date.size());
for(int i = 0 ; i < date.size();i++){
System.out.println(date.get(i).asText());
}
}
}
Now the problems with the page itself:
Have done some test and it looks like the page produces errors with real browsers also (check the browser console). But with HtmlUnit you get more problems (maybe because of the missing support of some javascript features). Usually this kind of pages are using many, many lines of js code - it will be really time consuming for me to figure out what is going wrong. If you like to get this fixed, try to find the real reason of the problem (see http://htmlunit.sourceforge.net/submittingJSBugs.html for some hints) and file a bug report.
I am working with a page source that runs Javascript to print text on the page. Here is a snippet of the source I'm speaking of:
var display_session = get_cookie("LastMRH_Session");
if(null != display_session) {
document.getElementById("sessionDIV").innerHTML = '<BR>The session reference number: ' + display_session + '<BR><BR>';
document.getElementById("sessionDIV").style.visibility = "visible";
}
The page is displaying the value of the display session variable when it's not null but I want to know if there is a way I can utilize this variable in my Selenium WebDriver code. The function that uses this code in the Javascript does not return the display_session variable and I cannot alter the page source. I tried this based on the post here but it throws an exception.
JavascriptExecutor js = (JavascriptExecutor) driverRp;
Object result = js.executeScript("return display_session");
System.out.println("sessionId = "+result);
Any suggestions?
Figured it out. Needed to go after the cookie itself instead of messing with the Javascript variable.
Cookie cookie = driverRp.manage().getCookieNamed("LastMRH_Session");
String sessionId = cookie.getValue().toUpperCase();
System.out.println("sessionId = "+sessionId);
Found the solution at this post.
If the display_session variable is inside a function the you will not be able to access it unless it is at global scope.
If your intention is to read the value of a cookie, you can use the driver.manage().getCookieNamed(...) as an alternative to executing Javascript.
EDIT:
Just saw that you were able to figure out the same. Didn't see your answer when I posted. Glad it worked out.
Try this:
js.executeScript('return get_cookie("LastMRH_Session")')
I'm using HtmlUnit in Java to deal with a DropDown Window (Java).
I tried as User skaffman suggests:
WebDriver driver = new HtmlUnitDriver();
driver.get("https://...");
......................
WebClient client = new Webclient();
Page page = client.getPage("https://...");
HtmlSelect select = (HtmlSelect) page.getElementById(mySelectId);
HtmlOption option = select.getOptionByValue(desiredOptionValue);
select.setSelectedAttribute(option, true);
It does not recognize: getElementById. Eclipse recommends to swith to findElement(By.id(" ")) PLEASE HELP
I agree with my college. The above code is correct, make sure you set javascript enabled, otherwise you will have issues with HtmlUnit
driver = new HtmlUnitDriver();
((HtmlUnitDriver) driver).setJavascriptEnabled(true);
In your code, you are declaring the local variable to be of the type Page that will contain the return value from client.getPage("https://...");
Although it's usually good practice to develop toward the generic interface (in this case, Page), the generic interface does not contain the method to getElementById(...).
Try changing your 4th line of code to the following:
HtmlPage page = client.getPage("https://...");
(I am assuming that the conent being returned by client.getPage("https://..."); is of MimeType text/html).
You could also use XmlPage or XhtmlPage, depending on your MimeType.
If it is none of these that you are retrieving via client.getPage("https://...");, then you should not be attempting to call getElementById on a structure that does not have this as part of its API.
I'm connecting to a webserver with a specific JavaScript. (Using HttpURLConnection atm)
What i need is a connection that makes it possible to manipulate a JavaScript function.
Afterwards i want to run the whole JavaScript again.
I want the following function always to return "new FlashSocketBackend()"
function createBackend() {
if (flashSocketsWork) {
return new FlashSocketBackend()
} else {
return new COMETBackend()
}
}
Do i have to use HtmlUnit for this?
Whats the easiest way to connect, manipulate and re-run the script?
Thanks.
With HtmlUnit you indeed can do it.
Even though you can not manipulate an existing JS function, you can however execute what JavaScript code you wish on an existing page.
Example:
WebClient htmlunit = new WebClient();
HtmlPage page = htmlunit.getPage("http://www.google.com");
page = page.executeJavaScript("<JS code here>").getNewPage();
//manipulate the JS code and re-excute
page = page.executeJavaScript("<manipulated JS code here>").getNewPage();
//manipulate the JS code and re-excute
page = page.executeJavaScript("<manipulated JS code here>").getNewPage();
more:
http://www.aviyehuda.com/2011/05/htmlunit-a-quick-introduction/
Your best shot is probably to use Rhino — an open-source implementation of JavaScript written entirely in Java. Loading your page with a window.location and hopefully running your JavaScript function. I read sometime before Bringing the Browser to the Server and seemed possible.
I'm using htmlunit to test some pages and I'd like to know how can I execute some javascript code in the context of the current page. I'm aware that the docs say I'd better emulate the behavior of a user on a page, but it isn't working this way :( (I have a div which has an onclick property, I call its click method but nothing happens). So I've made some googling and tried:
JavaScriptEngine jse = webClient.getJavaScriptEngine();
jse.execute(page, what here?);
Seems like I have to instantiate the script first, but I've found no info on how to do it (right). Could someone share a code snippet showing how to make webclient instance execute the needed code?
You need to call executeJavaScript() on the page, not on webClient.
Example:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
webClient.setJavaScriptEnabled(true);
HtmlPage page = webClient.getPage("http://www.google.com/ncr");
ScriptResult scriptResult = page.executeJavaScript("document.title");
System.out.println(scriptResult.getJavaScriptResult());
prints "Google". (I'm sure you'll have some more exciting code to put in there.)
I don't know the JavaScriptEngine you're quoting and maybe it's not the answer you want, but this sounds like a perfect case for Selenium IDE.
Selenium IDE is a Firefox add-on that records clicks, typing, and other actions to make a test, which you can play back in the browser.
In TestPlan using the HTMLUnit backend the google example is:
GotoURL http://www.google.com/ncr
set %Title% as evalJavaScript document.title
Notice %Title%