I'm trying to get the html source of Zhaopin Login Page with HttpResponse (and HttpClient), Jsoup and HtmlUnit (first tried it worked) but i haven't successed. The three methods return me obfuscated html source (and with the three of them i tried sending all the headers).
So i tried with PhantomJS, because i red that it waits for the page's javascript to execute, but i'm also having no success.
Has someone used it?
Here is the method i use:
public static Document renderPage(String url) {
System.setProperty("phantomjs.binary.path", "/usr/local/share/phantomjs-1.9.8-linux-x86_64/bin/phantomjs");
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.manage().timeouts().setScriptTimeout(-1, TimeUnit.DAYS);
ghostDriver.manage().timeouts().pageLoadTimeout(-1, TimeUnit.DAYS);
ghostDriver.get(url);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
Thanks!
This produces the source of the page (at least here with the lastest SNAPSHOT from HtmlUnit). The page code still contains a lot of javascript stuff but it should be easy to move this out.
try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
final HtmlPage page = webClient.getPage("https://passport.zhaopin.com/org/login");
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
Related
This is one the page that i am going to scrape: https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads
I want to scrape by the comment text under "ulasan terbaru" which I theorize it is a result of a javascript (I might be wrong though, I am not entirely sure how to check it through inspect element), other than that I also am not sure on several things in HTMLUnit
I have read that to scrape the javascript content I need to use HTMLUnit than Jsoup. I have read http://htmlunit.10904.n7.nabble.com/Selecting-a-div-by-class-name-td25787.html to try scrape the comment the div by class but i got zero output.
public static void comment(String url) throws IOException{
WebClient client = new WebClient();
client.setCssEnabled(true);
client.setJavaScriptEnabled(true);
try {
HtmlPage page = client.getPage(url);
List<?> date = page.getByXPath("//div/#class='list-box-comment'");
System.out.println(date.size());
for(int i =0 ; i<date.size();i++){
System.out.println(date.get(i).asText());
}
}
catch(Exception e){
e.printStackTrace();
}
}
This is the part of my code that will handle the comment scraping, do I do it right?. But I have two problems:
at "asText()" it said that "can't resolve method asText()"
Even if i run without "asText()", i get this as an error:
com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at ReviewScraping.comment(ReviewScraping.java:86)
at ReviewScraping.main(ReviewScraping.java:108)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
... 11 more
I hope that I can show all of the comment
/edit I use Intellij as my IDE when I am making this and the dependecies for HTMLUnit is in my Intellij project structure by using Maven
Regarding you code:
public static void main(String[] args) throws IOException {
final String url = "https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(40_000);
System.out.println(page.asXml());
List<DomNode> date = page.getByXPath("//div[#class='list-box-comment']");
System.out.println(date.size());
for(int i = 0 ; i < date.size();i++){
System.out.println(date.get(i).asText());
}
}
}
Now the problems with the page itself:
Have done some test and it looks like the page produces errors with real browsers also (check the browser console). But with HtmlUnit you get more problems (maybe because of the missing support of some javascript features). Usually this kind of pages are using many, many lines of js code - it will be really time consuming for me to figure out what is going wrong. If you like to get this fixed, try to find the real reason of the problem (see http://htmlunit.sourceforge.net/submittingJSBugs.html for some hints) and file a bug report.
I have this funny bug happening to me today. I've been using Selenium for years already and never had an issue navigating to URL (via driver.navigate().to(url)) however today I'm attempting to navigate to a specific URL and I find that after executing the program several times it sometimes just stays on the original page without navigating to new page.
The funny thing is that this only happens about 50% of the time and it only happens when navigating to a specific URL at a specific part of the program (in other parts of program I have no issue navigating to this url).
Is it possible that some element on the current page is preventing driver.navigate().to(url) from executing?
I've looked at this and this question but both seem to have issues with navigating altogether. In my case, it sometimes works and sometimes doesn't (even when the exact same url is being used).
Also I'm not getting any specific errors (so I don't have much more info to post). The program just moves on as if the statement didn't exist.
I'll be happy to provide additional details if necessary.
Code:
shoppingCartURL.navToShoppingCart(driver);
String[] XPath = { "//*[contains (#value,'Delete')]" };
List<WebElement> elementList = webElementX.getElementListByXPath(driver, XPath);
System.out.println("Deleting " + elementList.size() + " element(s) from shopping cart");
for (int elementListCounter = 0; elementListCounter < elementList.size(); elementListCounter++) {
WebElement singleElement = elementList.get(elementListCounter);
try {
singleElement.click();
} catch (Exception e) {
System.out.println("Exception occurred (StaleElementReferenceException)");
}
}
if (conditionX == false) {
productPage.navToProductPage(driver, product); // this method is not always executed, program continues to execute productPage.performActionOnPage(driver); without navigating to 'product page'
productPage.performActionOnPage(driver);
}
public void navToProductPage(WebDriver driver, String product)
{
String URL = "https://www.example.com/product/" + product;
System.out.println("Navigating to " + URL); // always prints the correct url (but still doesn't always navigate to url as mentioned in question)
driver.navigate().to(URL);
}
Update:
I noticed ref=cart-ajax-error in redirect url (after deleting items from cart). Apparently, the site is using AJAX to refresh page after deleting items from cart. Might this conflict with my attempt to navigate to another page? In other words, perhaps Selenium is getting two different messages at the same time.. refresh page and navigate to new page.. so it remains on the current page?
If this is true, what can be done to resolve this issue?
Thanks!
When it sometimes happens and sometimes not it's nearly always a matter of timing.
Add an explicit wait to your code after navigating to the URL:
WebDriverWait wait = new WebDriverWait(driver, 20);
wait.until(ExpectedConditions.urlToBe(URL));
i have a html page like below :
..
Loginwith myPass
..
Using selenium webdriver in Java, I tried the below:
WebDriver driver = new HtmlUnitDriver();
String url = "http://www.pic.net.sh/index.html";
// And now use this to visit Google
driver.get(url);
//driver.click("xpath=//a[contains(#href,'listDetails.do')
//driver.findElement(By.xpath("//div[#class='quickActionHolderInfo loginAccountPass']/a[contains(text(), 'Login')]")).click();
driver.findElement(By.xpath("//div[#class='span12 quickActionHolder']/a[#href='/organisation/pass/login']")).click();
BUT this page will redirect to another page and subsequent and when I tried the below to find the final redirected url
System.out.println("Page title is: " + driver.getCurrentUrl());
It will show the url of the immediate page after clicking the link, but NOT the final one.
Have I missed out anything or is my code insufficient to deal with re-direction of the page after a delay in that re-direction? I have qneuired about it here, but to no avail and if anyone can point me to the similar issue, I will be most grateful.
It usually happens due to late call to new URL. There are various ways to handle this:
Try waiting for some element on final page, then do getCurrentUrl().
Another way is to apply while loop till the current URL changes.
I use following code for my tests:
int i=0;
do{
try {
Thread.sleep(200);
} catch (InterruptedException e) {
}
i++;
} while(driver.getCurrentUrl().equals("http://www.pic.net.sh/index.html")&&i<10);
Either way should work.
I try to get a dynamic page from URL. I am working in Java. I have done this using Selenium, but it takes lots of time. As it takes time to invoke driver of Selenium. That's why I shifted to HtmlUnit, as it is GUILess Browser. But my HtmlUnit implementation shows some exception.
Question :-
How can I correct my HtmlUnit implementation.
Is the page produced by Selenium is simiar to the page produced by HtmlUnit? [ Both are dynamic or not? ]
My selenium code is :-
public static void main(String[] args) throws IOException {
// Selenium
WebDriver driver = new FirefoxDriver();
driver.get("ANY URL HERE");
String html_content = driver.getPageSource();
driver.close();
// Jsoup makes DOM here by parsing HTML content
Document doc = Jsoup.parse(html_content);
// OPERATIONS USING DOM TREE
}
HtmlUnit code:-
package XXX.YYY.ZZZ.Template_Matching;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Assert;
import org.junit.Test;
public class HtmlUnit {
public static void main(String[] args) throws Exception {
//HtmlUnit htmlUnit = new HtmlUnit();
//htmlUnit.homePage();
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("http://www.jabong.com/women/clothing/womens-tops/?source=women-leftnav");
String textSource = currentPage.asText();
System.out.println(textSource);
}
}
It shows exception :-
1: How can I correct my HtmlUnit implaementation.
Looking at the stack trace, it seems to be saying that the javascript engine executed some javascript that tried to access an attribute on a Javascript "undefined" value. If it is correct, that would be a bug in the javascript you are testing, not in the HtmlUnit code.
2: Is the page produced by Selenium is simiar to the page produced by HtmlUnit?
That does not make sense. Neither Selenium or HtmlUnit "produces" a page. The page is produced by the serve code you are testing.
If you are asking if HtmlUnit is capable of dealing with code that has embedded Javascript ... there is clear evidence in the stacktrace that it is trying to execute the Javascript.
I'm connecting to a webserver with a specific JavaScript. (Using HttpURLConnection atm)
What i need is a connection that makes it possible to manipulate a JavaScript function.
Afterwards i want to run the whole JavaScript again.
I want the following function always to return "new FlashSocketBackend()"
function createBackend() {
if (flashSocketsWork) {
return new FlashSocketBackend()
} else {
return new COMETBackend()
}
}
Do i have to use HtmlUnit for this?
Whats the easiest way to connect, manipulate and re-run the script?
Thanks.
With HtmlUnit you indeed can do it.
Even though you can not manipulate an existing JS function, you can however execute what JavaScript code you wish on an existing page.
Example:
WebClient htmlunit = new WebClient();
HtmlPage page = htmlunit.getPage("http://www.google.com");
page = page.executeJavaScript("<JS code here>").getNewPage();
//manipulate the JS code and re-excute
page = page.executeJavaScript("<manipulated JS code here>").getNewPage();
//manipulate the JS code and re-excute
page = page.executeJavaScript("<manipulated JS code here>").getNewPage();
more:
http://www.aviyehuda.com/2011/05/htmlunit-a-quick-introduction/
Your best shot is probably to use Rhino — an open-source implementation of JavaScript written entirely in Java. Loading your page with a window.location and hopefully running your JavaScript function. I read sometime before Bringing the Browser to the Server and seemed possible.