I'm trying to get page content, that javascript function getWines() returns. The page I'm trying to get info from is http://hedonism.co.uk/wines/. So I'm using HtmlUnit and wrote the following code:
final WebClient webClient = new WebClient(
BrowserVersion.INTERNET_EXPLORER_10);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(false);
final HtmlPage page = webClient.getPage(url);
String javaScriptCode = "getWines(1)";
ScriptResult result = page.executeJavaScript(javaScriptCode);
Page page1 = result.getNewPage();
StringBuffer p = WebGet.getBuffPageContent(page1.getUrl().toString(), true);
System.out.println(p.toString());
But it seems such approach isn't working. I receive the same page I had before function call with the same source code, so I'm not able to get info about, for example, wine name. Maybe I'm totally doing incorrect?
Related
Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}
I am trying to run the tutorial on here. The code looks like this:
public class Test {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient client = new WebClient(BrowserVersion.FIREFOX);
HtmlPage page = client.getPage("https://google.com/");
// Getting Form from google home page. tsf is the form name
HtmlForm form = page.getHtmlElementById("tsf"); // Error occurs here
form.getInputByName("q").setValueAttribute("test");
// Creating a virtual submit button
HtmlButton submitButton = (HtmlButton)page.createElement("button");
submitButton.setAttribute("type", "submit");
form.appendChild(submitButton);
// Submitting the form and getting the result
HtmlPage newPage = submitButton.click();
// Getting the result as text
String text = page.asNormalizedText();
System.out.println(text);
}
}
But I am getting error message:
Exception in thread "main" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[tsf]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1670)
at Test.main(Test.java:20)
Since this tutorial is relatively old, the ID tsf might be outdated. However, if I check the form name from the google home page, I cant figure it out. Maybe I dont understand the meaning of the whole HtmlForm object. (I am completely new to this topic)
There is no element with ID tsf anymore. Best way to check it out is to go to the site and use Web Developer Tools of your browser (f12) mostly on every browsers. You can see the whole HTML document from there.
I am trying to parsing a website wiht HtmlUnit and Jsoup and i facing this problem.
I have different pages to parse and I stored this links of this pages in a string array.
I want to loop on array's length and parse each page and i proceed in this way.
1) For loop on the length of link's array
2) Opening new webclient
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) Closing webclient
6) go back to 2).
In this way, i'm obtaining what I want, but code it's little bit slow. So i tried to open and close webClient outside the for loop. Like this:
1) Opening new webclient
2) For loop on the length of link's array
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) go back to 2).
6) Closing webclient
It's much more faster but i'm not obtaining same results of previous way.
Is it wrong to use webclient constructor in this way?
EDIT:
Following the code I'm testing:
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
// TODO Auto-generated method stub
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
String[] links = {"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;6",
"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;9"};
String bm = null;
String[] odds = new String[2];
//Second way
WebClient webClient = new WebClient(BrowserVersion.CHROME);
System.out.println("Client opened");
for (int i=0; i<links.length; i++) {
HtmlPage page = webClient.getPage(links[i]);
System.out.println("Page loaded");
Document csDoc = Jsoup.parse(page.asXml());
System.out.println("Page parsed");
Element table = csDoc.select("table.table-main.detail-odds.sortable").first();
Elements cols = table.select("td:eq(0)");
if (cols.first().text().trim().contains("bet365.it")) {
bm = cols.first().text().trim();
odds[i]=table.select("tbody > tr.lo").select("td.right.odds").first().text().trim();
}
else {
Elements footTable = csDoc.select("table.table-main.detail-odds.sortable");
Elements footRow = footTable.select("tfoot > tr.aver");
odds[i] = footRow.select("td.right").text().trim();
bm = "AVG";
}
webClient.close();
}
System.out.println(bm +"\t" +odds[0] + "\t" + odds[1]);
}
If i run this code results are right. If i move webClient.close(); outside the for loop results are not correct. In particular odds[0] is equal to odds[1];
Think about WebClient as the replacement of your browser. Creating a new WebClient is like starting a new browser.
If you like to do something equal to open a new tab in your browser, you can use WebClient#openWindow(..). And from the memory point of view it is a good idea to close the window if you are done.
If you are looking for performance, why you re-parse the whole page Jsoup. HtmlUnit retrieves the page, parses the page, creates the whole DOM and runs the javascript on top of this dom before your are getting back the page from your getPage call.
Then you are using HtmlUnit to serialize the Dom tree back to Html and use Jsoup to parse the page again.
HtmlUnit offers many ways to search for elements on a page. I'm suggesting to use this API directly on the page you got.
I have problems with getting content by URL. I'm using HtmlUnit for parsing an HTML page, but when I run my application I don't get content without filling after executed js.
I getting html without needed me content.
Who can help me please ?
Example code:
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
webClient.waitForBackgroundJavaScript(30 * 1000);
final HtmlPage page = webClient.getPage("http://.....some url");
final String pageAsXml = page.asXml();
final String pageAsText = page.asText();
} –
I am using web client for getting page source. I have logged in successfully. After that, I use same object for getting page source using different URL but it's showing an Exception like:
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
This is the code which i am using.
forms = (List<HtmlForm>) firstPage.getForms();
form = firstPage.getFormByName("");
HtmlTextInput usernameInput = form.getInputByName("email");
HtmlPasswordInput passInput = form.getInputByName("password");
HtmlHiddenInput redirectInput = form.getInputByName("redirect");
HtmlHiddenInput submitInput = form.getInputByName("form_submit");
usernameInput.setValueAttribute(username);
passInput.setValueAttribute(password);
//Create Submit Button
HtmlElement button = firstPage.createElement("button");
button.setAttribute("type", "submit");
button.setAttribute("name", "submit");
form.appendChild(button);
System.out.println(form.asXml());
HtmlPage pageAfterLogin = button.click();
String sourc = pageAfterLogin.asXml();
System.out.println(pageAfterLogin.asXml());
/////////////////////////////////////////////////////////////////////////
above code running successfully and login
After that i am using this code
HtmlPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
But i am getting Exception
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
Within the JavaDoc of UnexpectedPage they state that
A generic page that is returned whenever an unexpected content type is
returned by the server.
I would advise that you check the content type of webClient.getPage("url");
Instead of using
HtmlPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
Use
UnexpectedPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
It worked fine with me.