I have problems with getting content by URL. I'm using HtmlUnit for parsing an HTML page, but when I run my application I don't get content without filling after executed js.
I getting html without needed me content.
Who can help me please ?
Example code:
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
webClient.waitForBackgroundJavaScript(30 * 1000);
final HtmlPage page = webClient.getPage("http://.....some url");
final String pageAsXml = page.asXml();
final String pageAsText = page.asText();
} –
Related
Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}
I am trying to run the tutorial on here. The code looks like this:
public class Test {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient client = new WebClient(BrowserVersion.FIREFOX);
HtmlPage page = client.getPage("https://google.com/");
// Getting Form from google home page. tsf is the form name
HtmlForm form = page.getHtmlElementById("tsf"); // Error occurs here
form.getInputByName("q").setValueAttribute("test");
// Creating a virtual submit button
HtmlButton submitButton = (HtmlButton)page.createElement("button");
submitButton.setAttribute("type", "submit");
form.appendChild(submitButton);
// Submitting the form and getting the result
HtmlPage newPage = submitButton.click();
// Getting the result as text
String text = page.asNormalizedText();
System.out.println(text);
}
}
But I am getting error message:
Exception in thread "main" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[tsf]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1670)
at Test.main(Test.java:20)
Since this tutorial is relatively old, the ID tsf might be outdated. However, if I check the form name from the google home page, I cant figure it out. Maybe I dont understand the meaning of the whole HtmlForm object. (I am completely new to this topic)
There is no element with ID tsf anymore. Best way to check it out is to go to the site and use Web Developer Tools of your browser (f12) mostly on every browsers. You can see the whole HTML document from there.
I'm trying to get page content, that javascript function getWines() returns. The page I'm trying to get info from is http://hedonism.co.uk/wines/. So I'm using HtmlUnit and wrote the following code:
final WebClient webClient = new WebClient(
BrowserVersion.INTERNET_EXPLORER_10);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(false);
final HtmlPage page = webClient.getPage(url);
String javaScriptCode = "getWines(1)";
ScriptResult result = page.executeJavaScript(javaScriptCode);
Page page1 = result.getNewPage();
StringBuffer p = WebGet.getBuffPageContent(page1.getUrl().toString(), true);
System.out.println(p.toString());
But it seems such approach isn't working. I receive the same page I had before function call with the same source code, so I'm not able to get info about, for example, wine name. Maybe I'm totally doing incorrect?
I am writing a program in Java using Htmlunit that has a Radio Button that needs to be clicked to fill out a set of information. I am currently having an issue finding the fields that need to be entered after the radio button is clicked. Currently my code is:
String url = "http://cpdocket.cp.cuyahogacounty.us/";
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage(url);
final HtmlForm form = page.getForms().get(0);
final HtmlElement button = form.getElementById("SheetContentPlaceHolder_btnYes");
final HtmlPage page2 = button.click();
try {
synchronized (page2) {
page2.wait(3000);
}
}
catch(InterruptedException e)
{
System.out.println("error");
}
//returns the first page after the security page
final HtmlForm form2 = page2.getForms().get(0);
final HtmlRadioButtonInput button2 = form2.getInputByValue("forcl");
button2.setDefaultChecked(true);
page2.refresh();
final HtmlForm form3 = page2.getForms().get(0);
form3.getInputByName("ctl00$SheetContentPlaceHolder$foreclosureSearch$txtZip").setValueAttribute("44106");
final HtmlSubmitInput button3 = form3.getInputByValue("Submit");
final HtmlPage page3 = button3.click();
try {
synchronized (page3) {
page2.wait(10000);
}
}
catch(InterruptedException e)
{
System.out.println("error");
}
While the first page is a security page that needs to be bypassed, the second page is where I am running into the issue as I am getting the error "
com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[input] attributeName=[name] attributeValue=[ctl00$SheetContentPlaceHolder$foreclosureSearch$txtZip]
at com.gargoylesoftware.htmlunit.html.HtmlForm.getInputByName(HtmlForm.java:463)
at Courtscraper.scrapeWebsite(Courtscraper.java:58)"
I believe this means that the input field cannot be found in the form. I have been referring to two websites as reference. Website1, Website2. I am not sure, but i believe I may have to create a new HtmlPage after setting the radio button to true.
Without knowing the page it is impossible to see why the error is happening. However, as you say, it is clear that the getInputByName is not finding the element and raising the exception.
Given that code, and assuming you've not committed a syntactical error in the string to fetch the input by name, I would suggest removing this line:
page2.refresh();
Refreshing the page after making modifications to it might result in getting an unmodified page again.
Regarding creating a new HtmlPage after setting the radio button to true, that would only be necessary if the radio has an onchange or a similar event attached that fires a JavaScript AJAX call that modifies the DOM and creates the element that you are trying to fetch.
That's all I can suggest given that code.
In your code after creating page2 you will make a WebRequest not creating a new page like this.
String url = "http://cpdocket.cp.cuyahogacounty.us/Search.aspx";
String EventTarget = "ctl00$SheetContentPlaceHolder$rbCivilForeclosure";
String world = "ctl00$SheetContentPlaceHolder$UpdatePanel1|ctl00$SheetContentPlaceHolder$rbCivilForeclosure";
String Viewstate = page2.getElementById("__VIEWSTATE").getAttribute("value");
String EventValidation = page2.getElementById("__EVENTVALIDATION").getAttribute("value");
WebRequest req1 = new WebRequest(new URL(url));
req1.setHttpMethod(HttpMethod.POST);
req1.setAdditionalHeader("Origin", "http://cpdocket.cp.cuyahogacounty.us");
req1.setAdditionalHeader("Referer", "http://cpdocket.cp.cuyahogacounty.us/Search.aspx");
req1.setAdditionalHeader("X-Requested-With", "XMLHttpRequest");
String txtview1 = "ctl00$ScriptManager1=" + URLEncoder.encode(world) + "&__EVENTTARGET=" + URLEncoder.encode(EventTarget) + "&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=" + URLEncoder.encode(Viewstate) + "&__EVENTVALIDATION=" + URLEncoder.encode(EventValidation) + "&ctl00$SheetContentPlaceHolder$rbSearches=forcl&__ASYNCPOST=true&";
//System.out.println("this is text view =============== " + txtview1);
req1.setRequestBody(txtview1);
req1.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
String re=client.getPage(req1).getWebResponse().getContentAsString();
System.out.println("========== " + re);
After done above code successfully you getting a String in which your response is come.
I am using web client for getting page source. I have logged in successfully. After that, I use same object for getting page source using different URL but it's showing an Exception like:
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
This is the code which i am using.
forms = (List<HtmlForm>) firstPage.getForms();
form = firstPage.getFormByName("");
HtmlTextInput usernameInput = form.getInputByName("email");
HtmlPasswordInput passInput = form.getInputByName("password");
HtmlHiddenInput redirectInput = form.getInputByName("redirect");
HtmlHiddenInput submitInput = form.getInputByName("form_submit");
usernameInput.setValueAttribute(username);
passInput.setValueAttribute(password);
//Create Submit Button
HtmlElement button = firstPage.createElement("button");
button.setAttribute("type", "submit");
button.setAttribute("name", "submit");
form.appendChild(button);
System.out.println(form.asXml());
HtmlPage pageAfterLogin = button.click();
String sourc = pageAfterLogin.asXml();
System.out.println(pageAfterLogin.asXml());
/////////////////////////////////////////////////////////////////////////
above code running successfully and login
After that i am using this code
HtmlPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
But i am getting Exception
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
Within the JavaDoc of UnexpectedPage they state that
A generic page that is returned whenever an unexpected content type is
returned by the server.
I would advise that you check the content type of webClient.getPage("url");
Instead of using
HtmlPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
Use
UnexpectedPage downloadPage = null;
downloadPage=(HtmlPage)webClient.getPage("url");
It worked fine with me.