HtmlUnit - Get returned information - java

I am struggling with the last part of my project in regards to HtmlUnit. I have succesfully managed to fill out the form details and click the submit button but this returns me a page object
Page submitted = button.click();
The API for page interface can be found here - http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/Page.html . I have spent a while trawling through the API to try and see how, based on the returned page after clicking the button I can then access the html table on the resulting page.
Would anyone be able to help me with the appropriate methods calls I would need to use in order to complete this.
Thanks

If the page returned is truly HTML (and not, for instance, a zip file) you can do this:
HtmlPage htmlPage = (HtmlPage) button.click();
DomNodeList<HtmlElement> nodes = htmlPage.getElementsByTagName("table");
...
HtmlTable table = getTheTableIWant(nodes);
doSomethingWith(table);

Related

Easiest way to "browse" to a page and submit form in Java

What I need to do is browse to a webpage, login, then browse to another webpage on that site that requires you to be logged in, so it needs to save cookies. After that, I need to click an element on that page, in which I would fill out the form and get the message that the webpage returns to me. The reason I need to actually go to the page and click the button as suppose to just navigating directly to the link is because the you are assigned a session ID every time you log in and click the link, and its always different. The button looks like this, its not a normal href link:
<span id=":tv" idlink="" class="sA" tabindex="0" role="link">Next</span>
Anyway, what would be the easiest way to do this? Thanks.
Update:
After trying HTMLunit, and other headless browser libraries, it doesnt seem that its happening using anything "headless." Another thing that I recently found out about this page is that that all the HTML is in some weird format... Its all inside a script tag. Here is a sample.
"?ui\x3d2\x26view\x3dss\x26mset\x3dmain\x26ver\x3d-68igm85d1771\x26am\x3d!Zsl-0RZ-XLv0BO3aNKsL0sgMg3nH10t5WrPgJSU8CYS-KNWlyrLmiW3HvC5ykER_n_5dDw\x26fri"],"http://example.com/?ctx\x3d%67mail\x26hl\x3den",,0,"Gmail","Gmail",[["us","c130f0854ca2c2bb",[["n"],["m","New features!"],["u"],["k","0"],["p","1000:500000,10,200000,5,100000,3,75000,2,0,1"],["h","https://survey.googleratings.com/wix/p1679258.aspx?l\x3d1033"],["at","query,5,contacts,5,adv,5,cf,5,default,20"],["v","https://www.youtube.com/embed/Ra8HG6MkOXY?showinfo\x3d0"],
When I do inspect element on the button, the HTML code that I posted above for the button comes up, but not when doing view source. Basically, what I am going to need to do is use some sort of GUI and have the user navigate to the link and then have the program fill out the info. Does anyone know how I can do this? Thanks.
Have a look at the 5 Minute Getting Started Guide for Selenium: http://code.google.com/p/selenium/wiki/GettingStarted
On the login page, look at the form's HTML to see the url it posts to and the url parameters. Then request that url with the same parameters filled in with correct info, and make sure to save all the cookie headers to send to the second page. Then use an html parser to find your link. There are several html parsers available on sourceforge, and you could even try java's built in xml parsers, though if the site has even a tiny html mistake they will glitch.
EDIT didn't notice the fact that it is not a normal link. In that case you will need to look at the site's javascript to see where the link leads. If the link requires javascript to run, it gets more complicated. Java is not able to execute browser javascript, but I found a library called DJ native swing which includes a web browser class that you can add to jframes. It uses your native browser to render, and to run javascript.
This should be possible in Selenium as others have noted.
I have used Selenium to login then crawl a site and discover every permuation of values for every form on the site (30+ forms). These values are later used to fill and submit the form with a specific perumation of values. This site was very JS/jQuery heavy and I used Selenium's built-in support of javascript executor, css selectors, and XPath to accomplish this.
I implemented HtmlUnit and HttpUnit as faster alternatives, but found they were not as reliable as Selenium given the JS semantics of the site I was crawling.
It's hard to give you code on how to accomplish it because your Selenium implementation will be quite page-specific and I can't look at the page you're coding against to figure out what's going on with that button script junk. However, I have include some possibly relevant selenium code (Java) snippets:
Element element = driver.findElements(By.id(value)); //find element on page
List<Element> buttons = parent.findElements(By.xpath("./tr/td/button")); //find child element
button.click();
element.submit() //submit enclosing form
element.sendKeys(text); //enter text in an input
String elementText = (String) ((JavascriptExecutor) driver).executeScript("return arguments[0].innerText || arguments[0].textContent", element); //interact with a selenium element via JS
If you are coding similar functions on different pages, then PageObjects behind interfaces can help.
The link Anew posted is a good starting point and good ol' StackOverflow has answers to just about any Selenium problem ever.
Instead of trying to browse around programmatically, try executing the login request and save the cookies then set those in the next request to the form post.
HTMLUnit is pretty bad at processing JavaScript, the Rhino JS library produces often errors (actually no errors is much the exception). I would advise to use Selenium, which is basically a framework to control headless browsers (chrome, firefox based).
For your question, the following code would do the work
selenium.open(myurl);
selenium.click("id=:tv");
You then have to wait for the page to load
selenium.waitForPageToLoad(someTime);
I would recommend htmlunit any day. It's a great library.
First, check out their web page(http://htmlunit.sourceforge.net/) to get htmlunit up and running. Make sure you use the latest snapshot(2.12 when writing this)
Try these settings to ignore pretty much any obstacle:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);
Then when fetching your page, make sure you wait for background Javascript before doing anything with the page, like posting a login form:
//Get Page
HtmlPage page1 = webClient.getPage("https://login-url/");
//Wait for background Javascript
webClient.waitForBackgroundJavaScript(10000);
//Get first form on page
HtmlForm form = page1.getForms().get(0);
//Get login input fields using input field name
HtmlTextInput userName = form.getInputByName("UserName");
HtmlPasswordInput password = form.getInputByName("Password");
//Set input values
userName.setValueAttribute("MyUserName");
password.setValueAttribute("MyPassword");
//Find the first button in form using name, id or xpath
HtmlElement button = (HtmlElement) form.getFirstByXPath("//button");
//Post by clicking the button and cast the result, login arrival url, to a new page and repeat what you did with page1 or something else :)
HtmlPage page2 = (HtmlPage) button.click();
//Profit
System.out.println(page2.asXml());
I hope this basic example will help you!

HtmlUnit webscraping Anchor tag with dropdown link that has JavaScript

Is it possible to click on a link using HtmlUnit when that link has a dropdown list of links when you mouseover the initial link. If you click the initial link nothing happens except for you get list of links that drop down when you mouse over. I would like to click one of the drop down links and grab the web page that is associated with that link.
The problem seems to be that the Anchor has JavaScript and also it is a drop down list. If the Anchor did not have JavaScript and drop down then I would not have any problems.
Here is the pertinent JavaScript Code:
<script language='JavaScript' type='text/javascript'>
<!--
function mmLoadMenus(){
window.mm_menu_0805151542_0 = new Menu("root",211,23,"Arial, Helvetica, sans-serif",11,"#FFFFFF","#FFFFFF","#056CB9","#014D98","left","middle",3,0,1000,-5,7,true,false,true,2,true,false);
mm_menu_0805151542_0.addMenuItem("View Tax Sales","window.open('TCTaxSaleBrief.asp', '_blank','width=800,height=580,scrollbars=1,resizable=yes,top=50,left=100');");
mm_menu_0805151542_0.addMenuItem("Registration Renewal Reprint","window.open('vrRenewal.asp', '_blank','width=800,height=580,scrollbars=1,resizable=yes,top=50,left=100');");
mm_menu_0805151542_0.addMenuItem("Drivers License","window.open('http://www.dds.ga.gov/', '_blank');");
mm_menu_0805151542_0.addMenuItem("Online Tag Renewals","location='../TaxCommissioner/TagRenewal.html'");
mm_menu_0805151542_0.hideOnMouseOut=true;
mm_menu_0805151542_0.bgColor='#CCCCCC';
mm_menu_0805151542_0.menuBorder=0;
mm_menu_0805151542_0.menuLiteBgColor='#FFFFFF';
mm_menu_0805151542_0.menuBorderBgColor='#015BA7';
</script>
Here is the pertinent Anchor:
Online Services<br />
Here is the snippet of Java Code that I am using to make this work.
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_10);
String webPage="http://website.html";
try {
HtmlPage taxComPage = webClient.getPage(webPage);
HtmlElement htmlElement = taxComPage.getDocumentElement();
//HtmlAnchor anchor = taxComPage.getAnchorByText("View Tax Sales");
//HtmlAnchor htmlAnchor = taxComPage.getHtmlElementById("link10");
HtmlAnchor anchor = taxComPage.getAnchorByText("Online Services");
HtmlPage page = anchor.click();
}catch
If it is the case that HtmlUnit does not work with JavaScript please let me know!
Thanks
I understand that there is this function called: mmLoadMenus() which has text that is displayed when moused over but I having issue with how is this function associated with the anchor. In the anchor there is something called MM_showMenu. What is this MM_showMenu, who created it, is this a JavaScript keyword, I don't see it being defined anywhere. I have searched the whole page, the only place it is mentioned is in the anchor. It seems to be some type of a function with parameters of: window.mm_menu_0805151542_0,104,0,null,'link11' being passed to it. The only connection that I can make between function mmLoadMenus() and the anchor is that the anchor has mm_menu_0805151542_0 in it. I am not that well versed in JavaScript maybe that is why I am not making a strong connection with the JavaScript function and the anchor.
The data is already on the page so why not scrape it from the JavaScript function itself. Just a matter of parsing out the text - much easier then trying to force it to load.

Posting data using web client

I am creating a submit button using a web client, but it's not working.
This is the code, which I am using:
HtmlElement button = firstPage.createElement("button");
button.setAttribute("type", "submit");
button.setAttribute("name", "submit");
form.appendChild(button);
System.out.println(form.asXml());
HtmlPage pageAfterLogin = button.click();
Simple question, Do you have HTML FORM tag included in the page?
Also see below link, not sure if it helps you
HtmlUnit, how to post form without clicking submit button?

Getting Final HTML with Javascript rendered Java as String

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
The simple way to solve that problem.
Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.
WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());

Java HTMLUnit - How do I access the page DOM I'm sent to after submitting a form

I'm very confused on how to access the newly loaded page's DOM after I .click() the submit button on a page and am sent to another one.
Any ideas?
Thanks.
The click() method should return an HtmlPage object with the new page:
HtmlPage newPage = myElement.click();
Since you're being redirected to a new page, wouldn't you handle that page's DOM just like you normally would? In jQuery you would use:
$(document).ready()
And when the DOM has finished loading, you can manipulate it however you want.

Categories

Resources