I know you may think this question is stupid, but I need to use HtmlUnit. However, it returns a page either as XML or as text.
I don't how to get the pure HTML (the same as the source code that browsers return)
I need this, because I need to use some written modules. Any ideas?
You can use the following piece of code to achieve your goal:
WebClient webClient = new WebClient();
Page page = webClient.getPage("http://example.com");
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
See javadocs of the WebResponse.html#getContentAsString() method.
Related
I'm testing my website and what I do is moving inside of it using Htmlunit library and Java. Like this for example:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
HtmlPage page1 = webClient.getPage(mypage);
// sent using POST
HtmlForm form = page1.getForms().get(0);
HtmlSubmitInput button = form.getInputByName("myButton");
HtmlPage page2 = button.click();
// I want to open page2 on a web browser and continue there using a function like
// continueOnBrowser(page2);
I filled a form programmatically using Htmlunit then I sent the form which uses a POST method. But I'd want to see the content of the response inside a web browser page. The fact is that if I use the URL to see the response it doesn't work since it's the response to a POST method.
It seems like it's the wrong approach to me, it's obvious that if you do anything programmatically you could not expect to open the browser and continue there... I can't figure out what could solve my problem.
Do you have any suggestions?
So im trying to use HtmlUnit to go to a URL but once you visit that url it downloads a json file regarding the data you want. Not sure how to word this but basically in HtmlUnit how can I get the result from a downloaded file.
I suck at explaining here look
trying to check user availability by this
private static final String URL = "https://twitter.com/users/username_available?username=";
...
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
so that basically creates a new page for each username thing is the URL + username returns a json file of user availability. I know how to read the json file but the problem is this
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage
cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
I get that on this line
HtmlPage page = webClient.getPage(URL + users[finalUsersIndex]);
I suppose I need to create a new page for the response but how would I do that since it automatically downloads file instead of per se, clicking a button which downloads the file. (Correct me if im wrong)
Sorry 4AM
As its name indicates, an HtmlPage is a page containing HTML. JSON is not HTML.
As the documentation indicates;
The DefaultPageCreator will create a Page depending on the content type of the HTTP response, basically HtmlPage for HTML content, XmlPage for XML content, TextPage for other text content and UnexpectedPage for anything else.
(emphasis mine).
So, as the exception you're getting indicates, the behavior you observed is the documented behavior: you're getting a page that is neither HTML, nor XML, nor text, so you get an UnexpectedPage.
Your code should thus be:
UnexpectedPage page = webClient.getPage(URL + users[finalUsersIndex]);
When I used HttpUnit, I would invoke getCurrentPage() method of HttpUnit to get the current page. How can I do that in HtmlUnit? I tried webclient.getHomePage(), but it seem to return wesite of htmlunit.
One suggestion I got is use getPage using previous URL, but that doesn't work for me because I need to refactor a code that is earlier written in a code which makes it impossible reexecute previous request.
You can use the following approach to get HtmlPage object from webclient assuming you already navigaed to a page either by using wc.getPage(url) or submitting a form in previous page or using any other method. Assuming that wc is the WebClient object.
HtmlPage currentPage = (HtmlPage) wc.getCurrentWindow().getEnclosedPage();
I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
The simple way to solve that problem.
Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.
WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());
I have a snippet of code like this:
webUrl = new URL(url);
reader = new BufferedReader(new InputStreamReader(webUrl.openStream()));
When I try to get html content of some page I get response that my browser doesn't support frames. So I do not get the real html of the page.
Is there a workaround?
Maybe to tell to the program to register as some browser?
For me it is critical only to get the html, then I want to parse it.
EDIT: Can not get src of the frame from the html in browser. It is hidden in js.
The "You don't support frames and we haven't put sensible alternative content here" message will be in the <noframes> element. You need to access the appropriate <frame> element, access its src attribute, resolve the URI in it, and then fetch data from there.
You must set a user-agent string in your HTTP request, so that the server thinks you are supporting frames. I suggest something like HtmlClient or HttpClient for this.