Problem getting real html code in HtmlUnit - java

When I open https://www.instagram.com/metallica/ in browser and view its source code, I see javascript variable window._sharedData containing "graphql" field
When I get this page by HtmlUnit, variable window._sharedData is not the same
What's the problem? How can I get the same js field as in browser using HtmlUnit?
BrowserVersion my = new BrowserVersionBuilder(BrowserVersion.FIREFOX_52)
.setUserAgent("Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)").build();
WebClient webClient = new WebClient(my);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage htmlPage = webClient.getPage("https://www.instagram.com/metallica/");
String pageContent = htmlPage.getWebResponse().getContentAsString();
UPD
window._sharedData in Browser: {"config":{"csrf_token":"zkBxaROkhJqxHV7QAYKvHYNU8QCP15vm","viewer":null,"viewerId":null},"country_code":"RU","language_code":"ru","locale":"ru_RU","entry_data":{"ProfilePage":
window._sharedData in response: {"config":{"csrf_token":"Rpm5P3Ok3ZUh7wVklBLPkMzw9k3u1tbz","viewer":null,"viewerId":null},"country_code":"RU","language_code":"en","locale":"en_US","entry_data":{"LoginAndSignupPage":
so the difference in LoginAndSignupPage instead of ProfilePage
UPD2
On my server instagram for unknown reason redirects any address to /accouts/login that's why content is different. So now the question is how can I prevent this redirect?

getWebResponse returns the response you got from the server. If you like to get the current state of the page you have to wait for the js in the page to finish and then use something like page.getEnclosingWindow().getEnclosedPage().asXML();
If you compare with a real browser please make sure
there are no cookies stored by the browser because HtmlUnit always starts with an empty cookie store
enable JavaScript for HtmlUnit

Related

Open a web browser page after a POST request using Htmlunit library

I'm testing my website and what I do is moving inside of it using Htmlunit library and Java. Like this for example:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
HtmlPage page1 = webClient.getPage(mypage);
// sent using POST
HtmlForm form = page1.getForms().get(0);
HtmlSubmitInput button = form.getInputByName("myButton");
HtmlPage page2 = button.click();
// I want to open page2 on a web browser and continue there using a function like
// continueOnBrowser(page2);
I filled a form programmatically using Htmlunit then I sent the form which uses a POST method. But I'd want to see the content of the response inside a web browser page. The fact is that if I use the URL to see the response it doesn't work since it's the response to a POST method.
It seems like it's the wrong approach to me, it's obvious that if you do anything programmatically you could not expect to open the browser and continue there... I can't figure out what could solve my problem.
Do you have any suggestions?

Using HtmlUnit to get the access token from instagram

Since httpURLconnection didn't cut out, i switched to htmlUnit to get programatically the auth code to get the access token from instagram and then do whatever i need from there, the thing is i'm stuck trying to retrieve the authorization code from the url
mysite.com/?code=ca1ec5b06a0b409293cff74ed9876a46
but i can't access to that link since it doesn't seems to be redirected from the authorization URL. this one:
https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%CLIENT_ID%26redirect_uri%3Dhttp%3A//MYSITE.COM%26response_type%3Dcode
this is my code where i try to access to that url:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
HtmlPage page = (HtmlPage) webClient.getPage(authURL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
System.out.println(page.getUrl());
I resolved this by trying to get the authorization url without being logged to the site, then when i was asked to login, i used htmlunit to get the html form to login to instagram, finally i could get the desired redirect url with the code.

Log in https aspx web page with jsoup

Is it possible to log in to a https aspx web page using jsoup ?
the page where i try to log in is: https://by.vulog.com/communauto-labs/login.aspx
what i'm tryng to do at the end is to access https://by.vulog.com/communauto-labs/index.aspx in order to parse the html to get some information, but when u try to access this page, i still redirecting me to the login page (I can see that by looking at the html of homePage variable)
Or should I use some other tools ?
Here is the my code wich does not seem to work:
Connection.Response response = Jsoup.connect("https://by.vulog.com/communauto-labs/login.aspx")
.method(Connection.Method.GET)
.execute();
response = Jsoup.connect("https://by.vulog.com/communauto-labs/login.aspx")
.data("ctl00$ContentPlaceHolder1$LoginForm$UserName", "my_login")
.data("ctl00$ContentPlaceHolder1$LoginForm$Password", "my_password")
.cookies(response.cookies())
.method(Connection.Method.POST)
.execute();
Document homePage = Jsoup.connect("https://by.vulog.com/communauto-labs/index.aspx")
.cookies(response.cookies())
.get();
Struggling with the problem, I used a brutal solution :
Connecting through my naviagtor (chrome), using developers tools to get the authetification cookies, and pass them directly to my program before launching it.
I don't like this solution but it's for a single use programm.

POSTing a request to the correct URL once HTMLUnit is ignoring the form.setActionAttribute and fom.setAttribute

I'm trying to submit a form using HTMLUnit but it seems that the action attribute of the form is ignored once the http post is going to the same page.
I'm getting the form on this URL:
http://www.tjse.jus.br/tjnet/consultas/internet/consnomeparte.wsp
And in the source code of this URL we can find that the action attribute is set to this URL:
http://www.tjse.jus.br/tjnet/consultas/internet/respconsnomeparte.wsp
But HTMLUnit always post to the first URL.
I'm using fiddler to analyse the request through a real web browser and through HTMLUnit and comparing the two HTTP POST it's easy to see that HTMLUnit is POSTing to the same site, i.e, the first URL mentioned.
I need that HTMLUnit POST to the second URL.
If anyone could help me I'll appreciate.
Problem solved.
Instead of using:
HtmlPage page2 = button.click();
I used:
button.click().getWebResponse().getContentAsString();
I would use something simular to the following.
// Enter your username in feild
searchForm.getInputByName("Username").setValueAttribute(schoolID);
//Submit the form and get the result page
HtmlPage pageResult = (HtmlPage) searchForm.getInputByValue("Search").click();
//Page results in raw html source code
String html = pageResult.asXml();
/*
* filter source code if needed to collect desired data
*/
//login via another server url
page = (HtmlPage) webClient.getPage("https://"+url);
HtmlForm LoginForm = page.getFormByName("Form1");
// login to web portal
LoginForm.getInputByName("txtUserName").setValueAttribute(username);
LoginForm.getInputByName("txtPassword").setValueAttribute(password);
//Submit the form and get the result page
HtmlPage pageResult = (HtmlPage) LoginForm.getInputByName("btnLogin").click();
Note: this htmlUnit code complys with htmlunit 2.15 API

Getting Final HTML with Javascript rendered Java as String

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.
Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp
This page has comments as a facebook plugin which are fetched as Javascript.
Also similar to this even on this.
http://www.imdb.com/title/tt0848228/reviews
What should I do?
Use phantomjs: http://phantomjs.org
var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
// Where you want to save it
page.render("screenshoot.png")
// You can access its content using jQuery
var fbcomments = page.evaluate(function(){
return $(".fb-comments iframe").contents().find(".postContainer")
})
},10000)
You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)
To communicate with other applications from phantomjs you can use a web server or make a POST request: https://github.com/ariya/phantomjs/blob/master/examples/post.js
You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.
UPDATE: You were asking for example? You don't have to do anything extra for doing that:
Example:
WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
UPDATE 2: You can get iframe as follows:
HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();
Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit
The simple way to solve that problem.
Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.
WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());

Categories

Resources