Call Ajax using HtmlUnit - java

I want to crawl web page, this page has a download button, when I press it current page show me download progress in title and then show me download link which can be pressed. I think its done via Ajax because I can see some in developer console -> Network ->XHR
This my code to crawl site
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage("https://9xbuddy.com/process?url=https://www.fembed.com/v/6mv22g3qfsdfsd");
// final ScriptResult scriptResult = page.executeJavaScript("beacon.js");
webClient.waitForBackgroundJavaScript(10000);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
But this code return me page which I get after button click and don't load Ajax. I know which Ajax requests were made by site, is it any way to manually call Ajax requests?

You can construct the Ajax calls manually with HtmlUnit, if you find that the Google Chrome console is not sufficient, you can use a tool such as Fiddler. Once you have identified the HTTP call, you can reconstruct it with HTMLUnit like below
URL url = new URL(
"http://tws.target.com/searchservice/item/search_results/v1/by_keyword?callback=getPlpResponse&navigation=true&category=55krw&searchTerm=&view_type=medium&sort_by=bestselling&faceted_value=&offset=60&pageCount=60&response_group=Items&isLeaf=true&parent_category_id=55kug&custom_price=false&min_price=from&max_price=to");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
requestSettings.setAdditionalHeader("Accept", "*/*");
requestSettings.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
requestSettings.setAdditionalHeader("Referer", "http://www.target.com/c/xbox-one-games-video/-/N-55krw");
requestSettings.setAdditionalHeader("Accept-Language", "en-US,en;q=0.8");
requestSettings.setAdditionalHeader("Accept-Encoding", "gzip,deflate,sdch");
requestSettings.setAdditionalHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.3");
Page page = webClient.getPage(requestSettings);
System.out.println(page.getWebResponse().getContentAsString());

Related

Crawl dynamically changed web page with HtmlUnit

I want to crawl web page using HtmlUnit. My purpose is:
Load page
Write something to text field
Press download button
Get new page
This is the web site: https://9xbuddy.com/
Using browser I can write an url to text field then press download button and get download link
My code is:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage("https://9xbuddy.com/sites/fembed");
final HtmlForm form = page.getForms().get(0);
final HtmlInput urlInput = form.getInputByName("url");
urlInput.click();
urlInput.type(iframeUrl);
final List<HtmlButton> byXPath = (List<HtmlButton>) form.getByXPath("//button[#class='orange-gradient submit_btn']");
final HtmlPage click = byXPath.get(0).click();
webClient.waitForBackgroundJavaScript(15000);
The problem is:
When I press download button it probably send Ajax reuquest because title changed to save and after few seconds title changed to Process clompleted
. With the code below I want to wait all ajax request but what i finally got is save title that mean HtmlUnit didn't wait for ajax. What is my way to do it?

Open a web browser page after a POST request using Htmlunit library

I'm testing my website and what I do is moving inside of it using Htmlunit library and Java. Like this for example:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
HtmlPage page1 = webClient.getPage(mypage);
// sent using POST
HtmlForm form = page1.getForms().get(0);
HtmlSubmitInput button = form.getInputByName("myButton");
HtmlPage page2 = button.click();
// I want to open page2 on a web browser and continue there using a function like
// continueOnBrowser(page2);
I filled a form programmatically using Htmlunit then I sent the form which uses a POST method. But I'd want to see the content of the response inside a web browser page. The fact is that if I use the URL to see the response it doesn't work since it's the response to a POST method.
It seems like it's the wrong approach to me, it's obvious that if you do anything programmatically you could not expect to open the browser and continue there... I can't figure out what could solve my problem.
Do you have any suggestions?

Web scraping using HtmlUnit on an intranet website

I am presently using HtmlUnit to automatically fill forms and click a button on an intranet site. The code is working on internet websites successfully but failing to do so on the intranet website. The intranet website is an asp site, only opens on IE. The code I am using is as follows,
final WebClient webClient = new
WebClient(BrowserVersion.INTERNET_EXPLORER,"10.20.30.31", 8182);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);
System.out.println(url);
HtmlPage page = webClient.getPage(url);
System.out.println("HTML page opened");
HtmlInput searchBox = page.getElementByName("txtFaq"); //this is actual
searchBox.setValueAttribute(faq);
HtmlSubmitInput update =page.getElementByName("clear");
page=update.click();
HtmlDivision resultStatsDiv =
page.getFirstByXPath("//div[#id='resultStats']");
System.out.println(resultStatsDiv.asText());
webClient.close();
On execution it is encountering the following exceptions,
java.net.SocketTimeoutException: Read timed out
What am I missing here?

Using HtmlUnit to get the access token from instagram

Since httpURLconnection didn't cut out, i switched to htmlUnit to get programatically the auth code to get the access token from instagram and then do whatever i need from there, the thing is i'm stuck trying to retrieve the authorization code from the url
mysite.com/?code=ca1ec5b06a0b409293cff74ed9876a46
but i can't access to that link since it doesn't seems to be redirected from the authorization URL. this one:
https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%CLIENT_ID%26redirect_uri%3Dhttp%3A//MYSITE.COM%26response_type%3Dcode
this is my code where i try to access to that url:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
HtmlPage page = (HtmlPage) webClient.getPage(authURL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
System.out.println(page.getUrl());
I resolved this by trying to get the authorization url without being logged to the site, then when i was asked to login, i used htmlunit to get the html form to login to instagram, finally i could get the desired redirect url with the code.

POSTing a request to the correct URL once HTMLUnit is ignoring the form.setActionAttribute and fom.setAttribute

I'm trying to submit a form using HTMLUnit but it seems that the action attribute of the form is ignored once the http post is going to the same page.
I'm getting the form on this URL:
http://www.tjse.jus.br/tjnet/consultas/internet/consnomeparte.wsp
And in the source code of this URL we can find that the action attribute is set to this URL:
http://www.tjse.jus.br/tjnet/consultas/internet/respconsnomeparte.wsp
But HTMLUnit always post to the first URL.
I'm using fiddler to analyse the request through a real web browser and through HTMLUnit and comparing the two HTTP POST it's easy to see that HTMLUnit is POSTing to the same site, i.e, the first URL mentioned.
I need that HTMLUnit POST to the second URL.
If anyone could help me I'll appreciate.
Problem solved.
Instead of using:
HtmlPage page2 = button.click();
I used:
button.click().getWebResponse().getContentAsString();
I would use something simular to the following.
// Enter your username in feild
searchForm.getInputByName("Username").setValueAttribute(schoolID);
//Submit the form and get the result page
HtmlPage pageResult = (HtmlPage) searchForm.getInputByValue("Search").click();
//Page results in raw html source code
String html = pageResult.asXml();
/*
* filter source code if needed to collect desired data
*/
//login via another server url
page = (HtmlPage) webClient.getPage("https://"+url);
HtmlForm LoginForm = page.getFormByName("Form1");
// login to web portal
LoginForm.getInputByName("txtUserName").setValueAttribute(username);
LoginForm.getInputByName("txtPassword").setValueAttribute(password);
//Submit the form and get the result page
HtmlPage pageResult = (HtmlPage) LoginForm.getInputByName("btnLogin").click();
Note: this htmlUnit code complys with htmlunit 2.15 API

Categories

Resources