I am using HtmlUnit to load a webpage containing a dynamically updated ajax component using the following:-
WebClient webClient = new WebClient(BrowserVersion.CHROME);
URL url = new URL("https://live.xxx.com/en/ajax/getDetailedQuote/" + instrument);
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
HtmlPage redirectPage = webClient.getPage(requestSettings);
This works and I get the contents of the page at the time of request.
I want however to be able to monitor and respond to changes on the page.
I tried the following:-
webClient.addWebWindowListener(new WebWindowListener() {
public void webWindowContentChanged(WebWindowEvent event) {
System.out.println("Content changed ");
}
});
But I only get "Content changed" when the page is first loaded, and not when it updates.
I fear there is not really a solution with HtmlUnit (at least not out of the box). The webWindowContentChanged hook is only called if the (whole) page inside the window is replaced.
You can try to implement an DomChangeListener and attach that to the page (or maybe the body of the page).
If you like to track the ajax requests more on the http level you have the option to intercept the requests (see https://htmlunit.sourceforge.io/faq.html#HowToModifyRequestOrResponse for more details).
If you need more, please open an issue and we can discuss.
Related
I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.
actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.
I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)
I am trying to get to the docSearch form of the https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp web page using the latest HTMLUnit release (2.37.0). As you can see using Firefox's DOM Inspector, there is such a form
WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
webClient.waitForBackgroundJavaScript(1000000);
webClient.waitForBackgroundJavaScriptStartingBefore(100000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
page.getEnclosingWindow().getJobManager().waitForJobs(1000000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(1000000);
HtmlForm form = page.getFormByName("docSearch");
The last line of the above code gives me the following exception:
com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[form] attributeName=[name] attributeValue=[docSearch]
Any tips on what I can try in my code to get to the docSearch form ?
Do you believe this is a problem with HTMLUnit itself? Should I file this as an issue on HTMLUnit's GitHub site?
Have spend some time on this to build a complete sample. The page is only available from the us - i had to set up a vpn to access the page. The sample contains some hints; hope that helps.
final String url = "https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// webClient.getOptions().setUseInsecureSSL(true);
// open the url, this will do a redirect to the login page
HtmlPage page = webClient.getPage(url);
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// click the Public User Login
for (DomElement elem : page.getElementById("middle_left").getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput
&& "Login".equals(((HtmlSubmitInput) elem).getValueAttribute())) {
page = elem.click();
break;
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// search by owner name
HtmlInput ownerInput = (HtmlInput) page.getElementById("TaxAOwnerIDSearchString");
ownerInput.type("Trump");
// click submit
for (DomElement elem : page.getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput) {
page = elem.click();
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println(page.asText());
Your code looks really desperate, usually it is more helpful to try to understand what is going on than copy every snippet you can find into you code and hope this will help.
A good starting point is to understand how the page is working. Use a good web proxy like Charles (or Fiddler) to monitor what happens when you open the page with your browser. Sadly I cannot open your url because my browser reports server not found. Because of this the rest of this answer is more of a guess.
The next step is to create your web client and try to live with the default settings.
WebClient webClient = new WebClient();
webClient.getOptions().setThrowExceptionOnScriptError(false);
With this two lines your client is ready.
At least you RefreshHandler setup completely breaks the handling of refresh cases.
Next step is to check the output after you got the page and compare with your browser/web proxy session.
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
System.out.println(page.asXml());
No you can check if the form is there (in the output) or not. If not you have to figure out with the proxy if there is any kind of js based background reloading. Usual you will see the requests in your proxy output.
To wait for this you can call something like
webClient.waitForBackgroundJavaScriptStartingBefore(100_000);
Sometimes these backround jobs are replacing the content of the current window. To take care of this it is a good idea to get the current page content from the window before dumping.
page = page.getEnclosingWindow(getEnclosedPage());
System.out.println(page.asXML());
Hope that clarifies it a bit. If you need more help i need to be able to access the page myself. Otherwise it is only guessing.
I am using htmlunit from net.sourceforge.htmlunit for simulating web browser. I try to log in in steam web app, but I encoutered problem. After setting credentials I wanted to use click method:
final WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.setCookieManager(new CookieManager());
final HtmlPage loginPage = webClient.getPage(loginPageConfiguration.getLoginPageUrl());
final HtmlTextInput user = loginPage.getHtmlElementById(loginPageConfiguration.getLoginInputId());
user.setText(loginCredentials.getUsername());
final HtmlPasswordInput password = loginPage.getHtmlElementById(loginPageConfiguration.getPasswordInputId());
password.setText(loginCredentials.getPassword());
final HtmlPage afterLoginPage = loginPage.getHtmlElementById(loginPageConfiguration.getLoginButtonId()).click();
In normal browser after succesfull login it redirects to http://store.steampowered.com/ but afterLoginPage is still in previous login page.
Without knowing the page and having credentials to access i can only guess. Maybe the application is a single page application that replaces the visual content using ajax (maybe in combination with redirecting). Because ajax is async it might help to wait a bit after the click and before addressing the page.
It also might be a good starting point to understand what is going on by checking the http communication of the real browser (by using the developer tools or by using a web proxy like Charles). You can than compare this with the communication done by HtmlUnit (e.g. enable HttpClient wire log).
Another option might be a javascript error. Please check your log output.
I want to access instagram pages without using the API. I need to find the number of followers, so it is not simply a source download, since the page is being built dynamically.
I found HtmlUnit as a library to simulate the browser, so that the JS gets rendered, and I get back the content I want.
HtmlPage myPage = ((HtmlPage) webClient.getPage("http://www.instagram.com/instagram"));
This call however results in the following exception:
Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 403 Forbidden for http://d36xtkk24g8jdx.cloudfront.net/bluebar/3a30db9/scripts/webfont.js
So it can't access that script, but if I'm interpreting this correctly, it's just for font loading, which I don't need. I googled how to tell it to ignore parts of the page, and found this thread.
webClient.setWebConnection(new WebConnectionWrapper(webClient) {
#Override
public WebResponse getResponse(final WebRequest request) throws IOException {
if (request.getUrl().toString().contains("webfont")) {
System.out.println(request.getUrl().toString());
return super.getResponse(request);
} else {
System.out.println("returning response...");
return new StringWebResponse("", request.getUrl());
}
}
});
With that code, the exception goes away, but the source (or page title, or anything else I've tried) seems to be empty. "returning response..." is printed once.
I'm open to different approaches as well. Ultimately, entire page source in a single string would be good enough for me, but I need the JS to execute.
HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem.
You can use PhantomJs like a webdriver:
PhantomJs
As I understand it and have used it, AJAX is used to make requests from the client to the server and to then update a HTML DIV on the client with new content.
However, I want to use AJAX from a client to a servlet to verify the existence of a URL. If the result is bad, I can set an error message in the servlet and return it to the client page for display.
But does anyone know if, in the case of a positive result, I can have my servlet automatically display another (the next) page to the user? Or should that request be triggered by Javascript on the client when the positive results is received.
Thanks
Mr Morgan.
Since your ajax call is executed in the background the result returned by the servlet, returns to the ajax call, which should then act accordingly to the result. e.g. trigger the display of another page. (Which could have been already in the ajax response and then you can show it in a div or iframe or ...)
As per the W3 specification, XMLHttpRequest forces the webbrowser to the new location when the server returns a fullworthy 301/302 redirect and the Same Origin Policy of the new request is met. This however fails in certain browsers like certain Google Chrome versions.
To achieve best crossbrowser result, also when the redirected URL doesn't met the Same Origin Policy rules, you would like to change the location in JavaScript side instead. You can eventually let your servlet send the status and the desired new URL. E.g.
Map<String, Object> map = new HashMap<String, Object>();
map.put("redirect", true);
map.put("location", "http://stackoverflow.com");
response.setContentType("application/json");
response.setCharacterEncoding("UTF-8");
resposne.getWriter().write(new Gson().toJson(map));
(that Gson is by the way Google Gson which eases converting Java Objects to JSON)
and then in the Ajax success callback handler in JS:
if (response.redirect) {
window.location = response.location;
}
In your success call back (on client), set the self.location.href to new URL.
HTML is a "pull" technology: Nothing gets displayed in the browser that the browser hasn't previously requested from the server.
Hence, you don't have a chance to "make the servlet automatically display a different page." You have to talk your browser (from JavaScript) into requesting a different page.