When I open the same site with Google Chrome on one side and HtmlUnit.WebClient(BrowserVersion.CHROME) on the other, I do not see the same cookies on both sides. Cookies are checked here with Google-Chrome-Dev. Nine cookies vs four cookies for the same site.
The site is linckx.odoo.com.
Is there something missing in my HtmlUnit code?
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setCssEnabled(false);
webClient.getCookieManager().setCookiesEnabled(true);
CookieManager cookieManager = webClient.getCookieManager();
final HtmlPage loginPage = webClient.getPage(url + "/en_US/web/login");
If i access the page https://linckx.odoo.com/en_US/web/login
with a new browser (remove all cookies before) i get
nothing more.
Maybe the differences are from other request you did in your browser before.
new WebClient() in HtmlUnit is like starting a Browser with a complete new profile (all caches are empty, no cookies).
Related
I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.
actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.
I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)
I am trying to get to the docSearch form of the https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp web page using the latest HTMLUnit release (2.37.0). As you can see using Firefox's DOM Inspector, there is such a form
WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
webClient.waitForBackgroundJavaScript(1000000);
webClient.waitForBackgroundJavaScriptStartingBefore(100000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
page.getEnclosingWindow().getJobManager().waitForJobs(1000000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(1000000);
HtmlForm form = page.getFormByName("docSearch");
The last line of the above code gives me the following exception:
com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[form] attributeName=[name] attributeValue=[docSearch]
Any tips on what I can try in my code to get to the docSearch form ?
Do you believe this is a problem with HTMLUnit itself? Should I file this as an issue on HTMLUnit's GitHub site?
Have spend some time on this to build a complete sample. The page is only available from the us - i had to set up a vpn to access the page. The sample contains some hints; hope that helps.
final String url = "https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// webClient.getOptions().setUseInsecureSSL(true);
// open the url, this will do a redirect to the login page
HtmlPage page = webClient.getPage(url);
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// click the Public User Login
for (DomElement elem : page.getElementById("middle_left").getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput
&& "Login".equals(((HtmlSubmitInput) elem).getValueAttribute())) {
page = elem.click();
break;
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// search by owner name
HtmlInput ownerInput = (HtmlInput) page.getElementById("TaxAOwnerIDSearchString");
ownerInput.type("Trump");
// click submit
for (DomElement elem : page.getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput) {
page = elem.click();
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println(page.asText());
Your code looks really desperate, usually it is more helpful to try to understand what is going on than copy every snippet you can find into you code and hope this will help.
A good starting point is to understand how the page is working. Use a good web proxy like Charles (or Fiddler) to monitor what happens when you open the page with your browser. Sadly I cannot open your url because my browser reports server not found. Because of this the rest of this answer is more of a guess.
The next step is to create your web client and try to live with the default settings.
WebClient webClient = new WebClient();
webClient.getOptions().setThrowExceptionOnScriptError(false);
With this two lines your client is ready.
At least you RefreshHandler setup completely breaks the handling of refresh cases.
Next step is to check the output after you got the page and compare with your browser/web proxy session.
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
System.out.println(page.asXml());
No you can check if the form is there (in the output) or not. If not you have to figure out with the proxy if there is any kind of js based background reloading. Usual you will see the requests in your proxy output.
To wait for this you can call something like
webClient.waitForBackgroundJavaScriptStartingBefore(100_000);
Sometimes these backround jobs are replacing the content of the current window. To take care of this it is a good idea to get the current page content from the window before dumping.
page = page.getEnclosingWindow(getEnclosedPage());
System.out.println(page.asXML());
Hope that clarifies it a bit. If you need more help i need to be able to access the page myself. Otherwise it is only guessing.
I am trying to web router page scraping to search for connected devices and information about them.
I written this code:
String searchUrl="http://192.168.1.1";
HtmlPage page=client.getPage(searchUrl);
System.out.println(page.asXml());
The problem is that the code returns by HtmlUnit is different from the code in Chrome. In HtmlUnit I don't have the section of code that lists connected devices.
try something like this
try (final WebClient webClient = new WebClient()) {
// do not stop at js errors
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getPage(searchUrl);
// now wait for the js starting async (you can play with the timing)
webClient.waitForBackgroundJavaScript(10000);
// maybe the page was replaced from the async js
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
System.out.println(page.asXml());
}
Usually that helps.
If you still facing problems you have to open an issue on github (https://github.com/HtmlUnit/htmlunit).
But keep in mind i can only really help if i can run/and debug your code here - means you web app has to be public.
I am using htmlunit from net.sourceforge.htmlunit for simulating web browser. I try to log in in steam web app, but I encoutered problem. After setting credentials I wanted to use click method:
final WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.setCookieManager(new CookieManager());
final HtmlPage loginPage = webClient.getPage(loginPageConfiguration.getLoginPageUrl());
final HtmlTextInput user = loginPage.getHtmlElementById(loginPageConfiguration.getLoginInputId());
user.setText(loginCredentials.getUsername());
final HtmlPasswordInput password = loginPage.getHtmlElementById(loginPageConfiguration.getPasswordInputId());
password.setText(loginCredentials.getPassword());
final HtmlPage afterLoginPage = loginPage.getHtmlElementById(loginPageConfiguration.getLoginButtonId()).click();
In normal browser after succesfull login it redirects to http://store.steampowered.com/ but afterLoginPage is still in previous login page.
Without knowing the page and having credentials to access i can only guess. Maybe the application is a single page application that replaces the visual content using ajax (maybe in combination with redirecting). Because ajax is async it might help to wait a bit after the click and before addressing the page.
It also might be a good starting point to understand what is going on by checking the http communication of the real browser (by using the developer tools or by using a web proxy like Charles). You can than compare this with the communication done by HtmlUnit (e.g. enable HttpClient wire log).
Another option might be a javascript error. Please check your log output.
I am using the gui-less browser htmlunits to retrieve the webcontent for webpages and the code is working fine for other sites except "http://www.xyzzzzzzz.com.sg/". Can anybody explain why this is happening???? I already used HtmlUnit webdriver for all three browsers CHROME, FIREFOX and IE as BrowserVersion, nothing is working.
public class Test{
public static void main(String[] args) throws Exception {
String url = "http://www.xyzzzzzzz.com.sg/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
HtmlPage currentPage = (HtmlPage) webClient.getPage(url);
String content = currentPage.asXml();
webClient.waitForBackgroundJavaScript(20000);
System.out.println(content); // NOT SHOWING PROPER CONTECT
}
}
Cab you please describe what do you mean by NOT SHOWING PROPER CONTECT.Because I dnt think there is some mistake in code.
Some time JS makes problem to HtmlUnit for execution so check by stopping it too.