I am trying to web router page scraping to search for connected devices and information about them.
I written this code:
String searchUrl="http://192.168.1.1";
HtmlPage page=client.getPage(searchUrl);
System.out.println(page.asXml());
The problem is that the code returns by HtmlUnit is different from the code in Chrome. In HtmlUnit I don't have the section of code that lists connected devices.
try something like this
try (final WebClient webClient = new WebClient()) {
// do not stop at js errors
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getPage(searchUrl);
// now wait for the js starting async (you can play with the timing)
webClient.waitForBackgroundJavaScript(10000);
// maybe the page was replaced from the async js
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
System.out.println(page.asXml());
}
Usually that helps.
If you still facing problems you have to open an issue on github (https://github.com/HtmlUnit/htmlunit).
But keep in mind i can only really help if i can run/and debug your code here - means you web app has to be public.
Related
I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.
actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.
I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)
I am trying to get to the docSearch form of the https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp web page using the latest HTMLUnit release (2.37.0). As you can see using Firefox's DOM Inspector, there is such a form
WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
webClient.waitForBackgroundJavaScript(1000000);
webClient.waitForBackgroundJavaScriptStartingBefore(100000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
page.getEnclosingWindow().getJobManager().waitForJobs(1000000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(1000000);
HtmlForm form = page.getFormByName("docSearch");
The last line of the above code gives me the following exception:
com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[form] attributeName=[name] attributeValue=[docSearch]
Any tips on what I can try in my code to get to the docSearch form ?
Do you believe this is a problem with HTMLUnit itself? Should I file this as an issue on HTMLUnit's GitHub site?
Have spend some time on this to build a complete sample. The page is only available from the us - i had to set up a vpn to access the page. The sample contains some hints; hope that helps.
final String url = "https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// webClient.getOptions().setUseInsecureSSL(true);
// open the url, this will do a redirect to the login page
HtmlPage page = webClient.getPage(url);
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// click the Public User Login
for (DomElement elem : page.getElementById("middle_left").getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput
&& "Login".equals(((HtmlSubmitInput) elem).getValueAttribute())) {
page = elem.click();
break;
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
// search by owner name
HtmlInput ownerInput = (HtmlInput) page.getElementById("TaxAOwnerIDSearchString");
ownerInput.type("Trump");
// click submit
for (DomElement elem : page.getElementsByTagName("input")) {
if (elem instanceof HtmlSubmitInput) {
page = elem.click();
}
}
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println(page.asText());
Your code looks really desperate, usually it is more helpful to try to understand what is going on than copy every snippet you can find into you code and hope this will help.
A good starting point is to understand how the page is working. Use a good web proxy like Charles (or Fiddler) to monitor what happens when you open the page with your browser. Sadly I cannot open your url because my browser reports server not found. Because of this the rest of this answer is more of a guess.
The next step is to create your web client and try to live with the default settings.
WebClient webClient = new WebClient();
webClient.getOptions().setThrowExceptionOnScriptError(false);
With this two lines your client is ready.
At least you RefreshHandler setup completely breaks the handling of refresh cases.
Next step is to check the output after you got the page and compare with your browser/web proxy session.
HtmlPage page = (HtmlPage) webClient.getPage("https://eagletw.mohavecounty.us/treasurer/treasurerweb/search.jsp");
System.out.println(page.asXml());
No you can check if the form is there (in the output) or not. If not you have to figure out with the proxy if there is any kind of js based background reloading. Usual you will see the requests in your proxy output.
To wait for this you can call something like
webClient.waitForBackgroundJavaScriptStartingBefore(100_000);
Sometimes these backround jobs are replacing the content of the current window. To take care of this it is a good idea to get the current page content from the window before dumping.
page = page.getEnclosingWindow(getEnclosedPage());
System.out.println(page.asXML());
Hope that clarifies it a bit. If you need more help i need to be able to access the page myself. Otherwise it is only guessing.
I'm trying to scrape a page that uses Cloudflare, until recently this was possible with no issues. However as of yesterday, I'm encountering 503 (the ddos protection page). And today it transitioned to simply 403's. Inspecting the response I can see that the page is requesting I enable cookies. I am currently using HtmlUnit to perform the scrapes and I have the BrowserVersion set to Chrome.
here is my current attempt:
private HtmlPage scrapeJS(String targetUrl) throws ScrapeException {
Log.verbose("Attempting JS scrape ...");
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(css);
client.getOptions().setUseInsecureSSL(insecureSSL);
client.setCookieManager(new CookieManager());
client.getOptions().setRedirectEnabled(true);
HtmlPage page;
try {
page = client.getPage(targetUrl);
client.waitForBackgroundJavaScript(10000);
} catch (FailingHttpStatusCodeException e){
Log.verbose("JS scrape resulted in " + e.getStatusCode());
throw new ScrapeException(source, e);
} catch (IOException e){
throw new ScrapeException(source, e);
}
return page;
}
I should mention that this fails both the cookies check and 503s on my desktop, but it passes the cookies check not on my laptop (which is a mac).
I have looked around a little but most posts dealing with HtmlUnit seem a bit dated and the solutions, such as waiting for background JS, does not work nor does changing the user agent between firefox and chrome.
I solved it here https://stackoverflow.com/a/69760898/2751894
just use one of these jvm properties:
-Djdk.tls.client.protocols="TLSv1.3,TLSv1.2" or -Dhttps.protocols="TLSv1.3,TLSv1.2"
I'm facing this error while connecting to a website.
Website
java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.toString(Ljava/io/InputStream;Ljava/nio/charset/Charset;)Ljava/lang/String;
at com.gargoylesoftware.htmlunit.WebResponse.getContentAsString(WebResponse.java:242)
I'm using htmlunit 2.29 and while testing this
WebClient client;
HtmlPage homePage;
Document doc = null;
try {
client = new WebClient(BrowserVersion.FIREFOX_52);
client.getOptions().setUseInsecureSSL(true);
client.setAjaxController(new NicelyResynchronizingAjaxController());
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setThrowExceptionOnScriptError(false);
client.waitForBackgroundJavaScript(20000);
client.waitForBackgroundJavaScriptStartingBefore(20000);
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getCache().setMaxSize(0);
homePage = client.getPage(url);
I've tested it in separate new class. It works as normal.
But while using this (client.getPage(url);) functional block in other project,
It gives
noSUchMethodError from IOUtils.toString class.
What could be the missing thing here?
Usually this is because a broken dependency. Make sure you have the correct version of apache.commons.io in your classpath and check for duplicates also.
And BTW:
client.waitForBackgroundJavaScript(20000);
client.waitForBackgroundJavaScriptStartingBefore(20000);
are NO options, doing these calls during your client setup is useless. Please read the whole posts on stackoverflow before using c&p.
I am using htmlunit from net.sourceforge.htmlunit for simulating web browser. I try to log in in steam web app, but I encoutered problem. After setting credentials I wanted to use click method:
final WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.setCookieManager(new CookieManager());
final HtmlPage loginPage = webClient.getPage(loginPageConfiguration.getLoginPageUrl());
final HtmlTextInput user = loginPage.getHtmlElementById(loginPageConfiguration.getLoginInputId());
user.setText(loginCredentials.getUsername());
final HtmlPasswordInput password = loginPage.getHtmlElementById(loginPageConfiguration.getPasswordInputId());
password.setText(loginCredentials.getPassword());
final HtmlPage afterLoginPage = loginPage.getHtmlElementById(loginPageConfiguration.getLoginButtonId()).click();
In normal browser after succesfull login it redirects to http://store.steampowered.com/ but afterLoginPage is still in previous login page.
Without knowing the page and having credentials to access i can only guess. Maybe the application is a single page application that replaces the visual content using ajax (maybe in combination with redirecting). Because ajax is async it might help to wait a bit after the click and before addressing the page.
It also might be a good starting point to understand what is going on by checking the http communication of the real browser (by using the developer tools or by using a web proxy like Charles). You can than compare this with the communication done by HtmlUnit (e.g. enable HttpClient wire log).
Another option might be a javascript error. Please check your log output.