I'm facing this error while connecting to a website.
Website
java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.toString(Ljava/io/InputStream;Ljava/nio/charset/Charset;)Ljava/lang/String;
at com.gargoylesoftware.htmlunit.WebResponse.getContentAsString(WebResponse.java:242)
I'm using htmlunit 2.29 and while testing this
WebClient client;
HtmlPage homePage;
Document doc = null;
try {
client = new WebClient(BrowserVersion.FIREFOX_52);
client.getOptions().setUseInsecureSSL(true);
client.setAjaxController(new NicelyResynchronizingAjaxController());
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setThrowExceptionOnScriptError(false);
client.waitForBackgroundJavaScript(20000);
client.waitForBackgroundJavaScriptStartingBefore(20000);
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getCache().setMaxSize(0);
homePage = client.getPage(url);
I've tested it in separate new class. It works as normal.
But while using this (client.getPage(url);) functional block in other project,
It gives
noSUchMethodError from IOUtils.toString class.
What could be the missing thing here?
Usually this is because a broken dependency. Make sure you have the correct version of apache.commons.io in your classpath and check for duplicates also.
And BTW:
client.waitForBackgroundJavaScript(20000);
client.waitForBackgroundJavaScriptStartingBefore(20000);
are NO options, doing these calls during your client setup is useless. Please read the whole posts on stackoverflow before using c&p.
Related
I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages).
I've tried to achieve this with different tools, among them:
HtmlUnit: returned me too much scrap when crawling.
Essence: failed to get the important information on many pages.
Boilerpipe: retrieves the content successfully, almost perfect results but:
When I try to crawl pages like TripAdvisor instead of the given webpage html it returns the following message:
We noticed that you're using an unsupported browser. The Tripadvisor
website may not display properly.We support the following browsers:
Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac:
Safari.
I am using user agent:
private final static String USER_AGENT = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html)
I've also tried to use different user agents, even mobile ones but I always get the same error, is it related to Javascript maybe?
My code is the following, if needed:
public void getPageName(String urlString) throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
boolean javascriptEnabled = true;
webClient.setRefreshHandler(new WaitingRefreshHandler(TIMEOUT / 1000));
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.getCache().setMaxSize(0);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(javascriptEnabled);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setTimeout(TIMEOUT);
//Boilerpipe // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
URL url = new URL(urlString);
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = DefaultExtractor.INSTANCE.getText(is);
System.out.println("\n******************\n");
System.out.println(text);
System.out.println("\n******************\n");
writeIntoFile(text);
}
catch (Exception e){
System.out.println("Error when reading page " + e);
}
}
We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.
Most websites require javascript and usually this kind of message shows that your code does not support javascript.
Maybe you have to give HtmlUnit a second try. And if you have some suggestions or bug reports for HtmlUnit feel free to open issues on github and i will try to help.
I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.
actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.
I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)
I am trying to web router page scraping to search for connected devices and information about them.
I written this code:
String searchUrl="http://192.168.1.1";
HtmlPage page=client.getPage(searchUrl);
System.out.println(page.asXml());
The problem is that the code returns by HtmlUnit is different from the code in Chrome. In HtmlUnit I don't have the section of code that lists connected devices.
try something like this
try (final WebClient webClient = new WebClient()) {
// do not stop at js errors
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getPage(searchUrl);
// now wait for the js starting async (you can play with the timing)
webClient.waitForBackgroundJavaScript(10000);
// maybe the page was replaced from the async js
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
System.out.println(page.asXml());
}
Usually that helps.
If you still facing problems you have to open an issue on github (https://github.com/HtmlUnit/htmlunit).
But keep in mind i can only really help if i can run/and debug your code here - means you web app has to be public.
I am using htmlunit from net.sourceforge.htmlunit for simulating web browser. I try to log in in steam web app, but I encoutered problem. After setting credentials I wanted to use click method:
final WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.setCookieManager(new CookieManager());
final HtmlPage loginPage = webClient.getPage(loginPageConfiguration.getLoginPageUrl());
final HtmlTextInput user = loginPage.getHtmlElementById(loginPageConfiguration.getLoginInputId());
user.setText(loginCredentials.getUsername());
final HtmlPasswordInput password = loginPage.getHtmlElementById(loginPageConfiguration.getPasswordInputId());
password.setText(loginCredentials.getPassword());
final HtmlPage afterLoginPage = loginPage.getHtmlElementById(loginPageConfiguration.getLoginButtonId()).click();
In normal browser after succesfull login it redirects to http://store.steampowered.com/ but afterLoginPage is still in previous login page.
Without knowing the page and having credentials to access i can only guess. Maybe the application is a single page application that replaces the visual content using ajax (maybe in combination with redirecting). Because ajax is async it might help to wait a bit after the click and before addressing the page.
It also might be a good starting point to understand what is going on by checking the http communication of the real browser (by using the developer tools or by using a web proxy like Charles). You can than compare this with the communication done by HtmlUnit (e.g. enable HttpClient wire log).
Another option might be a javascript error. Please check your log output.
I want to export data from VersionOne into my own Java application. Please help me retrieve this data from it. I used the following code but it is not working.
V1APIConnector dataConnector = new V1APIConnector("http://www10.v1host.com/loxvo/rest-1.v1/", "username", "password");
V1APIConnector metaConnector = new V1APIConnector("http://www10.v1host.com/loxvo/meta.v1/");
metaModel = new MetaModel(metaConnector);
services = new Services(metaModel, dataConnector);
It seems there is some problem from with my URL. Please tell me what will be proper URL here as my company URL is https://www10.v1host.com/loxvo/
You have the correct form of the URL in your post, but not in your sample code. All hosted instances use https as you have shown at the end, but your code has http. While a browser will simply accept a redirect and take you from http to https, the API Client code does not; it simply fails to establish a connection.