I'm trying to scrape a page that uses Cloudflare, until recently this was possible with no issues. However as of yesterday, I'm encountering 503 (the ddos protection page). And today it transitioned to simply 403's. Inspecting the response I can see that the page is requesting I enable cookies. I am currently using HtmlUnit to perform the scrapes and I have the BrowserVersion set to Chrome.
here is my current attempt:
private HtmlPage scrapeJS(String targetUrl) throws ScrapeException {
Log.verbose("Attempting JS scrape ...");
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(css);
client.getOptions().setUseInsecureSSL(insecureSSL);
client.setCookieManager(new CookieManager());
client.getOptions().setRedirectEnabled(true);
HtmlPage page;
try {
page = client.getPage(targetUrl);
client.waitForBackgroundJavaScript(10000);
} catch (FailingHttpStatusCodeException e){
Log.verbose("JS scrape resulted in " + e.getStatusCode());
throw new ScrapeException(source, e);
} catch (IOException e){
throw new ScrapeException(source, e);
}
return page;
}
I should mention that this fails both the cookies check and 503s on my desktop, but it passes the cookies check not on my laptop (which is a mac).
I have looked around a little but most posts dealing with HtmlUnit seem a bit dated and the solutions, such as waiting for background JS, does not work nor does changing the user agent between firefox and chrome.
I solved it here https://stackoverflow.com/a/69760898/2751894
just use one of these jvm properties:
-Djdk.tls.client.protocols="TLSv1.3,TLSv1.2" or -Dhttps.protocols="TLSv1.3,TLSv1.2"
Related
I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages).
I've tried to achieve this with different tools, among them:
HtmlUnit: returned me too much scrap when crawling.
Essence: failed to get the important information on many pages.
Boilerpipe: retrieves the content successfully, almost perfect results but:
When I try to crawl pages like TripAdvisor instead of the given webpage html it returns the following message:
We noticed that you're using an unsupported browser. The Tripadvisor
website may not display properly.We support the following browsers:
Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac:
Safari.
I am using user agent:
private final static String USER_AGENT = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html)
I've also tried to use different user agents, even mobile ones but I always get the same error, is it related to Javascript maybe?
My code is the following, if needed:
public void getPageName(String urlString) throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
boolean javascriptEnabled = true;
webClient.setRefreshHandler(new WaitingRefreshHandler(TIMEOUT / 1000));
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.getCache().setMaxSize(0);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(javascriptEnabled);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setTimeout(TIMEOUT);
//Boilerpipe // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
URL url = new URL(urlString);
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = DefaultExtractor.INSTANCE.getText(is);
System.out.println("\n******************\n");
System.out.println(text);
System.out.println("\n******************\n");
writeIntoFile(text);
}
catch (Exception e){
System.out.println("Error when reading page " + e);
}
}
We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.
Most websites require javascript and usually this kind of message shows that your code does not support javascript.
Maybe you have to give HtmlUnit a second try. And if you have some suggestions or bug reports for HtmlUnit feel free to open issues on github and i will try to help.
I am using HtmlUnit to load a webpage containing a dynamically updated ajax component using the following:-
WebClient webClient = new WebClient(BrowserVersion.CHROME);
URL url = new URL("https://live.xxx.com/en/ajax/getDetailedQuote/" + instrument);
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
HtmlPage redirectPage = webClient.getPage(requestSettings);
This works and I get the contents of the page at the time of request.
I want however to be able to monitor and respond to changes on the page.
I tried the following:-
webClient.addWebWindowListener(new WebWindowListener() {
public void webWindowContentChanged(WebWindowEvent event) {
System.out.println("Content changed ");
}
});
But I only get "Content changed" when the page is first loaded, and not when it updates.
I fear there is not really a solution with HtmlUnit (at least not out of the box). The webWindowContentChanged hook is only called if the (whole) page inside the window is replaced.
You can try to implement an DomChangeListener and attach that to the page (or maybe the body of the page).
If you like to track the ajax requests more on the http level you have the option to intercept the requests (see https://htmlunit.sourceforge.io/faq.html#HowToModifyRequestOrResponse for more details).
If you need more, please open an issue and we can discuss.
I am trying to web router page scraping to search for connected devices and information about them.
I written this code:
String searchUrl="http://192.168.1.1";
HtmlPage page=client.getPage(searchUrl);
System.out.println(page.asXml());
The problem is that the code returns by HtmlUnit is different from the code in Chrome. In HtmlUnit I don't have the section of code that lists connected devices.
try something like this
try (final WebClient webClient = new WebClient()) {
// do not stop at js errors
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getPage(searchUrl);
// now wait for the js starting async (you can play with the timing)
webClient.waitForBackgroundJavaScript(10000);
// maybe the page was replaced from the async js
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
System.out.println(page.asXml());
}
Usually that helps.
If you still facing problems you have to open an issue on github (https://github.com/HtmlUnit/htmlunit).
But keep in mind i can only really help if i can run/and debug your code here - means you web app has to be public.
I am using the following code to update a confluence page:
public void publish() throws IOException {
XWikiXmlRpcClient rpc = new XWikiXmlRpcClient(CONFLUENCE_URI);
try {
rpc.login(USER_NAME, PASSWORD);
//The info macro would get rendered an info box in the Page
Page page = new Page();
page.setSpace("ATF");
page.setTitle("New Page");
page.setContent("New Page Created \\\\ {{info}}This is XMLRPC Test{{/info}}");
page.setParentId("demo UTF Home");
rpc.storePage(page);
} catch (XmlRpcException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
When I try to run the program, I get the following exception:
org.apache.xmlrpc.client.XmlRpcClientException: Failed to parse server's response: Expected methodResponse element, got html
This looks like a bug in Apache xml-rpc client going by this JIRA:
https://issues.apache.org/jira/browse/XMLRPC-159
It says it was fixed in 3.1.2 of the library, I am using 3.1.3.
Anyone seen this before?
Maybe the server really returned HTML; sometimes it simply returns a 200 because something is there that always produces HTML. In that case the bugfix in the XMLRPC-library you have linked to does not apply.
To check for this possibility, you can look into the server access logs for the url of the request and the status code (should be a 200); with this information you can replay the request e.g. in a browser or command line client like wget or curl, and see what is really returned as response.
I want to access instagram pages without using the API. I need to find the number of followers, so it is not simply a source download, since the page is being built dynamically.
I found HtmlUnit as a library to simulate the browser, so that the JS gets rendered, and I get back the content I want.
HtmlPage myPage = ((HtmlPage) webClient.getPage("http://www.instagram.com/instagram"));
This call however results in the following exception:
Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 403 Forbidden for http://d36xtkk24g8jdx.cloudfront.net/bluebar/3a30db9/scripts/webfont.js
So it can't access that script, but if I'm interpreting this correctly, it's just for font loading, which I don't need. I googled how to tell it to ignore parts of the page, and found this thread.
webClient.setWebConnection(new WebConnectionWrapper(webClient) {
#Override
public WebResponse getResponse(final WebRequest request) throws IOException {
if (request.getUrl().toString().contains("webfont")) {
System.out.println(request.getUrl().toString());
return super.getResponse(request);
} else {
System.out.println("returning response...");
return new StringWebResponse("", request.getUrl());
}
}
});
With that code, the exception goes away, but the source (or page title, or anything else I've tried) seems to be empty. "returning response..." is printed once.
I'm open to different approaches as well. Ultimately, entire page source in a single string would be good enough for me, but I need the JS to execute.
HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem.
You can use PhantomJs like a webdriver:
PhantomJs