I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages).
I've tried to achieve this with different tools, among them:
HtmlUnit: returned me too much scrap when crawling.
Essence: failed to get the important information on many pages.
Boilerpipe: retrieves the content successfully, almost perfect results but:
When I try to crawl pages like TripAdvisor instead of the given webpage html it returns the following message:
We noticed that you're using an unsupported browser. The Tripadvisor
website may not display properly.We support the following browsers:
Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac:
Safari.
I am using user agent:
private final static String USER_AGENT = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html)
I've also tried to use different user agents, even mobile ones but I always get the same error, is it related to Javascript maybe?
My code is the following, if needed:
public void getPageName(String urlString) throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
boolean javascriptEnabled = true;
webClient.setRefreshHandler(new WaitingRefreshHandler(TIMEOUT / 1000));
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.getCache().setMaxSize(0);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(javascriptEnabled);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setTimeout(TIMEOUT);
//Boilerpipe // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
URL url = new URL(urlString);
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = DefaultExtractor.INSTANCE.getText(is);
System.out.println("\n******************\n");
System.out.println(text);
System.out.println("\n******************\n");
writeIntoFile(text);
}
catch (Exception e){
System.out.println("Error when reading page " + e);
}
}
We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.
Most websites require javascript and usually this kind of message shows that your code does not support javascript.
Maybe you have to give HtmlUnit a second try. And if you have some suggestions or bug reports for HtmlUnit feel free to open issues on github and i will try to help.
Related
I have the URL https://www.facebook.com/ads/library/?id=286238429359299 which gets redirected to https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=US&impression_search_field=has_impressions_lifetime&id=286238429359299&view_all_page_id=575939395898200 in the browser.
I'm using the following code:
#Test
public void createWebClient() throws IOException {
getLogger("com.gargoylesoftware").setLevel(OFF);
WebClient webClient = new WebClient(CHROME);
WebClientOptions options = webClient.getOptions();
options.setJavaScriptEnabled(true);
options.setRedirectEnabled(true);
webClient.waitForBackgroundJavaScriptStartingBefore(10000);
// IMPORTANT: Without the country/language selection cookie the redirection does not work!
URL s = webClient.getPage("https://www.facebook.com/ads/library/?id=286238429359299").getUrl();
}
The above code doesn't take into account of the redirection, is there something I am missing? I need to get the final URL the original URL resolves to.
actually the url https://www.facebook.com/ads/library/?id=286238429359299 return a page with javascript.The javascript will detect environment of the web browser.For example,the js will detect if the current browser is the Headless browser and if the web driver is legal.So I think the solution is to analysis the javascript and you will get the final url.
I think it never actually resolves to final URL due being headless.
Please load the same page in a browser, load the source code and search for "page_uri" and you will see exactly URI you are looking for.
If you would check HtmlUnit output or print the page
System.out.println(page.asXml());
You will see that "page_uri" contains originally entered URL.
I suggest to use Selenium WebDriver (not headless)
I'm trying to scrape a page that uses Cloudflare, until recently this was possible with no issues. However as of yesterday, I'm encountering 503 (the ddos protection page). And today it transitioned to simply 403's. Inspecting the response I can see that the page is requesting I enable cookies. I am currently using HtmlUnit to perform the scrapes and I have the BrowserVersion set to Chrome.
here is my current attempt:
private HtmlPage scrapeJS(String targetUrl) throws ScrapeException {
Log.verbose("Attempting JS scrape ...");
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setCssEnabled(css);
client.getOptions().setUseInsecureSSL(insecureSSL);
client.setCookieManager(new CookieManager());
client.getOptions().setRedirectEnabled(true);
HtmlPage page;
try {
page = client.getPage(targetUrl);
client.waitForBackgroundJavaScript(10000);
} catch (FailingHttpStatusCodeException e){
Log.verbose("JS scrape resulted in " + e.getStatusCode());
throw new ScrapeException(source, e);
} catch (IOException e){
throw new ScrapeException(source, e);
}
return page;
}
I should mention that this fails both the cookies check and 503s on my desktop, but it passes the cookies check not on my laptop (which is a mac).
I have looked around a little but most posts dealing with HtmlUnit seem a bit dated and the solutions, such as waiting for background JS, does not work nor does changing the user agent between firefox and chrome.
I solved it here https://stackoverflow.com/a/69760898/2751894
just use one of these jvm properties:
-Djdk.tls.client.protocols="TLSv1.3,TLSv1.2" or -Dhttps.protocols="TLSv1.3,TLSv1.2"
I was trying to scrape links from google using 600 different searches, In the process of this I started getting the following error.
Error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...
Now I've done my research and it happens because of google scholar ban restricting you to limited searches and need to solve captch to proceed, which jsoup can't do.
Code
Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();
Answers on the internet are extremely vague and doesn't provide a clear solution, someone did mention cookies can solve this issue but haven't said a single thing about "how" to do it.
Some hints to improve your scraping:
1. Use proxies
Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.
// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))
// Fetch url with proxy
Document doc = Jsoup //
.proxy(proxy) //
.userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
.header("Content-Language", "en-US") //
.connect(searchUrl) //
.get();
2. Captchas
If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:
Detect captcha error page
--
try {
// Perform search here...
} catch(HttpStatusException e) {
switch(e.getStatusCode()) {
case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
// Ask online captcha service for help...
} else {
// ...
}
break;
default:
// ...
}
}
Download the captcha image (CI)
--
Jsoup //
//.cookie(..., ...) // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true) // Needed for fetching image
.execute() //
.bodyAsBytes(); // byte[] array returned...
Send CI to online captcha service online
--
This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.
Wait for response... (1-2 second(s) is perfect)
Fill the form with response and send it with Jsoup
The Jsoup FormElement is a life saver here. See this working sample code for details.
3. Some other hints
The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:
Cookies: clear them on each IP change or don't use them at all
Threads: You should not open two many connections. Firefox limits itself to 4 connections per proxy.
Returned results: append &num=100 to your url to sent less requests
Request rates: Make your requests look human. You should not send more than 500 requests per 24h per IP.
References :
How to use Jsoup through a proxy?
How to download an image with Jsoup?
How to fill a form with Jsoup?
FormElement javadoc
HttpStatusException javadoc
As an alternative to Stephan answer, you can use this package to get the Google search results without the hassle of proxies. Code sample:
Map<String, String> parameter = new HashMap<>();
parameter.put("q", "Coffee");
parameter.put("location", "Portland");
GoogleSearchResults serp = new GoogleSearchResults(parameter);
JsonObject data = serp.getJson();
JsonArray results = (JsonArray) data.get("organic_results");
JsonObject first_result = results.get(0).getAsJsonObject();
System.out.println("first coffee: " + first_result.get("title").getAsString());
Project Github
I want to access instagram pages without using the API. I need to find the number of followers, so it is not simply a source download, since the page is being built dynamically.
I found HtmlUnit as a library to simulate the browser, so that the JS gets rendered, and I get back the content I want.
HtmlPage myPage = ((HtmlPage) webClient.getPage("http://www.instagram.com/instagram"));
This call however results in the following exception:
Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 403 Forbidden for http://d36xtkk24g8jdx.cloudfront.net/bluebar/3a30db9/scripts/webfont.js
So it can't access that script, but if I'm interpreting this correctly, it's just for font loading, which I don't need. I googled how to tell it to ignore parts of the page, and found this thread.
webClient.setWebConnection(new WebConnectionWrapper(webClient) {
#Override
public WebResponse getResponse(final WebRequest request) throws IOException {
if (request.getUrl().toString().contains("webfont")) {
System.out.println(request.getUrl().toString());
return super.getResponse(request);
} else {
System.out.println("returning response...");
return new StringWebResponse("", request.getUrl());
}
}
});
With that code, the exception goes away, but the source (or page title, or anything else I've tried) seems to be empty. "returning response..." is printed once.
I'm open to different approaches as well. Ultimately, entire page source in a single string would be good enough for me, but I need the JS to execute.
HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem.
You can use PhantomJs like a webdriver:
PhantomJs
i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.