How to scrape Google SERPs with Jsoup? - java

I was trying to scrape links from google using 600 different searches, In the process of this I started getting the following error.
Error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...
Now I've done my research and it happens because of google scholar ban restricting you to limited searches and need to solve captch to proceed, which jsoup can't do.
Code
Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();
Answers on the internet are extremely vague and doesn't provide a clear solution, someone did mention cookies can solve this issue but haven't said a single thing about "how" to do it.

Some hints to improve your scraping:
1. Use proxies
Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.
// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))
// Fetch url with proxy
Document doc = Jsoup //
.proxy(proxy) //
.userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
.header("Content-Language", "en-US") //
.connect(searchUrl) //
.get();
2. Captchas
If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:
Detect captcha error page
--
try {
// Perform search here...
} catch(HttpStatusException e) {
switch(e.getStatusCode()) {
case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
// Ask online captcha service for help...
} else {
// ...
}
break;
default:
// ...
}
}
Download the captcha image (CI)
--
Jsoup //
//.cookie(..., ...) // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true) // Needed for fetching image
.execute() //
.bodyAsBytes(); // byte[] array returned...
Send CI to online captcha service online
--
This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.
Wait for response... (1-2 second(s) is perfect)
Fill the form with response and send it with Jsoup
The Jsoup FormElement is a life saver here. See this working sample code for details.
3. Some other hints
The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:
Cookies: clear them on each IP change or don't use them at all
Threads: You should not open two many connections. Firefox limits itself to 4 connections per proxy.
Returned results: append &num=100 to your url to sent less requests
Request rates: Make your requests look human. You should not send more than 500 requests per 24h per IP.
References :
How to use Jsoup through a proxy?
How to download an image with Jsoup?
How to fill a form with Jsoup?
FormElement javadoc
HttpStatusException javadoc

As an alternative to Stephan answer, you can use this package to get the Google search results without the hassle of proxies. Code sample:
Map<String, String> parameter = new HashMap<>();
parameter.put("q", "Coffee");
parameter.put("location", "Portland");
GoogleSearchResults serp = new GoogleSearchResults(parameter);
JsonObject data = serp.getJson();
JsonArray results = (JsonArray) data.get("organic_results");
JsonObject first_result = results.get(0).getAsJsonObject();
System.out.println("first coffee: " + first_result.get("title").getAsString());
Project Github

Related

JSOUP / HTTP error fetching URL. Status=503

I am using JSOUB to scrape all the web page as the following:
public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num +
"&start=" + start;
Document doc = Jsoup.connect(searchURL)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
// .ignoreHttpErrors(true)
.maxBodySize(1024*1024*3)
.followRedirects(true)
.timeout(100000)
.ignoreContentType(true)
.get();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
String linkHref = result.attr("href");
}
But my problem is that at the start of the code working good.
after a while, it will stop and always gives me " HTTP error fetching URL. Status=503 error".
when I add the .ignoreHttpErrors(true) it will work without any error but it will not scrape the web.
*search term is any keyword I want to search about and num is the number of pages that I need to retrieve.
could anyone help, please?
Is this mean that Google blocked my IP from scraping? if yes is there any solution or how I scape the google search result, please?
I need help.
Thank you,
503 error usually means the website you trying to scrap blocks you because they don't want non-human user navigating their sites. Especially Google.
There are something you can do though. Such as
Using proxy rotator
Use chromedriver
Add some delays to your application after each page
Basically you need to be as human as possible to prevent sites blocking you.
EDIT:
I need to warn you that scraping Google search result is against their ToS and might be illegal depends on where you are.
What you can do
You can use proxy rotating service to mask your request so google will see it as request from multiple region. Google proxy rotator service if you interested. It might be expensive depends on what you do with the data.
Then code some module that change the User-Agent every request to make Google less suspicious with your request.
Add random delay after scraping each page. I suggest around 1-5 seconds. Randomized delay makes your request more human-like for Google
At last if everything fails, you might want to look into Google search API and use their API instead of scraping their site.

Google Drive API - OAuth2.0: How to Automate Authentication Process? Doubts and Questions

I'm trying to integrate Google APIs inside a project (Thesis project) and I have some doubts and questions. So, here it is the scenario:
I wrote a back-end application in Java that runs solely from a command-line and has absolutely no interaction with a user. Its goal is to allow communication and interaction between sensors and actuators. Everything works great. Now I'd like to integrate something in order to let the sensors backup data both with a certain periodicity and due to some detected threshold value. So I thought, why not trying with Google Drive. The first very useful links have been:
https://developers.google.com/drive/web/quickstart/quickstart-java
https://developers.google.com/accounts/docs/OAuth2InstalledApp
Quick start examples work like a charm. However it requires quite a bit of settings: create a project inside the Developer Console (therefore an account), enable Drive API, then create a Client ID and a Client Secret. Once you've done these steps, you can hard-coded client ID and secret to form the request URL for google drive. Then you're kindly asked to enter the url in a browser, log in if you're not, accept and finally copy and paste into your console the authorization code for obtaining an access token. Wow, quite a security proccess. But hey, I completely agree with it, above all in a scenario where we have either a web app, a smartphone app or a web service that needs users' authentication and authorization in order to let the app doing its job by accessing someone else account. But in my case, I just would like that sensors will backup data on my google drive.
These facts lead to my first question: in order to use Google APIs (Drive in this case), do I have to create a project anyway? Or is there another approach? If I'm not wrong, there aren't other ways to create a client Id and secret without creating a project inside the Developer Console. This puzzles me a lot. Why should I create a project to use basically some libraries?
So, let's assume the previous as justifiable constraints and move on the real question: how to automate the authentication process? Given my scenario where a sensor (simply a Java module) want to backup data, it would be impossible to complete all that steps. The google page about OAuth 2.0 has a great explanations about different scenarios where we can embed the authentication procedure, included one for "devices with limited input capabilities". Unluckily, this is more complicated then the others and requires that "The user switches to a device or computer with richer input capabilities, launches a browser, navigates to the URL specified on the limited-input device, logs in, and enters the code." (LOL)
So, I didn't give up and I ended up on this post that talks about OAuth Playground: How do I authorise an app (web or installed) without user intervention? (canonical ?). It really looks like as a solution for me, in particular when it says:
NB2. This technique works well if you want a web app which access
your own (and only your own) Drive account, without bothering to write
the authorization code which would only ever be run once. Just skip
step 1, and replace "my.drive.app" with your own email address in step
5.
However if I'm not wrong, I think that OAuth Playground it's just for helping test and debug projects that use Google APIs, isn't it? Moreover, Google drive classes such as GoogleAuthorizationCodeFlow and GoogleCredential (used inside the Java quick start example) always need Client ID, Client Secret and so on, which brings me to point zero (create a project and do the whole graphical procedure).
In conclusion: is there a way to avoid the "graphical" authentication interaction and convert it into an automated process using only Drive's APIs without the user intervention? Thanks a lot, I would be grateful for any tip, hint, answer, pointer :-)
This is just a snippet of code that I wrote thanks to pinoyyid suggestions. Just to recap what we should do in this case (when in your program there isn't a user interaction for completing all the Google GUI authentication process). As reported in https://developers.google.com/drive/web/quickstart/quickstart-java
Go to the Google Developers Console.
Select a project, or create a new one.
In the sidebar on the left, expand APIs & auth. Next, click APIs. In the list of APIs, make sure the status is ON for the Drive API.
In the sidebar on the left, select Credentials.
In either case, you end up on the Credentials page and can create your project's credentials from here.
From the Credentials page, click Create new Client ID under the OAuth heading to create your OAuth 2.0 credentials. Your application's client ID, email address, client secret, redirect URIs, and JavaScript origins are in the Client ID for web application section.
The pinoyyd post is neater and get straight to the point: How do I authorise a background web app without user intervention? (canonical ?)
Pay attention to step number 7
Finally the snippet of code is very simple, it's just about sending a POST request and it's possible to do that in many ways in Java. Therefore this is just an example and I'm sure there is room for improvements ;-)
// Both to set access token the first time that we run the module and in general to refresh the token
public void sendPOST(){
try {
URL url = new URL("https://www.googleapis.com/oauth2/v3/token");
Map<String,Object> params = new LinkedHashMap<>();
params.put("client_id", CLIENT_ID);
params.put("client_secret", CLIENT_SECRET);
params.put("refresh_token", REFRESH_TOKEN);
params.put("grant_type", "refresh_token");
StringBuilder postData = new StringBuilder();
for (Map.Entry<String,Object> param : params.entrySet()) {
if (postData.length() != 0) postData.append('&');
postData.append(URLEncoder.encode(param.getKey(), "UTF-8"));
postData.append('=');
postData.append(URLEncoder.encode(String.valueOf(param.getValue()), "UTF-8"));
}
byte[] postDataBytes = postData.toString().getBytes("UTF-8");
HttpsURLConnection conn = (HttpsURLConnection)url.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
conn.setRequestProperty("Content-Length", String.valueOf(postDataBytes.length));
conn.setDoOutput(true);
conn.getOutputStream().write(postDataBytes);
BufferedReader in_rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
// Read response body which should be a json structure
String inputLine;
StringBuilder responseBody = new StringBuilder();
while ((inputLine = in_rd.readLine()) != null) {
responseBody.append(inputLine);
}
in_rd.close();
//Parsing Response --> create a json object
JSONObject jsonResp = new JSONObject(responseBody);
//Modify previous access token String
ACCESS_TOKEN = jsonResp.getString("access_token");
}
catch(MalformedURLException ex_URL){
System.out.println("An error occured: " + ex_URL.getMessage());
}
catch(JSONException ex_json) {
System.out.println("An error occured: " + ex_json.getMessage());
}
catch(IOException ex_IO){
System.out.println("An error occured: " + ex_IO.getMessage());
}
} //end of sendRefreshPOST method
Hope this snippet of code will help others that will face the same situation !
I wrote the SO post at How do I authorise an app (web or installed) without user intervention? (canonical ?)
What it describes is indeed the solution to your use-case. The key bit you'd missed is step 7 where you enter the details of your own application into the OAuth Playground. From that point, the playground is impersonating your app and so you can do the one-time authorization and obtaining a refresh token.

Google calendar access via java

I have a home integrated project working with google calendar...well, it was working. I've been using it for at least 6 months, maybe a year, I forget. Suddenly google changed the rules, and I can't figure out how to make things work now.
I don't want to use a whole library to do the extremely basic operations I need to do. I don't need a bunch of extra libraries in my Tomcat app.
Here is the full code sample that used to post a new calendar event, and get the id back so that we could later delete it if we wanted to for an update, etc.
I only get 403 errors back now, and the user/pass is OK, I can get my auth token, I can also login with a browser, I did the captcha unlock page, etc. It just stopped working on 11/18/2014. It was working on 11/17/2014.
Error:
java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/calendar/feeds/myuser#gmail.com/private/full
Help? urlc.getInputStream() throws the exception.
I would be happy to use OAuth2 as well, but I can't get over the aspect that all the docs indicate to use a library, and that the user is going to be presented with the google page to accept. They can't be...they don't interact with this. This is an automated server side app building out calendar events. There is no user present or web browser. So I don't get what to do...they have the service account item, and I downloaded my private key, but I see nowhere that they tell you what you are supposed to do with the private key...
I'm happy to do CalDAV too, but again, OAuth keeps me from proceeding. I have no issues with the technical aspects after login, but I can't understand google's login architecture to get that far anymore.
--Ben
HttpURLConnection urlc = (HttpURLConnection)new URL("https://www.google.com/calendar/feeds/myuser#gmail.com/private/full").openConnection();
urlc.setDoOutput(true);
urlc.setFollowRedirects(false);
urlc.setRequestMethod("POST");
urlc.setRequestProperty("Content-Type", "application/atom+xml");
urlc.setRequestProperty("Authorization", "GoogleLogin auth=" + authToken);
OutputStream out = urlc.getOutputStream();
out.write(b);
out.close();
int code = urlc.getResponseCode();
String location = "";
for (int x=0; x<10; x++)
{
System.out.println(x+":"+urlc.getHeaderFieldKey(x)+":"+urlc.getHeaderField(x));
if (urlc.getHeaderFieldKey(x) != null && urlc.getHeaderFieldKey(x).equalsIgnoreCase("Location")) location = urlc.getHeaderField(x);
}
String result = consumeResponse(urlc.getInputStream());
System.out.println(result);
urlc.disconnect();
urlc = (HttpURLConnection)new URL(location).openConnection();
urlc.setDoOutput(true);
urlc.setFollowRedirects(false);
urlc.setRequestMethod("POST");
urlc.setRequestProperty("Content-Type", "application/atom+xml");
urlc.setRequestProperty("Authorization", "GoogleLogin auth=" + authToken);
out = urlc.getOutputStream();
out.write(b);
out.close();
code = urlc.getResponseCode();
result = consumeResponse(urlc.getInputStream());
System.out.println("Raw result:"+result);
gcal_id = result.substring(result.indexOf("gCal:uid value='")+"gCal:uid value='".length());
gcal_id = gcal_id.substring(0,gcal_id.indexOf("#google.com"));
System.out.println("Calendar ID:"+gcal_id);
So I am partially answering my own question...
The "solution" is having a refresh token. This can be used offline to get new access tokens on demand that are good for about 1 hour. You submit your refresh token to: ht tps :/ /account s. go ogle .c om/o/oauth2/token and it will give you back a "Bearer" access token to use for the next hour.
To get your refresh token though, you need to go to a URL in your browser to get the access, and your allowed redirect URLs must be configured to where you are going to 'redirect' to. It can be something invalid, so long as you can get the 'code' parameter its going to give you. You will need this code to then get the refresh token.
Configure the allowed redirect URLs in your developer console. Find your own link to the dev console. I don't have the points to tell you apparently.
An example URL to go to is something like this:
https://accounts.google.com/o/oauth2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcalendar&state=&redirect_uri=url_encoded_url_to_redirect_to_that_is_in_developer_console&response_type=code&client_id=some_google_randomized_id.apps.googleusercontent.com&access_type=offline&approval_prompt=force
All of this info was pulled from:
https://developers.google.com/accounts/docs/OAuth2WebServer#refresh
So with all of this, you can now do the normal calendar API calls directly and pass in the Bearer authorization header.
So in total, you need exactly 0 google libraries to do all of this, they just make it very difficult to get to the meat of what is really going on. Half the "examples" even on google's pages are referencing invalid things. Most spend the majority of the example telling you how to reconfigure your eclipse to do the example...
The other side effect is this also requires the json format for calendar entries, and not the former XML style gcal was using. Not really a downside, just a change.
Until next year when it all breaks again...
https://apidata.googleusercontent.com/caldav/v2/calid/user
Where calid should be replaced by the "calendar ID" of the calendar to be accessed. This can be found through the Google Calendar web interface as follows: in the pull-down menu next to the calendar name, select Calendar Settings. On the resulting page the calendar ID is shown in a section labelled Calendar Address. The calendar ID for a user's primary calendar is the same as that user's email address.
Please refer the below link :-
https://developers.google.com/google-apps/calendar/caldav/v2/guide

web page source downloaded through Jsoup is not equal to the actual web page source

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.

How to mimic a web browser from Java using jSoup

I am using a query like this in jSoup:
Document doc = Jsoup.connect(urlString).timeout(1000).post();
It works for some sites, however:
it doesn't work for Google search queries (e.g. urlString = "http://www.google.com/search?q=text") - I don't know why, how it is special
result documents contain messages like "JavaScript should be turned on in your browser" which I would rather avoid
there are probably more quirks, but I haven't tested it fully yet...
My question: could these problems be avoided if we could mimic a web browser more closely? What is the best way to do it?
What are the other differences that can be encountered between getting pages via web browser and via Java (URLConnection or jSoup)?
I will like to answer your question.
In Google , when you search , parameters are passed in URL , So it's a get request.
In this case , you should use .get() method.
While there are many websites , parameters are passed using post request.
Taking an example of a simple login page of all websites , username and password are passed using POST REQUEST , In addition to that there are many hidden fields inside that page which is also needed to be passed.
If we missed that parmaters , it will result in a error.
I realized that the problem with some sites not responding was actually that I was using post() instead of get(). With get() it works fine now!
It also probably helps to add userAgent to the query, for example:
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
In the meantime, I've also tested HtmlUnit for the same task, and it worked, but it seems like an overkill for the purpose to simply get an HTML file (for some kind of processing). It basically runs a whole invisible web browser in the background to do this task.

Categories

Resources