I am using JSOUB to scrape all the web page as the following:
public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num +
"&start=" + start;
Document doc = Jsoup.connect(searchURL)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
// .ignoreHttpErrors(true)
.maxBodySize(1024*1024*3)
.followRedirects(true)
.timeout(100000)
.ignoreContentType(true)
.get();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
String linkHref = result.attr("href");
}
But my problem is that at the start of the code working good.
after a while, it will stop and always gives me " HTTP error fetching URL. Status=503 error".
when I add the .ignoreHttpErrors(true) it will work without any error but it will not scrape the web.
*search term is any keyword I want to search about and num is the number of pages that I need to retrieve.
could anyone help, please?
Is this mean that Google blocked my IP from scraping? if yes is there any solution or how I scape the google search result, please?
I need help.
Thank you,
503 error usually means the website you trying to scrap blocks you because they don't want non-human user navigating their sites. Especially Google.
There are something you can do though. Such as
Using proxy rotator
Use chromedriver
Add some delays to your application after each page
Basically you need to be as human as possible to prevent sites blocking you.
EDIT:
I need to warn you that scraping Google search result is against their ToS and might be illegal depends on where you are.
What you can do
You can use proxy rotating service to mask your request so google will see it as request from multiple region. Google proxy rotator service if you interested. It might be expensive depends on what you do with the data.
Then code some module that change the User-Agent every request to make Google less suspicious with your request.
Add random delay after scraping each page. I suggest around 1-5 seconds. Randomized delay makes your request more human-like for Google
At last if everything fails, you might want to look into Google search API and use their API instead of scraping their site.
Related
I am writing an application in Java that works with the Spotify Web API to get the album artwork of the currently playing album (and maybe other stuff in the future, hence the long list of scopes). Per Spotify's guide, I have to use callbacks in order to get the access token. However, when using the authorization link, Spotify gives me the following intensely helpful and insightful error message.
Spotify Error Message
The code I am using to call open a window is
if(Desktop.isDesktopSupported())
{
String url = "https://accounts.spotify.com/authorize/";
url += "client_id="+SpotifyClientID;
url += "&response_type=code";
url += "&redirect_uri=http%3A%2F%2Flocalhost%3A8888%2Fcallback%2F";
url += "&state="+state;
url += "&scope=playlist-read-private%20playlist-read-collaborative%20user-library-read%20user-read-private%20user-read-playback-state%20user-modify-playback-state%20user-read-currently-playing";
Desktop.getDesktop().browse(new URI(url));
}
Similar questions have been asked, and their issue was their callback URL was not whitelisted; however, I went to the Spotify Dashboard and made SURE http://localhost:8888/callback/ was whitelisted. I've tried using 'http://localhost:8888/callback/' directly in the URL, and I've also tried HTML escaping it, so that it becomes 'http%3A%2F%2Flocalhost%3A8888%2Fcallback%2F' as shown in the code above. Can anyone give me an insight as to why the error message appears instead of the login page?
Figured it out myself. Turns out, I am awesome at links. /s Changed the last '/' in "https://accounts.spotify.com/authorize/" into a '?' so that it would actually receive the parameters I was passing it and it worked perfectly.
Need some help with fetching some data from a website.
Previously , we had following code in our application and it used to fetch the required data. We just used to read the required fields by forming a URL by passing username , password and search parameter (DEA number). The same URL (with parameters ) could also be hit from browser directly to see the results. It was a simple GET request:
{URL url = new URL(
"http://www.deanumber.com/Websvc/deaWebsvc.asmx/GetQuery?UserName="+getUsername()+"&Password="+getPassword()+"&DEA="
+ deaNumber
+ "&BAC=&BASC=&ExpirationDate=&Company=&Zip=&State=&PI=&MaxRows=");
Document document = parser.parse(url.toExternalForm());
// Ask the document for a list of all <sect1> tags it contains
NodeList sections = document.getElementsByTagName("DEA");
//Followed by a loop code to get each element by using sections.item(index).getFirstChild() etc.
}
Now, the website URL has got changed to following:
https://www.deanumber.com/RelId/33637/ISvars/default/Home.htm
I am able to login to the URL with credentials , go to the search page , enter the DEA number and search. The login page comes as a pop-up once I click 'Login' link on home page. Also, the final result comes as a pop-up. This is a POST request so I am unable to form the complete URL which I could use in my code.
I am not an expert in Web Services , but I think I need a web service URL like the one mentioned in the code above. Not sure how to get that !! Even if I get the URL , I am not sure how to perform the login through Java code and search the DEA number.
Also, it would be great if I could validate the URL manually before using in Java. Let me know if there is any way.
Or, in case there is any alternate approach in Java; kindly suggest.
Thanks in advance.
First of all, the previous approach provided by the website was completely wrong and insecure, because it passes the username and password as querystring parameters in plain text. I think, they would have realized this thing and changed their way of authentication.
Also, it looks like that they have restricted the direct URL based requests from the client applications like yours. For such requests from clients, they have published the web services. Check this link. They also have mentioned the rates for web service request counts.
So, you may need to open a formal communication channel to get authentication and other details to access their web services for this purpose. Depends on what they use for web service client authentication, you may code your client to access the web services.
I hope this helps.
I was trying to scrape links from google using 600 different searches, In the process of this I started getting the following error.
Error
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...
Now I've done my research and it happens because of google scholar ban restricting you to limited searches and need to solve captch to proceed, which jsoup can't do.
Code
Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();
Answers on the internet are extremely vague and doesn't provide a clear solution, someone did mention cookies can solve this issue but haven't said a single thing about "how" to do it.
Some hints to improve your scraping:
1. Use proxies
Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.
// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))
// Fetch url with proxy
Document doc = Jsoup //
.proxy(proxy) //
.userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
.header("Content-Language", "en-US") //
.connect(searchUrl) //
.get();
2. Captchas
If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:
Detect captcha error page
--
try {
// Perform search here...
} catch(HttpStatusException e) {
switch(e.getStatusCode()) {
case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
// Ask online captcha service for help...
} else {
// ...
}
break;
default:
// ...
}
}
Download the captcha image (CI)
--
Jsoup //
//.cookie(..., ...) // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true) // Needed for fetching image
.execute() //
.bodyAsBytes(); // byte[] array returned...
Send CI to online captcha service online
--
This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.
Wait for response... (1-2 second(s) is perfect)
Fill the form with response and send it with Jsoup
The Jsoup FormElement is a life saver here. See this working sample code for details.
3. Some other hints
The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:
Cookies: clear them on each IP change or don't use them at all
Threads: You should not open two many connections. Firefox limits itself to 4 connections per proxy.
Returned results: append &num=100 to your url to sent less requests
Request rates: Make your requests look human. You should not send more than 500 requests per 24h per IP.
References :
How to use Jsoup through a proxy?
How to download an image with Jsoup?
How to fill a form with Jsoup?
FormElement javadoc
HttpStatusException javadoc
As an alternative to Stephan answer, you can use this package to get the Google search results without the hassle of proxies. Code sample:
Map<String, String> parameter = new HashMap<>();
parameter.put("q", "Coffee");
parameter.put("location", "Portland");
GoogleSearchResults serp = new GoogleSearchResults(parameter);
JsonObject data = serp.getJson();
JsonArray results = (JsonArray) data.get("organic_results");
JsonObject first_result = results.get(0).getAsJsonObject();
System.out.println("first coffee: " + first_result.get("title").getAsString());
Project Github
I have a home integrated project working with google calendar...well, it was working. I've been using it for at least 6 months, maybe a year, I forget. Suddenly google changed the rules, and I can't figure out how to make things work now.
I don't want to use a whole library to do the extremely basic operations I need to do. I don't need a bunch of extra libraries in my Tomcat app.
Here is the full code sample that used to post a new calendar event, and get the id back so that we could later delete it if we wanted to for an update, etc.
I only get 403 errors back now, and the user/pass is OK, I can get my auth token, I can also login with a browser, I did the captcha unlock page, etc. It just stopped working on 11/18/2014. It was working on 11/17/2014.
Error:
java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/calendar/feeds/myuser#gmail.com/private/full
Help? urlc.getInputStream() throws the exception.
I would be happy to use OAuth2 as well, but I can't get over the aspect that all the docs indicate to use a library, and that the user is going to be presented with the google page to accept. They can't be...they don't interact with this. This is an automated server side app building out calendar events. There is no user present or web browser. So I don't get what to do...they have the service account item, and I downloaded my private key, but I see nowhere that they tell you what you are supposed to do with the private key...
I'm happy to do CalDAV too, but again, OAuth keeps me from proceeding. I have no issues with the technical aspects after login, but I can't understand google's login architecture to get that far anymore.
--Ben
HttpURLConnection urlc = (HttpURLConnection)new URL("https://www.google.com/calendar/feeds/myuser#gmail.com/private/full").openConnection();
urlc.setDoOutput(true);
urlc.setFollowRedirects(false);
urlc.setRequestMethod("POST");
urlc.setRequestProperty("Content-Type", "application/atom+xml");
urlc.setRequestProperty("Authorization", "GoogleLogin auth=" + authToken);
OutputStream out = urlc.getOutputStream();
out.write(b);
out.close();
int code = urlc.getResponseCode();
String location = "";
for (int x=0; x<10; x++)
{
System.out.println(x+":"+urlc.getHeaderFieldKey(x)+":"+urlc.getHeaderField(x));
if (urlc.getHeaderFieldKey(x) != null && urlc.getHeaderFieldKey(x).equalsIgnoreCase("Location")) location = urlc.getHeaderField(x);
}
String result = consumeResponse(urlc.getInputStream());
System.out.println(result);
urlc.disconnect();
urlc = (HttpURLConnection)new URL(location).openConnection();
urlc.setDoOutput(true);
urlc.setFollowRedirects(false);
urlc.setRequestMethod("POST");
urlc.setRequestProperty("Content-Type", "application/atom+xml");
urlc.setRequestProperty("Authorization", "GoogleLogin auth=" + authToken);
out = urlc.getOutputStream();
out.write(b);
out.close();
code = urlc.getResponseCode();
result = consumeResponse(urlc.getInputStream());
System.out.println("Raw result:"+result);
gcal_id = result.substring(result.indexOf("gCal:uid value='")+"gCal:uid value='".length());
gcal_id = gcal_id.substring(0,gcal_id.indexOf("#google.com"));
System.out.println("Calendar ID:"+gcal_id);
So I am partially answering my own question...
The "solution" is having a refresh token. This can be used offline to get new access tokens on demand that are good for about 1 hour. You submit your refresh token to: ht tps :/ /account s. go ogle .c om/o/oauth2/token and it will give you back a "Bearer" access token to use for the next hour.
To get your refresh token though, you need to go to a URL in your browser to get the access, and your allowed redirect URLs must be configured to where you are going to 'redirect' to. It can be something invalid, so long as you can get the 'code' parameter its going to give you. You will need this code to then get the refresh token.
Configure the allowed redirect URLs in your developer console. Find your own link to the dev console. I don't have the points to tell you apparently.
An example URL to go to is something like this:
https://accounts.google.com/o/oauth2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcalendar&state=&redirect_uri=url_encoded_url_to_redirect_to_that_is_in_developer_console&response_type=code&client_id=some_google_randomized_id.apps.googleusercontent.com&access_type=offline&approval_prompt=force
All of this info was pulled from:
https://developers.google.com/accounts/docs/OAuth2WebServer#refresh
So with all of this, you can now do the normal calendar API calls directly and pass in the Bearer authorization header.
So in total, you need exactly 0 google libraries to do all of this, they just make it very difficult to get to the meat of what is really going on. Half the "examples" even on google's pages are referencing invalid things. Most spend the majority of the example telling you how to reconfigure your eclipse to do the example...
The other side effect is this also requires the json format for calendar entries, and not the former XML style gcal was using. Not really a downside, just a change.
Until next year when it all breaks again...
https://apidata.googleusercontent.com/caldav/v2/calid/user
Where calid should be replaced by the "calendar ID" of the calendar to be accessed. This can be found through the Google Calendar web interface as follows: in the pull-down menu next to the calendar name, select Calendar Settings. On the resulting page the calendar ID is shown in a section labelled Calendar Address. The calendar ID for a user's primary calendar is the same as that user's email address.
Please refer the below link :-
https://developers.google.com/google-apps/calendar/caldav/v2/guide
i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.
i have the following code,
Document doc = Jsoup.connect(url).timeout(30000).get();
Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code.
After searching some sites on Google, i saw this methid,
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
String result = sb.toString();
System.out.println(result);
But no Luck.
While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!
I had this recently. I'd run into some sort of robot protection. Change your original line to:
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(30000)
.get();
The problem might be that your web page is rendered by Javascript which is run in a browser, JSoup alone can't help you with this, so you may try using HtmlUnit which uses Selenium to emulate the browser: using Jsoup to sign in and crawl data.
UPDATE
There are several reasons why HTML is different. The most probable is that this web page contains <javascript> elements which contains dynamic page logic. This could be an application inside your web page which sends requests to the server and add or removes content depending on the responses.
JSoup would never render such pages because it's a job for a browser like Chrome, Firefox or IE. JSoup is a lightweight parser for plaintext html you get from the server.
So what you could do is you could use a web driver which emulates a web browser and renders a page in memory, so it would have the same content as shown to the user. You may even do mouse clicks with this driver.
And the proposed implementation for the web driver in the linked answer is HtmlUnit. It's the most lightweight solution, however, it's might give you unexpected results: Selenium vs HtmlUnit?.
If you want the most real page rendering, you might want to consider Selenium WebDriver.
Why do you want to parse a web page this way? If there is a consumable service available from the website, the website might have an REST API.
To answer your question, A webpage viewed using the web-browser may not be same, as the same webpage is downloaded using a URLConnection.
The following could be few of the reasons that cause these differences:
Request Headers: when the client (java application/browser) makes a request for a URL, it sets various headers as part of the request and the webserver may change the content of the response accordingly.
Java Script: once the response is recieved, if there are java script elements present in the response it's executed by the browsers javascript engine, which may change the contents of DOM.
Browser Plugins, such as IE Browser Helper Objects, Firefox Extensions or Chrome Extensions may change the contents of the DOM.
in simple terms, when you request a URL using a URLConnection you are recieving raw data, however when you request the same URL using a browser's addressbar you get processed (by javascript/browser plugins) webpage.
URLConnection/JSoup will allow you to set request headers as required, but you may still get the different response due to points 2 & 3. Selenium allows you to remote control a browser and has a api to access the rendered page. Selenium is used for automated testing of web applications.