Java web-scraper sees captcha - java

I have made a web-scraper for Google Scholar in Java with JSoup. The scraper search Scholar for a DOI and finds the citations for this paper. This data is needed for a research.
But, the scraper only works for the first requests. ..
After that the scraper encounters a captcha on the Scholar site.
However, when I open the website in my browser (Chrome) Google Scholar opens normally.
How is this possible? All request come from the same IP-address!
So far I have tried the following options:
Choose a random user-agent for the request (from a list of 5 user-agents)
Random delay between request between 5- 50 seconds
Use a TOR-proxy. However almost all the end-nodes have already been blocked by Google
When I analyse the request made by Chrome to Scholar I see that a cookie is used with some session ID's. Probably this is why Chrome requests are not blocked. Is it possible to use this cookie for request made with JSoup?
Thank you!

There's three things that spring to mind:
You aren't saving the cookies between requests. Your first request should save the cookie and pass it to the server for the next request (setting the Referer header wouldn't hurt too). There's an example here.
If Google was being tricky they could see that your first request didn't load any css/js/images on the page. This is a sure sign that you are a bot.
Javascript is doing something in the page once you have it loaded.
I think the first is the most likely option. You should try copy as many of the headers you see in the request from Chrome into your java code.

Related

Tracking site XHR with Java

Tried to use htmlUnit to send POST requests to communicate to server and met a tiny problem: target .php url is being changed from time to time
(www123.example.net -> www345.example.net, etc.). The only way to get new adress is to open site and check it's XHR requests, find one which goes to www???.example.net and then use this address to send POSTs.
So the question is: is there a way to track XHR using htmlUnit or any other Java library?
If you really need help you have to show your problem in more detail, provide some info about the web site you are requesting, show you code and try to explain what you expect and what goes wrong. Without this details we can only guess.
Looks like you should try to think about HtmlUnit more like a browser you can control from java instead of doing simple Http requests. Have a look at the simple samples on the HtmlUnit web site (the one at the bottoms is for you).
Try something like this (the same steps as the user of an ordinary browser does)
* open the url/page
* fill the various form fields
* find the submit button an click
* use the resulting page content
Usually HtmlUnit does all the stuff in the background for you.

Http Get Request - what data is actually send?

I'm currently building a web spider with java apache commons. I'm crawling basic google search queries like https://google.com/search?q=word&hl=en
Somehow after about 60 queries I get blocked, it seems they recognize me as a bot and I get a 503 Service Unavailable response
Now the important part:
If I visit the same site with firefox/chrome I get the desired result.
If I make a GET Request with my Application using the same http header (user-agent, cookies, cache etc.) I am still blocked.
HOW does Google know whether I'm connecting via Application or Chrome-Browser, when there is only the IP and the HTTP-Header as Information?(maybe I'm wrong?)
Are there more parameters to recognize my App? Something that Google sees and I don't?
(Maybe important: I'm using Chrome Developer Tools and httpbin.org to compare the headers of Browser and Application.)
Thanks a lot
Since you have not specified how quickly you send the 60 queries, I am assuming at a high rate. This is why google is blocking you. Several times I have rapidly done google searches from chrome and it asks for a captcha after a while and then blocks soon after.
Please see the API on Custom Search and this post about terms of Service Replacement for Google API
FAQ on blocked searches: Google FAQ

Can directing to a specific site be avoided in any way using servlets?

Is it possible to block directing to a specific site using servlets. For example, when yahoo.com is typed in url box, the network connection should turn off, whereas when you type some other website url, say, google.com, the network connection should remain intact. (maybe by working on ip filters??)
Servlets would not be a right choice for this kind of requirement.
Servlets are deployed at server and will work only when request is submitted to them from the browser. Typing URL in the browser would not allow servlet to react.
You can install any kind of Proxy at your local network and block the websites accordingly.
In case, if you are talking about any browser request that is calling your servlet and your servlet is redirect the request to any of the websites like Yahoo, Google etc... then the following procedure might work for you.
Maintain a list of blocked websites in a list
Check for the requested website in the request object
If the requested website is not in the list then you can allow the redirect
Else, you can forward to a page where website blocked message is displayed
Hope this answers your question.
Please let me know if you have any further questions.

Use httpclient to click next button?

I am writing a program to scrape the source code off a website. Each time the next button is clicked to go to the next page on the website a post request it sent.
I have been looking at using httpclient to take care of this issue, and have looked through examples and the httpclient API, but I cant seem to figure out whether httpclient can do this. Is this a function of httpclient, and if so what class would go about doing this?
I think that you're saying the webpage you're performing an http get on contains a "next button" on it, and that when you view the webpage in the browser and click the next button, the next page of the website is displayed.
If this is the case, yes, http client is able to do the same thing. But understand that http client does not integrate with your web browser. But you could scour the source code returned from the http get request using a library like jsoup to extract the url for the "next" page on the website, and then issue another http get to get that resource.
Assuming you already have the code for http client to issue the initial http get request, there is no additional api that is required. You just make another request after your program discovers the url for the "next" resource.

Crawl contents loaded by ajax

Nowadays many websites contain some content loaded by ajax(e.g,comments in some video websites). Normally we can't crawl these data and what we get is just some js source code. So here is the question: in what ways can we execute the javascript code after we get the html response and get to the final page we want?
I know that HtmlUnit has the ability to execute background js,yet some many bugs and errors are there. Are there any else tools can help me with it?
Some people tell me that I can crawl the ajax request url, analyze its parameters and send request again so as to gain the data. If things can't work out according to the way I mention above, can anyone tell me how to extract the ajax url and send the request in correct format?
By the way,if the language is java,it would be the best
Yes, Netwoof can crawl Ajax easily. Its API and bot builder let you do it without a line of code.
Thats the great thing about HTTP you don't even need java. My goto tool for debugging AJAX is the chrome extension Postman. I start by looking at the request in the chrome debugger and identifying the salient bits(url or form encoded params etc.)
Then it can be as simple as opening a tab and launch requests at the server with Postman. As long as its all in the same browser context all of your cookies(for authentication, etc.) will be shipped along too.

Categories

Resources