Scraping cricinfo only two pages are being loaded

Scraping cricinfo only two pages are being loaded - java

I am scraping espncricinfo.
Goal is to extract batting statistics for each player of specified country. Right now I am extracting all players in Input Country and then for each player I am parsing another link that will give me batting stats of that player.
I am using Apache HttpComponents for http requests and JSoup for parsing DOM elements.
Everything is going perfect but the only problem I am facing is when I start scraping two players are being scraped perfectly and then my application hangs.
I've narrowed the problem to a method which grabs a single page, If I provide any link of espncricinfo to this method it is only able to process two requests and no more.
I imagine the problem might be some kind of bot prevention mechanism implemented by espncricinfo. Can anybody help how to bypass this?
Here is code of grab method
public Document scrapSinglePage(String method, String url) {
try {
HttpGet httpGet = new HttpGet(url);
String htmlResponse = "";
HttpResponse httpResponse = httpClient.execute(httpGet, getLocalContext());
BufferedReader rd = new BufferedReader(new InputStreamReader(httpResponse.getEntity().getContent()));
String line = "";
while ((line = rd.readLine()) != null) {
htmlResponse += "\r\n" + line;
}
//Parse response
document = Jsoup.parse(htmlResponse);
return document;
} catch (IOException ex) {
Logger.getLogger(Scrapper.class.getName()).log(Level.SEVERE, null, ex);
return null;
}
}
I will appreciate your help on this.

Related

How to parse HTTP response argument JAVA?

I am now working on making get current order book of crypto currencies.
my codes are as shown below.
public static void bid_ask () {
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet("https://api.binance.com//api/v1/depth?symbol=ETHUSDT");
System.out.println("Binance ETHUSDT");
try {
HttpResponse response = client.execute(request);
HttpEntity entity = response.getEntity();
if (entity != null) {
try (InputStream stream = entity.getContent()) {
BufferedReader reader =
new BufferedReader(new InputStreamReader(stream));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
result of this code is as follows..
Binance ETHUSDT
"lastUpdateId":236998360,"bids":[["88.98000000","2.30400000",[]]..........,"asks":[["89.04000000","16.06458000",[]],.......
What I want to make is a kind of price array..
Double Binance[][] = [88.98000000][2.30400000][89.04000000][6.06458000]
How can I extract just Price and Qty from the HTTP/GET response?

If the response is a JSON, use a parser like Jackson and extract the values. This can be then added to array.
This link will help you.

You will need to parse those values out of the variable line:
There are many ways to do that including using regular expressions or simply using String functions. You'd need to create an ArrayList just before the loop and add each price into it. I highly recommend that you use the Money class from java 9 rather than a double or a float. See -> https://www.baeldung.com/java-money-and-currency

Googles Custom Search as if manually searched

I want to use Googles Custom Search Api for searching for song lyrics in the web via Java.
For getting the name and artist of current song playing I use Tesseract OCR. Even if the OCR works perfectly, I often don't get any results.
But when I try it manually: open Google in the web browser and search for the same string, then it works fine.
So now I don't really know what is the difference between the manual search engine and the api call.
Do I have to add some parameters to the Api request?
//The String searchString is what I am searching for, so the song name and artist
String searchUrl = "https://www.googleapis.com/customsearch/v1?key=(myKEY)=de&cx=(myID)&q=" + searchString + "lyrics";
String data = getData(searchUrl);
JSONObject json = new JSONObject(data);
String link = "";
try
{
link = json.getJSONArray("items").getJSONObject(0).getString("link");
URI url = new URI(link);
System.out.println(link);
Desktop.getDesktop().browse(url);
}
catch(Exception e)
{
System.out.println("No Results");
}
private static String getData(String _urlLink) throws IOException
{
StringBuilder result = new StringBuilder();
URL url = new URL(_urlLink);
URLConnection conn = url.openConnection();
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = rd.readLine()) != null)
{
result.append(line);
}
rd.close();
return result.toString();
}

Try to remove =de before &cx and use + to represent the space between words. Like this - https://www.googleapis.com/customsearch/v1?key=(yourKEY)&cx=(yourID)&q=paradise+coldplay+lyrics

Download AJAX generated content using java

I have a webpage on which a list of movies is being displayed. The content is created using AJAX (as far as my limited knowledge would suggest...).
I want to download the content, in this case the movie playing times, using Java. I know how to download a simple website, but here my solution only gives me the following as an result instead of the playing times:
ajaxpage('http://data.cineradoplex.de/mod/AndyCineradoProg/extern',
"kinoprogramm");
How do I make my program download the results this AJAX function gives?
Here is the code I use:
String line = "";
URL myUrl = http://www.cineradoplex.de/programm/spielplan/;
BufferedReader in = null;
try {
myUrl = new URL(URL);
in = new BufferedReader(new InputStreamReader(myUrl.openStream()));
while ((line = in.readLine()) != null) {
System.out.println(line);
}
} finally {
if (in != null) {
in.close();
}
}

In your response you can see the address from which actual data is retrieved
http://data.cineradoplex.de/mod/AndyCineradoProg/extern
You can request its contents and parse it.

Screen scraping in Java

I'm trying to create an application, written in java, that uses my university class search function. I am using a simple http get request with the following code:
public static String GET_Request(String urlToRead) {
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
}
catch (Exception e) {
e.printStackTrace();
}
return result;
}
But it is not working.
Here is the url I am trying to scrape:
https://webapp4.asu.edu/catalog/classlist?c=TEMPE&s=CSE&n=100&t=2141&e=open&hon=F
I tried looking into jsoup but when I go to their try jsoup tab and fetch the url it is coming up with the same results as the get request is coming up with.
The, repeated, failed results that I'm getting with the http get request and jsoup is that it is bring up the search page of the university but not the actual classes and information about if they are open or not.
What I am ultimately looking for is a way to scrape the website that shows if the classes have open seats or not. Once I get the contents of the web page I could parse through it I'm just not getting any good results.
Thanks!

You need to add a cookie to answer the initial course offerings question:
class search course catalog
Indicate which course offerings you wish to see
* ASU Campus
* ASU Online
You do this by simply adding
conn.setRequestProperty("Cookie", "onlineCampusSelection=C");
to the HttpURLConnection.
I found the cookie by using Google Chrome's Developer Tools (Ctrl-Shift-I) and looked at Resources tab then expanded Cookies to see the webapp4.asu.edu cookies.
The following code (mostly yours) gets the HTML of the page you are looking for:
public static void main(String[] args) {
System.out.println(download("https://webapp4.asu.edu/catalog/classlist?c=TEMPE&s=CSE&n=100&t=2141&e=open&hon=F"));
}
static String download(String urlToRead) {
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
String result = "";
try {
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Cookie", "onlineCampusSelection=C");
BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
result += line + "\n";
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
Although, I'd use a real parser like jsoup or HTML Parser to do the actual parsing job.

How can I retrieve a feed in JSON from a Java Servlet?

I want to make an Http request and store the result in a JSONObject. I haven't worked much with servlets, so I am unsure as to whether I am 1) Making the request properly, and 2) supposed to create the JSONObject. I have imported the JSONObject and JSONArray classes, but I don't know where I ought to use them. Here's what I have:
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
//create URL
try {
// With a single string.
URL url = new URL(FEED_URL);
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
}
catch (IOException e) {
}
My FEED_URL is already written so that it will return a feed formatted for JSON.
This has been getting to me for hours. Thank you very much, you guys are an invaluable resource!

First gather the response into a String:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuilder fullResponse = new StringBuilder();
String str;
while ((str = in.readLine()) != null) {
fullResponse.append(str);
}
Then, if the string starts with "{", you can use:
JSONObject obj = new JSONObject(fullResponse.toString()); //[1]
and if it starts with "[", you can use:
JSONArray arr = new JSONArray(fullResponse.toStrin()); //[2]
[1] http://json.org/javadoc/org/json/JSONObject.html#JSONObject%28java.lang.String%29
[2] http://json.org/javadoc/org/json/JSONArray.html#JSONArray%28java.lang.String%29

Firstly, this is actually not a servlet problem. You don't have any problems with javax.servlet API. You just have problems with java.net API and the JSON API.
For parsing and formatting JSON strings, I would recommend to use Gson (Google JSON) instead of the legacy JSON API's. It has much better support for generics and nested properties and can convert a JSON string to a fullworthy javabean in a single call.
I've posted a complete code example before here. Hope you find it useful.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scraping cricinfo only two pages are being loaded - java

Related

How to parse HTTP response argument JAVA?

Googles Custom Search as if manually searched

Download AJAX generated content using java

Screen scraping in Java

How can I retrieve a feed in JSON from a Java Servlet?

Categories

Resources