Screen scraping in Java

Screen scraping in Java - java

I'm trying to create an application, written in java, that uses my university class search function. I am using a simple http get request with the following code:
public static String GET_Request(String urlToRead) {
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
}
catch (Exception e) {
e.printStackTrace();
}
return result;
}
But it is not working.
Here is the url I am trying to scrape:
https://webapp4.asu.edu/catalog/classlist?c=TEMPE&s=CSE&n=100&t=2141&e=open&hon=F
I tried looking into jsoup but when I go to their try jsoup tab and fetch the url it is coming up with the same results as the get request is coming up with.
The, repeated, failed results that I'm getting with the http get request and jsoup is that it is bring up the search page of the university but not the actual classes and information about if they are open or not.
What I am ultimately looking for is a way to scrape the website that shows if the classes have open seats or not. Once I get the contents of the web page I could parse through it I'm just not getting any good results.
Thanks!

You need to add a cookie to answer the initial course offerings question:
class search course catalog
Indicate which course offerings you wish to see
* ASU Campus
* ASU Online
You do this by simply adding
conn.setRequestProperty("Cookie", "onlineCampusSelection=C");
to the HttpURLConnection.
I found the cookie by using Google Chrome's Developer Tools (Ctrl-Shift-I) and looked at Resources tab then expanded Cookies to see the webapp4.asu.edu cookies.
The following code (mostly yours) gets the HTML of the page you are looking for:
public static void main(String[] args) {
System.out.println(download("https://webapp4.asu.edu/catalog/classlist?c=TEMPE&s=CSE&n=100&t=2141&e=open&hon=F"));
}
static String download(String urlToRead) {
java.net.CookieManager cm = new java.net.CookieManager();
java.net.CookieHandler.setDefault(cm);
String result = "";
try {
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Cookie", "onlineCampusSelection=C");
BufferedReader rd = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
result += line + "\n";
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
Although, I'd use a real parser like jsoup or HTML Parser to do the actual parsing job.

Related

G suite account get report java sample question

I am trying to use this api to get report with java, and here is the link
https://developers.google.com/admin-sdk/reports/v1/appendix/activity/meet
and here is what i am using now
public static String getGraph() {
String PROTECTED_RESOURCE_URL = "https://www.googleapis.com/admin/reports/v1/activity/users/all/applications/meet?eventName=call_ended&maxResults=10&access_token=";
String graph = "";
try {
URL urUserInfo = new URL(PROTECTED_RESOURCE_URL + "access_token");
HttpURLConnection connObtainUserInfo = (HttpURLConnection) urUserInfo.openConnection();
if (connObtainUserInfo.getResponseCode() == HttpURLConnection.HTTP_OK) {
StringBuilder sbLines = new StringBuilder("");
BufferedReader reader = new BufferedReader(
new InputStreamReader(connObtainUserInfo.getInputStream(), "utf-8"));
String strLine = "";
while ((strLine = reader.readLine()) != null) {
sbLines.append(strLine);
}
graph = sbLines.toString();
}
} catch (IOException ex) {
x.printStackTrace();
}
return graph;
}
I am pretty sure it's not a smart way to do that and the string I get is quite complex, are there any jave sample that i can get the data directly instead of using java origin httpRequest
Or, are there and class I can import to switch the json string to the object!?
Anyone can help?!
I have trying this for many days already!
Thanks!!

Retrieve HTML content refreshed with Ajax

I tried to get HTML content from a website and I did it with this code.
public void extractRoutes(String urlStringifyed) throws MalformedURLException, IOException {
URL url = new URL(urlStringifyed);
URLConnection c = url.openConnection();
c.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
String line = null;
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line);
}
}
Now I want to get the content from a specific page that is loaded with Ajax and protected with ReChapta, but I can't.
Below is the url. I'm passing all the arguments, but the content that I get from this link says to me that the service is temporally down and I should try later. The thing that I don't understand is that when I copy the url and paste in my browser, it works fine. The second link that does not involve rechapta shouts the same thing.
https://mersultrenurilor.infofer.ro/ro-RO/Itineraries?DepartureStationName=Ia%C8%99i&ArrivalStationName=Suceava&DepartureDate=21.01.2019&TimeSelectionId=0&MinutesInDay=0&ChangeStationName=&DepartureTrainRunningNumber=&ArrivalTrainRunningNumber=&ConnectionsTypeId=0&OrderingTypeId=0&g-recaptcha-response=03AO9ZY1ChGhLCoSKCnF49dyCskHENK7ZUYdJEK_UCDVPn7RYGp40CMRUxvA0Q_ni6fDhP9BRm6viymicOOudd78WJbaHb2vbbtCq0DLS7NzngWBAgBKaWBFBa94RKqetwMSR89p5G1a8oS3bknB6d2tyZ2zhUk1veesR2Ef-RNVXDMpy0GotKH_XGPylDTvL5ftIrDem1LmWb4lQYNY0CCJ7jFScQf6SRqSH18jBWHAGEXVSlsQjoK8X4Q6riSlo1LK_vMJR-F-HVig7vavBd6zTI6LjceGyBtlQZCK7tcIuj4cS9Yg-tMbRKn_laukwLkceOpN8Q88_Aafz9JPtyx-eJAN_5fMbuRw
http://mersultrenurilor.infofer.ro/ro-RO/Itineraries?DepartureStationName=Ia%C8%99i&ArrivalStationName=Suceava&DepartureDate=21.01.2019%200%3A00%3A00&AreOnlyTrainsWithReservation=False&ArrivalTrainRunningNumber=&DepartureTrainRunningNumber=&ConnectionsTypeId=0&MinutesInDay=0&OrderingTypeId=0&TimeSelectionId=0&ChangeStationName=&IsSearchWanted=False
How I can get html content(I'm interested in train routes that are showed) from this loaded url?

Googles Custom Search as if manually searched

I want to use Googles Custom Search Api for searching for song lyrics in the web via Java.
For getting the name and artist of current song playing I use Tesseract OCR. Even if the OCR works perfectly, I often don't get any results.
But when I try it manually: open Google in the web browser and search for the same string, then it works fine.
So now I don't really know what is the difference between the manual search engine and the api call.
Do I have to add some parameters to the Api request?
//The String searchString is what I am searching for, so the song name and artist
String searchUrl = "https://www.googleapis.com/customsearch/v1?key=(myKEY)=de&cx=(myID)&q=" + searchString + "lyrics";
String data = getData(searchUrl);
JSONObject json = new JSONObject(data);
String link = "";
try
{
link = json.getJSONArray("items").getJSONObject(0).getString("link");
URI url = new URI(link);
System.out.println(link);
Desktop.getDesktop().browse(url);
}
catch(Exception e)
{
System.out.println("No Results");
}
private static String getData(String _urlLink) throws IOException
{
StringBuilder result = new StringBuilder();
URL url = new URL(_urlLink);
URLConnection conn = url.openConnection();
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = rd.readLine()) != null)
{
result.append(line);
}
rd.close();
return result.toString();
}

Try to remove =de before &cx and use + to represent the space between words. Like this - https://www.googleapis.com/customsearch/v1?key=(yourKEY)&cx=(yourID)&q=paradise+coldplay+lyrics

How to Properly Check for JSON Updates - Java

I'm trying to make a program to check for announcements via a web API - This connects to a remote server and reads the JSON on the page - I cannot test my code as the server is not live yet. Would this work & be the correct way to go about this?
public class AnnouncementChecker implements Runnable{
private final String announcementsURL = "REDACTED";
private String lastAnnouncement = "";
#Override
public void run(){
try {
URL url = new URL(announcementsURL);
HttpURLConnection http = (HttpURLConnection) url.openConnection();
http.setRequestMethod("conditional GET");
http.setRequestProperty("Connection", "keep-alive");
http.setUseCaches(true);
http.setAllowUserInteraction(false);
if (lastAnnouncement != ""){
http.setRequestProperty("If-Modified-Since", lastAnnouncement);
}
http.setConnectTimeout(10);
http.connect();
int status = http.getResponseCode();
if (status == 304 || (status == 200 && lastAnnouncement == "")){
lastAnnouncement = http.getHeaderField("Last-Modified");
BufferedReader br = new BufferedReader(new InputStreamReader(http.getInputStream()));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line+"\n");
}
br.close();
String json = sb.toString();
JSONParser parser = new JSONParser();
JSONObject jsonResponse = (JSONObject) parser.parse(json);
//String announcement = (String) jsonResponse.get("message");
//TODO What to do with announcement...
}
http.getInputStream().close();
http.disconnect();
} catch (IOException | ParseException e) {
e.printStackTrace();
}
}
}

I would recommend setting up a test of some kind, despite the server not being available. This would give you the answer to your question and the test would be there forever going forward to protect you when you make changes to the code and the business requirements change.
To help you with that I would recommend splitting up the code that returns the response and the code that does the parsing. That way you can test the parsing independent of the part that makes the HTTP connection.
If you have no idea how to do that then I'd be happy to post an example for you.

Download AJAX generated content using java

I have a webpage on which a list of movies is being displayed. The content is created using AJAX (as far as my limited knowledge would suggest...).
I want to download the content, in this case the movie playing times, using Java. I know how to download a simple website, but here my solution only gives me the following as an result instead of the playing times:
ajaxpage('http://data.cineradoplex.de/mod/AndyCineradoProg/extern',
"kinoprogramm");
How do I make my program download the results this AJAX function gives?
Here is the code I use:
String line = "";
URL myUrl = http://www.cineradoplex.de/programm/spielplan/;
BufferedReader in = null;
try {
myUrl = new URL(URL);
in = new BufferedReader(new InputStreamReader(myUrl.openStream()));
while ((line = in.readLine()) != null) {
System.out.println(line);
}
} finally {
if (in != null) {
in.close();
}
}

In your response you can see the address from which actual data is retrieved
http://data.cineradoplex.de/mod/AndyCineradoProg/extern
You can request its contents and parse it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Screen scraping in Java - java

Related

G suite account get report java sample question

Retrieve HTML content refreshed with Ajax

Googles Custom Search as if manually searched

How to Properly Check for JSON Updates - Java

Download AJAX generated content using java

Categories

Resources