Scrape a Dynamic Website using Java with Selenium?

Scrape a Dynamic Website using Java with Selenium? - java

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption.
I've built web scrapers before using crawler4j but the websites were static.
Since https://www.rspca.org.uk/findapet#onSubmitSetHere is not a static website, how can I scrape it? Is it possible? What technologies should I use and how?
Update:
When you fill in the search form (Select type of pet and Enter postcode/town or county) in the UI, the results are then displayed below the search box.
The red is highlighted as the search bar and the black is highlighted as results.
I'm trying to scrape the results and also the content of each result.
I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made.

You could use Selenium to extract information from the DOM once a browser has rendered it, but I think a simpler solution is to use "developer tools" to find the request that the browser makes when the "search" button is clicked, and try to reproduce that.
In this case that makes a POST to https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search
The body of the POST request contains a lot of parameters, including animalType and location. The content-type of the request is application/x-www-form-urlencoded.
To see these parameters, go to the "Network" tab in chrome dev tools, click on the "findapet" request (it's the first one in the list when I do this), and click on the "payload" tab to see the query string parameters and the form parameters (which contains animalType and location)
The response contains HTML.
I would try making a request to that endpoint and then parsing the HTML in the response.

Related

Logging into and retrieving data from a website using a "post" request on JSoup

I am trying to create an App that logs into my schoool's grade servers and displays data. The website I am trying to log into is: "https://portal.mcpsmd.org/public/"
I have written my code according to this stackoverflow question. Here is the relevant code for my specific situation:
Connection.Response loginForm = Jsoup.connect("https://portal.mcpsmd.org/public/")
.method(Connection.Method.GET)
.execute();
Document document = Jsoup.connect("https://portal.mcpsmd.org/guardian/home.html/")
.data("account", "#######")
.data("pw","*******")
.cookies(loginForm.cookies())
.post();
System.out.println(document.title());
The reason I have written my code this way is because when I did "inspect element" on my school's page I saw this:
screenshot of school's inspect element
I can see that my school uses the method "post" and the login request is at "/guardian/home.html"
Many of the Stack Overflow questions I have looked at has told me to use a method like this, where I connect to the login request and send it my username and password as data. Am I going about this correctly at all? I am trying to use JSoup to login to my school's website, so in my mind if I do "document.title()" it should print the name of the page I get once already signing in. This program is only printing the name of the log-in screen page.
I have been working on this project for almost two days now, and initially started with 0 knowledge of JSoup. I would greatly appreciate any explanation on the best way to go about logging into a website using JSoup.

How to get the Request URLs for different links on a page in Java?

I am on this page: http://www.flashscore.com/nhl/. You see here the first table with 'Today's Matches'. If you click on a match, you get to the summary of the game.
There you can click on the 'H2H' tab and you get here: http://www.flashscore.com/match/Q1OevyV9/#h2h;overall. At this point if you open the developer tools, and you click on the network tab you can find out the Request URL.
I started to write a program in JAVA which gets all the Request URLs for all the H2Hs of those matches that are in the 'Today's Matchs' table.
final Document page = Jsoup
.connect("http://d.flashscore.com/x/feed/tx_xlQp8HDC_pMu72He4")
.cookie("_ga","GA1.2.47011772.1485726144")
.referrer("http://d.flashscore.com/x/feed/proxy-local")
.userAgent(myUserAgent)
.header("X-Fsign", "SW9D1eZo")
.header("X-GeoIP", "1")
.header("X-Requested-With", "XMLHttpRequest")
.get();
}
So, with that code I got the page, but I have no idae how to proceed. Can somebody please help me, who has experience in WebScraping?

Jsoup with a plugin

I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!

Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.

How to show the content of a log file in new window when we click on a link in spring mvc

My main goal: When i click the link, a new browser window should be opened and displays the content of entire log file. And the window should not have an address bar and navigation buttons (Back, Forward).
Is there any approach to do this in Spring-MVC project?
Here is what i am doing now:
When i click the link, the controller will be called with a parameter logName.
Now the controller have access to get any kind of details of the log file like content, path, etc... I am setting all these details to an object and sending back to JSP.
Here i am not sure how to open a new window and display the content of the log file in that window.
Please suggest me an approach on this!!
It would be very helpful for me if you can share me with some examples...

Spring or JSP have nothing to do with it, the only way to force user's browser to open a link in a new tab is to use client-side Javascript. window.open() allows configuring the popup to hide certain interface elements (see all options in the documentation)
Your code would look something like:
<input type="button" value="Show Log" onclick="showLog(logName)">
function showLog(logName) {
var url = "/path-to-your-controller.html?logName=" + logName;
window.open(url, "LogPage", "toolbar=no,location=no,menubar=no");
}
However, I don't think using a customised browser popup is a good solution; it's been disappearing from the web for a reason. It would be more elegant to fetch raw data using AJAX and display it in a JS popup: it wouldn't interfere with user's page navigation (you tagged the question with jQuery, you could use jQuery UI for that).
What is more, I wouldn't be surprised if window.open wasn't supported by all browsers in the same way† - something to keep in mind if you're targeting a wider audience.
† seems that Chrome ignores location=no, for instance

Easiest way to "browse" to a page and submit form in Java

What I need to do is browse to a webpage, login, then browse to another webpage on that site that requires you to be logged in, so it needs to save cookies. After that, I need to click an element on that page, in which I would fill out the form and get the message that the webpage returns to me. The reason I need to actually go to the page and click the button as suppose to just navigating directly to the link is because the you are assigned a session ID every time you log in and click the link, and its always different. The button looks like this, its not a normal href link:
<span id=":tv" idlink="" class="sA" tabindex="0" role="link">Next</span>
Anyway, what would be the easiest way to do this? Thanks.
Update:
After trying HTMLunit, and other headless browser libraries, it doesnt seem that its happening using anything "headless." Another thing that I recently found out about this page is that that all the HTML is in some weird format... Its all inside a script tag. Here is a sample.
"?ui\x3d2\x26view\x3dss\x26mset\x3dmain\x26ver\x3d-68igm85d1771\x26am\x3d!Zsl-0RZ-XLv0BO3aNKsL0sgMg3nH10t5WrPgJSU8CYS-KNWlyrLmiW3HvC5ykER_n_5dDw\x26fri"],"http://example.com/?ctx\x3d%67mail\x26hl\x3den",,0,"Gmail","Gmail",[["us","c130f0854ca2c2bb",[["n"],["m","New features!"],["u"],["k","0"],["p","1000:500000,10,200000,5,100000,3,75000,2,0,1"],["h","https://survey.googleratings.com/wix/p1679258.aspx?l\x3d1033"],["at","query,5,contacts,5,adv,5,cf,5,default,20"],["v","https://www.youtube.com/embed/Ra8HG6MkOXY?showinfo\x3d0"],
When I do inspect element on the button, the HTML code that I posted above for the button comes up, but not when doing view source. Basically, what I am going to need to do is use some sort of GUI and have the user navigate to the link and then have the program fill out the info. Does anyone know how I can do this? Thanks.

Have a look at the 5 Minute Getting Started Guide for Selenium: http://code.google.com/p/selenium/wiki/GettingStarted

On the login page, look at the form's HTML to see the url it posts to and the url parameters. Then request that url with the same parameters filled in with correct info, and make sure to save all the cookie headers to send to the second page. Then use an html parser to find your link. There are several html parsers available on sourceforge, and you could even try java's built in xml parsers, though if the site has even a tiny html mistake they will glitch.
EDIT didn't notice the fact that it is not a normal link. In that case you will need to look at the site's javascript to see where the link leads. If the link requires javascript to run, it gets more complicated. Java is not able to execute browser javascript, but I found a library called DJ native swing which includes a web browser class that you can add to jframes. It uses your native browser to render, and to run javascript.

This should be possible in Selenium as others have noted.
I have used Selenium to login then crawl a site and discover every permuation of values for every form on the site (30+ forms). These values are later used to fill and submit the form with a specific perumation of values. This site was very JS/jQuery heavy and I used Selenium's built-in support of javascript executor, css selectors, and XPath to accomplish this.
I implemented HtmlUnit and HttpUnit as faster alternatives, but found they were not as reliable as Selenium given the JS semantics of the site I was crawling.
It's hard to give you code on how to accomplish it because your Selenium implementation will be quite page-specific and I can't look at the page you're coding against to figure out what's going on with that button script junk. However, I have include some possibly relevant selenium code (Java) snippets:
Element element = driver.findElements(By.id(value)); //find element on page
List<Element> buttons = parent.findElements(By.xpath("./tr/td/button")); //find child element
button.click();
element.submit() //submit enclosing form
element.sendKeys(text); //enter text in an input
String elementText = (String) ((JavascriptExecutor) driver).executeScript("return arguments[0].innerText || arguments[0].textContent", element); //interact with a selenium element via JS
If you are coding similar functions on different pages, then PageObjects behind interfaces can help.
The link Anew posted is a good starting point and good ol' StackOverflow has answers to just about any Selenium problem ever.

Instead of trying to browse around programmatically, try executing the login request and save the cookies then set those in the next request to the form post.

HTMLUnit is pretty bad at processing JavaScript, the Rhino JS library produces often errors (actually no errors is much the exception). I would advise to use Selenium, which is basically a framework to control headless browsers (chrome, firefox based).
For your question, the following code would do the work
selenium.open(myurl);
selenium.click("id=:tv");
You then have to wait for the page to load
selenium.waitForPageToLoad(someTime);

I would recommend htmlunit any day. It's a great library.
First, check out their web page(http://htmlunit.sourceforge.net/) to get htmlunit up and running. Make sure you use the latest snapshot(2.12 when writing this)
Try these settings to ignore pretty much any obstacle:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);
Then when fetching your page, make sure you wait for background Javascript before doing anything with the page, like posting a login form:
//Get Page
HtmlPage page1 = webClient.getPage("https://login-url/");
//Wait for background Javascript
webClient.waitForBackgroundJavaScript(10000);
//Get first form on page
HtmlForm form = page1.getForms().get(0);
//Get login input fields using input field name
HtmlTextInput userName = form.getInputByName("UserName");
HtmlPasswordInput password = form.getInputByName("Password");
//Set input values
userName.setValueAttribute("MyUserName");
password.setValueAttribute("MyPassword");
//Find the first button in form using name, id or xpath
HtmlElement button = (HtmlElement) form.getFirstByXPath("//button");
//Post by clicking the button and cast the result, login arrival url, to a new page and repeat what you did with page1 or something else :)
HtmlPage page2 = (HtmlPage) button.click();
//Profit
System.out.println(page2.asXml());
I hope this basic example will help you!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scrape a Dynamic Website using Java with Selenium? - java

Related

Logging into and retrieving data from a website using a "post" request on JSoup

How to get the Request URLs for different links on a page in Java?

Jsoup with a plugin

How to show the content of a log file in new window when we click on a link in spring mvc

Easiest way to "browse" to a page and submit form in Java

Categories

Resources