Will Jaunt web scraper be capable of scraping this javascript site - java

I have never done web scraping before, actually just 3 hours ago I google the word web scraping to see what it means... so this is my level of competence on the subject, but I have a task to scrape some numbers for different football matches of this website "betstars.uk" and from what I see it is a javascript website (is it ?) which makes the already hard task for me even harder, so can Jaunt tool for JAVA do this job or I need something else? I am asking because to avoid spending more than an hour learning how to use it just to find out it can't do the job

For some reason I cannot load the website so I can't tell you if it uses javascript to load content or not.
It's impossible to scrape a javascript based website with Jaunt because it's a basic web scraper library and it doesnt load javascript at all. Though, in case the site uses javascript indeed, you could use htmlUnit to load javascript content and scrape the information you need.
Here is an easy tutorial on How to Scrape Javascript in Java

Related

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

Best architecture for crawling website in application

I am working on a product in which we need a feature to crawl the user given URL and publish his separate mobile site. In the crawling process we want to crawl the site content, CSS, images and scripts. The product used to do more activities like scheduling some marketing activities and all. What I want to ask -
what is the best practice and open source framework to do this task?
Should we do it in the application itself or should there be another server for doing this activity (if this activity takes load)? Keep in mind that we have 1 "lacks" user visiting every month publishing his mobile site from the website, and around 1-2k concurrent users.
The application is built in Java and the Java EE platform using Spring and Hibernate as server side technology.
We used Derkley DB Java edition for managing off-heap queue of links to crawl and distinguish between links pending download and ones downloaded yet.
For parsing HTML TagSoup is the best choise in wild internet.
Batik is the choice for parsing CSS and SVG.
PDFBox is awesome and allows to extract links from PDF
Quartz scheduler is intustry-proven choice for event scheduling.
And yes, you will need one or more servers for crawling, one server for aggregating results and scheduling tasks, and perhaps another server for WEB front and back end.
This worked well for http://linktiger.com and http://pagefreezer.com
I'm implementing a crawling project based on Selenium HtmlUnit Driver. I think it's really the best Java Framework to automate a headless browser.

How can I implement similar to the web query feature in Excel in java?

I'm new here and I hope I'm not asking something which has already been answered. I have searched everywhere but am yet to discover an adequate answer.
My objective is fairly simple: I want to create a program which will stream the live gold and silver rates from: this website
How would I be able to pinpoint the values that I want to download? Currently, I have managed to implement this using Microsoft Excel's web query feature wherein I am able to select a table from the webpage. However, I want to make it a standalone application.
By the way, I need to retrieve the rates to perform a calculation which is then displayed to the user.
I would greatly appreciate any ideas on how this can be achieved.
I think you need to scrape or parse the data from your website. For that take a look at htmlcleaner, JSoup html parsers.
You can use XPath with htmlcleaner to pinpoint the data.Here is a nice example Xpath Example.
You can use firefox's firebug extension to get the xpath of an element. But your xpath is going to be very huge, by the look of the website you mentioned.
Than you have to execute the code at specified interval of your choice.
And if there is javascript in play than you have to execute the javascript running behind the page using your java code.
Try using Rhino from Mozilla and using its integration libraries or by using the JDK 1.6 ScriptEngine facility.
For ScriptEngine Example take a look here- http://metoojava.wordpress.com/2010/06/20/execute-javascript-from-java/
In, short take a look at html parsers to parse the content of your page.

Getting HTML from web pages that use AJAX

I wanted to know how to scrape web pages that use AJAX to fetch content on the web page being rendered. Typically a HTTP GET for such pages will just fetch the HTML page with the JavaScript code embedded in it. But I want to know if it is possible to programmatically (preferably Java) query for such pages and simulate a web browser kind of a request so that I get the HTML content resulting after the AJAX calls.
In The Productive Programmer author Neal Ford suggests that the functional testing tool Selenium can be used for non-testing tasks. Your task of inspecting HTML after client side DOM manipulation has taken place falls into this category. Selenium even allows you to automate interactions with the browser so if you need some buttons clicked to fire some AJAX events, you can script it. Selenium works by using a browser plugin and a java based server. Selenium test code (or non-test code in your case) can be written in a variety of languages including java, C# and other .Net languages, php, perl, python and ruby.
You may want to look at htmlunit
Why choose when you can have both? TestPlan supports both Selenium and HTMLUnit as a backend. Plus it has a really simple language for doing the most common tasks (extensions can be written in Java if need be -- which is rare actually).

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Categories

Resources