Use HtmlUnit as crawler - java

I needed a headless browser to parse pages.
HtmlUnit allow me to setup a Heroku Java app to fullfil this purpose.
But now I'm meeting with couple of issues.
The current one is malformed url "//path" instead of "/path" or "http(s)://path".
I downloaded sources of the 2.9.4 version and pushed tiny fixes in the sources ...
It's not really efficient to modify standard sources for obvious maintainability reasons.
I'm so wondering if i'm not digging in the wrong direction.
HtmlUnit is designed to browse pages in a testing purpose. Mine is to do like a browser, so make page working the most possible, especially because my damned targeted websites are the kind of ultra-dirty-not-respecting-anything...
What is your opinion about this retrospection ?

HTML Unit is used in Selenium 2/Web Driver for headless browser "simulation". There it works fine.
So I see no reason not to try Html Unit. May you can have a look at Selenium 2/Web Driver too.

Related

How to scrape a website with dynamic content loading?

How can I scrape a website with dynamic content loading, like a forbes.com article, but without using web-driver (it's slow) in apache http client.
I've tried getting the sitemap.xml but their sitemap includes only the latest articles and I want info from very old articles.
Also, I want a more generic solution and with the web-driver (I use selenium with phantomJS now) is site-specific and slow.
I'd suggest you to try this tool ui4j. It's a wrapper around the JavaFx WebKit Engine with headless modes. It can help you speeding up things.

Parse a page (partly generated by JavaScript) by using Selenium

I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database.
Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see - is, that crawler4j needs links to follow in the source code.
But in this case the hrefs are generated by some JavaScript code so that crawler4j does not get new links to visit / pages to crawl.
So my idea was to use Selenium so that I can inspect several Elements like in a real Browser like Chrome or Firefox (I'm quite new with this).
But, to be honest, I don't know how to get the "generated" HTML instead of the source code.
Can anybody help me?
To inspect elements, you do not need the Selenium IDE, just use Firefox with the Firebug extension. Also, with the developer tools add on you can view a page's source and also the generated source (this is mainly for PHP).
Crawler4J can not handle javascript like this. It is better left for another more advanced crawling library. See this response here:
Web Crawling (Ajax/JavaScript enabled pages) using java

Command line based HTTP POST to retrieve data from javascript-rich webpage

I'm not sure if this is possible but I would like to retrieve some data from a web page that uses Javascript to render data. This would be from a linux shell.
What I am able to do now:
http post using curl/lynx/wget to login and get headers from command line
use headers to get into 'secure' locations in the webpage on command line
However, the only elements that are rendered on the page are the static html. Most of the info I need are rendered dynamically with js (albeit eventually as a html as well) and don't show up on a command line browser. I understand the issue is with the lack of a js interpreter.
As such... some workarounds I thought might be possible are:
calling full browsers from command line and somehow passing the info back to stdout. this would mean that I have to be able to POST.
passing the headers (with session info, etc...) i got from curl to one of these full browsers and again dumping the output html back to stdout. it could very be a printscreen function on the window if all else fails.
a pure java solution would be OK too.
Anyone has any experience doing something similar and succeeding?
Thanks!
You can use WebDriver to do, just that you need have web browser installed. There are other solution as well such as Selenium and HtmlUnit (without browser but might behave differently).
You can find example of Selenium project at here.
WebDriver
WebDriver is a tool for writing automated tests of websites. It aims
to mimic the behaviour of a real user, and as such interacts with the
HTML of the application.
Selenium
Selenium automates browsers. That's it. What you do with that power is
entirely up to you. Primarily it is for automating web applications
for testing purposes, but is certainly not limited to just that.
Boring web-based administration tasks can (and should!) also be
automated as well.
HtmlUnit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML
documents and provides an API that allows you to invoke pages, fill
out forms, click links, etc... just like you do in your "normal"
browser.
I would recommend use WebDriver because it is not required standalone server like Selenium, while for HtmlUnit might suitable if you dont want install browser without worry about Xvfb in headless environment.
You might want to see what Selenium can do for you. It has numerous language drivers (Java included) that can be used to interact with the browser to process content typically for testing and verification purposes. I'm not exactly sure how you can get exactly what you are looking for out of it but wanted to make you aware of its existence and potential.
This is impossible unless you setup a websocket, and even like this I guess it really depends.
Could you detail your objective? For my personal curiosity :-)

Replacement for Jaxer for parsing/crawling websites

I have an old tool an (ex-)colleague wrote a few years back with Jaxer, that I'd like to replace/rewrite.
Jaxer is an (abandoned) server-side framework based on a headless Mozilla/Gecko-Browser allowing you to use JavaScript and the DOM server-side.
Since Jaxer is abandoned and because I have big problems installing and running Aptana Studio 1.5 with Jaxer on a new computer, I'm looking for a library/framework/something on which I can base a new version.
This tool is only run locally inside Aptana Studio (the IDE for Jaxer) and was never intended to be an actual web app. It crawls our customers websites by loading them page by page into the server-side Mozilla. In order to do that it uses jQuery and predefined CSS selectors to find the links in the menus and parse other information out of the pages. The final result is basically a glorified sitemap.
I'd like to keep this modus operandi if possible and continue using jQuery/JavaScript/the DOM to load and parse/access the pages, but it can be wrapped in a framework based on another language such as Java. I considered writing something based on Gecko myself, but that seems a bit over the top, so I'm open for an other suggestions.
As far as HTML crawling/parsing goes:
http://ccil.org/~cowan/XML/tagsoup/
or
http://jsoup.org/

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Categories

Resources