I wanted to know how to scrape web pages that use AJAX to fetch content on the web page being rendered. Typically a HTTP GET for such pages will just fetch the HTML page with the JavaScript code embedded in it. But I want to know if it is possible to programmatically (preferably Java) query for such pages and simulate a web browser kind of a request so that I get the HTML content resulting after the AJAX calls.
In The Productive Programmer author Neal Ford suggests that the functional testing tool Selenium can be used for non-testing tasks. Your task of inspecting HTML after client side DOM manipulation has taken place falls into this category. Selenium even allows you to automate interactions with the browser so if you need some buttons clicked to fire some AJAX events, you can script it. Selenium works by using a browser plugin and a java based server. Selenium test code (or non-test code in your case) can be written in a variety of languages including java, C# and other .Net languages, php, perl, python and ruby.
You may want to look at htmlunit
Why choose when you can have both? TestPlan supports both Selenium and HTMLUnit as a backend. Plus it has a really simple language for doing the most common tasks (extensions can be written in Java if need be -- which is rare actually).
Related
The program I am writing is in Java.
I am writing a little program that will download the html of webpages and save them. It works easily for basic pages that don't use JavaScript. But how can I download the page if I want it after a script has updated it? The page I am dealing with is actually updated by Ajax which might be one step harder.
I understand that this is probably a difficult problem that involves setting up a JavaScript run time environment of some kind. I am prepared for a solution of any level of difficulty, I just don't know exactly how to approach it or where to get started.
You can't do that alone with Java only. As the page that you want to download is rendered with javascript, then you must be able to execute the javascript to get the whole rendered page.
Because of this situation, you need to use a headless browser which is a web browser that can access to web pages but can’t show the output within a GUI, aims to provide the content of web pages as fully rendered to serve to the programs or scripts.
You can start with the most famous ones which are Selenium, HtmlUnit and PhantomJS
I'm not sure if this is possible but I would like to retrieve some data from a web page that uses Javascript to render data. This would be from a linux shell.
What I am able to do now:
http post using curl/lynx/wget to login and get headers from command line
use headers to get into 'secure' locations in the webpage on command line
However, the only elements that are rendered on the page are the static html. Most of the info I need are rendered dynamically with js (albeit eventually as a html as well) and don't show up on a command line browser. I understand the issue is with the lack of a js interpreter.
As such... some workarounds I thought might be possible are:
calling full browsers from command line and somehow passing the info back to stdout. this would mean that I have to be able to POST.
passing the headers (with session info, etc...) i got from curl to one of these full browsers and again dumping the output html back to stdout. it could very be a printscreen function on the window if all else fails.
a pure java solution would be OK too.
Anyone has any experience doing something similar and succeeding?
Thanks!
You can use WebDriver to do, just that you need have web browser installed. There are other solution as well such as Selenium and HtmlUnit (without browser but might behave differently).
You can find example of Selenium project at here.
WebDriver
WebDriver is a tool for writing automated tests of websites. It aims
to mimic the behaviour of a real user, and as such interacts with the
HTML of the application.
Selenium
Selenium automates browsers. That's it. What you do with that power is
entirely up to you. Primarily it is for automating web applications
for testing purposes, but is certainly not limited to just that.
Boring web-based administration tasks can (and should!) also be
automated as well.
HtmlUnit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML
documents and provides an API that allows you to invoke pages, fill
out forms, click links, etc... just like you do in your "normal"
browser.
I would recommend use WebDriver because it is not required standalone server like Selenium, while for HtmlUnit might suitable if you dont want install browser without worry about Xvfb in headless environment.
You might want to see what Selenium can do for you. It has numerous language drivers (Java included) that can be used to interact with the browser to process content typically for testing and verification purposes. I'm not exactly sure how you can get exactly what you are looking for out of it but wanted to make you aware of its existence and potential.
This is impossible unless you setup a websocket, and even like this I guess it really depends.
Could you detail your objective? For my personal curiosity :-)
I'm trying to find out if there is a standard or recommended way to communicate from javascript to the application which embeds a browser widget, and vice versa. The hosting application may be written in either java or c++ and may run on Windows and Unix platforms, but the javascript would be shared across both clients.
So far I've read about:
window.external (This seems to be IE specific, so it wouldn't work on Unix.)
LiveConnect (This seems to be java and mozilla specific, so it wouldn't work for IE or c++ based applications.)
SWT's Browser widget has some of this capability, but this would be a java-only solution..
What other options are out there?
Thanks!
Shyam
We have a VB6 application that hosts Microsoft's WebBrowser object (IE). We've used a simple URL intercept mechanism to facilitate communication between the browser and the hosting application. Since the browser control has a before navigate interface, we can pull out the URL and examine it for commands and either cancel the navigation event (since it was meant for the hosting app) or let it pass through (since it is a normal URL).
We used something like app://commandName?arg1=val&arg2=val in our Javascript or HTML link tags.
Then in the BeforeNavigate event from the browser, we check the url for app:// if we get that, we know the browser is sending the parent application a message.
Simple but effective (for our needs anyway).
EDIT
Should also mention that most embedded browsers also have mechanisms to manipulate the DOM. That in mind you should be able to extract information (HTML nodes) and inject information at will.
JavaScript has the XMLHttpRequest API that makes it possible to send data to, and retrieve data from a server. The use of this API with messages formatted in XML or JSON is designated AJAX.
AJAX can be used to implement the example you gave, of a tree node in the HTML/javascript that retrieves the list of children from the server when it is expanded.
Note that when using AJAX, the server may be written in any language (C, Java, Python, Ruby, etc).
I suggest you to read about AJAX. After you get a good understanding of AJAX you can read a little bit about web services. web service is a method of communication of 2 applications developed in arbitrary programming languages through the WEB.
I would like to open a webpage and run a javascript code from within a java app.
For example I would like to open the page www.mytestpage.com and run the following javascript code:
document.getElementById("txtEmail").value="test#hotmail.com";
submit();
void(0);
This works in a browser...how can I do it programatically within a java app?
Thanks!
You can use Rhino to execute JavaScript but you won't have a DOM available - i.e. document.getElementById() would work.
You can use HTML Unit (headless) or WebDriver/Selenium (Driving a browser) to execute JavaScript in an environment that has a DOM available.
I'm not sure what you are looking for but I assume that you want to write automated POST request. This can be done in with Http Client library. Only you have to set appropriate request (POST or GET) parameters.
Look at examples - with this library you can do basic authentication or post files too.
Your question is a bit ambiguous, as we don't know the position of the Java program.
If that's a Java applet inside your page, you should look at Java<->JavaScript interaction, it works well.
If you need a separate Java program to control a browser, like sending a bookmarklet in the address bar (as one of your tags suggests), it is a bit harder (depends on target browser), perhaps look at the Robot class.
There's Rhino JS engine written in Java that you can run on app server such as Tomcat and feed JS to, however - it's not clear what are you trying to do with this?
There's also Envjs simulated browser environment which is based on Rhino but complete enough to run jQuery and/or Prototype
DWR (and other frameworks) now support "reverse ajax." The general idea is that you use one of three methods to communicate back to the client:
Comet (long-lived https session)
Polling
opportunistic / piggy-back (i.e. next time a request comes from the client, append your js call)
Regardless of method (which is typically a configuration-time decision and not a coding issue), you will have full access to any/all js calls you want to make.
Check out the reference page from DWR to get a pretty good explanation.
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).