I want my java program to see the 'generated source' of a webpage, in Web Developer Toolbar: https://addons.mozilla.org/en-US/firefox/addon/web-developer/
in FireFox, found under the 'view source' menu, as opposed to simply the actual html source which regularly returns itself through java networking:
HttpURLconnection.getInputStream();
Can a java program do this, or at least delegate the task to another application on the same computer, written in something else (javascript) which gets embedded in the browser?
selenium should be able to do that. i used it a long time ago so i don't remember how exactly. but it's basically a browser plugin and some server code which communicates with the plugin. you can communicate via a java driver with the server and control the browser content and also get all the data from the DOM.
EDIT:
depending if a "real" browser is not necessary you can also use htmlunit which is basically a gui less browser in java.
If by "generated source", you mean the full DOM of a working web page, including elements that have been added, removed or modified by javascript in that page, then there is no way to do this without using a full browser engine to first render the page and then some sort of communication with that page or engine to give you the HTML for the generated page.
You could not do this with java alone.
You could put javascript in the web page itself which would fetch the innerHTML of the whole web page after it was fully generated and then use an ajax call to send that to your server. You would have to stay within the limitations of the same-origin-policy (which doesn't allow you to make ajax calls to domains other than where the host web page came from).
You could also find some server-side rendering engine that could do the same on the server side that your java application could use/communicate with.
Related
The program I am writing is in Java.
I am writing a little program that will download the html of webpages and save them. It works easily for basic pages that don't use JavaScript. But how can I download the page if I want it after a script has updated it? The page I am dealing with is actually updated by Ajax which might be one step harder.
I understand that this is probably a difficult problem that involves setting up a JavaScript run time environment of some kind. I am prepared for a solution of any level of difficulty, I just don't know exactly how to approach it or where to get started.
You can't do that alone with Java only. As the page that you want to download is rendered with javascript, then you must be able to execute the javascript to get the whole rendered page.
Because of this situation, you need to use a headless browser which is a web browser that can access to web pages but can’t show the output within a GUI, aims to provide the content of web pages as fully rendered to serve to the programs or scripts.
You can start with the most famous ones which are Selenium, HtmlUnit and PhantomJS
I am a Java programmer. I would like to write a client-side Java program that adds-on to Firefox to perform operations on the HTML received from a specific remote web site, BEFORE that HTML is displayed in the user's browser. The client side Java program would have to:
Locate and read specific files on the local (end-user) machine on which it resides.
Check the URLs of web pages requested by Firefox.
If a URL requested through Firefox contains a specific domain:
Iterate through the HTML text looking for startcode and endcode.
Slice out the string between startcode and endcode.
Transform the string between startcode and endcode using info from file on local pc.
Replace the string between startcode and endcode with the transformed string.
Allow the Firefox browser window to display the modified HTML.
Basically, the Java program would intercept incoming HTML from a specific web site and alter the contents before the contents are displayed on the user's screen. How would I go about writing this kind of program?
Of course, I have administrative privileges on the computers that would run this program. But I have never written a browser add on before. I would like to write it in Java, but the code would need to always be on the client computer. The code could never be on the server. I do not know where to start this project.
#Athafoud is correct in general. No browser supports Java out of the box.
Instead:
You can write browser extensions for Firefox, Chrome, Safari, Opera in Javascript. E.g. the firefox-addon has a link list to get you started with Firefox extension development.
You can also write browser extensions for Firefox in C/C++ (to some extend) using either js-ctypes or XPCOM.
You can write some limited C++ stuff for Chrome via their NaCL APIs.
You could potentially write Java Applets for browsers that support the Java plugin and bundle them with and script them from your extension (to some extend) but that is a PITA.
Firefox extension APIs are the most capable as anything Firefox can do, extensions can do too (incl. calling into external libraries). Other browsers have far more limited extensibility/extension-facing APIs (due to architectural issues and sometimes in the name of security, although that bold security claim is... well, bold).
As for the particular requirements you gave in your question:
Firefox extensions are capable of transforming raw HTTP responses (although this is a bit cumbersome), as well as the DOM once HTML is parsed (from javascript). Firefox can read/write all files in the file system (abiding OS-level ACLs, of course).
Chrome extensions are not capable of transforming raw HTTP responses ATM, but you could modify the DOM once parsed. Also IIRC Chrome cannot read arbitrary files by default but you can manually enable read-access.
I dont think that you are able to use native java to write a firefox addon. You can use javascript. A good place to start is on Mozilla documentation site.
There is also a good guide here shortest-tutorial-for-firefox-extension, it is a bit old and the SDK has change, but i think is good start.
And a more update from mozzila itself how-to-develop-firefox-extension
VIA JAVA, I want to login to a website.
Authentication: The site has a javascript button that performs the redirection to the home page. My webcrawler can login programatically to sites that have html buttons, using Jsoup. But when I try to login to a website that has the submit in a javascript, I can't seem to get authenticated in any of the ways I discovered so far.
So far I've tried:
I've tried to log in using the native java api, using URLConnection, and OutputWriter. It fills the user and password fields with their proper values, but when I try to execute the javascript method, it simply doesn't work;
Jsoup. (It can log me in to any website containing html buttons. But since it doesn't support javascript, it won't help much;
I've tried HtmlUnit. Not only does it print a gazilion lines of output, it takes a long long time to run, and in the end still fails.
At last, I tried using Rhino (Which HtmlUnit is based on), got it to work in a long list of javascript methods. But cannot authenticate;
I already have tried Selenium, and got nowhere, also..
I'm running out of ideas.. Maybe I haven't explored all the solutions contained in one of these APIs, but I still can't login to a website containing a javascript button. Anyone has any ideas?
Using Selenium Webdriver, send javascript commands to the browser. I've successfully used it to reliably and repeatedly run hundreds of tests on complicated javascript/ajax procedures on the client.
If you target a specific web page, you can customize the script and make it quite small.
WebDriver driver; // Assigned elsewhere
JavascriptExecutor js = (JavascriptExecutor) driver;
// This is javascript, but can be done through Webdriver directly
js.executeScript("document.getElementById('theform').submit();");
Filling out the form is assumed to have been handled by using the Selenium Webdriver API. You can also send commands to click() the right button etcetera.
Using Selenium Webdriver, you could also write <script> tags to the browser, in order to load larger libraries. Remember that you may have to wait/sleep until the browser has loaded the script files - both your own and the one the original web page uses for the login procedures. It could be seconds to load and execute all of it. To avoid sleeping for too long, use the more reliable method of injecting a small script that will check if everything else has been loaded (checking web page script's status flags, browser status).
I suggest HtmlUnit:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML
documents and provides an API that allows you to invoke pages, fill
out forms, click links, etc... just like you do in your "normal"
browser.
It has fairly good JavaScript support (which is constantly improving)
and is able to work even with quite complex AJAX libraries, simulating
either Firefox or Internet Explorer depending on the configuration you
want to use.
It is typically used for testing purposes or to retrieve information
from web sites.
I had an issue that sounds similar (I had a login button that called a javascript method).
I used JMeter to observe what was being passed when I manually clicked the login button through a web browser (but I imagine you could do this with WireShark as well).
In my Java code, I created a PostMethod with all the parameters that were being sent.
PostMethod post = new PostMethod(WEB_URL); // URL of the login page
// first is the name of the field on the login page,
// then the value being submitted for that field
post.addParameter(FIELD_USERNAME, "username");
post.addParameter(FIELD_PASSWORD, "password");
I then used HttpClient (org.apache.commons.httpclient.HttpClient) to execute the Post request.
One thing to note, there were 'hidden' parameters that were being passed that I did not see by manually looking at the login page. These were revealed to me when I used JMeter.
I will be happy to clarify anything that seems unclear.
I have developed a Java applet which opens a URL connection to some different server. The applet then retrieves contents of the HTML page and do some processing and then shows to user. Can I cross-compile it to JavaScript using GWT?
Cross compile: No.
Port: Probably. Depends on your constraints.
You won't be able to do a straight recompile and have it "just work" (GWT only supports a subset of the JRE and any UI stuff definitely isn't a part of it) but you might be able to port some of your logic over. If you're using XPath to pull content out of the page, that code most likely will need to be redone as well. There's a GWT wrapper for Sarissa that works pretty well.
Also, since the requested page is going to be on a different server, you'll need to set up some method of doing a cross site request. So either browser hacks or a proxy on the hosting server.
I wanted to know how to scrape web pages that use AJAX to fetch content on the web page being rendered. Typically a HTTP GET for such pages will just fetch the HTML page with the JavaScript code embedded in it. But I want to know if it is possible to programmatically (preferably Java) query for such pages and simulate a web browser kind of a request so that I get the HTML content resulting after the AJAX calls.
In The Productive Programmer author Neal Ford suggests that the functional testing tool Selenium can be used for non-testing tasks. Your task of inspecting HTML after client side DOM manipulation has taken place falls into this category. Selenium even allows you to automate interactions with the browser so if you need some buttons clicked to fire some AJAX events, you can script it. Selenium works by using a browser plugin and a java based server. Selenium test code (or non-test code in your case) can be written in a variety of languages including java, C# and other .Net languages, php, perl, python and ruby.
You may want to look at htmlunit
Why choose when you can have both? TestPlan supports both Selenium and HTMLUnit as a backend. Plus it has a really simple language for doing the most common tasks (extensions can be written in Java if need be -- which is rare actually).