How to use wkhtmltopdf in Java web application?

How to use wkhtmltopdf in Java web application? - java

I am newbie in wkhtmltopdf. I am wondering how to use wkhtmltopdf with my Dynamic Web Project in Eclipse? How to integrate wkhtmltopdf with my Java dynamic web application?
Is there any tutorials available for beginners of wkhtmltopdf ?
(Basically, I would like to use wkhtmltopdf in my web application so that when user click a save button , the current page will be saved to PDF file).

First, a technical note: Because you want to use wkhtmltopdf in a web project, if and when you deploy to a Linux server machine that you access via ssh (i.e. over the network), you will need to either use the patched Qt version, or run an X server, e.g. the dummy X server xvfb. (I don't know what happens if you deploy to a server running an operating system other than Linux.)
Second, it should be really quite simple to use wkhtmltopdf from any language in a web project.
If you just want to save the server-generated version of the current page, i.e. without any changes which might have been made like the user filling on forms, or Javascript adding new DOM elements, you just need to have an extra optional argument like ?generate=pdf on the end of your URL, which will cause that page to be generated as a PDF, and then the PDF button will link to that URL. This may be a lot of work to add to each page manually if you are just using simple JSP or something, but depending on which web framework you are using, the web framework may offer some help to implement the same action on every page, if you need to implement that.
To implement this approach, you would probably want to capture the response by wrapping the response object and overridding its getWriter() and getOutputStream() methods.
Another approach is to have a button "submit and generate PDF" which will generate the next page as a PDF. This might make more sense if you have a form the user needs to fill in - I don't know. It's a design decision really.
A third approach is to use Javascript to upload the current state of the page back to the server, and process that using wkhtmltopdf. This will work on any page. (This can even be used on any site, not just yours, if you make it a bookmarklet. Just an idea that occurred to me - it may not be a good idea.)
A fourth approach is, because wkhtmltopdf can fetch URLs, to pass the URL of your page instead of the contents of the page (which will only work if the request was a HTTP GET, or if it's equivalent to a HTTP GET on the same URL). This has some small amount of overhead over capturing your own response output, but it will probably be negligible. You will also very likely need to copy the cookie(s) into a cookie jar with this approach, since presumably your user might be logged in or have an implicit session.
So as you can see there are quite a lot of choices!
Now, the question remains: when your server has the necessary HTML, from any of the above approaches, how to feed it into wkhtmltopdf? This is pretty simple. You will need to spawn an external process using either Runtime.getRuntime().exec(), or the newer API called ProcessBuilder - see http://www.java-tips.org/java-se-tips/java.util/from-runtime.exec-to-processbuilder.html for a comparison. If you are smart about it you should be able to do this without needing to create any temporary files.
One of the wkhtmltopdf websites is currently down, but the main README is available here, which explains the command line arguments.
This is merely an outline answer which gives some pointers. If you need more details, let us know what specifically you need to know.

Additional info:
If you do end up trying to call wkhtmltopdf in an external process from java (or for that matter, any language), please note that the "normal" output that you see when using wkhtmltopdf from the command line (i.e. what you would expect to see in STDOUT) is not not in STDOUT but in STDERR. I raised this issue in the project page
http://code.google.com/p/wkhtmltopdf/issues/detail?id=825
and was replied that this is by design because wkhtmltopdf supports giving the actual pdf output in STDOUT. Please see the link for more details and java code.

java-wkhtmltopdf-wrapper provides an easy API for using wkhtmltopdf in Java.
It also works out-of-the-box on a headless server with xvfb.
E.g., on a Ubuntu or Debian server:
aptitude install wkhtmltopdf xvfb
Then in Java:
Pdf pdf = new Pdf();
pdf.addPage("http://www.google.com", PageType.url);
pdf.saveAs("output.pdf");
See the examples on their Github page for more options.

Related

Chrome/Firefox web browser automation for collect data

I would like to browse automatically in a website to collect some data.
There's a page with a form. The form consists of a select and a submit button. Selecting an option of the select and clicking on the submit button leads to another page where there's some tables with related data.
I need to collect and save in file this data for each option. Probably I will need to go back to the first page to repeat the task for each option. The detail is that I don't know the exactly number of options previously.
My idea is to do that task, preferably, with Firefox or Chrome. I think that the only way to do that is via programming.
Someone could indicate me a way to do that task in a easy and fast way. I know a little bit about Java, Javascript and Python.

You might want to google "web browser automation" tool like Selenium. Although not entirely fit for the purpose I think it can be used to implement your requirement.

Since the task is relatively well constrained, I would avoid Selenium (it's a little brittle), and instead try this approach:
Get a comprehensive list of options from the first page, record that in a text file
Capture, using a network monitoring tool like Fiddler, the traffic that is sent when you submit the first page. See what exactly is submitted to the server - and how (POST vs GET, parameter encoding, etc).
Use a tool like curl to replay the request steps in the exact format that you captured in step 2. Then write a batch script (using bash or python) to run through all the values in the text file from step 1 to do curl for all the values in the dropdown. Save curl output to files.

I found a solution to my problem. It's called HtmlUnit:
http://htmlunit.sourceforge.net/gettingStarted.html
HtmlUnit is a "GUI-Less browser for Java programs".
It allows to web browsing and data collecting using Java and it's very simple and easy to use.
Not exactly what I asked, but it's better. At least to me.

Command line based HTTP POST to retrieve data from javascript-rich webpage

I'm not sure if this is possible but I would like to retrieve some data from a web page that uses Javascript to render data. This would be from a linux shell.
What I am able to do now:
http post using curl/lynx/wget to login and get headers from command line
use headers to get into 'secure' locations in the webpage on command line
However, the only elements that are rendered on the page are the static html. Most of the info I need are rendered dynamically with js (albeit eventually as a html as well) and don't show up on a command line browser. I understand the issue is with the lack of a js interpreter.
As such... some workarounds I thought might be possible are:
calling full browsers from command line and somehow passing the info back to stdout. this would mean that I have to be able to POST.
passing the headers (with session info, etc...) i got from curl to one of these full browsers and again dumping the output html back to stdout. it could very be a printscreen function on the window if all else fails.
a pure java solution would be OK too.
Anyone has any experience doing something similar and succeeding?
Thanks!

You can use WebDriver to do, just that you need have web browser installed. There are other solution as well such as Selenium and HtmlUnit (without browser but might behave differently).
You can find example of Selenium project at here.
WebDriver
WebDriver is a tool for writing automated tests of websites. It aims
to mimic the behaviour of a real user, and as such interacts with the
HTML of the application.
Selenium
Selenium automates browsers. That's it. What you do with that power is
entirely up to you. Primarily it is for automating web applications
for testing purposes, but is certainly not limited to just that.
Boring web-based administration tasks can (and should!) also be
automated as well.
HtmlUnit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML
documents and provides an API that allows you to invoke pages, fill
out forms, click links, etc... just like you do in your "normal"
browser.
I would recommend use WebDriver because it is not required standalone server like Selenium, while for HtmlUnit might suitable if you dont want install browser without worry about Xvfb in headless environment.

You might want to see what Selenium can do for you. It has numerous language drivers (Java included) that can be used to interact with the browser to process content typically for testing and verification purposes. I'm not exactly sure how you can get exactly what you are looking for out of it but wanted to make you aware of its existence and potential.

This is impossible unless you setup a websocket, and even like this I guess it really depends.
Could you detail your objective? For my personal curiosity :-)

When JEditorPane will not do. JRex, JRex wherefore are thou JRex?

IDEA: Implement a recent web browser into a java application (for saved offline, non server content).
The question is this: can I have a java application implement a webbrowser with jquery / html / css support within a java program?
So I am asking anyone who has played with JRex for advice: I want to know how complicated will it be to integrate an open source webbrowser into java. I am not all that keen on the idea of compiling Mozilla from source build. Is there a ready made compiled version?
Is there a simplified method to have latest compiled version (most current in terms of support for HTML css & javascript), and integrate that into an application?
Also: I appreciate the amount of work required to support for HTML4 nevermind 5, and CSS2 compliance. How close is JRex to that?
Application: My intention with the webbrowser is to render a webpage from offline content. It will not need to be online content, and will simply be for file based displays = e.g. file:///C:...
Does the webbrowser have to be wrapped into a server to function, e.g. to pass files to the browser to render is how complicated? I am not keen to have to implement Jetty or another server type application just for this.
If JRex is not the solution... what then? Is it possible to start a browser implementation within Java and can Java interact with the information and traverse the Dom?
Or alternatively is there .hta equivalent in recent browsers like firefox?

If you need to have the embedded browser interact with your application code, you could try the SWT Browser control, it's actually maintained as opposed to JRex. Browser uses either WebKit or Gecko or embedded IE as appropriate, or lets you choose which one you want, so it should run jQuery and familiar Javascript. And since SWT is a JNI library to begin with they probably already have guidance on how to deploy an app that uses JNI.
You can feed HTML into the control from a string (example) or a java Url - which can point to local files or resource files in your JAR, which I assume will let you split your app into different files.
To call Java code, you need to expose it as Javascript functions. example
To manipulate the HTML from Java code, you need to call Javascript functions from Java. example
To make the previous two tasks easier, you might want to look into a JSON library to simplify passing around complex data.

Does it have to be implemented within a Java program? Could you let the user use the default browser on their machine (ie does it matter what browser)?
If not would use the Java Desktop API.
if (desktop.isSupported(Desktop.Action.BROWSE)) {
txtBrowserURI.setEnabled(true);
btnLaunchBrowser.setEnabled(true);
}
If you are using Java 1.5 try http://javadesktop.org/articles/jdic/

How can I open a webpage within a Java app and run my own javascript code

I would like to open a webpage and run a javascript code from within a java app.
For example I would like to open the page www.mytestpage.com and run the following javascript code:
document.getElementById("txtEmail").value="test#hotmail.com";
submit();
void(0);
This works in a browser...how can I do it programatically within a java app?
Thanks!

You can use Rhino to execute JavaScript but you won't have a DOM available - i.e. document.getElementById() would work.
You can use HTML Unit (headless) or WebDriver/Selenium (Driving a browser) to execute JavaScript in an environment that has a DOM available.

I'm not sure what you are looking for but I assume that you want to write automated POST request. This can be done in with Http Client library. Only you have to set appropriate request (POST or GET) parameters.
Look at examples - with this library you can do basic authentication or post files too.

Your question is a bit ambiguous, as we don't know the position of the Java program.
If that's a Java applet inside your page, you should look at Java<->JavaScript interaction, it works well.
If you need a separate Java program to control a browser, like sending a bookmarklet in the address bar (as one of your tags suggests), it is a bit harder (depends on target browser), perhaps look at the Robot class.

There's Rhino JS engine written in Java that you can run on app server such as Tomcat and feed JS to, however - it's not clear what are you trying to do with this?
There's also Envjs simulated browser environment which is based on Rhino but complete enough to run jQuery and/or Prototype

DWR (and other frameworks) now support "reverse ajax." The general idea is that you use one of three methods to communicate back to the client:
Comet (long-lived https session)
Polling
opportunistic / piggy-back (i.e. next time a request comes from the client, append your js call)
Regardless of method (which is typically a configuration-time decision and not a coding issue), you will have full access to any/all js calls you want to make.
Check out the reference page from DWR to get a pretty good explanation.

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.

The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.

You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.

I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.

A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.