Download Web Page with Resources using Java

Download Web Page with Resources using Java - java

Is there any any api available that capture whole java page just like browser save as option does?

I think emulating a browser in your case might be a easy solution (though a bit heavyweight). Look into HtmlUnit.

you can try and have a look at Lobo : http://lobobrowser.org/java-browser.jsp which is a web-based Java browser. They might provide a way for you to just download a whole page with the resources in a directory.
Another way could be to use something like Selenium and start a Firefox instance in your java app after you recorded a macro for going to File -> Save As ... etc.

Related

Possible to check if a website is open in browser from java

Is it possible to check if a website is already opened in the default browser from a java program? I need my program to open a specific website before doing some other stuff. So is it possible to check whether this website is already open?
EDIT:
Ok, i'll try til explain the situation a little further. So i want to download some files from a webpage (http://aula.au.dk/main/document/document.php?cidReq=IMFFOUANAE12). When you click a file you're redirected to some file destinations where you can download it. My program list all these files, and then when you click a filename the browser opens the url that will redirect you to the download of that specific file. My problem is then, if the url i linked above isn't open i get an SQL-error from the website. Apparantly this error only show when the above url isn't open i an tab. So if i download a file, close the browser, try to download a new i get the problem. But as below, it seems cookies can help me out.
I'm not that in to all this http, website kinda stuff.
Regards
Jesper

No it is not possible to do the thing that you have asked.

I am not sure whether it is possible, you can open Default browser by using Desktop class. Or you can do
Runtime rt = Runtime.getRuntime();
rt.exec(new String[]{"C:\\Program Files\\Mozilla Firefox\\firefox.exe", "-new-window", "example.com"});
Why should you worry even if your page is already opened ?

Yes but even if you try to open same page browser will keep the session of old page.It will be tracked using jsessionid.
Please check the links using cookies you can do
http://www.mkyong.com/servlet/a-simple-cookie-example-in-servlet/
http://www.tutorialspoint.com/servlets/servlets-session-tracking.htm

Its not possible purely because Java has no way of knowing whether browsers are open and what browsers are in use. If such a feature were to be implemented it would literally require natively hooking into the api of every browser. Something which while can be done would only support the browsers you choose it to (Firefox,Chrome,IE,Safari,Opera) etc etc, but it requires directly interfacing with the native C libraries, something outside of the scope of this and certainly overkill for such a trivial feature.

When JEditorPane will not do. JRex, JRex wherefore are thou JRex?

IDEA: Implement a recent web browser into a java application (for saved offline, non server content).
The question is this: can I have a java application implement a webbrowser with jquery / html / css support within a java program?
So I am asking anyone who has played with JRex for advice: I want to know how complicated will it be to integrate an open source webbrowser into java. I am not all that keen on the idea of compiling Mozilla from source build. Is there a ready made compiled version?
Is there a simplified method to have latest compiled version (most current in terms of support for HTML css & javascript), and integrate that into an application?
Also: I appreciate the amount of work required to support for HTML4 nevermind 5, and CSS2 compliance. How close is JRex to that?
Application: My intention with the webbrowser is to render a webpage from offline content. It will not need to be online content, and will simply be for file based displays = e.g. file:///C:...
Does the webbrowser have to be wrapped into a server to function, e.g. to pass files to the browser to render is how complicated? I am not keen to have to implement Jetty or another server type application just for this.
If JRex is not the solution... what then? Is it possible to start a browser implementation within Java and can Java interact with the information and traverse the Dom?
Or alternatively is there .hta equivalent in recent browsers like firefox?

If you need to have the embedded browser interact with your application code, you could try the SWT Browser control, it's actually maintained as opposed to JRex. Browser uses either WebKit or Gecko or embedded IE as appropriate, or lets you choose which one you want, so it should run jQuery and familiar Javascript. And since SWT is a JNI library to begin with they probably already have guidance on how to deploy an app that uses JNI.
You can feed HTML into the control from a string (example) or a java Url - which can point to local files or resource files in your JAR, which I assume will let you split your app into different files.
To call Java code, you need to expose it as Javascript functions. example
To manipulate the HTML from Java code, you need to call Javascript functions from Java. example
To make the previous two tasks easier, you might want to look into a JSON library to simplify passing around complex data.

Does it have to be implemented within a Java program? Could you let the user use the default browser on their machine (ie does it matter what browser)?
If not would use the Java Desktop API.
if (desktop.isSupported(Desktop.Action.BROWSE)) {
txtBrowserURI.setEnabled(true);
btnLaunchBrowser.setEnabled(true);
}
If you are using Java 1.5 try http://javadesktop.org/articles/jdic/

How to use wkhtmltopdf in Java web application?

I am newbie in wkhtmltopdf. I am wondering how to use wkhtmltopdf with my Dynamic Web Project in Eclipse? How to integrate wkhtmltopdf with my Java dynamic web application?
Is there any tutorials available for beginners of wkhtmltopdf ?
(Basically, I would like to use wkhtmltopdf in my web application so that when user click a save button , the current page will be saved to PDF file).

First, a technical note: Because you want to use wkhtmltopdf in a web project, if and when you deploy to a Linux server machine that you access via ssh (i.e. over the network), you will need to either use the patched Qt version, or run an X server, e.g. the dummy X server xvfb. (I don't know what happens if you deploy to a server running an operating system other than Linux.)
Second, it should be really quite simple to use wkhtmltopdf from any language in a web project.
If you just want to save the server-generated version of the current page, i.e. without any changes which might have been made like the user filling on forms, or Javascript adding new DOM elements, you just need to have an extra optional argument like ?generate=pdf on the end of your URL, which will cause that page to be generated as a PDF, and then the PDF button will link to that URL. This may be a lot of work to add to each page manually if you are just using simple JSP or something, but depending on which web framework you are using, the web framework may offer some help to implement the same action on every page, if you need to implement that.
To implement this approach, you would probably want to capture the response by wrapping the response object and overridding its getWriter() and getOutputStream() methods.
Another approach is to have a button "submit and generate PDF" which will generate the next page as a PDF. This might make more sense if you have a form the user needs to fill in - I don't know. It's a design decision really.
A third approach is to use Javascript to upload the current state of the page back to the server, and process that using wkhtmltopdf. This will work on any page. (This can even be used on any site, not just yours, if you make it a bookmarklet. Just an idea that occurred to me - it may not be a good idea.)
A fourth approach is, because wkhtmltopdf can fetch URLs, to pass the URL of your page instead of the contents of the page (which will only work if the request was a HTTP GET, or if it's equivalent to a HTTP GET on the same URL). This has some small amount of overhead over capturing your own response output, but it will probably be negligible. You will also very likely need to copy the cookie(s) into a cookie jar with this approach, since presumably your user might be logged in or have an implicit session.
So as you can see there are quite a lot of choices!
Now, the question remains: when your server has the necessary HTML, from any of the above approaches, how to feed it into wkhtmltopdf? This is pretty simple. You will need to spawn an external process using either Runtime.getRuntime().exec(), or the newer API called ProcessBuilder - see http://www.java-tips.org/java-se-tips/java.util/from-runtime.exec-to-processbuilder.html for a comparison. If you are smart about it you should be able to do this without needing to create any temporary files.
One of the wkhtmltopdf websites is currently down, but the main README is available here, which explains the command line arguments.
This is merely an outline answer which gives some pointers. If you need more details, let us know what specifically you need to know.

Additional info:
If you do end up trying to call wkhtmltopdf in an external process from java (or for that matter, any language), please note that the "normal" output that you see when using wkhtmltopdf from the command line (i.e. what you would expect to see in STDOUT) is not not in STDOUT but in STDERR. I raised this issue in the project page
http://code.google.com/p/wkhtmltopdf/issues/detail?id=825
and was replied that this is by design because wkhtmltopdf supports giving the actual pdf output in STDOUT. Please see the link for more details and java code.

java-wkhtmltopdf-wrapper provides an easy API for using wkhtmltopdf in Java.
It also works out-of-the-box on a headless server with xvfb.
E.g., on a Ubuntu or Debian server:
aptitude install wkhtmltopdf xvfb
Then in Java:
Pdf pdf = new Pdf();
pdf.addPage("http://www.google.com", PageType.url);
pdf.saveAs("output.pdf");
See the examples on their Github page for more options.

how to convert html page to image using java or php

I want to know how to convert html file to image. How do I do this?

You can checkout the source code for the popular BrowserShots service,
http://browsershots.org/

If you're running Windows, and have the GD library installed, you can use imagegrabwindow. I've never used it myself, but as always, the PHP site has lots of documentation and examples.

Use:
WKHTMLTOPDF.
It also has binding to PHP, or you can run it yourself from command line.

Problem is that you need to implement all the functionality of a browser and an HTTP stack (and this still does not deal with the case where the content is modified using javascript).
As John McCollum says, if you've got the website open in a browser on your PC, then you can use imagegrabwindow or snapsIE (MSIE only)
If you want to to be able to get a snapshot using code only, then you might want to look at one of the off the shelf solutions - AFAIK there are several programs (at least 2 of which are called html2pdf) which will generate a PDF of static html - and its relatively easy using standard tools to trim this to window size and convert to an image file.
e.g. https://metacpan.org/pod/distribution/PDF-FromHTML/script/html2pdf.pl

autogenerate HTTP screen scraping Java code

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.

The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.

You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.

I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.

A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.