Download a directory over HTTP in Java

Download a directory over HTTP in Java - java

I have some files in a directory tree which is being served over HTTP.
Given some sub-directory A, in that directory tree I want to be able to download directory A and all containing subdirectories and files.
It seems likely that a simple/direct/atomic solution exists in the some dark corner of Java. Does anyone know how to do this?
A webcrawler will not solve my problem since files in sub-directories may link to directories that are not subdirectories.
==Update==
The directories and files must be hosted in static manner.
The server is statically hosting files in a directory tree, the client is running Java and attempting to copy some branch of the directory tree using HTTP.
VFS is the answer to this, unfortunately I answered the question myself and so can't choose it as the answer until two days from now. If someone would write up of my answer I would be happy to mark their write up as the answer.
==Further Update==
VFS is in fact not the answer. VFS will not list directories over HTTP, as stated here. There does seem to be a few people that are interested in that functionality.

My first suggestion would be to create a servlet/jsp which recursiveley reads the directory structure (using java.io.File), reads all files, puts them in one zip (java.util.zip), and sends it to the browers for download.

I don't know of an atomic solution, but the most straightforward one would be using a URLConnection to fetch the sub-directory (assuming the server lists the directory) and then parse the response, look for contents of that directory and use URLConnection again to fetch each of the files under it.
Based on these answers, now I am wondering if you meant the Java to be on the client side or server side!

So you want from the client side on retrieve a list of all files and directores for the particular URL of the server side as if it is a local disk file system folder? That's usually not possible when the server doesn't have directory indexing enabled. And even then, you still need to parse the HTML page which represents the directory index and parse all <a> elements representing the files and folders yourself. There's no normal java.io.File approach for this. That would have been a huge security hole. One would for example be able to download all source files from http://gmail.com. HTTP is not meant as a file transfer protocol. Use FTP. That's where it stands for.

Assuming you have control over both the server and client, I would write a page (in your favorite technology of your choice; ASP, JSP, PHP, etc) that reads the server directory structure, and dynamically returns a page that consists of a bunch of links to each file to be downloaded.
Then client side you can trigger a download of each link.
What is the client side technology? is the thing doing the downloading an application of some sort, or a web browser? Does it have to have a client interface?
If this is some sort of in-house utility program, maybe you can just FTP instead? Having FTP access open on a server and downloading a directory would be easy...
Adding another possible answer:
If the server does not have directory listings turned on, then you basically have to make a modification server side. The easiest thing would be to just make a page that returns the dir structure to the client in a known format (see my 1st answer above).
If you control the server and have directory listings on, and you are always using the same server program (IIS, Tomcat, JBoss, etc) then you might be able to just make the client webcrawl the directory listings. For example, in a directory listing from IIS, you can tell which links are directories and which are files because it always puts a '/' at the end of a directory link, and shows 'dir' instead of a file size:
Friday, October 16, 2009 03:55 PM <dir> Unity
Thursday, July 02, 2009 10:42 AM 95 Global.asax
You can tell here that the 1st link is a directory, and the 2nd is an actual file.
So if you are using a consistent server app, just take a look at how the directory listing is returned. Maybe you'll get lucky.

If I am not terribly mistaken, HTTP does not tell you anything about the "structure" of the server side - if such a thing even exists.
Think about REST where the URI does not really tell you where to find a file on the server, but could merely trigger some action, retrieve data or the like.
So I do not think what you are trying to achieve can be done reliably, be it with Java or any other language. Or maybe I am getting you wrong here?

Talk about low-hanging fruit ;-) Thanks for the offer, e5!
Commons VFS provides a single API for accessing various different file systems. It presents a uniform view of the files from various different sources, such as the files on local disk, on an HTTP server, or inside a Zip archive.
http://commons.apache.org/vfs/

For the first time in a while google beat stackoverflow, Apache commons VFS does exactly what I need.
Commons VFS provides a single API for
accessing various different file
systems. It presents a uniform view of
the files from various different
sources, such as the files on local
disk, on an HTTP server, or inside a
Zip archive.
http://commons.apache.org/vfs/
==Update==
As stated in the question VFS only pretends to solve this problem, since it doesn't allow the listing of http directories.

Related

Embedded JavaApplet to get full path to file

For a company internal webapp I need a way to access the full path of files on our internal fileserver that the user drag n drops on a dropzone on the website. As the fileserver is mounted in a similar manner on the webserver the path will point to the same file on its side and I can further process the files.
As that is not possible with plain HTML/Java-Script because of security reasons I did some research and it seems, that one viable way to accomplish this would be to embed a java-applet in the HTML code of the website. However .. as this would be my first encounter with java I'd like to confirm if that is actually possible before starting out.
Or alternatively another way to accomplish this task.

How can i detect if i have already downloaded the most recent files on android app?

I am trying to make an app that uses a bunch text files as a base for most of its actions and this text files can be updated from a web server.
Currently my app is able to download a batch of text files via a zipped archive, but I was wondering if there was a way to check if I already had the contents of the zip file before downloading them.
What I had now was that I would download and unzip followed by a line by line check to see if the current files where different from the recently downloaded files.
This is seemingly very inefficient but I do not know of any other way.
If anybody has any suggestions and can either give a small example or point me to one I would greatly appreciate it.

To assemble what BackSlash and the others already said in the comments:
One possible solution could be to:
Create a hash of the file when the file is being created (good) or
after download (bad)
Store this hash somewhere (e.g. inside the filename instructions-d41d8cd98f00b204e9800998ecf8427e.zip)
Client: Query the server with the string
Server: Check the transmitted hash against the hash of the newest version
Server: Respond accordingly (e.g
by using the HTTP built-in 304 response)
Client: Act upon the response of the server

Displaying images from outside of java application context.

This was a question about testing file upload functionality using a local java server on Windows 7 platform. Since the question evolved with Marko's input, I have edited it, so that those who run into the same challenge do not waste time on evolution details and reach conclusions sooner.
The challenge was to direct uploaded file to a folder outside of the WAR structure and successfully read it from there. For example: upload an image into c:/tmp/ and then redirect to a confirmation page that displays the image <img src="c:/tmp/test.jpg" />. The upload worked but image would not be displayed. And based on Marko's input, this makes sense because browser sitting at localhost will refuse to load anything from local disk structure using c:. Maybe these are security considerations similar to those with file input control where we cannot set a default path...
The following tag will work in a locally created .html file but when pasted into a jsp, it won't work. And the difference is that browser uses localhost to get to the jsp.
<img src="c:/tmp/test.jpg" />
Solutions
I think that Marko's answer pretty much defines what needs to be done. While I didn't go with that approach, it clearly is the better way to do it and I will accept that as the answer. Thanks, Marko!
For those who don't want to bother installing a Web server and are willing to live with a bit of a hack, here's what I have done. Again, I didn't want to upload files into my WAR structure because I would then need to remember about clearing that folder before deploying to the server. But that upload folder still needs to be accessible, so I simply created another dummy project and put that upload folder under its WebContent. This works for the purposes of my local testing. The only nuisance is that after uploading a file, I need to refresh the dummy project's WebContent in Eclipse.
config.properties
#for uploading files
fileUploadDirectory=C:/javawork/modelsite/tmp/WebContent
#for building html links
publicFileServicePrefix=http://localhost:8080/tmp
<img src="http://localhost:8080/tmp/test.jpg" /> // this works - tmp is the name of my dummy project.

If you are citing literally the HTML that goes to the browser (the one that you access via "vieew source") then this has nothing to do with Java. The browser is the one who interprets these links. If they fail to load, the problem is in the browser/file system.
UPDATE
According to the results of your additional diagnostics, I conclude that the browser (sensibly!) refuses to load anything from your local disk if it is referenced from an HTML file coming from an internet URL, even when that URL is localhost.
UPDATE 2
(Deleted, irrelevant)
UPDATE 3
However you handle the files uploaded to the server, it's definitely not going to look like your solution -- the file is on the server's local filesystem, not client's. This sort of thing can be handled at the Apache HTTP server level -- reserve an URL section for static content and configure Apache with a base directory from which to serve the static content. Even if you run the server locally, on the same machine where you test it, you still need to go through the network interface.

enumerating directories/files on an HTTP server from a Java client application

I need to write a Java client application which, when given the below URL, will enumerate the directories/files recursively beneath it. I also need to get the last modified timestamp for each since I'm only concerned with changes since a known timestamp.
http://www.myserver.com/testproduct/
For example, suppose the following exist on the server.
http://www.myserver.com/testproduct/red/file1.txt
http://www.myserver.com/testproduct/red/file2.txt
http://www.myserver.com/testproduct/red/black/file3.txt
http://www.myserver.com/testproduct/red/black/file4.txt
http://www.myserver.com/testproduct/orange/anotherfile.html
http://www.myserver.com/testproduct/orange/mymovie.avi
http://www.myserver.com/testproduct/readme.txt
I need to, starting at the specified URL (http://www.myserver.com/testproduct/) enumerate the directories and files recursively beneath it along with the last modified timestamp of each. Once I have the list of directories/files, I'll be selectively downloading some of the files based on timestamp and other client-side filters.
The server is running Apache and is configured to allow directory listing.
I did some experimentation using Apache's HttpClient Java class and when I request the contents of http://www.myserver.com/testproduct/ I get back an HTML file which of course is the same thing you see if you go there in your browser. Its an HTML page showing the contents of the folder.
Is this the only way to do it? i.e. scraping the resulting HTML page to parse out the files and directories? Also, I'm not sure I can reliably distinguish files from directories based on the HTML returned
Is there a better way to enumerate directories and files without page scraping the resultant HTML?

If you have any control over the server, you should ask them to implement WebDAV, which is meant for precisely that sort of scenario. Apache comes with a mod_dav that just needs to be configured. On the Java client side, see this question

If your application is not on the same machine as the server, then there isn't much you can do beside scrape the data you're looking for. If you know about all of the products that exist on your server, then you can just issue web requests for each file and you will get them. However, if you only know about the root path or a single product page, then you will essentially have to crawl the web site and extract the links to the other products from the same web site. You would only select URLs to crawl if they're on the same host and you haven't seen/crawled them before.
For example:
if http://www.myserver.com/testproduct/ contains links to
http://www.myserver.com/testproduct/red/file1.txt
http://www.myserver.com/testproduct/red/file2.txt
http://www.devboost.com/
http://www.myspace.com/
http://blog.devboost.com/
http://beta.devboost.com/
http://www.myserver.com/testproduct/red/file2.txt
Then you would ignore any link that does not start with the host www.myserver.com.
Regarding directories and timestamps: as pointed in the comments HTTP does not support directory browsing and if you're trying to get the time stamp when the file was last modified, then you're out of luck on that one too.
More importantly, I don't know how much it would benefit you to know that a file has not been changed when that file is generating dynamic content. For example: it's extremely likely that the file which is responsible for displaying a product page hasn't change in a LONG time. Usually, the same file will be responsible for displaying all of the products in the database and if it's part of an MVC-type framework. In other words: you would have to parse the HTML and determine if there are any changes which you care about, then process the file accordingly.

How make working directory files available to WebStart application?

We have to make a Java application demo available on Internet using JWS. It goes pretty well; all that's missing is making working directory files available for the application.
We know about the getResource() way... The problem is that we have different plugins for the same application and each one require different files (and often different versions of the same files) in their working directory to work properly. So we just change the working directory when we want the application to have a different behavior.
Currently, the demo is packaged in a signed jar file and it loads correctly until it requires a file from the working directory. Obviously, the Internet users of this demo don't have the files ready. We need a way to make these files available to the WebStart application (read/write access) in its working directory.
We've thought of some things, like having the application download the files itself when it starts, or package them in the jar and extract them on startup.
Looking for advices and/or new ideas. I'll continue working on this... I'll update if I ever find something reliable.
Thank you very much!

I said I would share what I found in my research for something that would fit my needs. Here's what I have so far.
I have found that the concept of current working directory (CWD) does not really make sense in the context of a Java Web Start (JWS) application. This had for effect that I stopped trying to find the CWD of a JWS and started looking for other options.
I have found that (no, I didn't know that) you can refer (using getResource()) to a file in the root directory of a JAR file by simply adding a '/' in front of its name. ("/log4j.properties", for example.) The impact of this is that I can now take any file which is only referred to in a read-only manner in the root of that JAR file (which is really only a ZIP file). You can refer to any file in the root of the JAR file using AnyClass.class.getResourceAsStream. That rules out the problem with read-only files required to run the application, at the cost of a switch in the code telling whether the application is run from a valid CWD or from a JWS context. (You can very simply set a property in the JNLP file of the JWS application and check if that property is set or not to know where to look for the file.)
For write-only files (log files in my case), I used the property , adding a directory with the name of the application: <user.home>/.appname and added log files to it.
Read/write files (which I don't have in my case) would probably simply go at the same place than write-only files. The software could deal with uploading them somewhere if needed, once modified, I guess.
That's the way I deal with the problem for now.

Note there is a service you can explicitly ask for, to get file access to the computer (unless you go all the way and ask for full access (which requires signed jar files)).
Then you need to determine where these files need to go - basically you have no idea what is where and whether you may actually write anywhere. You can create tmp-files but those go away.
Would a file system abstraction talking to the JNLP-server do so you store the users data on the server?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.