i use this code snippet to download some mp3-files:
File target = /*...*/;
InputStream in = new URL(link).openStream();
Files.copy(in, target.toPath(), StandardCopyOption.REPLACE_EXISTING);
it usually works fine, but now i have a series of files, that are way too small and don't work. for example: https://kritisches-denken-podcast.de/wp-content/uploads/2019/01/KDP-Episode-17-Selbsterhaltungstherapie.mp3 should be about 46MB(when i download it via browser) but is only 315 Bytes when i download it with the code above on my android.
The URL has a redirect built into it. Usually such redirects, especially for URLs targeted at non-browsers (Which an mp3 URL clearly is), are served up as an HTTP 301 'Moved Permanently' (and sometimes 302 'Moved Temporarily'), with the right URL sent along in the Location header. The text you see (the 315 bytes you download) is merely 'fallback' HTML that also state the content has moved. There is no need to parse this, fortunately.
The HTTP 'browser' of URL's openStream code is very basic and does not follow redirects. You need an API that does. URLConnection (also from the core libs) can do it, but it does not follow redirects if the redirect switches from http to https or vice versa, so you might not wanna do that. Just in case you do:
File target = /*...*/;
HttpURLConnection con = (HttpURLConnection) new URL(link).openConnection();
con.setInstanceFollowRedirects(true);
try (InputStream in = con.getInputStream()) {
Files.copy(in, target.toPath(), StandardCopyOption.REPLACE_EXISTING);
}
If the above is no good (presumably due to HTTP/HTTPS redirect issue) I suggest picking up a real HTTP client, which the standard API does not provide. I suggest OkHttp.
Related
I am trying to read a website using the java.net package classes. The site has content, and i see it manually in html source utilities in the browser. When I get its response code and try to view the site using java, it connects successfully but interprets the site as one without content(204 code). What is going on and is it possible to get around this to view the html automatically.
thanks for your responses:
Do you need the URL?
here is the code:
URL hef=new URL(the website);
BufferedReader kj=null;
int kjkj=((HttpURLConnection)hef.openConnection()).getResponseCode();
System.out.println(kjkj);
String j=((HttpURLConnection)hef.openConnection()).getResponseMessage();
System.out.println(j);
URLConnection g=hef.openConnection();
g.connect();
try{
kj=new BufferedReader(new InputStreamReader(g.getInputStream()));
while(kj.readLine()!=null)
{
String y=kj.readLine();
System.out.println(y);
}
}
finally
{
if(kj!=null)
{
kj.close();
}
}
}
Suggestions:
Assert than when manually accessing the site (with a web browser client) you are effectively getting a 200 return code
Make sure that the HTTP request issued from the automated (java-based) logic is similar/identical to that of what is sent by an interactive web browser client. In particular, make sure the User-Agent is identical (some sites purposely alter their responses depending on the agent).
You can use a packet sniffer, maybe something like Fiddler2 to see exactly what is being sent and received to/from the server
I'm not sure that the java.net package is robot-aware, but that could be a factor as well (can you check if the underlying site has robot.txt files).
Edit:
assuming you are using the java.net package's HttpURLConnection class, the "robot" hypothesis doesn't apply.
On the other hand you'll probably want to use the connection's setRequestProperty() method to prepare the desired HTTP header for the request (so they match these from the web browser client)
Maybe you can post the relevant portions of your code.
I have HTML based queries in my code and one specific kind seems to give rise to IOExceptions upon receiving 505 response from the server. I have looked up the 505 response along with other people who seemed to have similar problems. Apparently 505 stands for HTTP version mismatch, but when I copy the same query URL to any browser (tried firefox, seamonkey and Opera) there seems to be no problem. One of the posts I read suggested that the browsers might automatically handle the version mismatch problem..
I have tried to dig in deeper by using the nice developer tool that comes with Opera, and it looks like there is no mismatch in versions (I believe Java uses HTTP 1.1) and a nice 200 OK response is received. Why do I experience problems when the same query goes through my Java code?
private InputStream openURL(String urlName) throws IOException{
URL url = new URL(urlName);
URLConnection urlConnection = url.openConnection();
return urlConnection.getInputStream();
}
sample link: http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length
There has been some issues in Tomcat with URLs containing space in it. To fix the problem, you need to encode your url with URLEncoder.
Example (notice the space):
String url="http://example.org/test test2/index.html";
String encodedURL=java.net.URLEncoder.encode(url,"UTF-8");
System.out.println(encodedURL); //outputs http%3A%2F%2Fexample.org%2Ftest+test2%2Findex.html
AS a developer at www.uniprot.org I have the advantage of being able to look in the request logs. In the last year according to the logs we have not send a 505 response code. In any case our servers do understand http 1 requests as well as the default http1.1 (though you might not get the results that you expect).
That makes me suspect there was either some kind of data corruption on the way. Or you where affected by a hardware failure (lately we have had some trouble with a switch and a whole datacentre ;). In any case if you ever have questions or problems with uniprot.org please contact help#uniprot.org then we can see if we can help/fix the problem.
Your code snippet seems normal and should work.
Regards,
Jerven Bolleman
Are you behind a proxy? This code works for me and prints out the same text I see through a browser.
final URL url = new URL("http://www.uniprot.org/uniprot/?query=mnemonic%3aNUGM_HUMAN&format=tab&columns=id,entry%20name,reviewed,organism,length");
final URLConnection conn = url.openConnection();
final InputStream is = conn.getInputStream();
System.out.println(IOUtils.toString(is));
conn is an instance of HttpURLConnection
from the API documentation for the URL class:
The URL class does not itself encode or decode any URL components
[...]. It is the responsibility of the caller to encode any fields,
which need to be escaped prior to calling URL, and also to decode any
escaped fields, that are returned from URL.
so if you have any spaces in your url-str encode it before calling new URL(url-str)
#posdef I was having same HTTP error code 505 problem. When I pasted URL that I was using in Java code in Firefox, Chrome it worked. But through code was giving IOException. But at last I came to know that in url string there were brackets '(' and ')', by removing them it worked so it seems I needed URLEncodeing same like browsers.
I am trying to create a HttpServlet that forwards all incoming requests as is, to another serlvet running on a different domain.
How can this be accomplished? The RequestDispatcher's forward() only operates on the same server.
Edit: I can't introduce any dependencies.
You can't when it doesn't run in the same ServletContext or same/clustered webserver wherein the webapps are configured to share the ServletContext (in case of Tomcat, check crossContext option).
You have to send a redirect by HttpServletResponse.sendRedirect(). If your actual concern is reusing the query parameters on the new URL, just resend them along.
response.sendRedirect(newURL + "?" + request.getQueryString());
Or when it's a POST, send a HTTP 307 redirect, the client will reapply the same POST query parameters on the new URL.
response.setStatus(HttpServletResponse.SC_TEMPORARY_REDIRECT);
response.setHeader("Location", newURL);
Update as per the comments, that's apparently not an option as well since you want to hide the URL. In that case, you have to let the servlet play for proxy. You can do this with a HTTP client, e.g. the Java SE provided java.net.URLConnection (mini tutorial here) or the more convenienced Apache Commons HttpClient.
If it's GET, just do:
InputStream input = new URL(newURL + "?" + request.getQueryString()).openStream();
OutputStream output = response.getOutputStream();
// Copy.
Or if it's POST:
URLConnection connection = new URL(newURL).openConnection();
connection.setDoOutput(true);
// Set and/or copy request headers here based on current request?
InputStream input1 = request.getInputStream();
OutputStream output1 = connection.getOutputStream();
// Copy.
InputStream input2 = connection.getInputStream();
OutputStream output2 = response.getOutputStream();
// Copy.
Note that you possibly need to capture/replace/update the relative links in the HTML response, if any. Jsoup may be extremely helpful in this.
As others have pointed out, what you want is a proxy. Your options:
Find an open-source Java library that does this. There are a few out there, but I haven't used any of them, so I can't recommend any.
Write it yourself. Shouldn't be too hard, just remember to deal with stuff like passing along all headers and response codes.
Use the proxy module in Apache 2.2. This is the one I'd pick, because I already know that it works reliably.
Jetty has a sample ProxyServlet implementation that uses URL.openConnection() under the hood. Feel free to use as-is or to use as inspiration for your own implementation. ;-)
Or you can use Apache HttpClient, see the tutorial.
Even though it's not part of HTTP 1.1/RFC2616 webapps that wish to force a resource to be downloaded (rather than displayed) in a browser can use the Content-Disposition header like this:
Content-Disposition: attachment; filename=FILENAME
Even tough it's only defined in RFC2183 and not part of HTTP 1.1 it works in most web browsers as wanted.
So from the client side, everything is good enough.
However on the server-side, in my case, I've got a Java webapp and I don't know how I'm supposed to set that header, especially in the following case...
I'll have a file (say called "bigfile") hosted on an Amazon S3 instance (my S3 bucket shall be accessible using a partial address like: files.mycompany.com/) so users will be able to access this file at files.mycompany.com/bigfile.
Now is there a way to craft a servlet (or a .jsp) so that the Content-Disposition header is always added when the user wants to download that file?
What would the code look like and what are the gotchas, if any?
I got this working as Pointy pointed out. Instead of linking directly to the asset - in my case pdfs - one now links to a JSP called download.jsp which takes and parses GET parameters and then serves out the pdf as a download.
Download here
Here's the jsp code I used. Its working in IE8, Chrome and Firefox:
<%#page session="false"
contentType="text/html; charset=utf-8"
import="java.io.IOException,
java.io.InputStream,
java.io.OutputStream,
javax.servlet.ServletContext,
javax.servlet.http.HttpServlet,
javax.servlet.http.HttpServletRequest,
javax.servlet.http.HttpServletResponse,
java.io.File,
java.io.FileInputStream"
%>
<%
//Set the headers.
response.setContentType("application/x-download");
response.setHeader("Content-Disposition", "attachment; filename=downloaded.pdf");
[pull the file path from the request parameters]
File file = new File("[pdf path pulled from the requests parameters]");
FileInputStream fileIn = new FileInputStream(file);
ServletOutputStream outstream = response.getOutputStream();
byte[] outputByte = new byte[40096];
while(fileIn.read(outputByte, 0, 40096) != -1)
{
outstream.write(outputByte, 0, 40096);
}
fileIn.close();
outstream.flush();
outstream.close();
%>
You wouldn't have a URL that was a direct reference to the file. Instead, you'd have a URL that leads to your servlet code (or to some sort of action code in your server-side framework). That, in turn, would have to access the file contents and shovel them out to the client, after setting up the header. (You'd also want to remember to deal with cache control headers, as appropriate.)
The HttpServletResponse class has APIs that'll let you set all the headers you want. You have to make sure that you set up the headers before you start dumping out the file contents, because the headers literally have to come first in the stream being sent out to the browser.
This is not that much different from a situation where you might have a servlet that would generate a download on-the-fly.
edit I'll leave that stuff above here for posterity's sake, but I'll note that there is (or might be) some way to hand over some HTTP headers to S3 when you store a file, such that Amazon will spit those back out when the file is served out. I'm not exactly sure how you'd do that, and I'm not sure that "Content-disposition" is a header that you can set up that way, but I'll keep looking.
Put a .htaccess file in the root folder with the following line:
Header set Content-Disposition attachment
I just found this via google.
And I had a simmilar problem, but I still want to use a Servlet (as I generate the Content).
However the following line is all you need in a Servlet.
response.setHeader("Content-Disposition", "attachment; filename=downloadedData.json");
I have Java webserver (no standard software ... self written). Everything seems to work fine, but when I try to call a page that contains pictures, those pictures are not displayed. Do I have to send images with the output stream to the client? Am I missing an extra step?
As there is too much code to post it here, here is a little outline what happens or is supposed to happen:
1. client logs in
2. client gets a session id and so on
3. the client is connected with an output stream
4. we built the response with the HTML-Code for a certain 'GET'-request
5. look what the GET-request is all about
6. send html response || file || image (not working yet)
So much for the basic outline ...
It sends css-files and stuff, but I still have a problem with images!
Does anybody have an idea? How can I send images from a server to a browser?
Thanks.
I check requests from the client and responses from the server with charles. It sends the files (like css or js) fine, but doesn't with images: though the status is "200 OK" the transfer-encoding is chunked ... I have no idea what that means!? Does anybody know?
EDIT:
Here is the file-reading code:
try{
File requestedFile = new File( file );
PrintStream out = new PrintStream( this.getHttpExchange().getResponseBody() );
// File wird geschickt:
InputStream in = new FileInputStream( requestedFile );
byte content[] = new byte[(int)requestedFile.length()];
in.read( content );
try{
// some header stuff
out.write( content );
}
catch( Exception e ){
e.printStackTrace();
}
in.close();
if(out!=null){
out.close();
System.out.println( "FILE " + uri + " SEND!" );
}
}
catch ( /*all exceptions*/ ) {
// catch it ...
}
Your browser will send separate GET image.png HTTP 1.1 requests to your server, you should handle these file-gets too. There is no good way to embed and image browser-independent in HTML, only the <img src="data:base64codedimage"> protocol handler is available in some browsers.
As you create your HTML response, you can include the contents of the external js/css files directly between <script></script> and <style></style> tags.
Edit: I advise to use Firebug for further diagnostics.
Are you certain that you send out the correct MIME type for the files?
If you need a tiny OpenSource webserver to be inspired by, then have a look at http://www.acme.com/java/software/Acme.Serve.Serve.html which serves us well for ad-hoc server needs.
Do I have to send those external files
or images with the output stream to
the client?
The client will make separate requests for those files, which your server will have to serve. However, those requests can arrive over the same persisten connection (a.k.a. keepalive). The two most likely reasons for your problem:
The client tries to send multiple requests over a persistent connection (which is the default with HTTP 1.1) and your server is not handling this correctly. The easiest way to avoid this is to send a Connection: close header with the response.
The client tries to open a separate connection and your server isn't handling it correctly.
Edit:
There's a problem with this line:
in.read( content );
This method is not guaranteed to fill the array; it will read an arbitrary number of bytes and return that number. You have to use it in a loop to make sure everything is read. Since you have to do a loop anyway, it's a good idea to use a smaller array as a buffer to avoid keeping the whole file in memory and running into an OutOfMemoryError with large files.
Proabably step #4 is where you are going wrong:
// 4. we built the response with the HTML-Code for a certain 'GET'-request
Some of the requests will be a 'GET /css/styles.css' or 'GET /js/main.js' or 'GET /images/header.jpg'. Make sure you stream those files in those circumstances - try loading those URLs directly.
Images (and css/js files) are requested by the browser as completely separate GET requests to the page, so there's definitely no need to "send those ... with the output stream". So if you're getting pages served up ok, but images aren't being loaded, my first guess would be that you're not setting your response headers appropriately (for example, setting the Content-Type of the response to text/html), so the browser isn't interpreting it as a proper page & therefore not loading the images.
Some other things to try if that doesn't work:
Check if you can access an image directly
Use something like firebug or fiddler to check whether the browser is actually requesting the image/css/js files & that all your request/response headers look ok
Use an existing web server!