Jersey Client download ZIP file and unpack efficiently

Jersey Client download ZIP file and unpack efficiently - java

So, I have a server application that returns ZIP files and I'm working with huge files (>=5GB). I am then using the jersey client to do a GET request from this application after which I want to basically extract the ZIP and save it as a folder. This is the client configuration:
Client client = ClientBuilder.newClient();
client.register(JacksonJaxbJsonProvider.class);
client.register(MultiPartFeature.class);
return client;
And here's the code fetching the response from the server:
client.target(subMediumResponseLocation).path("download?delete=true").request()
.get().readEntity(InputStream.class)
My code then goes through a bunch of (unimportant for this question) steps and finally gets to the writing of data.
try (ZipInputStream zis = new ZipInputStream(inputStream)) {
ZipEntry ze = zis.getNextEntry();
while(ze != null){
String fileName = ze.getName();
if(fileName.contains(".")) {
size += saveDataInDirectory(folder,zis,fileName);
}
is.closeEntry();
ze = zis.getNextEntry();
}
zis.closeEntry();
} finally {
inputStream.close();
}
Now the issue I'm getting is that the ZipInputStream refuses to work. I can debug the application and see that there are bytes in the InputStream but when it get to the while(ze != null) check, it returns null on the first entry, resulting in an empty directory.
I have also tried writing the InputStream from the client to a ByteArrayOutputStream using
the transferTo method, but I get a java heap space error saying the array length is too big (even though my heap space settings are Xmx=16gb and Xms=12gb).
My thoughts were that maybe since the InputStream is lazy loaded by Jersey using the UrlConnector directly, this doesn't react well with the ZipInputStream. Another possible issue is that I'm not using a ByteArrayInputStream for the ZipInputStream.
What would a proper solution for this be (keeping in mind the heap issues)?

Ok so I solved it, apparently my request was getting a 404 for adding the query param in the path... .path("download?delete=true")

Related

How to Ignore the content-length exception while writing (downloading) a file from server response?

I am making a request to a server, which I have no control over. It returns a downloadable response. I am downloading the file in client as follows
File backupFile = new File("Download.zip");
CloseableHttpResponse response = ...;
try(InputStream inputStream = response.getEntity().getContent()) {
try(FileOutputStream fos = new FileOutputStream(backupFile)) {
int inByte;
while((inByte = inputStream.read()) != -1) {
fos.write(inByte);
}
}
}
I am getting the following exception:
Premature end of Content-Length delimited message body (expected: 548846; received: 536338
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:142)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:120)
Premature end of Content-Length delimited message body (expected:
I went through the above SO question, but that question and its answers address a serious bug, where the server doesn't deliver what it promised. Also I am not closing the client before downloading the file is complete.
In my case, the file (zip file) is perfect, just that the estimate of size is off by a minute fraction.
Reported this to the server maintainer, but I was wondering if there was a way to ignore this exception. Assuming the checks on the downloaded file is done by self.

Assuming the file is complete as is, you can simply catch the exception, flush the rest of the stream, close it, and the file should be written in its entirety as given by the server. Of course if the file is only partially complete, then you won't be able to open the file as a zip file in any context, so do be sure that the file is correct as it is being sent and that it is only a problem of content length.

Partially reading a tar.gz file from Amazon S3

I'm trying to extract specific files from Amazon S3 without having to read all the bytes because the archives can be huge and I only need 2 or 3 files out of it.
I'm using the AWS Java SDK. Here's the code (Exception Handing skipped):
AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();
TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
FileOutputStream entryOs = new FileOutputStream("foo.bar");
IOUtils.copy(tarInputStream, entryOs);
entryOs.close();
break;
}
}
objectContent.abort(); // Warning at this line
tarInputStream.close(); // warning at this line
When I use this method it gives a warning that not all the bytes from the stream were read which I'm doing intentionally.
WARNING: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Is it necessary to drain the stream and what would be the downsides of not doing it? Can I just ignore the warning?

You don't have to worry about the warning - it only warns you that it will result in the closure of HTTP connection and that there might be data which you will miss. Since close() delegates to abort() you get the warning in either of the calls.
Note that it is not guaranteed as you are not reading the whole archive anyway if the files you are interested in are located towards the end of the archive.
S3's HTTP server supports ranges, so if you could influence the format of the archive or during the creation of it generate some metadata you could actually skip or perhaps request only the file you are interested in.

Why my HTTPS file download corrupts .zip files?

I'm trying to download zip files from internet using following code:
public void getFile(String updateURL) throws Exception {
URL url = new URL(updateURL);
HttpURLConnection httpsConn = (HttpURLConnection) url.openConnection();
httpsConn.setRequestMethod("GET");
TrustModifier.relaxHostChecking(httpsConn);
int responseCode = httpsConn.getResponseCode();
if (responseCode == HttpsURLConnection.HTTP_OK) {
String fileName = "fileFromNet";
try (FileOutputStream outputStream = new FileOutputStream(fileName)) {
ReadableByteChannel rbc = Channels.newChannel(httpsConn.getInputStream());
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
}
httpsConn.disconnect();
}
TrustModifier is a class used to solve the "trust issue": http://www.obsidianscheduler.com/blog/ignoring-self-signed-certificates-in-java/
The code above works well for zip files available via plain http or for non compressed files exposed via https but but if I try to download a zip file exposed via https endpoint only a small fragment of original file will be downloaded. I have tested with different download links from internet and always got the same result.
Does anybody has an idea what I've been doing wrong here?
Thank you.

transferFrom() must be called in a loop until the transfer is complete, and in this case the only way you can know that is by adding up the return values of transferFrom() until they equal the Content-length of the HTTP response.

Actually the problem was in the TrustModifier Class I was using to switch off the servier certificate check. Once I removed it because I didn't need it any longer (I took the certificate from server and put it in a local trust store), my problem was solved.

HtttpURLConnection never uses gzip

I'm currently developing an app, that should measure (fairly precisely) the size of webpages.
The thing I'm struggling with now is that I need to know the sizes of particular files that are on the website. I have an array of URLs and I try to fetch their headers to get Content-Length, however some files return -1 since they are chunked. If they return -1 I try to download them to get their size.
And here lies the problem - I found out that I always get uncompressed version of the file.
Example file -
http://www.google-analytics.com/analytics.js
When I open it in Chrome, the headers says this:
However, when I download it using HttpURLConnection, it has a size of 25421 bytes, and when I check the Content-Encoding header, its always null.
connection = (HttpURLConnection)(new URL(url)).openConnection();
connection.setRequestProperty("Accept-Encoding", "gzip");
connection.connect();
int contentLength = connection.getContentLength();
if (contentLength == -1 && connection != null) {
InputStream input = connection.getInputStream();
byte[] buffer = new byte[4096];
int count = 0, len;
while ((len = input.read(buffer)) > 0) {
count += len;
}
contentLength = count;
}
So the problem is, that I download a webpage with my application, and it says it has (let's say) 400kB. But when I download it using some kind of tool, like http://tools.pingdom.com/fpt/ , the size is much smaller, like 100kB, since most of the scripts are gzipped, that means the transfer is lower.
I know 300kB is not that much, but when you are using a mobile transfer, every kB counts, and I want my app to be precise.
Could you point me where I make mistake, or how could I solve this?
Thank you

Your HttpURLConnection setup code looks correct to me. You could try setting the User-Agent to a standard browser one, perhaps the server is trying to be more intelligent than it ought to be. Failing that, run your traffic through a debugging proxy like Fiddler or Burp to see what's going on at the network level.

If you are using iJetty, you have to enable gzip compression first
You have to enable the GzipFilter to make Jetty return compressed content. Have a look here on how to do that: http://blog.max.berger.name/2010/01/jetty-7-gzip-filter.html
You can also use the gzip init parameter to make Jetty search for compressed content. That means if the file file.txt is requested, Jetty will watch for a file named file.txt.gz and returns that.

java - Download and then write image as servlet response

How to download an image from a server and then write it as a response in my servlet.
What is the best way to do it keeping good performance?
Here's my code:
JSONObject imageJson;
... //getting my JSON
String imgUrl = imageJson.get("img");

if you don't need to hide your image source and if server is accessible from the client as well, I'd just point your response to remote server (as you already have the url) => you don't need to do a download to your server first, but possibly client could access it directly => you don't waste your resources.
However if you still need to download it to your server first, following post might help: Writing image to servlet response with best performance

It's important to avoid intermediate buffering of image in servlet. Instead just stream whatever was received to the servlet response:
InputStream is = new URL(imgUrl).openStream();
OutputStream os = servletResponse.getOutputStream();
IOUtils.copy(is, os);
is.close();
I'm using IOUtils from Apache Commons (not necessary, but useful).

The complete solution : download a map and save to file.
String imgUrl = "http://maps.googleapis.com/maps/api/staticmap?center=-15.800513,-47.91378&zoom=11&size=200x200&sensor=false";
InputStream is = new URL(imgUrl).openStream();
File archivo = new File("c://temp//mapa.png");
archivo.setWritable(true);
OutputStream output = new FileOutputStream(archivo);
IOUtils.copy(is, output);
IOUtils.closeQuietly(output);
is.close();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jersey Client download ZIP file and unpack efficiently - java

Ok so I solved it, apparently my request was getting a 404 for adding the query param in the path... .path("download?delete=true")

Related

How to Ignore the content-length exception while writing (downloading) a file from server response?

Partially reading a tar.gz file from Amazon S3

Why my HTTPS file download corrupts .zip files?

HtttpURLConnection never uses gzip

java - Download and then write image as servlet response

Categories

Resources