Partially reading a tar.gz file from Amazon S3

Partially reading a tar.gz file from Amazon S3 - java

I'm trying to extract specific files from Amazon S3 without having to read all the bytes because the archives can be huge and I only need 2 or 3 files out of it.
I'm using the AWS Java SDK. Here's the code (Exception Handing skipped):
AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();
TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
FileOutputStream entryOs = new FileOutputStream("foo.bar");
IOUtils.copy(tarInputStream, entryOs);
entryOs.close();
break;
}
}
objectContent.abort(); // Warning at this line
tarInputStream.close(); // warning at this line
When I use this method it gives a warning that not all the bytes from the stream were read which I'm doing intentionally.
WARNING: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Is it necessary to drain the stream and what would be the downsides of not doing it? Can I just ignore the warning?

You don't have to worry about the warning - it only warns you that it will result in the closure of HTTP connection and that there might be data which you will miss. Since close() delegates to abort() you get the warning in either of the calls.
Note that it is not guaranteed as you are not reading the whole archive anyway if the files you are interested in are located towards the end of the archive.
S3's HTTP server supports ranges, so if you could influence the format of the archive or during the creation of it generate some metadata you could actually skip or perhaps request only the file you are interested in.

Related

In Spring Boot Java Extracting the zip file from response without physically saving it

In Spring Boot the zip file that comes as a response has a corrupted structure before saving, but there is no problem when I save it physically. I need to take the file in the zip file and process the information in the database, but I cannot physically download this file because I am using GCP. How can I extract the file in this zip file that comes as a response?. How can I solve this please help.
Here is what it looks like in response.body() before saving (part of it):
"PK C`iUq �=n 緰) bu_customerfile_22110703_B001_10292121141�]i��������RI%U�v��CJ� ���×��My��y/ίϹ�������=>����}����8���׿}~}~yz�������ͲL��
�o�0�fV�29f�����6΋$K�c$�F��/�8˳�L��_�QaZ-q�F�d4γE�[���(f�8�D�0��2_��P"�I�A��D��4�߂�����D��(�T�$.��<�,���i]Fe�iM�q<ʨ�Olmi�(&���?�y�y4��<��Q�X�ޘp�#�6f-.F����8����"I㢨ҤU]�E��WI� %#������(W�8*0c�p:L��:� �}�G����e<����a�"
Here is the request call:
OkHttpClient client1 = new OkHttpClient().newBuilder()
.build();
MediaType mediaType1 = MediaType.parse("text/plain");
RequestBody body1 = RequestBody.create(mediaType1, "");
Request request1 = new Request.Builder()
.url(vers)
.method("POST", body1)
.addHeader("Cookie", "ASP.NET_SessionId=44dxexdxass5mtf00udjfwns")
.build();
Response response1 = client1.newCall(request1).execute();
String data = response1.body().string();

Would it be a matter of encoding type? String is of type UTF-16 (I think). Try a different datatype that is more like an array/vector of bytes.
Try something like what is mentioned here:
Having trouble reading http response into input stream
Update: Get the response as a stream of bytes and feed it into a ZipInputStream object as shown here https://zetcode.com/java/zipinputstream/#:~:text=ZipInputStream%20is%20a%20Java%20class,both%20compressed%20and%20uncompressed%20entries.
Then iterate over the contained files to find the one you need. Then retrieve the stream associated with the zipped file. Then you can read from there. (I realize that is a bit of handwaving, but it's been a while since I used Zip files and Java.) That should get you down the correct path.

Jersey Client download ZIP file and unpack efficiently

So, I have a server application that returns ZIP files and I'm working with huge files (>=5GB). I am then using the jersey client to do a GET request from this application after which I want to basically extract the ZIP and save it as a folder. This is the client configuration:
Client client = ClientBuilder.newClient();
client.register(JacksonJaxbJsonProvider.class);
client.register(MultiPartFeature.class);
return client;
And here's the code fetching the response from the server:
client.target(subMediumResponseLocation).path("download?delete=true").request()
.get().readEntity(InputStream.class)
My code then goes through a bunch of (unimportant for this question) steps and finally gets to the writing of data.
try (ZipInputStream zis = new ZipInputStream(inputStream)) {
ZipEntry ze = zis.getNextEntry();
while(ze != null){
String fileName = ze.getName();
if(fileName.contains(".")) {
size += saveDataInDirectory(folder,zis,fileName);
}
is.closeEntry();
ze = zis.getNextEntry();
}
zis.closeEntry();
} finally {
inputStream.close();
}
Now the issue I'm getting is that the ZipInputStream refuses to work. I can debug the application and see that there are bytes in the InputStream but when it get to the while(ze != null) check, it returns null on the first entry, resulting in an empty directory.
I have also tried writing the InputStream from the client to a ByteArrayOutputStream using
the transferTo method, but I get a java heap space error saying the array length is too big (even though my heap space settings are Xmx=16gb and Xms=12gb).
My thoughts were that maybe since the InputStream is lazy loaded by Jersey using the UrlConnector directly, this doesn't react well with the ZipInputStream. Another possible issue is that I'm not using a ByteArrayInputStream for the ZipInputStream.
What would a proper solution for this be (keeping in mind the heap issues)?

Ok so I solved it, apparently my request was getting a 404 for adding the query param in the path... .path("download?delete=true")

Get Zip file size overr FTP Connection in Java

I'm sending zip file over FTP connection so to fetch file size , I have used :
URLConnection conn = imageURL.openConnection();
long l = conn.getContentLengthLong();
But it returns -1
Similarly for files sent over Http request , I get correct file size.
How to get correct file size in ftp connection in this case ?

for files sent over Http request , I get correct file size.
MAYBE. URLConnection.getContentLength[Long] returns specifically the content-length header. HTTP (and HTTPS) supports several different ways of delimiting bodies, and depending on the HTTP options and versions the server implements, it might use a content-length header or it might not.
Somewhat similarly, an FTP server may provide the size of a 'retrieved' file at the beginning of the operation, or it may not. But it never uses a content-length header to do so, so getContentLength[Long] doesn't get it. However, the implementation code does store it internally if the server provides it, and it can be extracted by the following quite ugly hack:
URL url = new URL ("ftp://anonymous:dummy#192.168.56.2/pub/test");
URLConnection conn = url.openConnection();
try( InputStream is = conn.getInputStream() ){
if( ! conn.getClass().getName().equals("sun.net.www.protocol.ftp.FtpURLConnection") ) throw new Exception("conn wrong");
Field fld1 = conn.getClass().getDeclaredField("ftp");
fld1.setAccessible(true); Object ftp = fld1.get(conn);
if( ! ftp.getClass().getName().equals("sun.net.ftp.impl.FtpClient") ) throw new Exception ("ftp wrong");
Field fld2 = ftp.getClass().getDeclaredField("lastTransSize");
fld2.setAccessible(true); long size = fld2.getLong(ftp);
System.out.println (size);
}
Hacking undocumented internals may fail at any time, and versions of Java above 8 progressively discourage it: 9 through 15 give a warning message about illegal access unless you use --add-opens to permit it and 16 makes it an error (i.e. fatal). Unless you really need the size before reading the data (which implicitly gives you the size) I would not recommend this.

How to Ignore the content-length exception while writing (downloading) a file from server response?

I am making a request to a server, which I have no control over. It returns a downloadable response. I am downloading the file in client as follows
File backupFile = new File("Download.zip");
CloseableHttpResponse response = ...;
try(InputStream inputStream = response.getEntity().getContent()) {
try(FileOutputStream fos = new FileOutputStream(backupFile)) {
int inByte;
while((inByte = inputStream.read()) != -1) {
fos.write(inByte);
}
}
}
I am getting the following exception:
Premature end of Content-Length delimited message body (expected: 548846; received: 536338
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:142)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:120)
Premature end of Content-Length delimited message body (expected:
I went through the above SO question, but that question and its answers address a serious bug, where the server doesn't deliver what it promised. Also I am not closing the client before downloading the file is complete.
In my case, the file (zip file) is perfect, just that the estimate of size is off by a minute fraction.
Reported this to the server maintainer, but I was wondering if there was a way to ignore this exception. Assuming the checks on the downloaded file is done by self.

Assuming the file is complete as is, you can simply catch the exception, flush the rest of the stream, close it, and the file should be written in its entirety as given by the server. Of course if the file is only partially complete, then you won't be able to open the file as a zip file in any context, so do be sure that the file is correct as it is being sent and that it is only a problem of content length.

amazon s3 upload file time out

I have a JPG file with 800KB. I try to upload to S3 and keep getting timeout error.
Can you please figure what is wrong? 800KB is rather small for upload.
Error Message: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
HTTP Status Code: 400
AWS Error Code: RequestTimeout
Long contentLength = null;
System.out.println("Uploading a new object to S3 from a file\n");
try {
byte[] contentBytes = IOUtils.toByteArray(is);
contentLength = Long.valueOf(contentBytes.length);
} catch (IOException e) {
System.err.printf("Failed while reading bytes from %s", e.getMessage());
}
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(contentLength);
s3.putObject(new PutObjectRequest(bucketName, key, is, metadata));

Is it possible that IOUtils.toByteArray is draining your input stream so that there is no more data to be read from it when the service call is made? In that case a stream.reset() would fix the issue.
But if you're just uploading a file (as opposed to an arbitrary InputStream), you can use the simpler form of AmazonS3.putObject() that takes a File, and then you won't need to compute the content length at all.
http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3.html#putObject(java.lang.String, java.lang.String, java.io.File)
This will automatically retry any such network errors several times. You can tweak how many retries the client uses by instantiating it with a ClientConfiguration object.
http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html#setMaxErrorRetry(int)

If your endpoint is behind a VPC it will also silently error out. You can add a new VPC endpoint here for s3
https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Partially reading a tar.gz file from Amazon S3 - java

Related

In Spring Boot Java Extracting the zip file from response without physically saving it

Jersey Client download ZIP file and unpack efficiently

Get Zip file size overr FTP Connection in Java

How to Ignore the content-length exception while writing (downloading) a file from server response?

amazon s3 upload file time out

Categories

Resources