How to upload multipart to Amazon S3 asynchronously using the java SDK

How to upload multipart to Amazon S3 asynchronously using the java SDK - java

In my java application I need to write data to S3, which I don't know the size in advance and sizes are usually big so as recommend in the AWS S3 documentation I am using the Using the Java AWS SDKs (low-level-level API) to write data to the s3 bucket.
In my application I provide S3BufferedOutputStream which is an implementation OutputStream where other classes in the app can use this stream to write to the s3 bucket.
I store the data in a buffer and loop and once the data is bigger than bucket size I upload data in the buffer as a a single UploadPartRequest
Here is the implementation of the write method of S3BufferedOutputStream
#Override
public void write(byte[] b, int off, int len) throws IOException {
this.assertOpen();
int o = off, l = len;
int size;
while (l > (size = this.buf.length - position)) {
System.arraycopy(b, o, this.buf, this.position, size);
this.position += size;
flushBufferAndRewind();
o += size;
l -= size;
}
System.arraycopy(b, o, this.buf, this.position, l);
this.position += l;
}
The whole implementation is similar to this: code repo
My problem here is that each UploadPartRequest is done synchronously, so we have to wait for one part to be uploaded to be able to upload the next part. And because I am using the AWS S3 low level API I can not benefit from the parallel uploading provided by the TransferManager
Is there a way to achieve the parallel upload using low level SDK?
Or some code changes that can be done to operate Asynchronously without corrupting the uploaded data and maintain order of the data?

Here's some example code from a class that I have. It submits the parts to an ExecutorService and holds onto the returned Future. This is written for the v1 Java SDK; if you're using the v2 SDK you could use an async client rather than the explicit threadpool:
// WARNING: data must not be updated by caller; make a defensive copy if needed
public synchronized void uploadPart(byte[] data, boolean isLastPart)
{
partNumber++;
logger.debug("submitting part {} for s3://{}/{}", partNumber, bucket, key);
final UploadPartRequest request = new UploadPartRequest()
.withBucketName(bucket)
.withKey(key)
.withUploadId(uploadId)
.withPartNumber(partNumber)
.withPartSize(data.length)
.withInputStream(new ByteArrayInputStream(data))
.withLastPart(isLastPart);
futures.add(
executor.submit(new Callable<PartETag>()
{
#Override
public PartETag call() throws Exception
{
int localPartNumber = request.getPartNumber();
logger.debug("uploading part {} for s3://{}/{}", localPartNumber, bucket, key);
UploadPartResult response = client.uploadPart(request);
String etag = response.getETag();
logger.debug("uploaded part {} for s3://{}/{}; etag is {}", localPartNumber, bucket, key, etag);
return new PartETag(localPartNumber, etag);
}
}));
}
Note: this method is synchronized to ensure that parts are not submitted out of order.
Once you've submitted all of the parts, you use this method to wait for them to finish and then complete the upload:
public void complete()
{
logger.debug("waiting for upload tasks of s3://{}/{}", bucket, key);
List<PartETag> partTags = new ArrayList<>();
for (Future<PartETag> future : futures)
{
try
{
partTags.add(future.get());
}
catch (Exception e)
{
throw new RuntimeException(String.format("failed to complete upload task for s3://%s/%s"), e);
}
}
logger.debug("completing multi-part upload for s3://{}/{}", bucket, key);
CompleteMultipartUploadRequest request = new CompleteMultipartUploadRequest()
.withBucketName(bucket)
.withKey(key)
.withUploadId(uploadId)
.withPartETags(partTags);
client.completeMultipartUpload(request);
logger.debug("completed multi-part upload for s3://{}/{}", bucket, key);
}
You'll also need an abort() method that cancels outstanding parts and aborts the upload. This, and the rest of the class, are left as an exercise for the reader.

You should look at using the AWS SDK for Java V2. You are referencing V1, not the newest Amazon S3 Java API. If you are not familiar with V2, start here:
Get started with the AWS SDK for Java 2.x
To perform Async operations via the Amazon S3 Java API, you use S3AsyncClient.
Now to learn how to upload an object using this client, see this code example:
import software.amazon.awssdk.core.async.AsyncRequestBody;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
import software.amazon.awssdk.services.s3.model.PutObjectRequest;
import software.amazon.awssdk.services.s3.model.PutObjectResponse;
import java.nio.file.Paths;
import java.util.concurrent.CompletableFuture;
// snippet-end:[s3.java2.async_ops.import]
// snippet-start:[s3.java2.async_ops.main]
/**
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class S3AsyncOps {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" S3AsyncOps <bucketName> <key> <path>\n\n" +
"Where:\n" +
" bucketName - the name of the Amazon S3 bucket (for example, bucket1). \n\n" +
" key - the name of the object (for example, book.pdf). \n" +
" path - the local path to the file (for example, C:/AWS/book.pdf). \n" ;
if (args.length != 3) {
System.out.println(USAGE);
System.exit(1);
}
String bucketName = args[0];
String key = args[1];
String path = args[2];
Region region = Region.US_WEST_2;
S3AsyncClient client = S3AsyncClient.builder()
.region(region)
.build();
PutObjectRequest objectRequest = PutObjectRequest.builder()
.bucket(bucketName)
.key(key)
.build();
// Put the object into the bucket
CompletableFuture<PutObjectResponse> future = client.putObject(objectRequest,
AsyncRequestBody.fromFile(Paths.get(path))
);
future.whenComplete((resp, err) -> {
try {
if (resp != null) {
System.out.println("Object uploaded. Details: " + resp);
} else {
// Handle error
err.printStackTrace();
}
} finally {
// Only close the client when you are completely done with it
client.close();
}
});
future.join();
}
}
That is uploading an object using the S3AsyncClient client. To perform a multi-part upload, you need to use this method:
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html#createMultipartUpload-software.amazon.awssdk.services.s3.model.CreateMultipartUploadRequest-
TO see an example of Multipart upload using the S3 Sync client, see:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javav2/example_code/s3/src/main/java/com/example/s3/S3ObjectOperations.java
That is your solution - use S3AsyncClient object's createMultipartUpload method.

Related

Is there any direct way to copy one s3 directory to another in java or scala?

I want to archive all the files and sub directories in a s3 directory to some other s3 location using java. Is there any direct way to copy one s3 directory to another in java or scala?

There is no API call to operate on whole directories in Amazon S3.
In fact, directories/folders do not exist in Amazon S3. Rather, each object stores the full path in its filename (Key).
If you wish to copy multiple objects that have the same prefix in their Key, your code will need to loop through the objects, copying one object at a time.

A bit wordy, but does the job: reasonable logging, multithreading via TransferManager, handling continuation token for "folders" with more than 1000 keys:
/**
* Copies all content from s3://sourceBucketName/sourceFolder to s3://destinationBucketName/destinationFolder.
*/
public void copyAll(String sourceBucketName, String sourceFolder, String destinationBucketName, String destinationFolder) {
log.info("Copying data from s3://{}/{} to s3://{}/{}", sourceBucketName, sourceFolder, destinationBucketName, destinationFolder);
TransferManager transferManager = TransferManagerBuilder.standard()
.withS3Client(client)
.build();
try {
ListObjectsV2Request request = new ListObjectsV2Request()
.withBucketName(sourceBucketName)
.withPrefix(sourceFolder);
ListObjectsV2Result objects;
do {
objects = client.listObjectsV2(request);
List<Copy> transfers = new ArrayList<>();
for (S3ObjectSummary object : objects.getObjectSummaries()) {
String sourceKey = object.getKey();
String sourceRelativeKey = sourceKey.substring(sourceFolder.length());
String destinationKey = destinationFolder + sourceRelativeKey;
transfers.add(transferManager.copy(sourceBucketName, sourceKey, destinationBucketName, destinationKey));
}
for (Copy transfer : transfers) {
log.debug(transfer.getDescription());
transfer.waitForCompletion();
}
log.info("Copied batch of {} objects. Last object: {}", transfers.size(), transfers.isEmpty() ? "None" : transfers.get(transfers.size() - 1).getDescription());
request.setContinuationToken(objects.getNextContinuationToken());
} while (objects.isTruncated());
log.info("Copy operation completed successfully from s3://{}/{} to s3://{}/{}", sourceBucketName, sourceFolder, destinationBucketName, destinationFolder);
} catch (InterruptedException e) {
// Resetting interrupt flag and returning control to the caller.
Thread.currentThread().interrupt();
throw new RuntimeException(e);
} finally {
transferManager.shutdownNow(false);
}
}

Is there anyway to write small file into S3 as live streaming

I am getting 20k small xml files 1kb to 3kb size in a minute.
I have to write all the files as it arrives in the directory.
Sometimes the speed of the incoming files increases to 100k per minute.
Is there anything in java or aws api that can help me match the incoming speed?
I am using uploadFileList() API to upload all the files .
I have tried watch event as well so that when ever files arrives in a folder it will upload that file into S3 but that is so slow compared to incoming files and creates huge amount of backlogs.
I have tried multi threading also but if i spin up more thread i get error from S3 reduce you request rate error.
and some times i get below error also
AmazonServiceException:
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket
connection to the server was not read from or written to within the
timeout period. Idle connections will be closed.
but when i dot use threading i do not get this error
Another way I also have tried is to create one big files and then upload into S3 and then in S3 i again split it into small files which is fine but this solution delays the files upload int S3 and impacts the user who access this file from S3.
I know uploading small files into S3 is not suitable but i have use case like that.
The speed i noticed is 5k files upload in a minutes.
Can someone please suggest some alternate way so that my speed of uploading files will increase least 15k per minutes.
I am sharing my full code where i am trying to upload using multi threaded application
Class one where i create File to put into thread
public class FileProcessThreads {
public ArrayList process(String fileLocation) {
File dir = new File(fileLocation);
File[] directoryListing = dir.listFiles();
ArrayList<File> files = new ArrayList<File>();
if (directoryListing.length > 0) {
for (File path : directoryListing) {
files.add(path);
}
}
return files;
}
}
Class 2 where i create Thread pool and Executor
public class UploadExecutor {
private static String fileLocation = "C:\\Users\\u6034690\\Desktop\\ONEFILE";
// private static String fileLocation="D:\\TRFAudits_Moved\\";
private static final String _logFileName = "s3FileUploader.log";
private static Logger _logger = Logger.getLogger(UploadExecutor.class);
#SuppressWarnings("unchecked")
public static void main(String[] args) {
_logger.info("----------Stating application's main method----------------- ");
AWSCredentials credential = new ProfileCredentialsProvider("TRFAuditability-Prod-ServiceUser").getCredentials();
final ClientConfiguration config = new ClientConfiguration();
AmazonS3Client s3Client = (AmazonS3Client) AmazonS3ClientBuilder.standard().withRegion("us-east-1")
.withCredentials(new AWSStaticCredentialsProvider(credential)).withForceGlobalBucketAccessEnabled(true)
.build();
s3Client.getClientConfiguration().setMaxConnections(100);
TransferManager tm = new TransferManager(s3Client);
while (true) {
FileProcessThreads fp = new FileProcessThreads();
List<File> records = fp.process(fileLocation);
while (records.size() <= 0) {
try {
_logger.info("No records found willl wait for 10 Seconds");
TimeUnit.SECONDS.sleep(10);
records = fp.process(fileLocation);
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
}
}
_logger.info("Total no of Audit files = " + records.size());
ExecutorService es = Executors.newFixedThreadPool(2);
int recordsInEachThread = (int) (records.size() / 2);
_logger.info("No of records in each thread = " + recordsInEachThread);
UploadObject my1 = new UploadObject(records.subList(0, recordsInEachThread), tm);
UploadObject my2 = new UploadObject(records.subList(recordsInEachThread, records.size()), tm);
es.execute(my1);
es.execute(my2);
es.shutdown();
try {
boolean finshed = es.awaitTermination(1, TimeUnit.MINUTES);
if (!finshed) {
Thread.sleep(1000);
}
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
}
}
}
}
Last class where i upload files into S3
public class UploadObject implements Runnable{
static String bucketName = "a205381-auditxml/S3UPLOADER";
private String fileLocation="C:\\Users\\u6034690\\Desktop\\ONEFILE";
//private String fileLocation="D:\\TRFAudits\\";
//static String bucketName = "a205381-auditxml/S3UPLOADER";
private static Logger _logger;
List<File> records;
TransferManager tm;
UploadObject(List<File> list,TransferManager tm){
this.records = list;
this.tm=tm;
_logger = Logger.getLogger(UploadObject.class);
}
public void run(){
System.out.println(Thread.currentThread().getName() + " : ");
uploadToToS3();
}
public void uploadToToS3() {
_logger.info("Number of record to be processed in current thread: : "+records.size());
MultipleFileUpload xfer = tm.uploadFileList(bucketName, "TEST",new File(fileLocation), records);
try {
xfer.waitForCompletion();
TransferState xfer_state = xfer.getState();
_logger.info("Upload status -----------------" + xfer_state);
for (File file : records) {
try {
Files.delete(FileSystems.getDefault().getPath(file.getAbsolutePath()));
} catch (IOException e) {
System.exit(1);
_logger.error("IOException: "+e.toString());
}
}
_logger.info("Successfully completed file cleanse");
} catch (AmazonServiceException e) {
_logger.error("AmazonServiceException: "+e.toString());
System.exit(1);
} catch (AmazonClientException e) {
_logger.error("AmazonClientException: "+e.toString());
System.exit(1);
} catch (InterruptedException e) {
_logger.error("InterruptedException: "+e.toString());
System.exit(1);
}
System.out.println("Completed");
_logger.info("Upload completed");
_logger.info("Calling Transfer manager shutdown");
//tm.shutdownNow();
}
}

It sounds like you're tripping the built-in protections for S3 (quoted docs below). I've also listed some similar questions below; some of these advise rearchitecting using SQS to even out and distribute the load on S3.
Aside from introducing more moving pieces, you can reuse your S3Client and TransferManager. Move them up out of your runnable object and pass them into its constructor. TransferManager itself uses multithreading according to the javadoc.
When possible, TransferManager attempts to use multiple threads to upload multiple parts of a single upload at once. When dealing with large content sizes and high bandwidth, this can have a significant increase on throughput.
You can also increase the max number of simultaneous connections that the S3Client uses.
Maybe:
s3Client.getClientConfiguration().setMaxConnections(75) or even higher.
DEFAULT_MAX_CONNECTIONS is set to 50.
Lastly, you could try to upload to different prefixes/folders under the bucket, as noted below to allow scaling for high request rates.
The current AWS Request Rate and Performance Guidelines
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.
The current AWS S3 Error Best Practices
Tune Application for Repeated SlowDown errors
As with any distributed system, S3 has protection mechanisms which detect intentional or unintentional resource over-consumption and react accordingly. SlowDown errors can occur when a high request rate triggers one of these mechanisms. Reducing your request rate will decrease or eliminate errors of this type. Generally speaking, most users will not experience these errors regularly; however, if you would like more information or are experiencing high or unexpected SlowDown errors, please post to our Amazon S3 developer forum https://forums.aws.amazon.com/ or sign up for AWS Premium Support https://aws.amazon.com/premiumsupport/.
Similar questions:
S3 SlowDown: Please reduce your request rate exception
Amazon Web Services S3 Request Limit
AWS Forums - Maximizing Connection Reuse for S3 getObjectMetadata() Calls

S3 Transfer Acceleration does not necessarily give faster upload speeds. It is sometime slower than normal upload when using from same region. Amazon S3 Transfer Acceleration uses the AWS edge infrastructure they have around the world to get data on to the aws backbone quicker. When you use Amazon S3 Transfer Acceleration your request is routed to the best AWS edge location based on latency. Transfer Acceleration will then send your uploads back to S3 over the AWS-managed backbone network using optimized network protocols, persistent connections from edge to origin, fully-open send and receive windows, and so forth. As you would already be within the region you wouldn't see any benefit to using this. But, its better to test the speed from https://s3-accelerate-speedtest.s3-accelerate.amazonaws.com/en/accelerate-speed-comparsion.html

How to use Azure storage blob services [duplicate]

This question already has answers here:
Compiler error "archive for required library could not be read" - Spring Tool Suite
(24 answers)
Closed 4 years ago.
I need your help as I am new to this field. I want to use Azure storage blob service to upload images, list and download, but I am facing some problems.
I have imported a project from this repository, and as soon as I import I am getting errors:
Description Resource Path Location Type
Archive for required library: 'C:/Users/NUTRIP-DEVLP1/.m2/repository/org/apache/commons/commons-lang3/3.4/commons-lang3-3.4.jar' in project 'blobAzureApp' cannot be read or is not a valid ZIP file blobAzureApp Build path Build Path Problem
Description Resource Path Location Type
The project cannot be built until build path errors are resolved blobAzureApp Unknown Java Problem
Should I run this as a normal Java application or a Maven project? If Maven, how do I run it?

I suggest you using official java sdk in your maven project.
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage-blob</artifactId>
<version>10.1.0</version>
</dependency>
sample upload code:
static void uploadFile(BlockBlobURL blob, File sourceFile) throws IOException {
FileChannel fileChannel = FileChannel.open(sourceFile.toPath());
// Uploading a file to the blobURL using the high-level methods available in TransferManager class
// Alternatively call the Upload/StageBlock low-level methods from BlockBlobURL type
TransferManager.uploadFileToBlockBlob(fileChannel, blob, 8*1024*1024, null)
.subscribe(response-> {
System.out.println("Completed upload request.");
System.out.println(response.response().statusCode());
});
}
sample list code:
static void listBlobs(ContainerURL containerURL) {
// Each ContainerURL.listBlobsFlatSegment call return up to maxResults (maxResults=10 passed into ListBlobOptions below).
// To list all Blobs, we are creating a helper static method called listAllBlobs,
// and calling it after the initial listBlobsFlatSegment call
ListBlobsOptions options = new ListBlobsOptions(null, null, 10);
containerURL.listBlobsFlatSegment(null, options)
.flatMap(containersListBlobFlatSegmentResponse ->
listAllBlobs(containerURL, containersListBlobFlatSegmentResponse))
.subscribe(response-> {
System.out.println("Completed list blobs request.");
System.out.println(response.statusCode());
});
}
private static Single <ContainersListBlobFlatSegmentResponse> listAllBlobs(ContainerURL url, ContainersListBlobFlatSegmentResponse response) {
// Process the blobs returned in this result segment (if the segment is empty, blobs() will be null.
if (response.body().blobs() != null) {
for (Blob b : response.body().blobs().blob()) {
String output = "Blob name: " + b.name();
if (b.snapshot() != null) {
output += ", Snapshot: " + b.snapshot();
}
System.out.println(output);
}
}
else {
System.out.println("There are no more blobs to list off.");
}
// If there is not another segment, return this response as the final response.
if (response.body().nextMarker() == null) {
return Single.just(response);
} else {
/*
IMPORTANT: ListBlobsFlatSegment returns the start of the next segment; you MUST use this to get the next
segment (after processing the current result segment
*/
String nextMarker = response.body().nextMarker();
/*
The presence of the marker indicates that there are more blobs to list, so we make another call to
listBlobsFlatSegment and pass the result through this helper function.
*/
return url.listBlobsFlatSegment(nextMarker, new ListBlobsOptions(null, null,1))
.flatMap(containersListBlobFlatSegmentResponse ->
listAllBlobs(url, containersListBlobFlatSegmentResponse));
}
}
sample download code:
static void getBlob(BlockBlobURL blobURL, File sourceFile) {
try {
// Get the blob using the low-level download method in BlockBlobURL type
// com.microsoft.rest.v2.util.FlowableUtil is a static class that contains helpers to work with Flowable
blobURL.download(new BlobRange(0, Long.MAX_VALUE), null, false)
.flatMapCompletable(response -> {
AsynchronousFileChannel channel = AsynchronousFileChannel.open(Paths
.get(sourceFile.getPath()), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
return FlowableUtil.writeFile(response.body(), channel);
}).doOnComplete(()-> System.out.println("The blob was downloaded to " + sourceFile.getAbsolutePath()))
// To call it synchronously add .blockingAwait()
.subscribe();
} catch (Exception ex){
System.out.println(ex.toString());
}
}
More details, please refer to this doc.Hope it helps you.

How to copy S3 object from one region to another when vpc endpoint is enabled

Recently I was unable to copy files using the s3.copyObject(sourceBucket, sourceKey, destBucket, destKey); because of 2 reasons.
1) The source and destination buckets are in 2 different regions (us-east-1 and us-east2 in my case).
2) The region where the server resides is in a VPC which has an S3 endpoint enabled. S3 endpoint is an internal connection to S3, but only in the same region
Given that we are moving large files, we could not download and then upload even temporarily. We also wanted to keep the S3 endpoint in place, because the application makes serious use of S3 assets once in region.

The solution is to stream copy the files from one stream to another. I wrote this simple function which will handle it.
ZipException is just a custom exception. Throw whatever you want.
Hopefully this helps somebody.
public static void copyObject(AmazonS3 sourceClient, AmazonS3 destClient, String sourceBucket, String sourceKey, String destBucket, String destKey) throws IOException {
S3ObjectInputStream inStream = null;
try {
GetObjectRequest request = new GetObjectRequest(sourceBucket, sourceKey);
S3Object object = sourceClient.getObject(request);
inStream = object.getObjectContent();
destClient.putObject(destBucket,
destKey, inStream, object.getObjectMetadata());
} catch (SdkClientException e) {
throw new ZipException("Unable to copy file.", e);
} finally {
if (inStream != null) {
inStream.close();
}
}
}

Getting file size from S3 bucket

I am trying to get the file size (content-length) using Amazon S3 JAVA sdk.
public Long getObjectSize(AmazonS3Client amazonS3Client, String bucket, String key)
throws IOException {
Long size = null;
S3Object object = null;
try {
object = amazonS3Client.getObject(bucket, key);
size = object.getObjectMetadata().getContentLength();
} finally {
if (object != null) {
//object.close();
1. This results in 50 calls (connection pool size) post that I start getting connection pool errors.
2. If this line is uncommented it takes hell lot of time to make calls.
}
}
return size;
}
I followed this and this. But not sure what I am doing wrong here.
Any help on this?

I'm guessing what your actual question is asking, but I think you can reduce your code and eliminate the need to create an s3Object at all by doing something like:
public Long getObjectSize(AmazonS3Client amazonS3Client, String bucket, String key)
throws IOException {
return amazonS3Client.getObjectMetadata(bucket, key).getContentLength();
}
That should remove the need to call object.close() which you appear to be having issues with.

For v2 of the Amazon S3 Java SDK, try something like this:
HeadObjectRequest headObjectRequest =
HeadObjectRequest.builder()
.bucket(bucket)
.key(key)
.build();
HeadObjectResponse headObjectResponse =
s3Client.headObject(headObjectRequest);
Long contentLength = headObjectResponse.contentLength();

So we have 2 SDKs.
For v1 of the Amazon S3 Java SDK, below
client.getObjectMetadata(bucket, key).getContentLength();
where client is an instance of AmazonS3 coming from import com.amazonaws.services.s3.AmazonS3; and implementation 'com.amazonaws:aws-java-sdk-s3:1.12.353' gradle dependencies.
For v2 of the Amazon S3 Java SDK, below:
return client.headObject(HeadObjectRequest.builder().bucket(bucket).key(key).build()).contentLength();
where client is instance of S3Client coming from import software.amazon.awssdk.services.s3.S3Client; and gradle dependencies implementation 'software.amazon.awssdk:s3:2.18.35'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.