I have the next code for uploading files to an Amazon S3:
AmazonS3Client client = new AmazonS3Client(credentials,
new ClientConfiguration().withMaxConnections(100)
.withConnectionTimeout(120 * 1000)
.withMaxErrorRetry(15));
TransferManager tm = new TransferManager(client);
TransferManagerConfiguration configuration = new TransferManagerConfiguration();
configuration.setMultipartUploadThreshold(MULTIPART_THRESHOLD);
tm.setConfiguration(configuration);
Upload upload = tm.upload(bucket, key, file);
try {
upload.waitForCompletion();
} catch(InterruptedException ex) {
logger.error(ex.getMessage());
} finally {
tm.shutdownNow(false);
}
It works, but some uploads(1GB) produce the next log message:
INFO AmazonHttpClient:Unable to execute HTTP request: bucket-name.s3.amazonaws.com failed to respond
org.apache.http.NoHttpResponseException: bucket-name.s3.amazonaws.com failed to respond
I have tried to create TransferManager without AmazonS3Client, but it doesn't help.
Is there any way to fix it?
The log message is telling you that there was a transient error sending data to S3. You've configured .withMaxErrorRetry(15), so the AmazonS3Client is transparently retrying the request that failed and the overall upload is succeeding.
There isn't necessarily anything to fix here - sometimes packets get lost on the network, especially if you're trying to push through a lot of packets at once. Waiting a little while and retrying is usually the right way to deal with this, and that's what's already happening.
If you wanted, you could try turning down the MaxConnections setting to limit how many chunks of the file will be uploaded at a time - there's probably a sweet spot where you're still getting reasonable throughput, but not overloading the network.
Related
We are using Java 8 and using AWS SDK to programmatically upload files to AWS S3. For uploading large file (>100MB), we read that the preferred method to use is Multipart Upload. We tried that but it seems it does not speed it up, upload time remains almost the same as not using multipart upload. Worse is, we even encountered out of memory errors saying heap space is not sufficient.
Questions:
Is using multipart upload really supposed to speed up the upload? if not, then why use it?
How come using multi part upload eats up memory faster than not using? does it concurrently upload all the parts?
See below for the code we used:
private static void uploadFileToS3UsingBase64(String bucketName, String region, String accessKey, String secretKey,
String fileBase64String, String s3ObjectKeyName) {
byte[] bI = org.apache.commons.codec.binary.Base64.decodeBase64((fileBase64String.substring(fileBase64String.indexOf(",")+1)).getBytes());
InputStream fis = new ByteArrayInputStream(bI);
long start = System.currentTimeMillis();
AmazonS3 s3Client = null;
TransferManager tm = null;
try {
s3Client = AmazonS3ClientBuilder.standard().withRegion(region)
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))
.build();
tm = TransferManagerBuilder.standard()
.withS3Client(s3Client)
.withMultipartUploadThreshold((long) (50* 1024 * 1025))
.build();
ObjectMetadata metadata = new ObjectMetadata();
metadata.setHeader(Headers.STORAGE_CLASS, StorageClass.Standard);
PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, s3ObjectKeyName,
fis, metadata).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams());
Upload upload = tm.upload(putObjectRequest);
// Optionally, wait for the upload to finish before continuing.
upload.waitForCompletion();
long end = System.currentTimeMillis();
long duration = (end - start)/1000;
// Log status
System.out.println("Successul upload in S3 multipart. Duration = " + duration);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (s3Client != null)
s3Client.shutdown();
if (tm != null)
tm.shutdownNow();
}
}
Using multipart will only speed up the upload if you upload multiple parts at the same time.
In your code you're setting withMultipartUploadThreshold. If your upload size is larger than that threshold, then you should observe concurrent upload of separate parts. If it is not, then only one upload connection should be used. You're saying that you have >100 MB file, and in your code you have 50 * 1024 * 1025 = 52 480 000 bytes as the multipart upload threshold, so concurrent upload of parts of that file should have been happening.
However, if your upload throughput is anyway capped by your network speed, there would not be any increase in throughput. This might be the reason you're not observing any speed increase.
There are other reasons to use multipart too, as it is recommended for fault tolerance reasons as well. Also, it has a larger maximum size than single upload.
For more details see documentation:
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput - You can upload parts in parallel to improve throughput.
Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network
error.
Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you
must explicitly complete or stop the multipart upload.
Begin an upload before you know the final object size - You can upload an object as you are creating it.
We recommend that you use multipart upload in the following ways:
If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available
bandwidth by uploading object parts in parallel for multi-threaded
performance.
If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts.
When using multipart upload, you need to retry uploading only parts
that are interrupted during the upload. You don't need to restart
uploading your object from the beginning.
The answer from eis is very fine. Though you still should take some action:
String.getBytes(StandardCharsets.US_ASCII) or ISO_8859_1 prevents using a more costly encoding, like UTF-8. If the platform encoding would be UTF-16LE the data would even be corrupt (0x00 bytes).
The standard java Base64 has some de-/encoders that might work. It can work on a String. However check the correct handling (line endings).
try-with-resources closes also in case of exceptions/internal returns.
The ByteArrayInputStream was not closed, which would have been better style (easier garbage collection?).
You could set the ExecutorFactory to a thread pool factory limiting the number of threads globally.
So
byte[] bI = Base64.getDecoder().decode(
fileBase64String.substring(fileBase64String.indexOf(',') + 1));
try (InputStream fis = new ByteArrayInputStream(bI)) {
...
}
So I am developing 2 services. One service handles file on filesystem and returns them via REST interface.
ServiceA:
File zipFile = new File(folderDownload.getZipPath());
try {
final InputStream inputStream = new BufferedInputStream(new FileInputStream(zipFile));
return ResponseEntity
.status(HttpStatus.OK)
.eTag(folderDownload.getDownloadId())
.contentLength(zipFile.length())
.lastModified(Instant.ofEpochMilli(zipFile.lastModified()))
.body(new InputStreamResource(inputStream));
} catch (Exception e) {
throw new DownloadException(zipFile.getName(), e.getMessage(), 400);
}
ServiceB is a public facing API that receives requests validates them and if the request is valid it retrieves files from serviceA and returns them to the client. For security reasons serviceA can only interact with serviceB so there is no other way but sending file through 2 services...
WebClient client = WebClient.create(STORAGE_HOST);
// Request service to get file data
Flux<DataBuffer> fileDataStream = client.get().uri(uriBuilder -> uriBuilder.path("folders/zip/download").queryParam("requestId", requestId).build())
.accept(MediaType.APPLICATION_OCTET_STREAM).retrieve()
.bodyToFlux(DataBuffer.class);
// Streams the stream from response instead of loading it all in memory
DataBufferUtils.write(fileDataStream, outputStream).map(DataBufferUtils::release).then().block();
When downloading smaller files everything is OK.
When downloading larger files I was getting outOfmemory exception on serviceB. I started using webClient instead of restTemplate and those problems went away. But now I am facing a new problem:
If file is large, the request timeouts before completing. So user receives incomplete file. What is the solution here?
The exception I see in console on serviceB:
2020-04-09 13:40:24.138 WARN 4106 --- [io-8081-exec-10] s.a.s.e.CustomGlobalExceptionHandler : Async request timed out
2020-04-09 13:40:24.138 WARN 4106 --- [io-8081-exec-10] .m.m.a.ExceptionHandlerExceptionResolver : Resolved [org.springframework.web.context.request.async.AsyncRequestTimeoutException]
Also I find it weird that when downloading directly from serviceA I get no loading indicator...
I have been on this problem for past week and I am slowly losing my mind. please help :(
kinds regards
I have kafka records:
ConsumerRecords<String, Events> records = kafkaConsumer.poll(POLL_TIMEOUT);
I want to run the below code using parallel streams, not multithreading.
records.forEach((record) -> {
Event event = record.value();
HTTPSend.send(event);
});
I tried with mlutithreading but I want to try parallelstream:
for (ConsumerRecord<String, Event> record : records) {
executor.execute(new Runnable() {
#Override
public void run() {
HTTPSend.send(Event);
}
});
}
Actually I'm facing issue with HTTP.send with multithreading (even with a thread pool of 1 thread). I'm getting
"Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target".
This is a request over https. This error comes only for the first time the request is made. Afterwards, the exception vanishes. poof!
For multithreading i'm using:
int threadCOunt=1;
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(threadCOunt, true);
RejectedExecutionHandler handler = new ThreadPoolExecutor.CallerRunsPolicy();
ExecutorService executor = new ThreadPoolExecutor(threadCOunt, threadCOunt, 0L, TimeUnit.MILLISECONDS, queue, handler);
HTTPSend.send() is:
long sizeSend = 0;
SSLContext sc = null;
try {
sc = SSLContext.getInstance("TLS");
sc.init(null, TRUST_ALL_CERTS, new SecureRandom());
} catch (NoSuchAlgorithmException | KeyManagementException e) {
LOGGER.error("Failed to create SSL context", e);
}
// Ignore differences between given hostname and certificate hostname
HostnameVerifier hv = (hostname, session) -> true;
// Create the REST client and configure it to connect meta
Client client = ClientBuilder.newBuilder()
.hostnameVerifier(hv)
.sslContext(sc).build();
WebTarget baseTarget = client.target(getURL()).path(HTTP_PATH);
Response jsonResponse = null;
try {
StringBuilder eventsBatchString = new StringBuilder();
eventsBatchString.append(this.getEvent(event));
Entity<String> entity = Entity.entity(eventsBatchString.toString(), MediaType.APPLICATION_JSON_TYPE);
builder = baseTarget.request();
LOGGER.debug("about to send the event {} and URL {}", entity, getURL());
jsonResponse = builder.header(HTTP_ACK_CHANNEL, guid.toString())
.header("Content-type", MediaType.APPLICATION_JSON)
.header("Authorization", String.format("Meta %s", eventsModuleConfig.getSecretKey()))
.post(entity);
I see what you want to do, and I'm not sure that's the best idea (I'm also not sure it's not).
The poll / commit model of Kafka allows simple backpressure and retention of the last item processed if you crashed. By returning to your poll loop "immediately" you are telling Kafka "I am ready for more", and committing the offset (manually or automatically) tells Kafka that you have successfully read up to that point.
What you seem to want to do is read off Kafka as fast as possible, committing offsets, then putting the Kafka records into an executor queue then you balance your requests per second etc from that.
I'm not 100% sure that's a good idea: what happens if your app crashes? You may have committed some Kafka messages that actually didn't make it upstream. If you do really want to do this, I would suggest manually committing the offset (via commitSync) upon completion of the Runnable, instead of letting the high level consumer do it for you.
Why might you want to use a thread executor: I think these can be accomplished with Kafka too.
You may want to post multiple messages to the web server at the same time. A well paritioned Kafka topic will let multiple consumers / consumer groups consumer multiple partitions, thus - assuming a perfectly scaling HTTP server - would let you parallelize the posting of messages to your server. Yay for process based concurrency!
Maybe the web server is not perfectly scalable, or slow for this request (say each request takes 1 second): you need to limit the number of requests per second the web server takes, if you have a queue you might have a couple threads posting while not backing up Kafka.
In this case you can set max.poll.records to a scalable value that your web server requires. There's probably a better way to do this too, although it's escaping me at the moment.
If your web server takes a long time to respond you may get errors related to failing heartbeats. In that case I direct you to this SO answer on the timeout / heartbeat topic.
Instead of using a thread executioner, thus making synchronous HTTP requests appear to be async, I would use an evented HTTP client like Netty, thus achieving parallelism without thread based concurrency.
For solving a "slow consumer" use case where you're doing I/O processing, you should use something like Parallel Consumer (PC) to avoid the "head of line blocking" problem you're describing.
By using PC, you can processing all your keys in parallel, regardless of how long it takes to do your I/O.
It also comes with a non blocking Vert.x module which more efficiently uses the CPU.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
I'm trying to insert large file into Google's drive using google-api-services-drive version v2-rev93-1.16.0-rc
I've set setChunkSize() for minimum in order to have my own ProgressListener notified more frequent. The following code is used to insert file:
File body = new File();
body.setTitle(filetobeuploaded.getName());
body.setMimeType("application/zip");
body.setFileSize(filetobeuploaded.length());
InputStreamContent mediaContent =
new InputStreamContent("application/zip",
new BufferedInputStream(new FileInputStream(filetobeuploaded)));
mediaContent.setLength(filetobeuploaded.length());
Insert insert = drive.files().insert(body, mediaContent);
MediaHttpUploader uploader = insert.getMediaHttpUploader();
uploader.setChunkSize(MediaHttpUploader.MINIMUM_CHUNK_SIZE);
uploader.setProgressListener(new CustomProgressListener(filetobeuploaded));
insert.execute();
After 'a while' (sometimes 200 MB sometimes 300 MB ) I got IOException :
Exception in thread "main" java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3213)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:960)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:482)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:390)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:418)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
Any ideas how to get this code working?
You wont be able to get it working from a frontend because of time constrains. The only reliable way (but a pain) is to do it from a backend using resumable upload since the backend/task queue may also be shut down while processing chunks.
Does 'a while' happen to mean 1 hour? In this case you are probably experiencing the following bug:
http://code.google.com/p/gdata-issues/issues/detail?id=5124
This issue is only for Drive Resumable Media upload. check this reply ..
https://stackoverflow.com/a/30796105/4576135
For me this meant "you are doing a POST and specified a content length, but then the stream you uploaded wasn't long enough to match that content length" (a closed bytearray stream, in my case, closed because previously used to basically "exhausted" already, as it were).
I have written an app using the official API to upload to Google drive, this works perfectly. However I can't find anyway to cancel an upload. I'm running it in an ASyncTask, so far I have tried making the drive object, the file path, the token and a few other variables none. I've also tried to invalidate the token however it seems this only removes it from the cache and doesn't actually invalidate it server side. I've called cancel on the ASyncTask however if it is cancelled once the upload has started there seems to be no way to stop it.
My issue is if a user starts uploading a 100mb file there's no way for them to cancel it other then turning their internet connection on and off which isn't practical. This is my upload code:
java.io.File jFile = new java.io.File(path); //Does not get file just info
File body = setBody(jFile, mimeType);
try
{
java.io.File mediaFile = new java.io.File(path);
InputStreamContent mediaContent =
new InputStreamContent(mimeType,
new BufferedInputStream(new FileInputStream(mediaFile)));
mediaContent.setLength(mediaFile.length());
request = drive.files().insert(body, mediaContent);
request.getMediaHttpUploader().setProgressListener(new CustomProgressListener());
File file = request.execute();
Surely there is a way to cancel an upload?
Great question! I filed a feature request for the ability to cancel a media upload request:
https://code.google.com/p/google-api-java-client/issues/detail?id=671
However, this may a difficult feature to implement based on the current design of that library.
Another option to consider is to not use AsyncTask and instead implementing the multi-threading yourself and abort the running Thread. Not pleasant, but may be your only option for doing this now.
In a comment I have said that putting the request in a thread and call interrupt is not working. So I was searching another way of doing it.
I have finally found one, an ugly one: Close the InputStream of the uploaded file!
I initialize with:
mediaContent = new InputStreamContent(NOTE_MIME_TYPE,new BufferedInputStream(new FileInputStream(fileContent)));
mediaContent.setLength(fileContent.length());
And in a stop listener I have :
IOUtils.closeQuietly(mediaContent.getInputStream());
For sure it should be possible to do it in a better way!
Regards
I've been looking for an answer for a long time as well. The following is the best I could come up with.
Store the Future object anywhere you want (use Object or Void):
Future<Object> future;
Call this code (Java 8):
try
{
ExecutorService executor = Executors.newSingleThreadExecutor();
future = (Future<Object>) executor.submit(() ->
{
// start uploading process here
});
future.get(); // run the thread (blocks)
executor.shutdown();
}
catch (CancellationException | ExecutionException | InterruptedException e)
{}
To cancel, call:
future.cancel(true);
This solution hasn't produced any nasty side effects so far, unlike stopping a thread (deprecated), and guarantees a cancel, unlike interrupting.
I've found a nasty way to do it.
I am developing an android app, but I assume similar logic can apply to other platforms
I've set a progressListener to the upload request:
uploader.setProgressListener(new MediaHttpUploaderProgressListener() {
#Override
public void progressChanged(MediaHttpUploader uploader) throws IOException {
if (cancelUpload) {
//This was the only way I found to abort the download
throw new GoogleDriveUploadServiceCancelException("Upload canceled");
}
reportProgress(uploader);
}
});
As you can see, I've used a boolean member that is set from outside when I need to abort the upload.
If this boolean is set to true, then on the next update of the upload progress I will crash, resulting in the abortion of the request.
Again, very ugly, but that was the only way I found to make it work