IOUtils.copy() with input and output streams is extremely slow

IOUtils.copy() with input and output streams is extremely slow - java

As part of my web service, I have a picture repository which retrieves an image from Amazon S3 (a datastore) then returns it. This is how the method that does this looks:
File getPicture(String path) throws IOException {
File file = File.createTempFile(path, ".png");
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, path));
IOUtils.copy(object.getObjectContent(), new FileOutputStream(file));
return file;
}
The problem is that it takes way too long to get a response from the service - (a 3MB image took 7.5 seconds to download). I notice that if I comment out the IOUtils.copy() line, the response time is significantly faster so it must be that particular method that's causing this delay.
I've seen this method used in almost all modern examples of converting an S3Object to a file but I seem to be a unique case. Am I missing a trick here?
Appreciate any help!

From the AWS documentation:
public S3Object getObject(GetObjectRequest getObjectRequest)
the returned Amazon S3 object contains a direct stream of data from the HTTP connection. The underlying HTTP connection cannot be reused until the user finishes reading the data and closes the stream.
public S3ObjectInputStream getObjectContent()
Note: The method is a simple getter and does not actually create a stream. If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3.
If you remove the IOUtils.copy line, then method exits quickly because you don't actually process the stream. If the file is large it will take time to download. You can't do much about that unless you can get a better connection to the AWS services.

Related

Java File Upload to S3 - should multipart speed it up?

We are using Java 8 and using AWS SDK to programmatically upload files to AWS S3. For uploading large file (>100MB), we read that the preferred method to use is Multipart Upload. We tried that but it seems it does not speed it up, upload time remains almost the same as not using multipart upload. Worse is, we even encountered out of memory errors saying heap space is not sufficient.
Questions:
Is using multipart upload really supposed to speed up the upload? if not, then why use it?
How come using multi part upload eats up memory faster than not using? does it concurrently upload all the parts?
See below for the code we used:
private static void uploadFileToS3UsingBase64(String bucketName, String region, String accessKey, String secretKey,
String fileBase64String, String s3ObjectKeyName) {
byte[] bI = org.apache.commons.codec.binary.Base64.decodeBase64((fileBase64String.substring(fileBase64String.indexOf(",")+1)).getBytes());
InputStream fis = new ByteArrayInputStream(bI);
long start = System.currentTimeMillis();
AmazonS3 s3Client = null;
TransferManager tm = null;
try {
s3Client = AmazonS3ClientBuilder.standard().withRegion(region)
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))
.build();
tm = TransferManagerBuilder.standard()
.withS3Client(s3Client)
.withMultipartUploadThreshold((long) (50* 1024 * 1025))
.build();
ObjectMetadata metadata = new ObjectMetadata();
metadata.setHeader(Headers.STORAGE_CLASS, StorageClass.Standard);
PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, s3ObjectKeyName,
fis, metadata).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams());
Upload upload = tm.upload(putObjectRequest);
// Optionally, wait for the upload to finish before continuing.
upload.waitForCompletion();
long end = System.currentTimeMillis();
long duration = (end - start)/1000;
// Log status
System.out.println("Successul upload in S3 multipart. Duration = " + duration);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (s3Client != null)
s3Client.shutdown();
if (tm != null)
tm.shutdownNow();
}
}

Using multipart will only speed up the upload if you upload multiple parts at the same time.
In your code you're setting withMultipartUploadThreshold. If your upload size is larger than that threshold, then you should observe concurrent upload of separate parts. If it is not, then only one upload connection should be used. You're saying that you have >100 MB file, and in your code you have 50 * 1024 * 1025 = 52 480 000 bytes as the multipart upload threshold, so concurrent upload of parts of that file should have been happening.
However, if your upload throughput is anyway capped by your network speed, there would not be any increase in throughput. This might be the reason you're not observing any speed increase.
There are other reasons to use multipart too, as it is recommended for fault tolerance reasons as well. Also, it has a larger maximum size than single upload.
For more details see documentation:
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. In general,
when your object size reaches 100 MB, you should consider using
multipart uploads instead of uploading the object in a single
operation.
Using multipart upload provides the following advantages:
Improved throughput - You can upload parts in parallel to improve throughput.
Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network
error.
Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you
must explicitly complete or stop the multipart upload.
Begin an upload before you know the final object size - You can upload an object as you are creating it.
We recommend that you use multipart upload in the following ways:
If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available
bandwidth by uploading object parts in parallel for multi-threaded
performance.
If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts.
When using multipart upload, you need to retry uploading only parts
that are interrupted during the upload. You don't need to restart
uploading your object from the beginning.

The answer from eis is very fine. Though you still should take some action:
String.getBytes(StandardCharsets.US_ASCII) or ISO_8859_1 prevents using a more costly encoding, like UTF-8. If the platform encoding would be UTF-16LE the data would even be corrupt (0x00 bytes).
The standard java Base64 has some de-/encoders that might work. It can work on a String. However check the correct handling (line endings).
try-with-resources closes also in case of exceptions/internal returns.
The ByteArrayInputStream was not closed, which would have been better style (easier garbage collection?).
You could set the ExecutorFactory to a thread pool factory limiting the number of threads globally.
So
byte[] bI = Base64.getDecoder().decode(
fileBase64String.substring(fileBase64String.indexOf(',') + 1));
try (InputStream fis = new ByteArrayInputStream(bI)) {
...
}

Configuration in a S3 bucket: bad idea?

I'm quite new in AWS. I have designed an architecture that uses Api Gateway to call a lambda function written in java. Since I have some configuration I decided to create an S3 file to store a standard Java configuration file there and load it when needed. This took a lot of time, about 15 sec, for a very small file.
To read the file I'm using AmazonS3Client client class, Do I have other options?
long ms = System.nanoTime();
AmazonS3Client client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
GetObjectRequest request = new GetObjectRequest("bucket","filepath");
InputStream inputStream = client.getObject(request).getObjectContent();
try {
PropertiesConfiguration p = new PropertiesConfiguration();
p.load(inputStream);
composite.addConfiguration(p);
log.debug(String.format("Configuration read in %f mS",(System.nanoTime()-ms)/1000000f));
}catch (ConfigurationException e) {
logger.error("error reading configuration on S3:"+e);
}
So the questions: if storing the config file in an s3 bucket is a bad idea, where is supposed to be stored a configuration?
Is that performance normal? I'm thinking in using s3 a lot in my architecture for something else, but having a 15 sec handshake for a file is, of course, unacceptable.

In this case I think you should try to store it with EBS. But it will cost you more because EBS is optimized I/O and EBS storage is organized into volumes and once an EBS volume is attached to a server it is treated like a local disk drive.

Java concurrency with read/write

I'm making a client server application in Java. In short, the Server has some files. The Client can send a file to the Server and the Client can request to download all the files from the Server. I'm using RMI for the Server and Client to communicate and I'm using the RMI IO library to send files between Client and Server.
Some example code:
Server:
class Server implements ServerService {
// private Map<String, File> files;
private ConcurrentHashMap<String, File> files // Solution
// adding a file to the server
public synchronized void addFile(RemoteInputStream inFile, String filename)
throws RemoteException, IOException {
// From RMI IO library
InputStream istream = RemoteInputStreamClient.wrap(inFile);
File f = new File(dir, filename);
FileOutputStream ostream = new FileOutputStream(f);
while (istream.available() > 0) {
ostream.write(istream.read());
}
istream.close();
ostream.close();
files.put(filename, f);
}
// requesting all files
public requestFiles(ClientService stub)
throws RemoteException, IOException {
for(File f: files.values()) {
//Open a stream to this file and give it to the Client
RemoteInputStreamServer istream = null;
istream = new SimpleRemoteInputStream(new BufferedInputStream(
new FileInputStream(f)));
stub.receiveFile(istream.export());
}
}
Please note that this is just some example code to demonstrate.
My questions concerns concurrent access to the files on the Server. As you can see, I've made the addFile method synchronized because it modifies the resources on my Server. My requestFiles method is not synchronized.
I am wondering if this can cause some trouble. When Client A is adding a File and Client B is at the same time requesting all files, or vice versa, will this cause trouble? Or will the addFile method wait (or make the other method wait) because it is synchronized?
Thanks in advance!

Yes this could cause trouble. Other threads could access requestFiles(), whilst a single thread is performing the addFile() method.
It is not possible for two invocations of synchronized methods
on the same object to interleave. When one thread is executing a
synchronized method for an object, all other threads that invoke
synchronized methods for the same object block (suspend execution)
until the first thread is done with the object.
[Source] http://docs.oracle.com/javase/tutorial/essential/concurrency/syncmeth.html
So methods that are declared syncronised lock the instance to all syncronised methods in that instance (In your case the instance of Server). If you had the requestFiles() method syncronised as well, you would essentially be syncronising access to the Server instance completely. So you wouldn't have this problem.
You could also use syncronised blocks on the files map. See this stackoverflow question:
Java synchronized block vs. Collections.synchronizedMap
That being said, a model that essentially locks the entire Server object whenever a file is being written or read is hampering a concurrent design.
Depending on the rest of your design and assuming each file you write with the 'addFile()' method has a different name, and you are not overwriting files. I would explore something like the following:
Remove the map completely, and have each method interact with the file system separately.
I would use a temporary (.tmp) extension for files being written by 'addFile()', and then (once the file has been written) perform an atomic file rename to convert the extension to a '.txt' file.
Files.move(src, dst, StandardCopyOption.ATOMIC_MOVE);
Then restrict the entire 'requestFiles()' method to just '.txt' files. This way file writes and file reads could happen in parallel.
Obviously use whatever extensions you require.

Got IOException "insufficient data written" when inserting large file into Google drive

I'm trying to insert large file into Google's drive using google-api-services-drive version v2-rev93-1.16.0-rc
I've set setChunkSize() for minimum in order to have my own ProgressListener notified more frequent. The following code is used to insert file:
File body = new File();
body.setTitle(filetobeuploaded.getName());
body.setMimeType("application/zip");
body.setFileSize(filetobeuploaded.length());
InputStreamContent mediaContent =
new InputStreamContent("application/zip",
new BufferedInputStream(new FileInputStream(filetobeuploaded)));
mediaContent.setLength(filetobeuploaded.length());
Insert insert = drive.files().insert(body, mediaContent);
MediaHttpUploader uploader = insert.getMediaHttpUploader();
uploader.setChunkSize(MediaHttpUploader.MINIMUM_CHUNK_SIZE);
uploader.setProgressListener(new CustomProgressListener(filetobeuploaded));
insert.execute();
After 'a while' (sometimes 200 MB sometimes 300 MB ) I got IOException :
Exception in thread "main" java.io.IOException: insufficient data written
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3213)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:960)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:482)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:390)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:418)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
Any ideas how to get this code working?

You wont be able to get it working from a frontend because of time constrains. The only reliable way (but a pain) is to do it from a backend using resumable upload since the backend/task queue may also be shut down while processing chunks.

Does 'a while' happen to mean 1 hour? In this case you are probably experiencing the following bug:
http://code.google.com/p/gdata-issues/issues/detail?id=5124

This issue is only for Drive Resumable Media upload. check this reply ..
https://stackoverflow.com/a/30796105/4576135

For me this meant "you are doing a POST and specified a content length, but then the stream you uploaded wasn't long enough to match that content length" (a closed bytearray stream, in my case, closed because previously used to basically "exhausted" already, as it were).

Storing java objects online

this is my first question on stack overflow, I hope you can help me. I've done a bit of searching online but I keep finding tutorials or answers that talk about reading either text files using a BufferedReader or reading bytes from files on the internet. Ideally, I'd like to have a file on my server called "http://ascistudent.com/scores.data" that stores all of the Score objects made by players of a game I have made.
The game is a simple "block-dropping" game where you try to get 3 of the same blocks touching do increase the score. When time runs out, the scores are loaded from a file, their score is added in the right position of a List of Score objects. After that the scores are saved again to the same file.
At the moment I get an exception, java.io.EOFException on the highlighted line:
URL url = new URL("http://ascistudent.com/scores.data");
InputStream is = url.openStream();
Score s;
ObjectInputStream load;
//if(is.available()==0)return;
load = new ObjectInputStream(is); //----------java.io.EOFException
while ((s = (Score)load.readObject()) != null){
scores.add(s);
}
load.close();
I suspect that this is due to the file being empty. But then when I catch this exception and tell it to write to the file anyway (after changing the Score List) with the following code, nothing appears to be written (the exception continues to happen.)
URL url = new URL("http://ascistudent.com/scores.data");
URLConnection ucon = url.openConnection();
ucon.setDoInput(true);
ucon.setDoOutput(true);
os = ucon.getOutputStream();
ObjectOutputStream save = new ObjectOutputStream(os);
for(Score s:scores){
save.writeObject(s);
}
save.close();
What am I doing wrong? Can anyone point me in the right direction?
Thanks very much,
Luke

Natively you can't write to an URLConnection unless that connection is writable.
What I mean is that you cannot direcly write to an URL unless the otherside accept what you are going to send. This in HTTP is done throug a POST request that attaches data from your client to the request itself.
On the server side you'll have to accept this post request, take the data and add it tothe scores.data. You can't directly write to the file, you need to process the request in the webserver, eg:
http://host/scores.data
provides the data, while
http://host/uploadscores
should be a different URL that accepts a POST request, process it and remotely modifies score.data

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.