S3 multithreaded download library - java

I have a java application that needs to do fast and reliable downloads from Amazon's S3. Ideally, I'd use something like the AWS SDK's TransferManager ( http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html ), except I'd like to process the data in a streaming fashion, without having to stage all the downloaded data on local disk.
Ideally, the library would have an interface similar to AmazonS3#getObject(), but the implementation would be faster and more robust. Even better, the library would support pre-fetching for multiple S3 objects: I could give it a list of objects that I want to download eventually, then consume a sequence of streams for each object quickly. It's ok if the library has to use a lot of RAM to do the pre-fetching.
Does anybody know of a library that has some/all of these features?

I would recommend to use minio-java
Java Library for Amazon S3 Compatible Cloud Storage
io.minio.MinioClient.getObject returns InputStream [example] and you could do multiple getObject where each call returns individual InputStream.
MinioClient s3Client = new MinioClient("https://s3.amazonaws.com", "YOUR-ACCESSKEYID", "YOUR-SECRETACCESSKEY");
InputStream stream1 = s3Client.getObject("my-bucketname", "my-objectname1");
InputStream stream2 = s3Client.getObject("my-bucketname", "my-objectname2");
Here, the streams are not pre-fetched. If pre-fetching is hard requirement, you could use another variant of getObject
public void getObject(String bucketName, String objectName, String fileName)
Advantage of using this method is, it resumes previous getObject if any.
MinioClient s3Client = new MinioClient("https://s3.amazonaws.com", "YOUR-ACCESSKEYID", "YOUR-SECRETACCESSKEY");
s3Client.getObject("my-bucketname", "my-objectname1", "/mycachedir/my-objectname1");
s3Client.getObject("my-bucketname", "my-objectname2", "/mycachedir/my-objectname2");

Related

How to upload a Flux java object into Azure Blob Storage?

I am trying to upload a Flux object into azure blob storage, but I'm not sure how to send a Flux pojo using BlobAsyncClient. BlobAsyncClient has upload methods that take Flux or BinaryData but I have no luck trying to convert CombinedResponse to BYteBuffer or BinaryData. Does anyone have any suggestions or know how to upload a flux object to blob storage?
You will need an asynch blob container client:
#Bean("blobServiceClient")
BlobContainerAsyncClient blobServiceClient(ClientSecretCredential azureClientCredentials, String storageAccount, String containerName) {
BlobServiceClientBuilder blobServiceClientBuilder = new BlobServiceClientBuilder();
return blobServiceClientBuilder
.endpoint(format("https://%s.blob.core.windows.net/", storageAccount))
.credential(azureClientCredentials)
.buildAsyncClient()
.getBlobContainerAsyncClient(containerName);
}
And in your code you can use it to get a client, and save your Flux to it:
Flux<ByteBuffer> content = getContent();
blobServiceClient.getBlobAsyncClient(id)
.upload(content, new ParallelTransferOptions(), true);
I get that the getContent() step is the part you are struggling with. You can save either a BinaryData object or a Flux<ByteBuffer> stream.
To turn your object into a BinaryData object, use the static helper method:
BinaryData foo = BinaryData.fromObject(myObject);
BinaryData is meant for exactly what the name says: binary data. For example the content of an image file.
If you want to turn it into a ByteBuffer, keep in mind that you're trying to turn an object into a stream of data here. You will probably want to use some standardized way of doing that, so it can be reliably reversed, so rather than a stream of bytes that may break if you ever load the data in a different client, or even just a different version of the same, we usually save a json or xml representation of the object.
My go-to tool for this is Jackson:
byte[] myBytes = new ObjectMapper().writeValueAsBytes(myObject);
var myByteBuffer = ByteBuffer.wrap(myBytes);
And return it as a Flux:
Flux<ByteBuffer> myFlux = Flux.just(myByteBuffer);
By the way, Azure uses a JSON serializer under the hood in the BinaryData.fromObject() method. From the JavaDoc:
Creates an instance of BinaryData by serializing the Object using the default JsonSerializer.
Note: This method first looks for a JsonSerializerProvider
implementation on the classpath. If no implementation is found, a
default Jackson-based implementation will be used to serialize the object

How to putObject without InputStream lenght knowdlege

I'm using SDK 2.0 and trying to putObject into the bucket.
From different API I'm receiving InputStream which holds file (CSV or plain text)
InputStream stream = otherApi.get();
S3Client s3 = S3Client.build();
s3.putObject(PutObjectRequest.builder(), RequestBody ? )
RequestBody has multiple useful methods as well RequestBody.fromInputStream but require to provide contentLenght which I don't know. Files could be (1MB or even 20MB).
Anyone faced with this problem during using the new API version?
Old 1.X did not requered knowledge of contentLength.

S3 Implementation for org.apache.parquet.io.InputFile?

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.
Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

Get a Google Cloud Storage file from its BlobKey

I wrote a Google App Engine application that makes use of Blobstore to save programmatically-generated data. To do so, I used the Files API, which unfortunately has been deprecated in favor to Google Cloud Storage. So I'm rewriting my helper class to work with GCS.
I'd like to keep the interface as similar as possible as it was before, also because I persist BlobKeys in the Datastore to keep references to the files (and changing the model of a production application is always painful). When i save something to GCS, i retrieve a BlobKey with
BlobKey blobKey = blobstoreService.createGsBlobKey("/gs/" + fileName.getBucketName() + "/" + fileName.getObjectName());
as prescribed here, and I persist it in the Datastore.
So here's the question: the documentation tells me how to serve a GCS file with blobstoreService.serve(blobKey, resp); in a servlet response, BUT how can I retrieve the file content (as InputStream, byte array or whatever) to use it in my code for further processing? In my current implementation I do that with a FileReadChannel reading from an AppEngineFile (both deprecated).
Here is the code to open a Google Storage Object as Input Stream. Unfortunately, you have to use bucket name and object name and not the blob key
GcsFilename gcs_filename = new GcsFilename(bucket_name, object_name);
GcsService service = GcsServiceFactory.createGcsService();
ReadableByteChannel rbc = service.openReadChannel(gcs_filename, 0);
InputStream stream = Channels.newInputStream(rbc);
Given a blobKey, use the BlobstoreInputStream class to read the value from Blobstore, as described in the documentation:
BlobstoreInputStream in = new BlobstoreInputStream(blobKey);
You can get the cloudstorage filename only in the upload handler (fileInfo.gs_object_name) and store it in your database. After that it is lost and it seems not to be preserved in BlobInfo or other metadata structures.
Google says:
Unlike BlobInfo metadata FileInfo metadata is not persisted to
datastore. (There is no blob key either, but you can create one later
if needed by calling create_gs_key.) You must save the gs_object_name
yourself in your upload handler or this data will be lost.
Sorry, this is a python link, but it should be easy to find something similar in java.
https://developers.google.com/appengine/docs/python/blobstore/fileinfoclass
Here is the Blobstore approach (sorry, this is for Python, but I am sure you find it quite similar for Java):
blob_reader = blobstore.BlobReader(blob_key)
if blob_reader:
file_content = blob_reader.read()

Playing mp3 files in JavaFx from input stream

I am using JavaFX media player to play an mp3 file using following code
new MediaPlayer(new Media(FileObject.toURI().toString())).play();
However now I have a requirement that I have the mp3 byte data in memory instead of an File Object. The reason is the mp3 file is encrypted and then shipped along with the program. Hence I need to decrypt the mp3 file in memory or input stream.
I could decrypt the mp3 file to an temporary file in temp directory but this would be a performance overhead and the audio content would be insecure.
From the Media Javadoc
Only HTTP, FILE, and JAR URLs are supported. If the provided URL is invalid then an exception will be thrown. If an asynchronous error occurs, the error property will be set. Listen to this property to be notified of any such errors.
I'm not personally familiar with JavaFX, but that would suggest to me that without resorting to nasty hacks, you're not going to be able to read media directly from memory. Normally for this kind of URI only interface I'd suggest registering a custom UrlStreamHandler and a custom protocol that reads from memory. But assuming that JavaDoc is correct, the JavaFX uses it's own resolution so presumably this will not work.
Given this then I suspect the only way to make this work is to provide access to the in-memory MP3 over HTTP. You could do this using Jetty or any similar embeddable servlet container. Something along the following lines:
1) Start Up Jetty as per Quick Start Guide
2) Register a servlet that looks something like below. This servlet will expose your in-memory data:
public class MagicAccessServlet extends HttpServlet {
private static final Map<String, byte[]> mediaMap = new ConcurrentHashMap();
public static String registerMedia(byte[] media) {
String key = UUID.randomUUID().toString();
mediaMap.put(key, media);
return key;
}
public static deregisterMedia(String key) {
mediaMap.remove(key);
}
public void doGet(HttpServletRequest req, HttpServletResponse resp) {
String key = req.get("key");
byte[] media = mediaMap.get(key);
resp.setContentLength(media.length);
resp.getOutputStream().write(media);
}
}
Then you can access from within your application using an http url. E.g. something like
MagicAccessServlet.registerMedia(decodedMp3);
new MediaPlayer(new Media("http://localhost:<port>/<context>/<servlet>?key=" + key)).play();
Unfortunately as the Media constructor stands I see no easy way to do this other than the temporary file approach. Note that while I agree the performance would have an overhead, if the files aren't too big (most mp3 files generally aren't) then the overhead should be minimal in this sense. And technically, decoding the content to memory also renders it insecure (though admittedly much harder to extract.)
One slightly crazy approach I did think of was to use sockets. You could setup a separate part of your application which decrypts the unencrypted content and then streams the raw mp3 bytes over a certain port on localhost. You could then provide this as a HTTP URI to the Media constructor.

Categories

Resources