I am currently evaluating a proof of concept which uses Google bucket, a java microservice and Dataflow.
The communication flow is like so:
User sends CSV file to third party service
Service uploads CSV file to Google bucket with ID and filename
A create event is triggered and sent as a HTTP request to Java microservice
Java service triggers a Google Dataflow job
I am starting to think that the Java service is not necessary and I can directly call Dataflow after the CSV is uploaded to the bucket?
This is the service as you can see its just a basic controller that validates the request params from the "Create" trigger and then delegates to the Dataflow service
#PostMapping(value = "/dataflow", produces = {MediaType.APPLICATION_JSON_VALUE})
public ResponseEntity<Object> triggerDataFlowJob(#RequestBody Map<String, Object> body) {
Map<String, String> requestParams = getRequestParams(body);
log.atInfo().log("Body %s", requestParams);
String bucket = requestParams.get("bucket");
String fileName = requestParams.get("name");
if (Objects.isNull(bucket) || Objects.isNull(fileName)) {
AuditLogger.log(AuditCode.INVALID_CLOUD_STORAGE_REQUEST.getCode(), AuditCode.INVALID_CLOUD_STORAGE_REQUEST.getAuditText());
return ResponseEntity.accepted().build();
}
log.atInfo().log("Triggering a Dataflow job, using Cloud Storage bucket: %s --> and file %s", bucket, fileName);
try {
return DataflowTransport
.newDataflowClient(options)
.build()
.projects()
.locations()
.flexTemplates()
.launch(gcpProjectIdProvider.getProjectId(),
dataflowProperties.getRegion(),
launchFlexTemplateRequest)
.execute();
} catch (Exception ex) {
if (ex instanceof GoogleJsonResponseException && ((GoogleJsonResponseException) ex).getStatusCode() == 409) {
log.atInfo().log("Dataflow job already triggered using Cloud Storage bucket: %s --> and file %s", bucket, fileName);
} else {
log.atSevere().withCause(ex).log("Error while launching dataflow jobs");
AuditLogger.log(AuditCode.LAUNCH_DATAFLOW_JOB.getCode(), AuditCode.LAUNCH_DATAFLOW_JOB.getAuditText());
}
}
return ResponseEntity.accepted().build();
}
Is there a way to directly integrate Google bucket triggers with Dataflow?
When a file is uploaded to Cloud Storage, you can trigger a Cloud Function V2 with event arc.
Then in this Cloud Function, you can trigger a Dataflow job.
Deploy and trigger the Cloud Function V2, with event type object finalize :
gcloud functions deploy your_function_name \
--gen2 \
--trigger-event-filters="type=google.cloud.storage.object.v1.finalized" \
--trigger-event-filters="bucket=YOUR_STORAGE_BUCKET
In the Cloud Function, you will trigger the Dataflow job with a code sample that looks like this :
def startDataflowProcess(data, context):
from googleapiclient.discovery import build
#replace with your projectID
project = "grounded-pivot-266616"
job = project + " " + str(data['timeCreated'])
#path of the dataflow template on google storage bucket
template = "gs://sample-bucket/sample-template"
inputFile = "gs://" + str(data['bucket']) + "/" + str(data['name'])
#user defined parameters to pass to the dataflow pipeline job
parameters = {
'inputFile': inputFile,
}
#tempLocation is the path on GCS to store temp files generated during the dataflow job
environment = {'tempLocation': 'gs://sample-bucket/temp-location'}
service = build('dataflow', 'v1b3', cache_discovery=False)
#below API is used when we want to pass the location of the dataflow job
request = service.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location='europe-west1',
body={
'jobName': job,
'parameters': parameters,
'environment':environment
},
)
response = request.execute()
print(str(response))
This Cloud Function shows an example with Python but you can keep your logic with Java if you prefer.
Related
In my java application I need to write data to S3, which I don't know the size in advance and sizes are usually big so as recommend in the AWS S3 documentation I am using the Using the Java AWS SDKs (low-level-level API) to write data to the s3 bucket.
In my application I provide S3BufferedOutputStream which is an implementation OutputStream where other classes in the app can use this stream to write to the s3 bucket.
I store the data in a buffer and loop and once the data is bigger than bucket size I upload data in the buffer as a a single UploadPartRequest
Here is the implementation of the write method of S3BufferedOutputStream
#Override
public void write(byte[] b, int off, int len) throws IOException {
this.assertOpen();
int o = off, l = len;
int size;
while (l > (size = this.buf.length - position)) {
System.arraycopy(b, o, this.buf, this.position, size);
this.position += size;
flushBufferAndRewind();
o += size;
l -= size;
}
System.arraycopy(b, o, this.buf, this.position, l);
this.position += l;
}
The whole implementation is similar to this: code repo
My problem here is that each UploadPartRequest is done synchronously, so we have to wait for one part to be uploaded to be able to upload the next part. And because I am using the AWS S3 low level API I can not benefit from the parallel uploading provided by the TransferManager
Is there a way to achieve the parallel upload using low level SDK?
Or some code changes that can be done to operate Asynchronously without corrupting the uploaded data and maintain order of the data?
Here's some example code from a class that I have. It submits the parts to an ExecutorService and holds onto the returned Future. This is written for the v1 Java SDK; if you're using the v2 SDK you could use an async client rather than the explicit threadpool:
// WARNING: data must not be updated by caller; make a defensive copy if needed
public synchronized void uploadPart(byte[] data, boolean isLastPart)
{
partNumber++;
logger.debug("submitting part {} for s3://{}/{}", partNumber, bucket, key);
final UploadPartRequest request = new UploadPartRequest()
.withBucketName(bucket)
.withKey(key)
.withUploadId(uploadId)
.withPartNumber(partNumber)
.withPartSize(data.length)
.withInputStream(new ByteArrayInputStream(data))
.withLastPart(isLastPart);
futures.add(
executor.submit(new Callable<PartETag>()
{
#Override
public PartETag call() throws Exception
{
int localPartNumber = request.getPartNumber();
logger.debug("uploading part {} for s3://{}/{}", localPartNumber, bucket, key);
UploadPartResult response = client.uploadPart(request);
String etag = response.getETag();
logger.debug("uploaded part {} for s3://{}/{}; etag is {}", localPartNumber, bucket, key, etag);
return new PartETag(localPartNumber, etag);
}
}));
}
Note: this method is synchronized to ensure that parts are not submitted out of order.
Once you've submitted all of the parts, you use this method to wait for them to finish and then complete the upload:
public void complete()
{
logger.debug("waiting for upload tasks of s3://{}/{}", bucket, key);
List<PartETag> partTags = new ArrayList<>();
for (Future<PartETag> future : futures)
{
try
{
partTags.add(future.get());
}
catch (Exception e)
{
throw new RuntimeException(String.format("failed to complete upload task for s3://%s/%s"), e);
}
}
logger.debug("completing multi-part upload for s3://{}/{}", bucket, key);
CompleteMultipartUploadRequest request = new CompleteMultipartUploadRequest()
.withBucketName(bucket)
.withKey(key)
.withUploadId(uploadId)
.withPartETags(partTags);
client.completeMultipartUpload(request);
logger.debug("completed multi-part upload for s3://{}/{}", bucket, key);
}
You'll also need an abort() method that cancels outstanding parts and aborts the upload. This, and the rest of the class, are left as an exercise for the reader.
You should look at using the AWS SDK for Java V2. You are referencing V1, not the newest Amazon S3 Java API. If you are not familiar with V2, start here:
Get started with the AWS SDK for Java 2.x
To perform Async operations via the Amazon S3 Java API, you use S3AsyncClient.
Now to learn how to upload an object using this client, see this code example:
import software.amazon.awssdk.core.async.AsyncRequestBody;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
import software.amazon.awssdk.services.s3.model.PutObjectRequest;
import software.amazon.awssdk.services.s3.model.PutObjectResponse;
import java.nio.file.Paths;
import java.util.concurrent.CompletableFuture;
// snippet-end:[s3.java2.async_ops.import]
// snippet-start:[s3.java2.async_ops.main]
/**
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class S3AsyncOps {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" S3AsyncOps <bucketName> <key> <path>\n\n" +
"Where:\n" +
" bucketName - the name of the Amazon S3 bucket (for example, bucket1). \n\n" +
" key - the name of the object (for example, book.pdf). \n" +
" path - the local path to the file (for example, C:/AWS/book.pdf). \n" ;
if (args.length != 3) {
System.out.println(USAGE);
System.exit(1);
}
String bucketName = args[0];
String key = args[1];
String path = args[2];
Region region = Region.US_WEST_2;
S3AsyncClient client = S3AsyncClient.builder()
.region(region)
.build();
PutObjectRequest objectRequest = PutObjectRequest.builder()
.bucket(bucketName)
.key(key)
.build();
// Put the object into the bucket
CompletableFuture<PutObjectResponse> future = client.putObject(objectRequest,
AsyncRequestBody.fromFile(Paths.get(path))
);
future.whenComplete((resp, err) -> {
try {
if (resp != null) {
System.out.println("Object uploaded. Details: " + resp);
} else {
// Handle error
err.printStackTrace();
}
} finally {
// Only close the client when you are completely done with it
client.close();
}
});
future.join();
}
}
That is uploading an object using the S3AsyncClient client. To perform a multi-part upload, you need to use this method:
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html#createMultipartUpload-software.amazon.awssdk.services.s3.model.CreateMultipartUploadRequest-
TO see an example of Multipart upload using the S3 Sync client, see:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javav2/example_code/s3/src/main/java/com/example/s3/S3ObjectOperations.java
That is your solution - use S3AsyncClient object's createMultipartUpload method.
I have BlobServiceAsyncClient
Used TenantID, clientID, ClientSecret, ContainerName for creating the blobContainerAsyncClient.
Uploading file as
blobContainerAsyncClient.getBlobAsyncClient(fileName).upload(.........);
You can use the below code
creates a Shared Access Signature with Read only permission and available only for the next 10 minutes.
public string CreateSAS(string blobName)
{
var container = blobClient.GetContainerReference(ContainerName);
// Create the container if it doesn't already exist
container.CreateIfNotExists();
var blob = container.GetBlockBlobReference(blobName);
var sas = blob.GetSharedAccessSignature(new SharedAccessBlobPolicy()
{
Permissions = SharedAccessBlobPermissions.READ,
SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(10),
});
return sas;
}
Please refer this document for more information: https://tech.trailmax.info/2013/07/upload-files-to-azure-blob-storage-with-using-shared-access-keys/
I am writing a quick proof-of-concept for downloading images from Azure Blob Storage using the Java 12 Azure Storage SDK. The following code works properly when I convert it to synchronous. However, despite the subscribe() at the bottom of the code, I only see the subscription message. The success and error handlers are not firing. I would appreciate any suggestions or ideas.
Thank you for your time and help.
private fun azureReactorDownload() {
var startTime = 0L
var accountName = "abcd"
var key = "09sd0908sd08f0s&&6^%"
var endpoint = "https://${accountName}.blob.core.windows.net/$accountName
var containerName = "mycontainer"
var blobName = "animage.jpg"
// Get the Blob Service client, so we can use it to access blobs, containers, etc.
BlobServiceClientBuilder()
// Container URL
.endpoint(endpoint)
.credential(
SharedKeyCredential(
accountName,
key
)
)
.buildAsyncClient()
// Get the container client so we can work with our container and its blobs.
.getContainerAsyncClient(containerName)
// Get the block blob client so we can access individual blobs and include the path
// within the container as part of the filename.
.getBlockBlobAsyncClient(blobName)
// Initiate the download of the desired blob.
.download()
.map { response ->
// Drill down to the ByteBuffer.
response.value()
}
.doOnSubscribe {
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Subscription arrived.")
startTime = System.currentTimeMillis()
}
.doOnSuccess { data ->
data.map { byteBuffer ->
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> READY TO WRITE TO THE FILE")
byteBuffer.writeToFile("/tmp/azrxblobdownload.jpg")
val elapsedTime = System.currentTimeMillis() - startTime
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished downloading blob in $elapsedTime ms.")
}
}
.doOnError {
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Failed to download blob: ${it.localizedMessage}")
}
.subscribe()
}
fun ByteBuffer.writeToFile(path: String) {
val fc = FileOutputStream(path).channel
fc.write(this)
fc.close()
}
I see someone asking the same question 4 months ago and getting no answer:
Azure Blob Storage Java SDK: Why isn't asynchronous working?
I'm going to conjecture that this part of the JDK just isn't working right now. I wouldn't recommend using Azure's version of Java.
You should be able to accomplish it another way perhaps one of these answers:
Downloading Multiple Files Parallelly or Asynchronously in Java
I've worked with Microsoft and have a documented solution at the following link: https://github.com/Azure/azure-sdk-for-java/issues/5071. The person who worked with me provided very good background information, so it is more than just some working code.
I have opened a similar query with Microsoft for the downloadToFile() method in the Azure Java SDK v12, which is throwing an exception when saving to a file.
Here is the working code from that posting:
private fun azureReactorDownloadMS() {
var startTime = 0L
val chunkCounter = AtomicInteger(0)
// Get the Blob Service client, so we can use it to access blobs, containers, etc.
val aa = BlobServiceClientBuilder()
// Container URL
.endpoint(kEndpoint)
.credential(
SharedKeyCredential(
kAccountName,
kAccountKey
)
)
.buildAsyncClient()
// Get the container client so we can work with our container and its blobs.
.getContainerAsyncClient(kContainerName)
// Get the block blob client so we can access individual blobs and include the path
// within the container as part of the filename.
.getBlockBlobAsyncClient(kBlobName)
.download()
// Response<Flux<ByteBuffer>> to Flux<ByteBuffer>
.flatMapMany { response ->
response.value()
}
.doOnSubscribe {
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Subscription arrived.")
startTime = System.currentTimeMillis()
}
.doOnNext { byteBuffer ->
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CHUNK ${chunkCounter.incrementAndGet()} FROM BLOB ARRIVED...")
}
.doOnComplete {
val elapsedTime = System.currentTimeMillis() - startTime
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Finished downloading ${chunkCounter.incrementAndGet()} chunks of data for the blob in $elapsedTime ms.")
}
.doOnError {
println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Failed to download blob: ${it.localizedMessage}")
}
.blockLast()
}
why does spring webflux (or java nio) makes multipartData DataBuffer tmp file?
in my case on macOS, files like /private/var/folders/v6/vtrxqpbd4lb3pq8v_sbm10hc0000gn/T/nio-file-upload/nio-body-1-82f11dbe-61b3-4e5d-8c43-92e02aa38481.tmp made on request and then deleted.
is it possible to improve performance with preventing disk write?
this is my code:
public class FileHandler {
public Mono<ServerResponse> postFile(ServerRequest req) {
val file = req.multipartData()
.map(map -> map.getFirst("file"))
.ofType(FilePart.class);
val buffer = file.flatMap(part -> part.content().next());
val hash = buffer.map(d -> {
try {
val md = MessageDigest.getInstance("SHA-1");
md.update(d.asByteBuffer());
return Base64Utils.encodeToString(md.digest());
} catch (NoSuchAlgorithmException e) {
// does not reach here!
return "";
}
});
val name = file.map(FilePart::filename);
return ok().body(hash, String.class);
}
}
The multipart file support in Spring WebFlux is using the Synchronoss NIO Multipart library. The downside of that implementation is that it's not fully reactive and as a result it can create temporary files to not load the whole content in memory.
What makes you think that this behavior is a performance problem? Do you have a sample or benchmark results that show that this is an issue?
The Spring Framework team already worked on this and a fully reactive implementation will be available as the default in Spring Framework 5.2 (see spring-framework#21659).
I am trying to use Lambda function for S3 Put event notification. My Lambda function should be called once I put/add any new JSON file in my S3 bucket.
The challenge I have is there are not enough documents for this to implement such Lambda function in Java. Most of doc I found are for Node.js
I want, my Lambda function should be called and then inside that Lambda function, I want to consume that added json and then send that JSON to AWS ES Service.
But what all classes I should use for this? Anyone has any idea about this? S3 abd ES are all setup and running. The auto generated code for lambda is
`
#Override
public Object handleRequest(S3Event input, Context context) {
context.getLogger().log("Input: " + input);
// TODO: implement your handler
return null;
}
What next??
Handling S3 events in Lambda can be done, but you have to keep in mind, the the S3Event object only transports the reference to the object and not the object itself. To get to the actual object you have to invoke the AWS SDK yourself.
Requesting a S3 Object within a lambda function would look like this:
public Object handleRequest(S3Event input, Context context) {
AmazonS3Client s3Client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
for (S3EventNotificationRecord record : input.getRecords()) {
String s3Key = record.getS3().getObject().getKey();
String s3Bucket = record.getS3().getBucket().getName();
context.getLogger().log("found id: " + s3Bucket+" "+s3Key);
// retrieve s3 object
S3Object object = s3Client.getObject(new GetObjectRequest(s3Bucket, s3Key));
InputStream objectData = object.getObjectContent();
//insert object into elasticsearch
}
return null;
}
Now the rather difficult part to insert this object into ElasticSearch. Sadly the AWS SDK does not provide any functions for this. The default approach would be to do a REST call against the AWS ES endpoint. There are various samples out their on how to proceed with calling an ElasticSearch instance.
Some people seem to go with the following project:
Jest - Elasticsearch Java Rest Client
Finally, here are the steps for S3 --> Lambda --> ES integration using Java.
Have your S3, Lamba and ES created on AWS. Steps are here.
Use below Java code in your lambda function to fetch a newly added object in S3 and send it to ES service.
public Object handleRequest(S3Event input, Context context) {
AmazonS3Client s3Client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
for (S3EventNotificationRecord record : input.getRecords()) {
String s3Key = record.getS3().getObject().getKey();
String s3Bucket = record.getS3().getBucket().getName();
context.getLogger().log("found id: " + s3Bucket+" "+s3Key);
// retrieve s3 object
S3Object object = s3Client.getObject(new GetObjectRequest(s3Bucket, s3Key));
InputStream objectData = object.getObjectContent();
//Start putting your objects in AWS ES Service
String esInput = "Build your JSON string here using S3 objectData";
HttpClient httpClient = new DefaultHttpClient();
HttpPut putRequest = new HttpPut(AWS_ES_ENDPOINT + "/{Index_name}/{product_name}/{unique_id}" );
StringEntity input = new StringEntity(esInput);
input.setContentType("application/json");
putRequest.setEntity(input);
httpClient.execute(putRequest);
httpClient.getConnectionManager().shutdown();
}
return "success";}
Use either Postman or Sense to create Actual index & corresponding mapping in ES.
Once done, download and run proxy.js on your machine. Make sure you setup ES Security steps suggested in this post
Test setup and Kibana by running http://localhost:9200/_plugin/kibana/ URL from your machine.
All is set. Go ahead and set your dashboard in Kibana. Test it by adding new objects in your S3 bucket