Read and write to a file in Amazon s3 bucket - java

I need to read a large (>15mb) file (say sample.csv) from an Amazon S3 bucket. I then need to process the data present in sample.csv and keep writing it to another directory in the S3 bucket. I intend to use an AWS Lambda function to run my java code.
As a first step I developed java code that runs on my local system. The java code reads the sample.csv file from the S3 bucket and I used the put method to write data back to the S3 bucket. But I find only the last line was processed and put back.
Region clientRegion = Region.Myregion;
AwsBasicCredentials awsCreds = AwsBasicCredentials.create("myAccessId","mySecretKey");
S3Client s3Client = S3Client.builder().region(clientRegion).credentialsProvider(StaticCredentialsProvider.create(awsCreds)).build();
ResponseInputStream<GetObjectResponse> s3objectResponse = s3Client.getObject(GetObjectRequest.builder().bucket(bucketName).key("Input/sample.csv").build());
BufferedReader reader = new BufferedReader(new InputStreamReader(s3objectResponse));
String line = null;
while ((line = reader.readLine()) != null) {
s3Client.putObject(PutObjectRequest.builder().bucket(bucketName).key("Test/Testout.csv").build(),RequestBody.fromString(line));
}
Example: sample.csv contains
1,sam,21,java,beginner;
2,tom,28,python,practitioner;
3,john,35,c#,expert.
My output should be
1,mas,XX,java,beginner;
2,mot,XX,python,practitioner;
3,nhoj,XX,c#,expert.
But only 3,nhoj,XX,c#,expert is written in the Testout.csv.

The putObject() method creates an Amazon S3 object.
It is not possible to append or modify an S3 object, so each time the while loop executes, it is creating a new Amazon S3 object.
Instead, I would recommend:
Download the source file from Amazon S3 to local disk (use GetObject() with a destinationFile to download to disk)
Process the file and output to a local file
Upload the output file to the Amazon S3 bucket (method)
This separates the AWS code from your processing code, which should be easier to maintain.

Related

Max file upload limitation at jvm or AWS S3 end

I am trying to upload a file through TransferManager(Java) to AWS S3 bucket.
My question is, if I try to upload a file of approx 4GB, then do I need to handle something at Java end.
Below is the code:
File tempFile = AWSUtil.createTempFileFromStream(inputStream);
putObjectRequest = new PutObjectRequest(bucketName, fileName, tempFile);
final TransferManager transferMgr = TransferManagerBuilder.standard().withS3Client(s3client).build();
upload = transferMgr.upload(putObjectRequest);
As far as I know AWS S3 has a file upload limitation of 5TB, which is not a concern.
Please let me know if I need to handle it at Java end.

Create multiple empty directories in Amazon S3 using java

I am new to S3 and I am trying to create multiple directories in Amazon S3 using java by only making one call to S3.
I could only come up with this :-
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(0);
InputStream emptyContent = new ByteArrayInputStream(new byte[0]);
PutObjectRequest putObjectRequest = new PutObjectRequest(bucket,
"test/tryAgain/", emptyContent, metadata);
s3.putObject(putObjectRequest);
But the problem with this while uploading 10 folders (when the key ends with "/" in the console we can see the object as a folder ) is that I have to make 10 calls to S3.
But I want to do a create all the folders at once like we do a batch delete using DeleteObjectsRequest.
Can anyone please suggest me or help me how to solve my problem ?
Can you be a bit more specific as to what you're trying to do (or avoid doing)?
If you're primarily concerned with the cost per PUT, I don't think there is a way to batch 'upload' a directory with each file being a separate key and avoid that cost. Each PUT (even in a batch process) will cost you the price per PUT.
If you're simply trying to find a way to efficiently and recursively upload a folder, check out the uploadDirectory() method of TransferManager.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#uploadDirectory-java.lang.String-java.lang.String-java.io.File-boolean-
public MultipleFileUpload uploadDirectory(String bucketName,
String virtualDirectoryKeyPrefix,
File directory,
boolean includeSubdirectories)

Accessing S3 Objects with storage class Glacier

I wrote a piece of (java) software that downloads objects (archives) from an S3 bucket, extracts the data locally and does operations on it.
A few days back, I set the lifecycle policy of all the objects in the "folder" within S3 to be moved to glacier automatically 2 days after creation, so that I have the time to DL and extract the data before it's archived. However, when accessing the data programmatically, Amazon Web Services throws an error
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class
I suppose this is due to the fact that the objects' storage classes have been updated to Glacier.
So far I have used the following code to access my S3 data:
public static void downloadObjectFromBucket(String bucketName, String pathToObject, String objectName) throws IOException{
AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, pathToObject));
InputStream reader = new BufferedInputStream(object.getObjectContent());
File file = new File(objectName);
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file));
int read = -1;
while ( ( read = reader.read() ) != -1 ) {
writer.write(read);
}
writer.flush();
writer.close();
reader.close();
}
Do I have to update my code or change some settings in the AWS Console? It is unclear to me, since the objects are still in S3 and accessing every S3 object has been working wonderfully up until a few days ago where I adapted the lifecycle policies....
An Amazon S3 lifecycle policy can be used to archive objects from S3 into Amazon Glacier.
When archived (as indicated by a Storage Class of Glacier), the object still "appears" to be in S3 (it appears in listings, you can see its size and metadata) but the contents of the object is kept in Glacier. Therefore, the contents cannot be accessed.
To retrieve the contents of an object in S3 with a Storage Class of Glacier, you will need to RestoreObject to retrieve the contents into S3. This takes 3-5 hours. You also nominate a duration for how long the contents should remain in S3 (where is will be stored with a Storage Class of Reduced Redundancy). Once the object is restored, you can retrieve the contents of the object.

How to access index Lucene on Amazon S3?

I created a bucket on S3 to share some indexes Lucene by multiple EC2 instances.
The indexes were been created in local machine and after were uploaded into the bucket.
Now I would to access these indexes from my virtual machine in EC2, but the IndexReader Lucene needs a local file directory.
In the specific I have this situation:
path index on bucket S3 -> bucket_name/indexes/index_target_directory
IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(index_target_directory_path)));
AmazonS3 s3 = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey));
S3Object s3object = s3.getObject(new GetObjectRequest("bucket_name","indexes/index_target_directory"));
I know that s3object.getContent() returns an InputStream, how can I use it with IndexReader?
Have a look at the lucene-s3directory which I wrote. It can read and write Lucene indices to/from AWS S3 directly and does not need a local filesystem. It's pretty early stage so use with caution.
S3Directory dir = new S3Directory("my-lucene-index");
dir.create();
// use it in your code in place of FSDirectory, for example
dir.close();
dir.delete();

Converting MultipartFile to java.io.File without copying to local machine

I have a Java Spring MVC web application. From client, through AngularJS, I am uploading a file and posting it to Controller as webservice.
In my Controller, I am gettinfg it as MultipartFile and I can copy it to local machine.
But I want to upload the file to Amazone S3 bucket. So I have to convert it to java.io.File. Right now what I am doing is, I am copying it to local machine and then uploading to S3 using jets3t.
Here is my way of converting in controller
MultipartHttpServletRequest mRequest=(MultipartHttpServletRequest)request;
Iterator<String> itr=mRequest.getFileNames();
while(itr.hasNext()){
MultipartFile mFile=mRequest.getFile(itr.next());
String fileName=mFile.getOriginalFilename();
fileLoc="/home/mydocs/my-uploads/"+date+"_"+fileName; //date is String form of current date.
Then I am using FIleCopyUtils of SpringFramework
File newFile = new File(fileLoc);
// if the directory does not exist, create it
if (!newFile.getParentFile().exists()) {
newFile.getParentFile().mkdirs();
}
FileCopyUtils.copy(mFile.getBytes(), newFile);
So it will create a new file in the local machine. That file I am uplaoding in S3
S3Object fileObject = new S3Object(newFile);
s3Service.putObject("myBucket", fileObject);
It creates file in my local system. I don't want to create.
Without creating a file in local system, how to convert a MultipartFIle to java.io.File?
MultipartFile, by default, is already saved on your server as a file when user uploaded it.
From that point - you can do anything you want with this file.
There is a method that moves that temp file to any destination you want.
http://docs.spring.io/spring/docs/3.0.x/api/org/springframework/web/multipart/MultipartFile.html#transferTo(java.io.File)
But MultipartFile is just API, you can implement any other MultipartResolver
http://docs.spring.io/spring/docs/3.0.x/api/org/springframework/web/multipart/MultipartResolver.html
This API accepts input stream and you can do anything you want with it. Default implementation (usually commons-multipart) saves it to temp dir as a file.
But other problem stays here - if S3 API accepts a file as a parameter - you cannot do anything with this - you need a real file. If you want to avoid creating files at all - create you own S3 API.
The question is already more than one year old, so I'm not sure if the jets35 link provided by the OP had the following snippet at that time.
If your data isn't a File or String you can use any input stream as a data source, but you must manually set the Content-Length.
// Create an object containing a greeting string as input stream data.
String greeting = "Hello World!";
S3Object helloWorldObject = new S3Object("HelloWorld2.txt");
ByteArrayInputStream greetingIS = new ByteArrayInputStream(greeting.getBytes());
helloWorldObject.setDataInputStream(greetingIS);
helloWorldObject.setContentLength(
greeting.getBytes(Constants.DEFAULT_ENCODING).length);
helloWorldObject.setContentType("text/plain");
s3Service.putObject(testBucket, helloWorldObject);
It turns out you don't have to create a local file first. As #Boris suggests you can feed the S3Object with the Data Input Stream, Content Type and Content Length you'll get from MultipartFile.getInputStream(), MultipartFile.getContentType() and MultipartFile.getSize() respectively.
Instead of copying it to your local machine, you can just do this and replace the file name with this:
File newFile = new File(multipartFile.getOriginalName());
This way, you don't have to have a local destination create your file
if you are try to use in httpentity check my answer here
https://stackoverflow.com/a/68022695/7532946

Categories

Resources