How to access index Lucene on Amazon S3? - java

I created a bucket on S3 to share some indexes Lucene by multiple EC2 instances.
The indexes were been created in local machine and after were uploaded into the bucket.
Now I would to access these indexes from my virtual machine in EC2, but the IndexReader Lucene needs a local file directory.
In the specific I have this situation:
path index on bucket S3 -> bucket_name/indexes/index_target_directory
IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(index_target_directory_path)));
AmazonS3 s3 = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey));
S3Object s3object = s3.getObject(new GetObjectRequest("bucket_name","indexes/index_target_directory"));
I know that s3object.getContent() returns an InputStream, how can I use it with IndexReader?

Have a look at the lucene-s3directory which I wrote. It can read and write Lucene indices to/from AWS S3 directly and does not need a local filesystem. It's pretty early stage so use with caution.
S3Directory dir = new S3Directory("my-lucene-index");
dir.create();
// use it in your code in place of FSDirectory, for example
dir.close();
dir.delete();

Related

How to copy files from S3 bucket from one region to another region using aws java sdk?

We have two regions in AWS, where there is a AWS S3 Bucket in each region. How do i copy the files inside the bucket from one region to another using AWS Java SDK?
We do not have access to credentials of the source region bucket, but We have a presigned URL for the source of each file in the source region bucket, using which we can download the file and then use the AWS Upload URL to upload it to destination region bucket.
There are space constraints while downloading the file, so we are trying to find a way to copy files from bucket in one region to another using AWS Java SDK. is this achievable?
Edit:
For some more clarity, both buckets are already created, and it is a continuous process to be implemented as part of our code. It is not a one time activity.
Normally, copying between buckets is easy. The CopyObject() command can copy objects between buckets (even buckets in different regions and different accounts) without having to download/upload the files.
However, since you only have access to the files via a pre-signed URL, you will need to:
Download each file individually
Upload each file to the target Amazon S3 bucket using the PutObject() command in the AWS SDK for Java
This would best be done on an Amazon EC2 instance in either the source or destination regions.
It would be much better if the owner of the source bucket could provide you with "normal" access to the objects so that you can use CopyObject(), or if they were able to copy the objects to your bucket for you.
I would recommend exploring the S3 cross-region replication service first.
Programmatically copying files from bucket to bucket has a lot to concern:
S3 API rate limit
AWS cost to transfer data
AWS cost to host your solution, be it hosted in EC2, container, or serverless
cost to develop and maintain your codebase
what about S3 object metadata? does metadata need to be preserved?
what about S3 buckets with versioning enabled?
the below code copies the file from one region to another region. If the two different region shares same access and secret key
System.setProperty(SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true");
AmazonS3 s3ClientBuilder = null;
AmazonS3 s3desClientBuilder = null;
TransferManager transferManager = null;
try {
ClientConfiguration clientCfg = new ClientConfiguration();
clientCfg.setProtocol(Protocol.HTTPS);
clientCfg.setSignerOverride("S3SignerType");
AWSStaticCredentialsProvider credentialProvidor = new AWSStaticCredentialsProvider(
new BasicAWSCredentials(accessKey, secretKey));
s3ClientBuilder = AmazonS3ClientBuilder//
.standard()//
.withCredentials(credentialProvidor)//
.withEndpointConfiguration(new EndpointConfiguration(s3Endpoint, region.getName()))//
.withClientConfiguration(clientCfg)//
.build();
//list all bucket names in the s3region
List<Bucket> buckets = s3ClientBuilder.listBuckets();
System.out.println("Your Amazon S3 buckets are:");
for (Bucket b : buckets) {
System.out.println("* " + b.getName());
}
//List object and objectkey inside the bucket
ListObjectsRequest lor = new ListObjectsRequest()
.withBucketName(SOURCE_BUCKET_NAME)
.withPrefix("vivek/20210801");
ObjectListing objectListing = s3ClientBuilder.listObjects(lor);
for (S3ObjectSummary summary: objectListing.getObjectSummaries()) {
SOURCE_KEY=summary.getKey();
DESTINATION_KEY=SOURCE_KEY
s3desClientBuilder = AmazonS3ClientBuilder//
.standard()//
.withCredentials(credentialProvidor)//
.withEndpointConfiguration(new EndpointConfiguration(s3desEndpoint, region.getName()))//
.withClientConfiguration(clientCfg)//
.build();
transferManager = TransferManagerBuilder.standard()
.withS3Client(s3desClientBuilder)
.build();
Copy copy = transferManager.copy(new CopyObjectRequest(SOURCE_BUCKET_NAME, SOURCE_KEY,
DESTINATION_BUCKET_NAME, DESTINATION_KEY),
s3ClientBuilder, null);
copy.waitForCopyResult();
}
transferManager.shutdownNow();
s3ClientBuilder.shutdown();
s3desClientBuilder.shutdown();

Read and write to a file in Amazon s3 bucket

I need to read a large (>15mb) file (say sample.csv) from an Amazon S3 bucket. I then need to process the data present in sample.csv and keep writing it to another directory in the S3 bucket. I intend to use an AWS Lambda function to run my java code.
As a first step I developed java code that runs on my local system. The java code reads the sample.csv file from the S3 bucket and I used the put method to write data back to the S3 bucket. But I find only the last line was processed and put back.
Region clientRegion = Region.Myregion;
AwsBasicCredentials awsCreds = AwsBasicCredentials.create("myAccessId","mySecretKey");
S3Client s3Client = S3Client.builder().region(clientRegion).credentialsProvider(StaticCredentialsProvider.create(awsCreds)).build();
ResponseInputStream<GetObjectResponse> s3objectResponse = s3Client.getObject(GetObjectRequest.builder().bucket(bucketName).key("Input/sample.csv").build());
BufferedReader reader = new BufferedReader(new InputStreamReader(s3objectResponse));
String line = null;
while ((line = reader.readLine()) != null) {
s3Client.putObject(PutObjectRequest.builder().bucket(bucketName).key("Test/Testout.csv").build(),RequestBody.fromString(line));
}
Example: sample.csv contains
1,sam,21,java,beginner;
2,tom,28,python,practitioner;
3,john,35,c#,expert.
My output should be
1,mas,XX,java,beginner;
2,mot,XX,python,practitioner;
3,nhoj,XX,c#,expert.
But only 3,nhoj,XX,c#,expert is written in the Testout.csv.
The putObject() method creates an Amazon S3 object.
It is not possible to append or modify an S3 object, so each time the while loop executes, it is creating a new Amazon S3 object.
Instead, I would recommend:
Download the source file from Amazon S3 to local disk (use GetObject() with a destinationFile to download to disk)
Process the file and output to a local file
Upload the output file to the Amazon S3 bucket (method)
This separates the AWS code from your processing code, which should be easier to maintain.

Create multiple empty directories in Amazon S3 using java

I am new to S3 and I am trying to create multiple directories in Amazon S3 using java by only making one call to S3.
I could only come up with this :-
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(0);
InputStream emptyContent = new ByteArrayInputStream(new byte[0]);
PutObjectRequest putObjectRequest = new PutObjectRequest(bucket,
"test/tryAgain/", emptyContent, metadata);
s3.putObject(putObjectRequest);
But the problem with this while uploading 10 folders (when the key ends with "/" in the console we can see the object as a folder ) is that I have to make 10 calls to S3.
But I want to do a create all the folders at once like we do a batch delete using DeleteObjectsRequest.
Can anyone please suggest me or help me how to solve my problem ?
Can you be a bit more specific as to what you're trying to do (or avoid doing)?
If you're primarily concerned with the cost per PUT, I don't think there is a way to batch 'upload' a directory with each file being a separate key and avoid that cost. Each PUT (even in a batch process) will cost you the price per PUT.
If you're simply trying to find a way to efficiently and recursively upload a folder, check out the uploadDirectory() method of TransferManager.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#uploadDirectory-java.lang.String-java.lang.String-java.io.File-boolean-
public MultipleFileUpload uploadDirectory(String bucketName,
String virtualDirectoryKeyPrefix,
File directory,
boolean includeSubdirectories)

Accessing S3 Objects with storage class Glacier

I wrote a piece of (java) software that downloads objects (archives) from an S3 bucket, extracts the data locally and does operations on it.
A few days back, I set the lifecycle policy of all the objects in the "folder" within S3 to be moved to glacier automatically 2 days after creation, so that I have the time to DL and extract the data before it's archived. However, when accessing the data programmatically, Amazon Web Services throws an error
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class
I suppose this is due to the fact that the objects' storage classes have been updated to Glacier.
So far I have used the following code to access my S3 data:
public static void downloadObjectFromBucket(String bucketName, String pathToObject, String objectName) throws IOException{
AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, pathToObject));
InputStream reader = new BufferedInputStream(object.getObjectContent());
File file = new File(objectName);
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file));
int read = -1;
while ( ( read = reader.read() ) != -1 ) {
writer.write(read);
}
writer.flush();
writer.close();
reader.close();
}
Do I have to update my code or change some settings in the AWS Console? It is unclear to me, since the objects are still in S3 and accessing every S3 object has been working wonderfully up until a few days ago where I adapted the lifecycle policies....
An Amazon S3 lifecycle policy can be used to archive objects from S3 into Amazon Glacier.
When archived (as indicated by a Storage Class of Glacier), the object still "appears" to be in S3 (it appears in listings, you can see its size and metadata) but the contents of the object is kept in Glacier. Therefore, the contents cannot be accessed.
To retrieve the contents of an object in S3 with a Storage Class of Glacier, you will need to RestoreObject to retrieve the contents into S3. This takes 3-5 hours. You also nominate a duration for how long the contents should remain in S3 (where is will be stored with a Storage Class of Reduced Redundancy). Once the object is restored, you can retrieve the contents of the object.

Uploading files to S3 using AmazonS3Client.java api

I am using AmazonS3Client.java to upload files to S3 from my application. I am using the putObject method to upload the file
val putObjectRequest = new PutObjectRequest(bucketName, key, inputStream, metadata)
val acl = CannedAccessControlList.Private
putObjectRequest.setCannedAcl(acl)
s3.putObject(putObjectRequest)
This works for buckets at the topmost level in my S3 account. Now, suppose i want to upload the file to a sub-bucket for example bucketB which is inside bucketA . How should i specify the bucket name for bucketB ?
Thank You
It is admittedly somewhat surprising, but there is no such thing as a "sub-bucket" in S3. All buckets are top-level. The structures inside buckets that you see in the S3 admin console or other UIs are called "folders", but even they don't really exist! You can't directly create or destroy folders, for instance, or set any attributes on them. Folders are purely a presentation-level convention for viewing the underlying flat set of objects in your bucket. That said, it's pretty easy to split your objects into (purely non-existent) folders. Just give them heirarchical names, with each level separated by a "/".
val putObjectRequest = new PutObjectRequest(bucketName, topFolderName +"/" + subFolderName+ "/" +key, inputStream, metadata)
Trying using putObjectRequest.setKey("folder")

Categories

Resources