Amazon S3 Client NOT listing all folders in the bucket

Amazon S3 Client NOT listing all folders in the bucket - java

I'm trying to list all so-called folders and sub-folders in an s3 bucket.
Now, as I am trying to list all the folders in a path recursively I am not using withDelimeter() function.
All the so-called folder names should end with / and this is my logic to list all the folders and sub-folders.
Here's the scala code (Intentionally not pasting the catch code here):
val awsCredentials = new BasicAWSCredentials(awsKey, awsSecretKey)
val client = new AmazonS3Client(awsCredentials)
def listFoldersRecursively(bucketName: String, fullPath: String): List[String] = {
try {
val objects = client.listObjects(bucketName).getObjectSummaries
val listObjectsRequest = new ListObjectsRequest()
.withPrefix(fullPath)
.withBucketName(bucketName)
val folderPaths = client
.listObjects(listObjectsRequest)
.getObjectSummaries()
.map(_.getKey)
folderPaths.filter(_.endsWith("/")).toList
}
}
Here's the structure of my bucket through an s3 client
Here's the list I am getting using this scala code
Without any apparent pattern, many folders are missing from the list of retrieved folders.
I did not use
client.listObjects(listObjectsRequest).getCommonPrefixes.toList
because it was returning empty list for some reason.
P.S: Couldn't add photos in post directly because of being a new user.

Without any apparent pattern, many folders are missing from the list of retrieved folders.
Here's your problem: you are assuming there should always be objects with keys ending in / to symbolize folders.
This is an incorrect assumption. They will only be there if you created them, either via the S3 console or the API. There's no reason to expect them, as S3 doesn't actually need them or use them for anything, and the S3 service does not create them spontaneously, itself.
If you use the API to upload an object with key foo/bar.txt, this does not create the foo/ folder as a distinct object. It will appear as a folder in the console for convenience, but it isn't there unless at some point you deliberately created it.
Of course, the only way to upload such an object with the console is to "create" the folder unless it already appears -- but appears in the console does not necessarily equate to exists as a distinct object.
Filtering on endsWith("/") is invalid logic.
This is why the underlying API includes CommonPrefixes with each ListObjects response if delimiter and prefix are specified. This is a list of the next level of "folders", which you have to recursively drill down into in order to find the next level.
If you specify a prefix, all keys that contain the same string between the prefix and the first occurrence of the delimiter after the prefix are grouped under a single result element called CommonPrefixes. If you don't specify the prefix parameter, the substring starts at the beginning of the key. The keys that are grouped under the CommonPrefixes result element are not returned elsewhere in the response.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
You need to access this functionality with whatever library you or using, or, you need to iterate the entire list of keys and discover the actual common prefixes on / boundaries using string splitting.

Well, in case someone faces the same problem in future, the alternative logic I used is as suggested by #Michael above, I iterated through all the keys, splat them at last occurrence of /. The first index of the returned list + / was the key of a folder, appended it to another list. At the end, returned the unique list I was appending into. This gave me all the folders and sub-folders in a certain prefix location.
Note that I didn't use CommonPrefixes because I wasn't using any delimiter and that's because I didn't want the list of folders at a certain level but instead recursively get all the folders and sub-folders
def listFoldersRecursively(bucketName: String, fullPath: String): List[String] = {
try {
val objects = client.listObjects(bucketName).getObjectSummaries
val listObjectsRequest = new ListObjectsRequest()
.withPrefix(fullPath)
.withBucketName(bucketName)
val folderPaths = client.listObjects(listObjectsRequest)
.getObjectSummaries()
.map(_.getKey)
.toList
val foldersList: ArrayBuffer[String] = ArrayBuffer()
for (folderPath <- folderPaths) {
val split = folderPath.splitAt(folderPath.lastIndexOf("/"))
if (!split._1.equals(""))
foldersList += split._1 + "/"
}
foldersList.toList.distinct
P.S: Catch block is intentionalyy missing due to irrelevancy.

The listObjects function (and others) is paginating, returning up to 100 entries every time.
From the doc:
Because buckets can contain a virtually unlimited number of keys, the
complete results of a list query can be extremely large. To manage
large result sets, Amazon S3 uses pagination to split them into
multiple responses. Always check the ObjectListing.isTruncated()
method to see if the returned listing is complete or if additional
calls are needed to get more results. Alternatively, use the
AmazonS3Client.listNextBatchOfObjects(ObjectListing) method as an easy
way to get the next page of object listings.

Related

get part of the file name from each file in a list in java

I am uploading files in a list into S3 using uploadFileList()
So this API takes list (records) as parameter like below
MultipleFileUpload xfer = tm.uploadFileList(bucketName, "TEST",new File(fileLocation), records);
The records in a list like this
21564_114762642_ANA_9ECB7C98-C2D7-428A-B6AD-7A6C62E1A7BE_App.xml.gz
21224_114762642_ANA_9ECB7C98-C2D7-428A-B6AD-7A6C62E1A7BE_App.xml.gz
20780_114762642_ANA_9ECB7C98-C2D7-428A-B6AD-7A6C62E1A7BE_App.xml.gz
20407_114762642_ANA_9ECB7C98-C2D7-428A-B6AD-7A6C62E1A7BE_App.xml.gz
This is working fine as of now .
Now i need to add prefix in the API as the first four digit of the file name like 21564 will be the prefix for the first file
So to do this i have to iterate over the list and add file by file but that will slow down the upload into S3 compare to uploading list .
Is there anyway to add prefix while uploading list into S3 and the files in the list are random but pattern is fix ?

See the S3 documentation about object keys. Because S3 buckets are flat (not a filesystem hierarchy), you can tell Amazon the key prefix to use for your uploaded files, thus grouping them all under the same prefix. For example, I could supply "MovieReviews/" as a prefix for a list of files and the resulting object keys in S3 would begin with that. Some tools understand the slashes and allow you to browse your S3 bucket as a directory hierarchy.
In your case, if the files should use the first N characters as a grouping key, then you can upload them in batches by first grouping according to that substring, e.g. fileList.stream().collect(Collectors.groupingBy(s -> s.substring(0, N)))

Merge tab delimited files by key

I have three MapReduce jobs that produce tab delimited files, that operate on the same files. The first value is the key. This is the case for every output of these three MR jobs.
What I want to do now, is use MapReduce to "stitch" these files together by key. What would be the best Mapper output and Reducer input? I tried using ArrayWritable, but because of the shuffle, for some records the ArrayWritable from 1 file is in the third position, instead of the second.
I want this:
Key \t Values-from-first-MR-job \t Values-from-second-MR-job \t Values-from-third-MR-job
And this should be the same for all records. But, as I said, because of the shuffle, sometimes this happens for a few records:
Key \t Values-from-third-MR-job \t Values-from-first-MR-job \t Values-from-second-MR-job
How should I set up my Mapper and Reducer to fix this?

It's possible with simple tagging on the emitted value since only three types of files are involved. In map extract the path of the split, identify its position and add a suitable prefix to the value. For clarity, say the outputs are in 3 directories :
path1/mr_out_1
path2/mr_out_2
path3/mr_out_3
Using TextInputForamt for all these paths, in map you will do :
String[] keyVal = value.spilt("\t",2);
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String dirName = filePath.getParent().getName().toString();
Text outValue = new Text();
if(dirName.equals("mr_out_1")){
outValue.set("1_" + keyVal[1]);
} else if(dirName.equals("mr_out_2")){
outValue.set("2_" + keyVal[1]);
} else {
outValue.set("3_" + keyVal[1]);
}
context.write(new Text(keyVal[0]), outVal);
If you have all the files in the same directory, use the fileName instead of dirName. Then identify the flag based on the name(a regex match may be suitable) :
String fileName = filePath.getName().toString();
if(fileName.matches("regex")){ ... }
In reduce just put the incoming values to a List and sort. Rest is simple enough.
List<String> list = new ArrayList<String>(3);
for(Text v : values){
list.add(v.toString());
}
Collections.sort(list);
StringBuilder builder = new StringBuilder();
for(String s : list){
builder.append(s.substring(2)+"\t");
}
context.write(key, new Text(builder.toString().trim()));
I think it will serve the purpose. Keep in mind that the Collection.sort strategy will fail if there are more than 9 files (due to alphabetical order). Then you may extract the tag separately, cast it to an Integer and use a TreeMap<tag, actualString> for sorting.
NB: All the above snippets are using new API. I didn't use an IDE to write those, so few syntax errors may exist. And again I didn't follow proper conventions in writing. Say the outKey of map could be a class member, and using outKey.set(keyVal[0]) could remove a Text object creation overhead.

How can I create cache map in this situation?

For example, I've got some place in my code, that receives many files (many of them are identical) from disk and, further, unmarshalls them.
final File configurationFile = getConfigurationFile();
FileOutputStream fileOutputStream = new FileOutputStream(configurationFile);
Marshaller.marshal(configObject, fileOutputStream);
Obviously, I can create a special cache map for them to increase performance (in order not to unmarshall identical files again and again). For my case, HashMap implementation will be enough.
The question is: what key for that should I use?
configurationFile.hashCode() is very bad for this?
Thanks for all your answers!

Use the canonical path instead of the absolute path (explanation of the difference) and put it in a HashSet. Sets don't allow duplicated values. If you try to add a value that already exists, it will return false, otherwise true.
Example code (untested):
Set<String> filesMarshalled= new HashSet<>();
...
final File configurationFile = getConfigurationFile();
if (filesMarshalled.add(configurationFile.getCanonicalPath())) {
//not marshalled yet
FileOutputStream fileOutputStream = new FileOutputStream(configurationFile);
Marshaller.marshal(configObject, fileOutputStream);
}

You can also use hashset without actually worrying about key.
if(hashset.add(file)) {
// do unmarshling;
} else {
//do nothing
}
Hashset.add() method return true if an object can be added.
If you try to add duplicate entry then it will return false since duplicacy is not allowed in sets.

...identical files again and again...
What is identical?
If the file content decides, you may use a hash of the file content (e.g. MD5, SHA1, SHA256) as the key.
If the file name must be identical, simply use the file name as the key.
If the file path, then use the full path of the file as the key (File.getCanonicalPath()).

Find specific file

I want capture a specific file name and the easiest way I found was using a class called JavaXT, based on examples of official site (http://www.javaxt.com/javaxt-core/io/Directory/Recursive_Directory_Search) I tried return the result in my console application across
javaxt.io.Directory directory = new javaxt.io.Directory("/temp");
javaxt.io.File[] files;
//Return a list of PDF documents found in the current directory and in any subdirectories
files = directory.getFiles("*.pdf", true);
System.out.println(files);
But the returned value always are strange characters like [Ljavaxt.io.File;#5266db4e
Someone could help me to print the correct file(s) name?

When you try to print an array, what you get is its hashcode. Try this if you want to visualize it:
Integer[] a = { 1, 2, 3 };
System.out.println(a);
the output will be
[Ljava.lang.Integer;#3244331c
If you want to print element by element, you can iterate through the array. In this case, using a for-each:
for (javaxt.io.File f : files)
System.out.println(f);
Note that this will print the String returned by the method toString() of the object.

Your files variable is an array. You need
for(javaxt.io.File f:files) System.out.println(f);
Because files is an array, Java will print the array type and the hex hash code.

Trailing null (\x00) characters when writing text to Accumulo

I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?

You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.