Could we iterate over the complete set of objects in Amazon S3

Could we iterate over the complete set of objects in Amazon S3 - java

I have tried to print the metadata of all the objects in S3 bucket. However, it does not return the results of more than 1000 objects. I have tried implementing the objectListing.isTruncated() and it did not help. Here is a sample code of what I did to list more than 1000 objects.
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary :
objectListing.getObjectSummaries()) {
System.out.println( " - " + objectSummary.getKey() + " " +
"(size = " + objectSummary.getSize() +
")");
listObjectsRequest.setMarker(objectListing.getNextMarker());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

For all those who read this in 2018+. There is a new API in Java SDK that allows you to iterate through objects in S3 bucket very easy without hustling with pagination:
AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
S3Objects.inBucket(s3, "bucket").forEach((S3ObjectSummary objectSummary) -> {
// TODO: Consume `objectSummary` the way you need
// System.out.println(objectSummary.key);
});

Amazon recently published AWS SDK for Java 2.x. The API changed, so here is an SDK 2.x version:
S3Client client = S3Client.builder().region(Region.US_EAST_1).build();
ListObjectsV2Request request = ListObjectsV2Request.builder().bucket("the-bucket").prefix("the-prefix").build();
ListObjectsV2Iterable response = client.listObjectsV2Paginator(request);
for (ListObjectsV2Response page : response) {
page.contents().forEach(x -> System.out.println(x.key()));
}
ListObjectsV2Iterable is lazy as well:
When the operation is called, an instance of this class is returned. At this point, no service calls are made yet and so there is no guarantee that the request is valid. As you iterate through the iterable, SDK will start lazily loading response pages by making service calls until there are no pages left or your iteration stops. If there are errors in your request, you will see the failures only after you start iterating through the iterable.

This solved my problem. I had setup a marker and truncated my list and was able to print all the objects (more than 1000).
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3.listObjects(listObjectsRequest);
System.out.println("Enter the path where to save yout file");
Scanner scan = new Scanner(System.in);
String path = scan.nextLine();
fileOne = new File(path);
fw = new FileWriter(fileOne.getAbsoluteFile(), true);
bw = new BufferedWriter(fw);
bw.write("Writing data to file");
bw.write("\n");
for (S3ObjectSummary objectSummary: objectListing.getObjectSummaries()) {
String key = objectSummary.getKey();
String dummyKey = key.substring(2);
if (dummyKey.equalsIgnoreCase("somestring")) {
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));
String line;
int i = 0;
while ((line = reader.readLine()) != null) {
if (i > 0) {
bw.append(line + "," + s3object.getKey().substring(0, 2));
bw.append(objectSummary.getLastModified().toString());
bw.newLine();
}
i++;
System.out.println(line);
}
}
// bw.close();
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

Related

Replacing values from HashMap in a file with Java

i'm stuck on this part. The aim is to take the values from an file.ini with this format
X = Y
X1 = Y1
X2 = Y2
take the Y values and replace them in a scxml file instead of the corresponding X keys, and save the new file.scxml
As you can see from my pasted code, i use the HashMap to take the key and values printed correctly, that although it seems right the code to replace the values works only for the first entry of the HashMap.
The code is currently as follows:
public String getPropValues() throws IOException {
try {
Properties prop = new Properties();
String pathconf = this.pathconf;
String pathxml = this.pathxml;
//Read file conf
File inputFile = new File(pathconf);
InputStream is = new FileInputStream(inputFile);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
//load the buffered file
prop.load(br);
String name = prop.getProperty("name");
//Read xml file to get the format
FileReader reader = new FileReader(pathxml);
String newString;
StringBuffer str = new StringBuffer();
String lineSeparator = System.getProperty("line.separator");
BufferedReader rb = new BufferedReader(reader);
//read file.ini to HashMap
Map<String, String> mapFromFile = getHashMapFromFile();
//iterate over HashMap entries
for(Map.Entry<String, String> entry : mapFromFile.entrySet()){
System.out.println( entry.getKey() + " -> " + entry.getValue() );
//replace values
while ((newString = rb.readLine()) != null){
str.append(lineSeparator);
str.append(newString.replaceAll(entry.getKey(), entry.getValue()));
}
}
rb.close();
String pathwriter = pathxml + name + ".scxml";
BufferedWriter bw = new BufferedWriter(new FileWriter(new File(pathwriter)));
bw.write(str.toString());
//flush the stream
bw.flush();
//close the stream
bw.close();
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return result;
}
so my .ini file is for example
Apple = red
Lemon = yellow
it print key and values correctly:
Apple -> red
Lemon -> yellow
but replace in the file only Apple with red and not the others key

The problem lays in your control flow order.
By the time the first iteration in your for loop, which corresponds to the first entry Apple -> red, runs it would caused the BufferedReader rb to reach the end of stream, hence doing nothing for subsequent iterations.
You have then either to reinitialize the BufferedReader for each iteration, or better, inverse the looping over your Map entries to be within the BufferedReader read loop:
EDIT (following #David hints)
You should can assign the resulting replaced value to the line replacement that will be appended to the result file at each line iteration:
public String getPropValues() throws IOException {
try {
// ...
BufferedReader rb = new BufferedReader(reader);
//read file.ini to HashMap
Map<String, String> mapFromFile = getHashMapFromFile();
//replace values
while ((newString = rb.readLine()) != null) {
// iterate over HashMap entries
for (Map.Entry<String, String> entry : mapFromFile.entrySet()) {
newString = newString.replace(entry.getKey(), entry.getValue());
}
str.append(lineSeparator)
.append(newString);
}
rb.close();
// ...
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return result;
}

How can I iterate over big amount of files from S3, decompress each of them and pass them as Stream?

So currently, my code iterates over huge amount of files (.ndjson.gzip).
During the iteration, each file converted into string and this string inserted to an hashset.
My Originally idea was to stream all files, so that the end user could manipulate this stream as he wishes.
Also, it is a huge chunk of data, so I though, it will be much faster to stream all the files one after the other.
1.How can I implement my idea?
2.For example, how can I return from a function ( getAllFiles()) a stream, so that I could print the content?
public class ServiceGzipped {
HashSet<String> getAllFiles() throws IOException {
HashSet<String> hashSet = new HashSet<>();
ListObjectsV2Request request = new ListObjectsV2Request().withBucketName("some-bucket-name").withPrefix("some-prefix");
ListObjectsV2Result result;
do {
result = client.listObjectsV2(request);
for (S3ObjectSummary summary : result.getObjectSummaries()) {
System.out.println(summary.getKey() + " : " + summary.getSize());
String s = downloadFromAWS(summary.getKey());
hashSet.add(s);
}
String token = result.getNextContinuationToken();
System.out.println(token);
request.setContinuationToken(token);
} while (result.isTruncated());
return hashSet ;
}
String downloadFromAWS(String file) {
S3Object s3Object = client.getObject(bucketName, file);
String output = "";
try {
GZIPInputStream gzipInputStream= new GZIPInputStream(s3Object.getObjectContent());
InputStreamReader reader = new InputStreamReader(gzipInputStream);
BufferedReader in = new BufferedReader(reader);
String readed;
while ((readed = in.readLine())!= null){
System.out.println(readed);
output +=readed;
}
return output;
}catch (Exception e){
System.out.println(e);
}
return null;
}
Thanks in advance for your help :)

Optimising CSV parsing to be faster

I'm working on this "program" that reads data from 2 large csv files (line by line), compares an Array element from the files and, when a match is found, it writes my necessary data into a 3rd file. The only problem I have is that it is very slow. It reads 1-2 lines per second, which is extremely slow, considering I have millions of records. Any ideas on how could I make it faster? Here's my code:
public class ReadWriteCsv {
public static void main(String[] args) throws IOException {
FileInputStream inputStream = null;
FileInputStream inputStream2 = null;
Scanner sc = null;
Scanner sc2 = null;
String csvSeparator = ",";
String line;
String line2;
String path = "D:/test1.csv";
String path2 = "D:/test2.csv";
String path3 = "D:/newResults.csv";
String[] columns;
String[] columns2;
Boolean matchFound = false;
int count = 0;
StringBuilder builder = new StringBuilder();
FileWriter writer = new FileWriter(path3);
try {
// specifies where to take the files from
inputStream = new FileInputStream(path);
inputStream2 = new FileInputStream(path2);
// creating scanners for files
sc = new Scanner(inputStream, "UTF-8");
// while there is another line available do:
while (sc.hasNextLine()) {
count++;
// storing the current line in the temporary variable "line"
line = sc.nextLine();
System.out.println("Number of lines read so far: " + count);
// defines the columns[] as the line being split by ","
columns = line.split(",");
inputStream2 = new FileInputStream(path2);
sc2 = new Scanner(inputStream2, "UTF-8");
// checks if there is a line available in File2 and goes in the
// while loop, reading file2
while (!matchFound && sc2.hasNextLine()) {
line2 = sc2.nextLine();
columns2 = line2.split(",");
if (columns[3].equals(columns2[1])) {
matchFound = true;
builder.append(columns[3]).append(csvSeparator);
builder.append(columns[1]).append(csvSeparator);
builder.append(columns2[2]).append(csvSeparator);
builder.append(columns2[3]).append("\n");
String result = builder.toString();
writer.write(result);
}
}
builder.setLength(0);
sc2.close();
matchFound = false;
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
//then I close my inputStreams, scanners and writer

Use an existing CSV library rather than rolling your own. It will be far more robust than what you have now.
However, your problem is not CSV parsing speed, it that your algorithm is O(n^2), for each line in the first file, you need to scan the second file. This kind of algorithm explodes very quickly with the size of data, when you have millions of rows, you'll run into problems. You need a better algorithm.
The other problem is you are re-parsing the second file for every scan. You should at least read it into an memory as an ArrayList or something first at the start of the program so you only need to load and parse it once.

Use univocity-parsers' CSV parser as it won't take much longer than a couple of seconds to process two files with 1 million rows each:
public void diff(File leftInput, File rightInput) {
CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial
CsvParser leftParser = new CsvParser(settings);
CsvParser rightParser = new CsvParser(settings);
leftParser.beginParsing(leftInput);
rightParser.beginParsing(rightInput);
String[] left;
String[] right;
int row = 0;
while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) {
row++;
if (!Arrays.equals(left, right)) {
System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right));
}
}
leftParser.stopParsing();
rightParser.stopParsing();
}
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Java formatting string of doubles to output file

My code uses BufferedReader to read columns of data in a text file. The text file looks like:
Year.....H2OIN....CO2IN
0.000......0.0..........0.0
1.000......2.0..........6.0
2.000......3.0..........7.0
3.000......4.0..........8.0
My formatting code looks like:
try {
FileInputStream file = new FileInputStream(inputFile);
BufferedReader in = new BufferedReader(new InputStreamReader(file));
f = new Formatter("M:\\TESTPACK\\AL6000803OUT.TXT");
while((line = in.readLine()) != null) {
if (line.startsWith(" 0.000"))
break;
}
while((line = in.readLine()) != null) {
stream = line.split(parse);
start = line.substring(6,9);
if (start.equals("000")) {
H2OIN = Double.parseDouble(stream[1]);
CO2IN = Double.parseDouble(stream[2]);
f.format("%s ", H2OIN);
f.format("%s ", CO2IN);
}
}
}catch (FileNotFoundException e) {
}catch (IOException e) {
}
f.close();
However, my output file looks like:
2.0 6.0 3.0 7.0 4.0 8.0
While I want it to look like:
2.0 3.0 4.0
6.0 7.0 8.0
I need a suggestion for how to apply formatting to the data strings, not the data itself. Essentially I need to transpose columns of data to rows of data. The duplicate post suggested was not the problem I'm trying to solve.

You'll need to include two StringBuffers. One for your H2OIN row and another for your CO2IN row.
Like so:
With your other declarations...
StringBuffer H2OINRow = new StringBuffer();
StringBuffer CO2INRow = new StringBuffer();
In your if (start.equals("000")) block...
// in place of the f.format calls
H2OINRow.Append(H2OIN + " ");
CO2INRow.Append(CO2IN + " ");
After your while loops...
f.format("%s\n", H2OINRow);
f.format("%s\n", CO2INRow);

I suggest you gather all the values you want on each line in a different List.
So instead, your while loop would look like :
List<String> h2oValues = new ArrayList<String>();
List<String> c02Values = new ArrayList<String>();
while((line = in.readLine()) != null) {
stream = line.split(parse);
start = line.substring(6,9);
if (start.equals("000")) {
H2OIN = Double.parseDouble(stream[1]);
CO2IN = Double.parseDouble(stream[2]);
h2oValues.add(H2OIN);
c02Values.add(CO2IN);
}
}
After that, loop the values of h2oValues to write them in a line and do the same for c02Values
for (String value : h2oValues) {
f.format("%s ", value);
}
// Add a end of line character... using the system one, you might want to change that
f.format(%n);
for (String value : h2oValues) {
f.format("%s ", c02Values);
}
For the end line, see this question if you want to change it.

OutOfMemoryError in StringBuilder and HashSet

I have a JSON file (.json) in Amazon S3. I need to read it and create a new field called Hash_index for each JsonObject. The file is very big, so I am using a GSON library to avoid my OutOfMemoryError in reading the file. Below is my code. Please note that I am using GSON
//Create the Hashed JSON
public void createHash() throws IOException
{
System.out.println("Hash Creation Started");
strBuffer = new StringBuffer("");
try
{
//List all the Buckets
List<Bucket>buckets = s3.listBuckets();
for(int i=0;i<buckets.size();i++)
{
System.out.println("- "+(buckets.get(i)).getName());
}
//Downloading the Object
System.out.println("Downloading Object");
S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
System.out.println("Content-Type: " + s3Object.getObjectMetadata().getContentType());
//Read the JSON File
/*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) break;
// System.out.println(" " + line);
strBuffer.append(line);
}*/
// JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
// jsonArray = new JSONArray(jTokener);
JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
reader.beginArray();
int gsonVal = 0;
while (reader.hasNext()) {
JsonParser _parser = new JsonParser();
JsonElement jsonElement = _parser.parse(reader);
JsonObject jsonObject1 = jsonElement.getAsJsonObject();
//Do something
StringBuffer hashIndex = new StringBuffer("");
//Add Title and Body Together to the list
String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");
//Remove full stops and commas
titleAndBodyContainer = titleAndBodyContainer.replaceAll("\\.(?=\\s|$)", " ");
titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
titleAndBodyContainer = titleAndBodyContainer.toLowerCase();
//Create a word list without duplicated words
StringBuilder result = new StringBuilder();
HashSet<String> set = new HashSet<String>();
for(String s : titleAndBodyContainer.split(" ")) {
if (!set.contains(s)) {
result.append(s);
result.append(" ");
set.add(s);
}
}
//System.out.println(result.toString());
//Re-Arranging everything into Alphabetic Order
String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
//String testHash = "057 1$k 983 5*1 058 52j 6!v 983 03z";
String[]finalWordHolder = (result.toString()).split(" ");
Arrays.sort(finalWordHolder);
//Navigate through text and create the Hash
for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
{
if(wordMap.containsKey(finalWordHolder[arrayCount]))
{
hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
}
}
//System.out.println(hashIndex.toString().trim());
jsonObject1.addProperty("hash_index", hashIndex.toString().trim());
jsonObject1.addProperty("primary_key", gsonVal);
jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection
jsonHashHolder.add(hashIndex.toString().trim());
System.out.println("Primary Key: "+jsonObject1.get("primary_key"));
//System.out.println(Arrays.toString(finalWordHolder));
//System.out.println("- "+hashIndex.toString());
//break;
gsonVal++;
}
System.out.println("Hash Creation Completed");
}
catch(Exception e)
{
e.printStackTrace();
}
}
When this code is executed, I got the following error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at HashCreator.createHash(HashCreator.java:252)
at HashCreator.<init>(HashCreator.java:66)
at Main.main(Main.java:9)
[root#ip-172-31-45-123 JarFiles]#
Line number 252 is - result.append(s);. It is Inside the HashSet loop.
Previously, it generated OutOfMemoryError in line number 254. Line number 254 is - set.add(s); it is also inside the HashSet array.
My Json files are really really big. Gigabytes and Terabytes. I have no idea about how to avoid the above issue.

Use a streaming JSON library like Jackson.
Read in a some JSON, add the hash, and write them out.
Then read in some more, process them, and write them out.
Keep going until you have processed all the objects.
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example
(See also this StackOverflow post: Is there a streaming API for JSON?)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Could we iterate over the complete set of objects in Amazon S3 - java

Related

Replacing values from HashMap in a file with Java

How can I iterate over big amount of files from S3, decompress each of them and pass them as Stream?

Optimising CSV parsing to be faster

Java formatting string of doubles to output file

OutOfMemoryError in StringBuilder and HashSet

Categories

Resources