OutOfMemoryError in StringBuilder and HashSet - java

I have a JSON file (.json) in Amazon S3. I need to read it and create a new field called Hash_index for each JsonObject. The file is very big, so I am using a GSON library to avoid my OutOfMemoryError in reading the file. Below is my code. Please note that I am using GSON
//Create the Hashed JSON
public void createHash() throws IOException
{
System.out.println("Hash Creation Started");
strBuffer = new StringBuffer("");
try
{
//List all the Buckets
List<Bucket>buckets = s3.listBuckets();
for(int i=0;i<buckets.size();i++)
{
System.out.println("- "+(buckets.get(i)).getName());
}
//Downloading the Object
System.out.println("Downloading Object");
S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
System.out.println("Content-Type: " + s3Object.getObjectMetadata().getContentType());
//Read the JSON File
/*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) break;
// System.out.println(" " + line);
strBuffer.append(line);
}*/
// JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
// jsonArray = new JSONArray(jTokener);
JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
reader.beginArray();
int gsonVal = 0;
while (reader.hasNext()) {
JsonParser _parser = new JsonParser();
JsonElement jsonElement = _parser.parse(reader);
JsonObject jsonObject1 = jsonElement.getAsJsonObject();
//Do something
StringBuffer hashIndex = new StringBuffer("");
//Add Title and Body Together to the list
String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");
//Remove full stops and commas
titleAndBodyContainer = titleAndBodyContainer.replaceAll("\\.(?=\\s|$)", " ");
titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
titleAndBodyContainer = titleAndBodyContainer.toLowerCase();
//Create a word list without duplicated words
StringBuilder result = new StringBuilder();
HashSet<String> set = new HashSet<String>();
for(String s : titleAndBodyContainer.split(" ")) {
if (!set.contains(s)) {
result.append(s);
result.append(" ");
set.add(s);
}
}
//System.out.println(result.toString());
//Re-Arranging everything into Alphabetic Order
String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
//String testHash = "057 1$k 983 5*1 058 52j 6!v 983 03z";
String[]finalWordHolder = (result.toString()).split(" ");
Arrays.sort(finalWordHolder);
//Navigate through text and create the Hash
for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
{
if(wordMap.containsKey(finalWordHolder[arrayCount]))
{
hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
}
}
//System.out.println(hashIndex.toString().trim());
jsonObject1.addProperty("hash_index", hashIndex.toString().trim());
jsonObject1.addProperty("primary_key", gsonVal);
jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection
jsonHashHolder.add(hashIndex.toString().trim());
System.out.println("Primary Key: "+jsonObject1.get("primary_key"));
//System.out.println(Arrays.toString(finalWordHolder));
//System.out.println("- "+hashIndex.toString());
//break;
gsonVal++;
}
System.out.println("Hash Creation Completed");
}
catch(Exception e)
{
e.printStackTrace();
}
}
When this code is executed, I got the following error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at HashCreator.createHash(HashCreator.java:252)
at HashCreator.<init>(HashCreator.java:66)
at Main.main(Main.java:9)
[root#ip-172-31-45-123 JarFiles]#
Line number 252 is - result.append(s);. It is Inside the HashSet loop.
Previously, it generated OutOfMemoryError in line number 254. Line number 254 is - set.add(s); it is also inside the HashSet array.
My Json files are really really big. Gigabytes and Terabytes. I have no idea about how to avoid the above issue.

Use a streaming JSON library like Jackson.
Read in a some JSON, add the hash, and write them out.
Then read in some more, process them, and write them out.
Keep going until you have processed all the objects.
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example
(See also this StackOverflow post: Is there a streaming API for JSON?)

Related

How can I iterate over big amount of files from S3, decompress each of them and pass them as Stream?

So currently, my code iterates over huge amount of files (.ndjson.gzip).
During the iteration, each file converted into string and this string inserted to an hashset.
My Originally idea was to stream all files, so that the end user could manipulate this stream as he wishes.
Also, it is a huge chunk of data, so I though, it will be much faster to stream all the files one after the other.
1.How can I implement my idea?
2.For example, how can I return from a function ( getAllFiles()) a stream, so that I could print the content?
public class ServiceGzipped {
HashSet<String> getAllFiles() throws IOException {
HashSet<String> hashSet = new HashSet<>();
ListObjectsV2Request request = new ListObjectsV2Request().withBucketName("some-bucket-name").withPrefix("some-prefix");
ListObjectsV2Result result;
do {
result = client.listObjectsV2(request);
for (S3ObjectSummary summary : result.getObjectSummaries()) {
System.out.println(summary.getKey() + " : " + summary.getSize());
String s = downloadFromAWS(summary.getKey());
hashSet.add(s);
}
String token = result.getNextContinuationToken();
System.out.println(token);
request.setContinuationToken(token);
} while (result.isTruncated());
return hashSet ;
}
String downloadFromAWS(String file) {
S3Object s3Object = client.getObject(bucketName, file);
String output = "";
try {
GZIPInputStream gzipInputStream= new GZIPInputStream(s3Object.getObjectContent());
InputStreamReader reader = new InputStreamReader(gzipInputStream);
BufferedReader in = new BufferedReader(reader);
String readed;
while ((readed = in.readLine())!= null){
System.out.println(readed);
output +=readed;
}
return output;
}catch (Exception e){
System.out.println(e);
}
return null;
}
Thanks in advance for your help :)

Writing multiple times on a JSON file with Java

I'm trying to write multiple times on a JSON file using JSON-Simple and Java, but I have some problems after the second run. I'm new to JSON so that's just a way to learn about it, here is the code:
public class Writer{
#SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException {
JSONParser parser = new JSONParser();
JSONObject outer = new JSONObject();
JSONObject inner = new JSONObject();
JSONObject data = new JSONObject();
ArrayList<JSONObject> arr = new ArrayList<JSONObject>();
inner.put("Name", "Andrea");
inner.put("Email", "andrea#mail.com");
arr.add(inner);
outer.put("Clienti", arr);
System.out.println("Dati: " + outer);
File file = new File("temp.json");
if(file.exists()) {
PrintWriter write = new PrintWriter(new FileWriter(file));
Iterator<JSONObject> iterator = arr.iterator();
while(iterator.hasNext()) {
JSONObject it = iterator.next();
data = (JSONObject) it;
}
arr.add(data);
outer.put("Clienti", arr);
System.out.println("Dati: " + outer);
write.write(outer.toString());
write.flush();
write.close();
} else {
PrintWriter write = new PrintWriter(new FileWriter(file));
write.write(outer.toString());
write.flush();
write.close();
}
}
}
So, I just wanna try to add the same thing without losing what I added before, but when I run:
The first run goes well, it prints normally on the file.
Result:
Dati: {"Clienti":[{"Email":"andrea#gmail.com","Nome":"Andrea"}]}
The second run too, it adds another field inside the list, keeping the first one too.
Result:
Dati:
{"Clienti":[{"Email":"andrea#gmail.com","Nome":"Andrea"},{"Email":"andrea#gmail.com","Nome":"Andrea"}]}
From the third run it doesn't upload anymore the file, instead of adding another field to the existent 2 it just prints the second result.
I tried many options but still can't understand how to add a third field without losing the previous two, how can i solve this?
Solved putting this on if clause:
if(file.exists()) {
Object obj = parser.parse(new FileReader("temp.json"));
JSONObject jsonObject = (JSONObject) obj;
JSONArray array = (JSONArray) jsonObject.get("Clienti");
PrintWriter write = new PrintWriter(new FileWriter(file));
Iterator<JSONObject> iterator = array.iterator();
while(iterator.hasNext()) {
JSONObject it = iterator.next();
data = (JSONObject) it;
System.out.println("Data" + data);
arr.add(data);
}
arr.add(inner);
System.out.println(arr);
outer.put("Clienti", arr);
System.out.println("Dati: " + outer);
write.write(outer.toString());
write.flush();
write.close();
}

Could we iterate over the complete set of objects in Amazon S3

I have tried to print the metadata of all the objects in S3 bucket. However, it does not return the results of more than 1000 objects. I have tried implementing the objectListing.isTruncated() and it did not help. Here is a sample code of what I did to list more than 1000 objects.
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary :
objectListing.getObjectSummaries()) {
System.out.println( " - " + objectSummary.getKey() + " " +
"(size = " + objectSummary.getSize() +
")");
listObjectsRequest.setMarker(objectListing.getNextMarker());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
For all those who read this in 2018+. There is a new API in Java SDK that allows you to iterate through objects in S3 bucket very easy without hustling with pagination:
AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
S3Objects.inBucket(s3, "bucket").forEach((S3ObjectSummary objectSummary) -> {
// TODO: Consume `objectSummary` the way you need
// System.out.println(objectSummary.key);
});
Amazon recently published AWS SDK for Java 2.x. The API changed, so here is an SDK 2.x version:
S3Client client = S3Client.builder().region(Region.US_EAST_1).build();
ListObjectsV2Request request = ListObjectsV2Request.builder().bucket("the-bucket").prefix("the-prefix").build();
ListObjectsV2Iterable response = client.listObjectsV2Paginator(request);
for (ListObjectsV2Response page : response) {
page.contents().forEach(x -> System.out.println(x.key()));
}
ListObjectsV2Iterable is lazy as well:
When the operation is called, an instance of this class is returned. At this point, no service calls are made yet and so there is no guarantee that the request is valid. As you iterate through the iterable, SDK will start lazily loading response pages by making service calls until there are no pages left or your iteration stops. If there are errors in your request, you will see the failures only after you start iterating through the iterable.
This solved my problem. I had setup a marker and truncated my list and was able to print all the objects (more than 1000).
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3.listObjects(listObjectsRequest);
System.out.println("Enter the path where to save yout file");
Scanner scan = new Scanner(System.in);
String path = scan.nextLine();
fileOne = new File(path);
fw = new FileWriter(fileOne.getAbsoluteFile(), true);
bw = new BufferedWriter(fw);
bw.write("Writing data to file");
bw.write("\n");
for (S3ObjectSummary objectSummary: objectListing.getObjectSummaries()) {
String key = objectSummary.getKey();
String dummyKey = key.substring(2);
if (dummyKey.equalsIgnoreCase("somestring")) {
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));
String line;
int i = 0;
while ((line = reader.readLine()) != null) {
if (i > 0) {
bw.append(line + "," + s3object.getKey().substring(0, 2));
bw.append(objectSummary.getLastModified().toString());
bw.newLine();
}
i++;
System.out.println(line);
}
}
// bw.close();
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

How to extract multiple JSON Objects into String

I am reading multiple JSONObject from a file and converting into a string using StringBuilder.
These are the JSON Objects.
{"Lng":"-1.5908601","Lat":"53.7987816"}
{"Lng":"-2.5608601","Lat":"54.7987816"}
{"Lng":"-3.5608601","Lat":"55.7987816"}
{"Lng":"-4.5608601","Lat":"56.7987816"}
{"Lng":"-5.560837","Lat":"57.7987816"}
{"Lng":"-6.5608294","Lat":"58.7987772"}
{"Lng":"-7.5608506","Lat":"59.7987823"}
How to convert into a string?
Actual code is:-
BufferedReader reader = new BufferedReader(new InputStreamReader(contents.getInputStream()));
StringBuilder builder = new StringBuilder();
String line;
try {
while ((line = reader.readLine()) != null) {
builder.append(line);
}
}
catch(IOException e)
{
msg.Log(e.toString());
}
String contentsAsString = builder.toString();
//msg.Log(contentsAsString);
I tried this code
JSONObject json = new JSONObject(contentsAsString);
Iterator<String> iter = json.keys();
while(iter.hasNext())
{
String key = iter.next();
try{
Object value = json.get(key);
msg.Log("Value :- "+ value);
}catch(JSONException e)
{
//error
}
}
It just gives first object. How to loop them?
try this and see how it works for you,
BufferedReader in
= new BufferedReader(new FileReader("foo.in"));
ArrayList<JSONObject> contentsAsJsonObjects = new ArrayList<JSONObject>();
while(true)
{
String str = in.readLine();
if(str==null)break;
contentsAsJsonObjects.add(new JSONObject(str));
}
for(int i=0; i<contentsAsJsonObjects.size(); i++)
{
JSONObject json = contentsAsJsonObjects.get(i);
String lat = json.getString("Lat");
String lng = json.getString("Lng");
Log.i("TAG", lat + lng)
}
What you do is you are loading multiple JSON objects into one JSON object. This does not make sense -- it is logical that only the first object is parsed, the parser does not expect anything after the first }. Since you want to loop over the loaded objects, you should load those into a JSON array.
If you can edit the input file, convert it to the array by adding braces and commas
[
{},
{}
]
If you cannot, append the braces to the beginning of the StringBuilder and append comma to each loaded line. Consider additional condition to eliminate exceptions caused by inpropper input file.
Finally you can create JSON array from string and loop over it with this code
JSONArray array = new JSONArray(contentsAsString);
for (int i = 0; i < array.length(); ++i) {
JSONObject object = array.getJSONObject(i);
}

How to read and store data from a text file in which the first line are titles, and the other lines are related data

I have a text file with 300 lines or so. And the format is like:
Name Amount Unit CountOfOrder
A 1 ml 5000
B 1 mgm 4500
C 4 gm 4200
// more data
I need to read the text file line by line because each line of data should be together for further processing.
Now I just use string array for each line and access the data by index.
for each line in file:
array[0] = {data from the 'Name' column}
array[1] = {data from the 'Amount' column}
array[2] = {data from the 'Unit' column}
array[3] = {data from the 'CountOfOrder' column}
....
someOtherMethods(array);
....
However, I realized that if the text file changes its format (e.g. switch two columns, or insert another column), it would break my program (accessing through index might be wrong or even cause exception).
So I would like to use the title as reference to access each column. Maybe HashMap is a good option, but since I have to keep each line of data together, if I build a HashMap for each line, that would be too expensive.
Does anyone have any thought on this? Please help!
you only need a single hash map to map your column names to the proper column index. you fill the arrays by indexing with integers as you did before, to retrieve a column by name you'd use array[hashmap.get("Amount")].
You can read the file using opencsv.
CSVReader reader = new CSVReader(new FileReader("yourfile.txt"), '\t');
List<String[]> lines = reader.readAll();
The fist line contains the headers.
you can read each line of the file and assuming that the first line of the file has the column header you can parse that line to get all the names of the columns.
String[] column_headers = firstline.split("\t");
This will give you the name of all the columns now you just read through splitting on tabs and they will all line up.
You could do something like this:
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(FILE)));
String line = null;
String[] headers = null;
String[] data = null;
Map<String, List<String>> contents = new HashMap<String, List<String>>();
if ((line = in.readLine()) != null) {
headers = line.split("\t");
}
for(String h : headers){
contents.put(h, new ArrayList<String>());
}
while ((line = in.readLine()) != null) {
data = line.split("\t");
if(data.length != headers.length){
throw new Exception();
}
for(int i = 0; i < data.length; i++){
contents.get(headers[i]).add(data[i]);
}
}
It would give you flexibility, and would only require making the map once. You can then get the data lists from the map, so it should be a convenient data structure for the rest of your program to use.
This will give you individual list of columns.
public static void main(String args[]) throws FileNotFoundException, IOException {
List<String> headerList = new ArrayList<String>();
List<String> column1 = new ArrayList<String>();
List<String> column2 = new ArrayList<String>();
List<String> column3 = new ArrayList<String>();
List<String> column4 = new ArrayList<String>();
int lineCount=0;
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
String tokens[];
while (line != null) {
tokens = line.split("\t");
if(lineCount != 0)
{
int count = 0;
column1.add(tokens[count]); ++count;
column2.add(tokens[count]); ++count;
column3.add(tokens[count]); ++count;
column4.add(tokens[count]); ++count;
continue;
}
if(lineCount==0){
for(int count=0; count<tokens.length; count++){
headerList.add(tokens[count]);
lineCount++;
}
}
}
} catch (IOException e) {
} finally {
br.close();
}
}
using standard java.util.Scanner
String aa = " asd 9 1 3 \n d -1 4 2";
Scanner ss = new Scanner(aa);
ss.useDelimiter("\n");
while ( ss.hasNext()){
String line = ss.next();
Scanner fs = new Scanner(line);
System.out.println( "1>"+ fs.next()+" " +fs.nextInt() +" " +fs.nextLong()+" " +fs.nextBigDecimal());
}
using a bunch of hashmap's is ok...i won't be afraid ;)
if you need to process a lot of data...then try to translate your problem into a dataprocessing transformation
for example:
read all of you data into a hashmap's, but store them in a database using some JPA implementation....then you can go round'a'round your data ;)\

Categories

Resources