Read a huge json array file of objects

Read a huge json array file of objects - java

I have a big json file, about ~40Gb in size. When I try to convert this file of array of objects to a list of java objects, it crashes. I've used all sizes of maximum heap xmx but nothing has worked!
public Set<Interlocutor> readJsonInterlocutorsToPersist() {
String userHome = System.getProperty(USER_HOME);
log.debug("Read file interlocutors "+userHome);
try {
ObjectMapper mapper = new ObjectMapper();
// JSON file to Java object
Set<Interlocutor> interlocutorDeEntities = mapper.readValue(
new File(userHome + INTERLOCUTORS_TO_PERSIST),
new TypeReference<Set<Interlocutor>>() {
});
return interlocutorDeEntities;
} catch (Exception e) {
log.error("Exception while Reading InterlocutorsToPersist file.",
e.getMessage());
return null;
}
}
Is there a way to read this file using BufferedReader and then to push object by object?

You should definitly have a look at the Jackson Streaming API (https://www.baeldung.com/jackson-streaming-api). I used it myself for GB large JSON files. The great thing is you can divide your JSON into several smaller JSON objects and then parse them with mapper.readTree(parser). That way you can combine the convenience of normal Jackson with the speed and scalability of the Streaming API.
Related to your problem:
I understood that your have a really large array (which is the reason for the file size) and some much more readable objects:
e.g.:
[ // 40GB
{}, // Only 400 MB
{},
]
What you can do now is to parse the file with Jackson's Streaming API and go through the array. But each individual object can be parsed as "regular" Jackson object and then processed easily.
You may have a look at this Use Jackson To Stream Parse an Array of Json Objects which actually matches your problem pretty well.

is there a way to read this file using BufferedReader and then to push
object by object ?
Of course, not. Even you can open this file how you can store 40GB as java objects in memory? I think you don't have such amount of memory in you computers (but technically using ObjectMapper you should have about 2 times more operation memory - 40GB for store json + 40GB for store results as java objects = 80 GB).
I think you should use any way from this questions, but store information in databases or files instead of memory. For example, if you have millions rows in json, you should parse and save every rows to database without keeping it all in memory. And then you can get this data from database step by step (for example, not more then 1GB for every time).

Related

Read Very Large and Dynamic Nested JSON file in JAVA

I have a huge json file (++500mb) consists of dynamic structure of nested json file. This json was extracted to file using 'json.dump' in python.
My problem is how can i read this huge json file with buffer method?
Since if i read all the strings in the same runtime it throws java heap error.
My thought is i want to read the json each record and then parse it, after that continue to next record, parse it, and so on. But how can i know which one is the end of one json record. Because i can't find the seperator between each json record.
Any suggestion? Please ask if something is not clear.
Thanks

Assuming that you can't simply increase the heap space size with -Xmx you can switch your JSON reading logic to use a SAX JSON parsers e.g. RapidJSON or Jackson Streaming API. Instead of storing the entire JSON body in the memory those libraries will emit an event for each encountered JSON construct:
{
"hello": "world",
"t": true
...
}
will produce below when using RapidJSON:
StartObject()
Key("hello", 5, true)
String("world", 5, true)
Key("t", 1, true)
Bool(true)
...
EndObject()

Loading and processing very large files with java

I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}

You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.

I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.

I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});

Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here

TokenBuffer Jackson Json

I am wondering what to do about large files of JSON.
I don't want to store them in memory so I don't want to do JsonNodes because I think that stores the entire tree in memory.
My other idea was using TokenBuffer. However, I'm wondering how this works. Does the TokenBuffer store the entire document as well ? Is there a max limit. I know that it's a performance best practice but if I do:
TokenBuffer buff = jParser.readValueAs(TokenBuffer.class);
It seems like it reads the whole document at once (which I don't want).

The purpose of TokenBuffer is to store an extensible array of JSON tokens in memory. It does that by creating at first 1 Segment object with 16 JsonToken objects and then adding new Segment objects as needed.
You are correct to guess that the entire document will be loaded in memory. The only difference is that instead of storing an array of chars it stores the tokens. The performance advantages according to the docs:
You can reprocess JSON tokens without re-parsing JSON content from textual representation.
It's faster if you want to iterate over all tokens in the order they were appended in the buffer.
TokenBuffer is not a low level buffer of a disk file in memory.
I you only want to parse a file once without loading all of it at once in memory skip the TokenBuffer. Just createJsonParser from a JsonFactory or MappingJsonFactory and get tokens with nextToken. Example.

Jackson writeValueAsString too slow

I want to create JSON string from object.
ObjectMapper om = new ObjectMapper();
String str = om.writeValueAsString(obj);
Some objects are large, and it takes long time to create JSON string.
To create 8MB JSON string, it needs about 15secs.
How can I improve this?

Make sure you have enough memory: Java String for storing 8 MB of serialized JSON needs about 16 megabytes of contiguous memory in heap.
But more importantly: why are you creating a java.lang.String in memory?
What possible use is there for such a huge String?
If you need to write JSON content to a file, there are different methods for that; similarly for writing to a network socket. At very least you could write output as a byte[] (takes 50% less memory), but in most cases incremental writing to an external stream requires very little memory.
15 seconds is definitely very slow. Without GC problems, after initial warmup, Jackson should write 8 megs in fraction of a second, something like 10-20 milliseconds for simple object consisting of standard Java types.
EDIT:
Just realized that during construction of the result String, temporary memory usage will be doubled as well, since buffered content is not yet cleared when String is constructed. So 8 MB would need at least 32 MB to construct String. With default heap of 64 MB this would not work well.

why my jsonParser using Streaming API use much more memory

I have code which jsonparse the whole file:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
Map<String, ValueClass<V>> map = mapper.readValue(new FileReader(fileName)...);
for each entry of map:
space.put(entry.getKey(), entry.value());
the memory used to parse the file is large and is much more than the size of the file itself. To save memory, I decided to replace the code with jsonparse streaming API as follows:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
JsonFactory f = new MappingJsonFactory();
JsonParser jp = f.createJsonParser(new File(fileName);
JsonToken current;
current = jp.nextToken();
if (current != JsonToken.START_OBJECT) {
show error and return;
}
ObjectMapper mapper = new ObjectMapper();
while (jp.nextToken() != JsonToken.END_OBJECT) {
key = jp.getCurrentName();
current = jp.nextToken();
if (key != null) {
if (current == JsonToken.START_OBJECT) {
mem_before=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
value = mapper.readValue(jp, ValueClass.class); (1)
mem_after=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
mem_diff = mem_after - mem_before; (2)
space.put(key, value); (3)
}
else jp.skipChildren();
else jp.skipChildren();
}
however, it spends even much more memory than parsing the whole file once. And it shows the memory increase is due to (1) (I detect the allocatedMemory using Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory before (1) and after (1) to obtain difference)
shouldn't using streaming API save memory, according to http://www.ngdata.com/parsing-a-large-json-file-efficiently-and-easily and other article said?
EDIT
#StaxMan: Thanks for answering me. Looks like you are very familiar with memory consumption we should expect.
My purpose is to load content from file to memory, so that I do not need to search content from file. so I do need all the content in map all the time.
The reason I think streaming may help is: if loading entire file as one object, I guess besides consuming memory for the map variable: space, we need a large additional memory to parse the whole file, which is a big burden; but by using streaming, although we still need the same memory for the map variable:space, we only need a small amount of additional memory to parse each entry of a file, this additional memory is small because we reuse it to parse each entry instead of parsing the whole file. Doesn’t it save memory therefore? Please correct me if I am wrong.
I understand memory size may be more than file size. But first I do not understand why it need so much more than file (2.7 times of file size in my case); second, I do not understand why with streaming I spend even more memory (even double of without using streaming). Memory leak? But I do not see any problem of my code. should have been at least no more than 2.7 times.
Besides, do you know how to estimate memory consumption of an object such as each entry of my map variable space?

It would help if you explained what you are trying to achieve. For example, do you absolutely need all the values in memory at all times? Or would it be possible to process values one at a time, and avoid building that Map with all keys, values.
Your second attempt does not make much sense without explanation what you are trying to do there; but essentially amount of memory used should be space for value POJOs, keys, and Map that contains them. This may be more than input JSON or less, but it really depends on kind of content you are dealing with. Strings, for example, will consume memory memory in JVM than JSON file: most characters in UTF-8 are single byte, whereas each char in JVM is 16-bit value (UCS-2 encoded). Further, whereas JSON Strings have 3 or 4 byte overhead (quotes, separators), memory overhead for java.lang.String is more like 16 or 24 bytes per String.
Similarly, Maps consume quite a bit of memory for structure that JSON file does not need. JSON has couple of separators per entry, maybe 6 - 8 bytes (depending on indentation is used). Java Maps, on the other hand, have bigger per-entry overhead (maybe 16 bytes?) in addition to String keys (with above-mentioned overhead) and POJO values.
So: all in all, it is not uncommon that objects in memory may consume bit more memory than what JSON file itself contains.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.