Read Very Large and Dynamic Nested JSON file in JAVA - java

I have a huge json file (++500mb) consists of dynamic structure of nested json file. This json was extracted to file using 'json.dump' in python.
My problem is how can i read this huge json file with buffer method?
Since if i read all the strings in the same runtime it throws java heap error.
My thought is i want to read the json each record and then parse it, after that continue to next record, parse it, and so on. But how can i know which one is the end of one json record. Because i can't find the seperator between each json record.
Any suggestion? Please ask if something is not clear.
Thanks

Assuming that you can't simply increase the heap space size with -Xmx you can switch your JSON reading logic to use a SAX JSON parsers e.g. RapidJSON or Jackson Streaming API. Instead of storing the entire JSON body in the memory those libraries will emit an event for each encountered JSON construct:
{
"hello": "world",
"t": true
...
}
will produce below when using RapidJSON:
StartObject()
Key("hello", 5, true)
String("world", 5, true)
Key("t", 1, true)
Bool(true)
...
EndObject()

Related

Read a huge json array file of objects

I have a big json file, about ~40Gb in size. When I try to convert this file of array of objects to a list of java objects, it crashes. I've used all sizes of maximum heap xmx but nothing has worked!
public Set<Interlocutor> readJsonInterlocutorsToPersist() {
String userHome = System.getProperty(USER_HOME);
log.debug("Read file interlocutors "+userHome);
try {
ObjectMapper mapper = new ObjectMapper();
// JSON file to Java object
Set<Interlocutor> interlocutorDeEntities = mapper.readValue(
new File(userHome + INTERLOCUTORS_TO_PERSIST),
new TypeReference<Set<Interlocutor>>() {
});
return interlocutorDeEntities;
} catch (Exception e) {
log.error("Exception while Reading InterlocutorsToPersist file.",
e.getMessage());
return null;
}
}
Is there a way to read this file using BufferedReader and then to push object by object?
You should definitly have a look at the Jackson Streaming API (https://www.baeldung.com/jackson-streaming-api). I used it myself for GB large JSON files. The great thing is you can divide your JSON into several smaller JSON objects and then parse them with mapper.readTree(parser). That way you can combine the convenience of normal Jackson with the speed and scalability of the Streaming API.
Related to your problem:
I understood that your have a really large array (which is the reason for the file size) and some much more readable objects:
e.g.:
[ // 40GB
{}, // Only 400 MB
{},
]
What you can do now is to parse the file with Jackson's Streaming API and go through the array. But each individual object can be parsed as "regular" Jackson object and then processed easily.
You may have a look at this Use Jackson To Stream Parse an Array of Json Objects which actually matches your problem pretty well.
is there a way to read this file using BufferedReader and then to push
object by object ?
Of course, not. Even you can open this file how you can store 40GB as java objects in memory? I think you don't have such amount of memory in you computers (but technically using ObjectMapper you should have about 2 times more operation memory - 40GB for store json + 40GB for store results as java objects = 80 GB).
I think you should use any way from this questions, but store information in databases or files instead of memory. For example, if you have millions rows in json, you should parse and save every rows to database without keeping it all in memory. And then you can get this data from database step by step (for example, not more then 1GB for every time).

TokenBuffer Jackson Json

I am wondering what to do about large files of JSON.
I don't want to store them in memory so I don't want to do JsonNodes because I think that stores the entire tree in memory.
My other idea was using TokenBuffer. However, I'm wondering how this works. Does the TokenBuffer store the entire document as well ? Is there a max limit. I know that it's a performance best practice but if I do:
TokenBuffer buff = jParser.readValueAs(TokenBuffer.class);
It seems like it reads the whole document at once (which I don't want).
The purpose of TokenBuffer is to store an extensible array of JSON tokens in memory. It does that by creating at first 1 Segment object with 16 JsonToken objects and then adding new Segment objects as needed.
You are correct to guess that the entire document will be loaded in memory. The only difference is that instead of storing an array of chars it stores the tokens. The performance advantages according to the docs:
You can reprocess JSON tokens without re-parsing JSON content from textual representation.
It's faster if you want to iterate over all tokens in the order they were appended in the buffer.
TokenBuffer is not a low level buffer of a disk file in memory.
I you only want to parse a file once without loading all of it at once in memory skip the TokenBuffer. Just createJsonParser from a JsonFactory or MappingJsonFactory and get tokens with nextToken. Example.

How to parse multiple JSON data from single stream in Java continuously?

I have a InputStreamReader (connected to a socket), which will receive multiple JSON document data. For example, it will have
{ "name" : "foo" }
and, some times later (without connection closed), the stream will have another JSON data,
{ "name" : "bar" }
.
I want to parse it in my processing loop with json-simple or json-smart, whatever. Is there anyway to do this?
I like to have a JSON parser (input data from a stream) and if it does not received data from the stream, the parser can block for more data, and if it receives a complete JSON data (possibly with some method), it can continuously parse the next JSON data.
Apprently, I tried with json-simple, and json-smart but no success.
Any help or advice would be appreciated.
Thank you.
If I understand the question correctly then the problem is parsing a complete JSON object from a batch of JSON object (continuously coming through an input stream).
So, If it is possible to detect that a complete JSON object have been received, we can parse that object.
To detect if the received JSON object (String) is complete and valid, we can maintain a Stack of {, and } 1. For every { we receive we can push it on Stack, and for every }, we can remove an { from Stack. And if we have the Stack empty (not first time) we can conclude that there is a complete JSON object. We can then proceed to parse.
1 - Not sure about parentheses structure (is it balanced?) of JSON, didn't explore JSON formation deeply earlier :(.

How do I parse a string of json if I do not know its object model before hand?

I want to work with Open Street Map (OSM). OSM keeps its data formats as flexible as possible by using key value pairs. I am developing an application for Android and I am going to send it a JSON string of OSM data. What should I do if I do not know what the JSON will look like in advance? What would be the best library?
Thanks for your help,
Chris
This may be what you are looking for
http://code.google.com/p/google-gson/
Cheers
First of all, you need to know if the JSON file contains an array or an object. If the first nonwhite space character is a [, it's an array, if it's a {, it's an object. Creating JSONArray when the first char is a { or vice versa will throw a Runtime Exception.
Second off all, once you have your JSONObject, you're going to want to get data from it. So you have to know the name of the keys to get the values, i.e.
myStreet = myJsonOjbect.getString("street name")
If you're not going to get data from it, what's the point of having the json file? Surely you can open the JSON in a Lint to see what the structure is.
hope this helps!

Incremental streaming JSON library for Java

Can anyone recommend a JSON library for Java which allows me to give it chunks of data as they come in, in a non-blocking fashion? I have read through A better Java JSON library and similar questions, and haven't found precisely what I'd like.
Essentially, what I'd like is a library which allows me to do something like the following:
String jsonString1 = "{ \"A broken";
String jsonString2 = " json object\" : true }";
JSONParser p = new JSONParser(...);
p.parse(jsonString1);
p.isComplete(); // returns false
p.parse(jsonString2);
p.isComplete(); // returns true
Object o = p.getResult();
Notice the actual key name ("A broken json object") is split between pieces.
The closest I've found is this async-json-library which does almost exactly what I'd like, except it cannot recover objects where actual strings or other data values are split between pieces.
There are a few blocking streaming/incemental JSON parsers (as per Is there a streaming API for JSON?); but for async nothing yet that I am aware of.
The lib you refer to seems badly named; it does not seem to do real asynchronous processing, but merely allow one to parse sequence of JSON documents (which multiple other libs allow doing as well)
If there were people who really wanted this, writing one is not impossible -- for XML there is Aalto, and handling JSON is quite a bit simpler than XML.
For what it is worth, there is actually this feature request to add non-blocking parsing mode for Jackson; but very few users have expressed interest in getting that done (via voting for the feature request).
EDIT: (2016-01) while not async, Jackson ObjectMapper allows for convenient sub-tree by sub-tree binding of parts of the stream as well -- see ObjectReader.readValues() (ObjectReader created from ObjectMapper), or short-cut versions of ObjectMapper.readValues(...). Note the trailing s in there, which implies a stream of Objects, not just a single one.
Google Gson can incrementally parse Json from an InputStream
https://sites.google.com/site/gson/streaming
I wrote such a parser: JsonParser.java. See examples how to use it:JsonParserTest.java.

Categories

Resources