I am building a tool to parse huge JSON around 1GB. In that logic, I am creating JsonParser object keep reading till it reaches expected JsonToken. Now I create another JsonParser(called child), which should be starting from previous JsonParser token position without much overhead. Is there a way to do that in JasonParser API for that? I am using skipChildren(), which is also taking time in my scenario.
You can try to call releaseBuffered(...) to get the data that are read but not consumed by the parser, and then prepend these data to the input stream (getInputSource()) to somehow parse the resulting stream (one way to do this might be to use an input stream that supports marks when constructing the parser).
However, since you're already using a stream based API, you probably won't get better performance than with skipChildren().
Related
we have a very long XMLResponse from one web service, and we want to optimize the time of parsing this XMLResponse, so the strategy is to parse the XMLResponse line by line until we have what we need from the XMLResponse => the mandatory elements
we know the mandatory elements and we know also that the XMLResponse is valid XML object, so we can stop reading the string before arriving at the end.
so please guys: what are the tools that gonna help us to do this?
thank you in advance ...
Using the Stax api is probably the most efficient way of parsing xml since the calling code is in complete control and can skip whole element subtrees that are not needed.
There is a good tutorial about this approach at https://medium.com/tech-travelstart/efficient-xml-parsing-with-java-df3169e1766b
I would not advise to parse an XML by hand. You either parse it accurately, (which would be as time consuming as parsing it with a parser) or you do it approximately risking errors.
If you need performance you can use a SAX parser, it get's more complicated but it is as fast as it can be.
I have to write a Jersey client which should handle a huge payload (>1GB) but the problem is if I use the java object model then I am getting a memory error. I am considering using Jackson streaming API but I am confused that it will still get buffered in memory and occupy more than 1 GB space. Can someone explain how streaming works on the client side?
The Jackson Streaming API is identical for both the server and client side. It can be very efficient but it is substantially more work than the Databind API since you have to code a bunch of that work yourself. (see Jackson Performance)
Functionally, you want to leave the input in the stream and parse (and process) it piece by piece. In cases where you know the structure or it happens to be an array you can theoretically process each object in the array one-by-one to avoid having to read the entire array before processing.
JsonFactory factory = ObjectMapper.getJsonFactory();
try(JsonParser parser = factory.createJsonParser(inputStream)) {
while(parser.nextToken() != JsonToken.END_OBJECT) {
// process tokens, etc. here
}
}
My current Android application is employing RealmIO to store Json data.
I use the following Realm insert
backgroundRealm.createOrUpdateAllFromJson(Data.class, rawJson);
My raw Json data is retrieved via a REST API
The API supports paging however I need to pass subsequent calls the last Id contained within the previous rawJson.
These Json arrays have 25000 items. I dont want the cost of parsing the entire 25000 item array.
Is there any method I can use to extract the last array item to discover what the last Id value is?
each item resembles this
{"id":6,"risk":"xxxxxxxxx","active":0,"from":"2016-07-18"}
What options do I have?
is it just rawJson.lastIndexOf("id") and substring()?
There's no standard way to avoid parsing the entire JSON. You can, on the other hand, avoid storing it all in RAM, and extracting it into nodes. Take a streaming JSON parser (these exist) and watch for the events of the element you need. The last event of this kind you receive will contain the last ID.
On the other hand, if you know that the schema and the serialization format of the JSON are not going to change (e.g. you control it), you can just scan the text of unparsed JSON from the end and extract that value, by counting parens and quotes. This is, of course, a brittle approach, though very fast. Its brittleness definitely depends on the schema: if the list you're looking for is a last element of another list, it's much more robust than if it's a value under a particular key, and can end up anywhere in the map.
If you are in control of the REST API side of the app, consider rethinking it, e.g. passing the events from latest to earliest, so you can look for the first events while parsing, and safely discard the rest if you don't need the past.
First the background info:
I'm using commons httpclient to make a get request to the server.
Server response a json string.
I parse the string using org.json.
The problem:
Actually everything works, that is for small responses (smaller then 2^31 bytes = max value of an integer which limits the getResponseBody and the stringbuilder). I on the other hand have a giant response (over several GB) and I'm getting stuck. I tried using the "getResponseBodyAsStream" of httpclient but the response is so big that my system is getting stuck. I tried using a String, a Stringbuilder, even saving it to a file.
The question:
First, is this the right approach, if so, what is the best way to handle such a response? If not, how should I proceed?
If you ever shall have a response which can be factor of GB you shall parse the json as stream character by character (almost) and avoid creating any String objects... (its very important because java stopTheWorld garbage collection will cause your system freese for seconds if you constantly create lot of garbage)
You can use SAXophone to create parsing logic.
You'll have to implement all the methods like onObjectStart, onObjectClose, onObjectKey etc... its hard at first but once you take a look implementation of PrettyPrinter in test packages you'll have the idea...
Once properly implemented you can handle an infinite stream of data ;)
P.S. this is designed for HFT so its all about performance and no garbage...
I googled it and didn't find an answer - may be I didn't search it right.
My question is I'm parsing xml from the beginnning to the end of document - one way.
What if somewhere in the middle I need to set parser to go to the Start of Document again?
I know only myXmlPullParser.next(); (or any other next) to move forward, but I need at some condition to start parsing from the beginning of document again.
Is it possible?
Is it possible?
Sure. Create a new pull parser instance, using the same code as you used to create the first one. Or, try calling setInput() on your existing instance, providing it a fresh copy of the data.
If it is reading from a file (or network) I am thinking that caching the data in an input stream and reading it back into the parser may be the fastest route. I will work up an example