TokenBuffer Jackson Json - java

I am wondering what to do about large files of JSON.
I don't want to store them in memory so I don't want to do JsonNodes because I think that stores the entire tree in memory.
My other idea was using TokenBuffer. However, I'm wondering how this works. Does the TokenBuffer store the entire document as well ? Is there a max limit. I know that it's a performance best practice but if I do:
TokenBuffer buff = jParser.readValueAs(TokenBuffer.class);
It seems like it reads the whole document at once (which I don't want).

The purpose of TokenBuffer is to store an extensible array of JSON tokens in memory. It does that by creating at first 1 Segment object with 16 JsonToken objects and then adding new Segment objects as needed.
You are correct to guess that the entire document will be loaded in memory. The only difference is that instead of storing an array of chars it stores the tokens. The performance advantages according to the docs:
You can reprocess JSON tokens without re-parsing JSON content from textual representation.
It's faster if you want to iterate over all tokens in the order they were appended in the buffer.
TokenBuffer is not a low level buffer of a disk file in memory.
I you only want to parse a file once without loading all of it at once in memory skip the TokenBuffer. Just createJsonParser from a JsonFactory or MappingJsonFactory and get tokens with nextToken. Example.

Related

Read a huge json array file of objects

I have a big json file, about ~40Gb in size. When I try to convert this file of array of objects to a list of java objects, it crashes. I've used all sizes of maximum heap xmx but nothing has worked!
public Set<Interlocutor> readJsonInterlocutorsToPersist() {
String userHome = System.getProperty(USER_HOME);
log.debug("Read file interlocutors "+userHome);
try {
ObjectMapper mapper = new ObjectMapper();
// JSON file to Java object
Set<Interlocutor> interlocutorDeEntities = mapper.readValue(
new File(userHome + INTERLOCUTORS_TO_PERSIST),
new TypeReference<Set<Interlocutor>>() {
});
return interlocutorDeEntities;
} catch (Exception e) {
log.error("Exception while Reading InterlocutorsToPersist file.",
e.getMessage());
return null;
}
}
Is there a way to read this file using BufferedReader and then to push object by object?
You should definitly have a look at the Jackson Streaming API (https://www.baeldung.com/jackson-streaming-api). I used it myself for GB large JSON files. The great thing is you can divide your JSON into several smaller JSON objects and then parse them with mapper.readTree(parser). That way you can combine the convenience of normal Jackson with the speed and scalability of the Streaming API.
Related to your problem:
I understood that your have a really large array (which is the reason for the file size) and some much more readable objects:
e.g.:
[ // 40GB
{}, // Only 400 MB
{},
]
What you can do now is to parse the file with Jackson's Streaming API and go through the array. But each individual object can be parsed as "regular" Jackson object and then processed easily.
You may have a look at this Use Jackson To Stream Parse an Array of Json Objects which actually matches your problem pretty well.
is there a way to read this file using BufferedReader and then to push
object by object ?
Of course, not. Even you can open this file how you can store 40GB as java objects in memory? I think you don't have such amount of memory in you computers (but technically using ObjectMapper you should have about 2 times more operation memory - 40GB for store json + 40GB for store results as java objects = 80 GB).
I think you should use any way from this questions, but store information in databases or files instead of memory. For example, if you have millions rows in json, you should parse and save every rows to database without keeping it all in memory. And then you can get this data from database step by step (for example, not more then 1GB for every time).

Jackson writeValueAsString too slow

I want to create JSON string from object.
ObjectMapper om = new ObjectMapper();
String str = om.writeValueAsString(obj);
Some objects are large, and it takes long time to create JSON string.
To create 8MB JSON string, it needs about 15secs.
How can I improve this?
Make sure you have enough memory: Java String for storing 8 MB of serialized JSON needs about 16 megabytes of contiguous memory in heap.
But more importantly: why are you creating a java.lang.String in memory?
What possible use is there for such a huge String?
If you need to write JSON content to a file, there are different methods for that; similarly for writing to a network socket. At very least you could write output as a byte[] (takes 50% less memory), but in most cases incremental writing to an external stream requires very little memory.
15 seconds is definitely very slow. Without GC problems, after initial warmup, Jackson should write 8 megs in fraction of a second, something like 10-20 milliseconds for simple object consisting of standard Java types.
EDIT:
Just realized that during construction of the result String, temporary memory usage will be doubled as well, since buffered content is not yet cleared when String is constructed. So 8 MB would need at least 32 MB to construct String. With default heap of 64 MB this would not work well.

Java StringBuilder high volumes of data blocks thread

I'm running a small program that processes around 215K of records in the database. These records contain xml that is used by JaxB to marshal and unmarshal to objects.
The program I was running was trying to find xml's that due to legacy couldn't be unmarshalled anymore. Each time I had the unmarshal exception I save this exception message containing the xml in an arraylist. All in the end I wanted to send out a mail with all failed records with the cause exception message. So I used the messages in the arraylist together with a StringBuilder to compose the email body.
However there where around 75K failures and when I was building the body the StringBuilder just stopped appending at a certain point in the for loop and the thread was blocked. I since changed my approach not to append the xml from the exception message anymore, but I'm still not clear why it didn't work.
Could it be that the VM went out of memory, or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era). Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first, but that would be such a dirty interface then :(
Any architectural insights would be appreciated.
EDIT
As requested here the code, it's no rocket science. Take that the failures list contains around 75K entries, each entry contains an xml of on avg 500 to 1000 lines
private String createBodyMessage(List<String> failures) {
StringBuilder builder = new StringBuilder();
builder.append("Failed operations\n");
builder.append("=================\n\n");
for (String failure : failures) {
builder.append(failure);
builder.append("\n");
}
return builder.toString();
}
You might be just successful with
int sizeEstimate = failures.size() * 20;
StringBuilder builder = new StringBuilder(sizeEstimate);
builder.append("Failed operations\n");
builder.append("=================\n\n");
while (!failures.isEmpty()) {
builder.append(failures.remove(0));
builder.append('\n');
}
This does less resizing the internal buffer of StringBuilder and consumes failures to reduce that memory.
It might not solve the problem if the text is too huge.
Compressed attachment however is standard procedure.
StringBuffer is based on Array structure, and the maximum number of cells in array is 2^31-1
Reaching this size will normally throws an error on Java 7, but i'm not very sure
The solution is to swap your data to a file, before reaching a fixed size of your StringBuffer
Could it be that the VM went out of memory,
If you filled up the heap, you would get an OutOfMemoryError exception.
or can Strings only be of a certain size (doubtful I believe certainly in the 64 bit era).
Actually, yes. A Java String or StringBuilder can contain at most 2^32-1 characters1.
Is there a better way I could have solved this ? I contemplated sending the StringBuilder to my service instead of saving the strings in an arraylist first ...
That won't help if the real problem is that the concatenation of the strings is too large to hold in a StringBuilder.
Actually, a better approach would be to stream the strings into a PipedOutputStream, and use the corresponding PipedInputStream to construct a MimeBodyPart that you then attach to the email. You could include a compressor in the stream stack too.
But an even better approach would be not to attempt to send gigabytes of erroneous data as email attachments. Save them as files that can be be fetched (or whatever) if the email recipient wants them.
1 - Surprisingly, the javadocs don't seem to state this explicitly. However, String.length() returns an int, and various string manipulation methods take int arguments to specify offsets and lengths. And certainly, the standard implementations of String and StringBuilder use a single char[] as backing store, and arrays are limited to 2^31-1 elements by the JLS and the JVM spec.

How can I get byteSize of String Array other than traversing the Array

I want to optimize my code by using ByteBuffer in place of String. What I am getting is String[]. I am doing formatting on each element of it.
e.g. String strAry[] = {"Help", "I", "am", "trapped", "in", "a", "fortune", "cookie", "factory"};
is my String array, I am writing content of it to a .csv file in
format "StrArray[0]";"StrArray[1]";"StrArray2";"StrArray[3]"; so on...
which is internally creating multiple Strings and this code is running into loop for hundreds n thousands of time some time.
I want to implement ByteBuffer. While creating
ByteBuffer bbuf = ByteBuffer.allocate(bufferSize); I need to specify buffer size here.
I dont want to iterate over each element of String [] to calculate its byteSize.
Any help is appreciated.
Couple of notes:
Data structure usage
I think you should be using CharBuffer and not ByteBuffer. CharBuffer is requiring the number of characters and not bytes.
Buffers from Java NIO are always used as buffers, that means there is a possibility that you will need to read into them multiple times.
If you need to have the whole content in memory, buffers are not the data structure for this use case.
You don't have to know the exact size for a buffer, the allocated size is the maximal capacity of the buffer.
StringBuilder is a mutable data structure for string processing. You might consider using it instead.
You don't have to know the exact size.
Computation of final size
might be done using Stream API (Java 8) or similar utility methods.

why my jsonParser using Streaming API use much more memory

I have code which jsonparse the whole file:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
Map<String, ValueClass<V>> map = mapper.readValue(new FileReader(fileName)...);
for each entry of map:
space.put(entry.getKey(), entry.value());
the memory used to parse the file is large and is much more than the size of the file itself. To save memory, I decided to replace the code with jsonparse streaming API as follows:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
JsonFactory f = new MappingJsonFactory();
JsonParser jp = f.createJsonParser(new File(fileName);
JsonToken current;
current = jp.nextToken();
if (current != JsonToken.START_OBJECT) {
show error and return;
}
ObjectMapper mapper = new ObjectMapper();
while (jp.nextToken() != JsonToken.END_OBJECT) {
key = jp.getCurrentName();
current = jp.nextToken();
if (key != null) {
if (current == JsonToken.START_OBJECT) {
mem_before=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
value = mapper.readValue(jp, ValueClass.class); (1)
mem_after=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
mem_diff = mem_after - mem_before; (2)
space.put(key, value); (3)
}
else jp.skipChildren();
else jp.skipChildren();
}
however, it spends even much more memory than parsing the whole file once. And it shows the memory increase is due to (1) (I detect the allocatedMemory using Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory before (1) and after (1) to obtain difference)
shouldn't using streaming API save memory, according to http://www.ngdata.com/parsing-a-large-json-file-efficiently-and-easily and other article said?
EDIT
#StaxMan: Thanks for answering me. Looks like you are very familiar with memory consumption we should expect.
My purpose is to load content from file to memory, so that I do not need to search content from file. so I do need all the content in map all the time.
The reason I think streaming may help is: if loading entire file as one object, I guess besides consuming memory for the map variable: space, we need a large additional memory to parse the whole file, which is a big burden; but by using streaming, although we still need the same memory for the map variable:space, we only need a small amount of additional memory to parse each entry of a file, this additional memory is small because we reuse it to parse each entry instead of parsing the whole file. Doesn’t it save memory therefore? Please correct me if I am wrong.
I understand memory size may be more than file size. But first I do not understand why it need so much more than file (2.7 times of file size in my case); second, I do not understand why with streaming I spend even more memory (even double of without using streaming). Memory leak? But I do not see any problem of my code. should have been at least no more than 2.7 times.
Besides, do you know how to estimate memory consumption of an object such as each entry of my map variable space?
It would help if you explained what you are trying to achieve. For example, do you absolutely need all the values in memory at all times? Or would it be possible to process values one at a time, and avoid building that Map with all keys, values.
Your second attempt does not make much sense without explanation what you are trying to do there; but essentially amount of memory used should be space for value POJOs, keys, and Map that contains them. This may be more than input JSON or less, but it really depends on kind of content you are dealing with. Strings, for example, will consume memory memory in JVM than JSON file: most characters in UTF-8 are single byte, whereas each char in JVM is 16-bit value (UCS-2 encoded). Further, whereas JSON Strings have 3 or 4 byte overhead (quotes, separators), memory overhead for java.lang.String is more like 16 or 24 bytes per String.
Similarly, Maps consume quite a bit of memory for structure that JSON file does not need. JSON has couple of separators per entry, maybe 6 - 8 bytes (depending on indentation is used). Java Maps, on the other hand, have bigger per-entry overhead (maybe 16 bytes?) in addition to String keys (with above-mentioned overhead) and POJO values.
So: all in all, it is not uncommon that objects in memory may consume bit more memory than what JSON file itself contains.

Categories

Resources