Jackson writeValueAsString too slow - java

I want to create JSON string from object.
ObjectMapper om = new ObjectMapper();
String str = om.writeValueAsString(obj);
Some objects are large, and it takes long time to create JSON string.
To create 8MB JSON string, it needs about 15secs.
How can I improve this?

Make sure you have enough memory: Java String for storing 8 MB of serialized JSON needs about 16 megabytes of contiguous memory in heap.
But more importantly: why are you creating a java.lang.String in memory?
What possible use is there for such a huge String?
If you need to write JSON content to a file, there are different methods for that; similarly for writing to a network socket. At very least you could write output as a byte[] (takes 50% less memory), but in most cases incremental writing to an external stream requires very little memory.
15 seconds is definitely very slow. Without GC problems, after initial warmup, Jackson should write 8 megs in fraction of a second, something like 10-20 milliseconds for simple object consisting of standard Java types.
EDIT:
Just realized that during construction of the result String, temporary memory usage will be doubled as well, since buffered content is not yet cleared when String is constructed. So 8 MB would need at least 32 MB to construct String. With default heap of 64 MB this would not work well.

Related

What is an overhead for creating Java objects from lines of csv file

the code reads lines of CSV file like:
Stream<String> strings = Files.lines(Paths.get(filePath))
then it maps each line in the mapper:
List<String> tokens = line.split(",");
return new UserModel(tokens.get(0), tokens.get(1), tokens.get(2), tokens.get(3));
and finally collects it:
Set<UserModel> current = currentStream.collect(toSet())
File size is ~500MB
I've connected to the server using jconsole and see that heap size grew from 200MB to 1.8GB while processing.
I can't understand where this x3 memory usage came from - I expected something like 500MB spike or so?
My first impression was it's because there is no throttling and garbage collector simply doesn't have enough time for cleanup.
But I've tried to use guava rate limiter to let garbage collector time to do it's job but result is the same.
Tom Hawtin made good points - I just wanna expand on them and provide a bit more details.
Java Strings take at least 40 bytes of memory (that's for empty string) due to java object header (see later) overhead and an internal byte array.
That means the minimal size for non-empty string (1 or more characters) is 48 bytes.
Nowawadays, JVM uses Compact Strings which means that ASCII-only strings only occupy 1 byte per character - before it was 2 bytes per char minimum.
That means if your file contains characters beyond ASCII set, then memory usage can grow significantly.
Streams also have more overhead compared to plain iteration with arrays/lists (see here Java 8 stream objects significant memory usage)
I guess your UserModel object adds at least 32 bytes overhead on top of each line, because:
the minimum size of java object is 16 bytes where first 12 bytes are the JVM "overhead": object's class reference (4 bytes when Compressed Oops are used) + the Mark word (used for identity hash code, Biased locking, garbage collectors)
and the next 4 bytes are used by the reference to the first "token"
and the next 12 bytes are used by 3 references to the second, third and fourth "token"
and the last 4 bytes are required due to Java Object Alignment at 8-byte boundaries (on 64-bit architectures)
That being said, it's not clear whether you even use all the data that you read from the file - you parse 4 tokens from a line but maybe there are more?
Moreover, you didn't mention how exactly the heap size "grew" - If it was the commited size or the used size of the heap. The used portion is what actually is being "used" by live objects, the commited portion is what has been allocated by the JVM at some point but could be garbage-collected later; used < commited in most cases.
You'd have to take a heap snapshot to find out how much memory actually the result set of UserModel occupies and that would actually be interesting to compare to the size of the file.
It may be that the String implementation is using UTF-16 whereas the file may be using UTF-8. That would be double the size assuming all US ASCII characters. However, I believe JVM tend to use a compact form for Strings nowadays.
Another factor is that Java objects tend to be allocated on a nice round address. That means there's extra padding.
Then there's memory for the actual String object, in addition to the actual data in the backing char[] or byte[].
Then there's your UserModel object. Each object has a header and references are usually 8-bytes (may be 4).
Lastly not all the heap will be allocated. GC runs more efficiently when a fair proportion of the memory isn't, at any particular moment, being used. Even C malloc will end up with much of the memory unused once a process is up and running.
You code reads the full file into memory. Then you start splitting each line into an array, then you create objects of your custom class for each line. So basically you have 3 different pieces of "memory usage" for each line in your file!
While enough memory is available, the jvm might simply not waste time running the garbage collector while turning your 500 megabytes into three different representations. Therefore you are likely to "triplicate" the number of bytes within your file. At least until the gc kicks in and throws away the no longer required file lines and splitted arrays.

why my jsonParser using Streaming API use much more memory

I have code which jsonparse the whole file:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
Map<String, ValueClass<V>> map = mapper.readValue(new FileReader(fileName)...);
for each entry of map:
space.put(entry.getKey(), entry.value());
the memory used to parse the file is large and is much more than the size of the file itself. To save memory, I decided to replace the code with jsonparse streaming API as follows:
ConcurrentHashMap<String, ValueClass<V>> space = new ConcurrentHashMap<String, ValueClass<V>> ();
JsonFactory f = new MappingJsonFactory();
JsonParser jp = f.createJsonParser(new File(fileName);
JsonToken current;
current = jp.nextToken();
if (current != JsonToken.START_OBJECT) {
show error and return;
}
ObjectMapper mapper = new ObjectMapper();
while (jp.nextToken() != JsonToken.END_OBJECT) {
key = jp.getCurrentName();
current = jp.nextToken();
if (key != null) {
if (current == JsonToken.START_OBJECT) {
mem_before=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
value = mapper.readValue(jp, ValueClass.class); (1)
mem_after=Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
mem_diff = mem_after - mem_before; (2)
space.put(key, value); (3)
}
else jp.skipChildren();
else jp.skipChildren();
}
however, it spends even much more memory than parsing the whole file once. And it shows the memory increase is due to (1) (I detect the allocatedMemory using Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory before (1) and after (1) to obtain difference)
shouldn't using streaming API save memory, according to http://www.ngdata.com/parsing-a-large-json-file-efficiently-and-easily and other article said?
EDIT
#StaxMan: Thanks for answering me. Looks like you are very familiar with memory consumption we should expect.
My purpose is to load content from file to memory, so that I do not need to search content from file. so I do need all the content in map all the time.
The reason I think streaming may help is: if loading entire file as one object, I guess besides consuming memory for the map variable: space, we need a large additional memory to parse the whole file, which is a big burden; but by using streaming, although we still need the same memory for the map variable:space, we only need a small amount of additional memory to parse each entry of a file, this additional memory is small because we reuse it to parse each entry instead of parsing the whole file. Doesn’t it save memory therefore? Please correct me if I am wrong.
I understand memory size may be more than file size. But first I do not understand why it need so much more than file (2.7 times of file size in my case); second, I do not understand why with streaming I spend even more memory (even double of without using streaming). Memory leak? But I do not see any problem of my code. should have been at least no more than 2.7 times.
Besides, do you know how to estimate memory consumption of an object such as each entry of my map variable space?
It would help if you explained what you are trying to achieve. For example, do you absolutely need all the values in memory at all times? Or would it be possible to process values one at a time, and avoid building that Map with all keys, values.
Your second attempt does not make much sense without explanation what you are trying to do there; but essentially amount of memory used should be space for value POJOs, keys, and Map that contains them. This may be more than input JSON or less, but it really depends on kind of content you are dealing with. Strings, for example, will consume memory memory in JVM than JSON file: most characters in UTF-8 are single byte, whereas each char in JVM is 16-bit value (UCS-2 encoded). Further, whereas JSON Strings have 3 or 4 byte overhead (quotes, separators), memory overhead for java.lang.String is more like 16 or 24 bytes per String.
Similarly, Maps consume quite a bit of memory for structure that JSON file does not need. JSON has couple of separators per entry, maybe 6 - 8 bytes (depending on indentation is used). Java Maps, on the other hand, have bigger per-entry overhead (maybe 16 bytes?) in addition to String keys (with above-mentioned overhead) and POJO values.
So: all in all, it is not uncommon that objects in memory may consume bit more memory than what JSON file itself contains.

best way of loading a large text file in java

I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.
Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.
Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)

File size vs. in memory size in Java

If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).
I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.
Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.
In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.
String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?
Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.
As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);

Huge String Table in Java

I've got a question about storing huge amount of Strings in application memory. I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. From time to time i'll need to search one of them. Is it possible to do this app runnable on ~1GB of RAM?
Will
ArrayList <String> list = new ArrayList<String>();
work?
As far as I know String in java is coded in UTF-8, what gives me huge memory use. Is it possible to make such array with String coded in ANSI?
This is console application run with parameters:
java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui
The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally.
Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds)
If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server.
A Java String is a full blown object. This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). So an empty String already takes 45 bytes in memory (as you can see here).
Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list.
Anyway, I would suggest you to load the string as byte[] if you have memory issues. That way you can control the encoding and you can still do searchs.
Is there some reason you need to restrict it to 1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G.
If you have to search, use a SortedSet, not an ArrayList

Categories

Resources