File size vs. in memory size in Java

File size vs. in memory size in Java - java

If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).

I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.

Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.

In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.

String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?

Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.

As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);

Related

What is an overhead for creating Java objects from lines of csv file

the code reads lines of CSV file like:
Stream<String> strings = Files.lines(Paths.get(filePath))
then it maps each line in the mapper:
List<String> tokens = line.split(",");
return new UserModel(tokens.get(0), tokens.get(1), tokens.get(2), tokens.get(3));
and finally collects it:
Set<UserModel> current = currentStream.collect(toSet())
File size is ~500MB
I've connected to the server using jconsole and see that heap size grew from 200MB to 1.8GB while processing.
I can't understand where this x3 memory usage came from - I expected something like 500MB spike or so?
My first impression was it's because there is no throttling and garbage collector simply doesn't have enough time for cleanup.
But I've tried to use guava rate limiter to let garbage collector time to do it's job but result is the same.

Tom Hawtin made good points - I just wanna expand on them and provide a bit more details.
Java Strings take at least 40 bytes of memory (that's for empty string) due to java object header (see later) overhead and an internal byte array.
That means the minimal size for non-empty string (1 or more characters) is 48 bytes.
Nowawadays, JVM uses Compact Strings which means that ASCII-only strings only occupy 1 byte per character - before it was 2 bytes per char minimum.
That means if your file contains characters beyond ASCII set, then memory usage can grow significantly.
Streams also have more overhead compared to plain iteration with arrays/lists (see here Java 8 stream objects significant memory usage)
I guess your UserModel object adds at least 32 bytes overhead on top of each line, because:
the minimum size of java object is 16 bytes where first 12 bytes are the JVM "overhead": object's class reference (4 bytes when Compressed Oops are used) + the Mark word (used for identity hash code, Biased locking, garbage collectors)
and the next 4 bytes are used by the reference to the first "token"
and the next 12 bytes are used by 3 references to the second, third and fourth "token"
and the last 4 bytes are required due to Java Object Alignment at 8-byte boundaries (on 64-bit architectures)
That being said, it's not clear whether you even use all the data that you read from the file - you parse 4 tokens from a line but maybe there are more?
Moreover, you didn't mention how exactly the heap size "grew" - If it was the commited size or the used size of the heap. The used portion is what actually is being "used" by live objects, the commited portion is what has been allocated by the JVM at some point but could be garbage-collected later; used < commited in most cases.
You'd have to take a heap snapshot to find out how much memory actually the result set of UserModel occupies and that would actually be interesting to compare to the size of the file.

It may be that the String implementation is using UTF-16 whereas the file may be using UTF-8. That would be double the size assuming all US ASCII characters. However, I believe JVM tend to use a compact form for Strings nowadays.
Another factor is that Java objects tend to be allocated on a nice round address. That means there's extra padding.
Then there's memory for the actual String object, in addition to the actual data in the backing char[] or byte[].
Then there's your UserModel object. Each object has a header and references are usually 8-bytes (may be 4).
Lastly not all the heap will be allocated. GC runs more efficiently when a fair proportion of the memory isn't, at any particular moment, being used. Even C malloc will end up with much of the memory unused once a process is up and running.

You code reads the full file into memory. Then you start splitting each line into an array, then you create objects of your custom class for each line. So basically you have 3 different pieces of "memory usage" for each line in your file!
While enough memory is available, the jvm might simply not waste time running the garbage collector while turning your 500 megabytes into three different representations. Therefore you are likely to "triplicate" the number of bytes within your file. At least until the gc kicks in and throws away the no longer required file lines and splitted arrays.

Jackson writeValueAsString too slow

I want to create JSON string from object.
ObjectMapper om = new ObjectMapper();
String str = om.writeValueAsString(obj);
Some objects are large, and it takes long time to create JSON string.
To create 8MB JSON string, it needs about 15secs.
How can I improve this?

Make sure you have enough memory: Java String for storing 8 MB of serialized JSON needs about 16 megabytes of contiguous memory in heap.
But more importantly: why are you creating a java.lang.String in memory?
What possible use is there for such a huge String?
If you need to write JSON content to a file, there are different methods for that; similarly for writing to a network socket. At very least you could write output as a byte[] (takes 50% less memory), but in most cases incremental writing to an external stream requires very little memory.
15 seconds is definitely very slow. Without GC problems, after initial warmup, Jackson should write 8 megs in fraction of a second, something like 10-20 milliseconds for simple object consisting of standard Java types.
EDIT:
Just realized that during construction of the result String, temporary memory usage will be doubled as well, since buffered content is not yet cleared when String is constructed. So 8 MB would need at least 32 MB to construct String. With default heap of 64 MB this would not work well.

best way of loading a large text file in java

I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)

Is this leaking memory or am I just reaching the limit of objects I can keep in memory?

I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?

Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.

I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.

Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.

While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.

Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.

Huge String Table in Java

I've got a question about storing huge amount of Strings in application memory. I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. From time to time i'll need to search one of them. Is it possible to do this app runnable on ~1GB of RAM?
Will
ArrayList <String> list = new ArrayList<String>();
work?
As far as I know String in java is coded in UTF-8, what gives me huge memory use. Is it possible to make such array with String coded in ANSI?
This is console application run with parameters:
java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui

The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally.
Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds)
If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server.

A Java String is a full blown object. This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). So an empty String already takes 45 bytes in memory (as you can see here).
Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list.
Anyway, I would suggest you to load the string as byte[] if you have memory issues. That way you can control the encoding and you can still do searchs.

Is there some reason you need to restrict it to 1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G.
If you have to search, use a SortedSet, not an ArrayList

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.