Serialization - differences between C++ and Java - java

I've recently been running some benchmarks trying to find the "best" serialization frameworks for C++ and also in Java. The factors that make up "best" for me are
the speed of de/serializing and also the resulting size of the serialized object.
If I look at the results of various frameworks in Java, I see that the resulting byte[] is generally smaller than the object size in memory. This is even the case with the built in Java serialization. If you then look at some of the other offerings (protobuf etc.) the size decreases even more.
I was quite surprised that when I looked at things on the C++ size (boost, protobuf) that the resulting object is generally no smaller (and in some cases bigger) than the original object.
Am I missing something here? Why do I get a fair amount of "compression" for free in Java but not in C++?
n.b for measuring the size of the objects in Java, I'm using Instrumentation http://docs.oracle.com/javase/6/docs/api/java/lang/instrument/Instrumentation.html

Did you compare the absolute size of the data? I would say that Java has more overhead, so if you "compress" the data into a serialized buffer, the amount of overhead decreases a lot more. In C/C++ you have almost the bare minimum required for the physical data size, so there is not much room for compression. And in fact, you have to add additional information to deserialize it, which could even result in a growth.

Object size can be observed to be bigger than the actual data size due to the offset bits between data members.
When an object is serialized, these offset bits are discarded and as a result, serialized object memory is smaller.
Because java is a managed environment, it will need more of such offset data to control memory and ownership, therefore, their compression rate is bigger.

Related

Minimizing overhead for large trees in java

I need to implement a large octree in Java. The tree will be very large so I will need to use a paging mechanism and split the tree into many small files for long-term storage.
My concern is that Java objects have a very high space overhead. If I were using C, I would be able to store the 8 reference pointers and any other data with only a byte or so of overhead to store the node type.
Is there any way that I can approach this level of overhead in Java?
I would be tempted to just use a single byte array per file. I could then use offsets in place of pointers (this is how I plan on storing the files). However, even when limiting the max file size, that would easily leave me with arrays too large to fit in contiguous memory, particularly if that memory becomes heavily fragmented. This would also lead to large time overheads for adding new nodes as the entire space would need to be reallocated. A ByteBuffer might resolve the first problem (I am not entirely sure of this), however it would not solve the second, as the size of a ByteBuffer is static.
For the moment I will just stick to using node objects. If anyone knows a more space efficient solution with a low time cost, please let me know.

How do I calculate 64bit Java Memory Cost

I'm trying to find a simple and accurate reference for the cost in bytes of Java 64 bit Objects. I've not been able to find this. The primitives are clearly specified but there are all these edge cases and exceptions that I am trying to figure out like padding for an Object and cost vrs. space they actually take up on the heap, etc. From the gist of what I'm reading here: http://btoddb-java-sizing.blogspot.com/ that can actually be different?? :-/
If you turn off the TLAB, you will get accurate accounting and you can see exactly how much memory each object allocation uses.
The best way to see where your memory is being used, is via a memory profiler. Worry about bytes here and there is most likely a waste of time. When you have hundreds of MB, then it makes a difference and the best way to see that is in a profiler.
BTW Most systems use 32-bit references, even in 64-bit JVMs. There is no such thing as a 64-bit Object. Apart from the header, the object will uses the same space whether it is a 32-bit JVM or using 32-bit references in a 64-bit JVM.
You are essentially asking for a simple way to get an accurate prediction of object sizes in Java.
Unfortunately ... there isn't one!
The blog posting you found mentions a number of complicating factors. Another one is that the object sizing calculation can potentially change from one Java release to the next, or between different Java implementation vendors.
In practice, you options are:
Estimate the sizes based on what you know, and accept that your estimates may be wrong. (If you take account of enough factors, you should be able to get reasonable ballpark estimates, at least for a particular platform. But accurate predictions are inherently hard work.)
Write micro benchmarks using the TLAB technique to measure the size of the objects.
The other point is that in most cases it doesn't matter if your object size predictions are not entirely accurate. The recommended approach is to implement, measure and then optimize. This does not require accurate size information until you get to the optimization stage, and at that point you can measure the sizes ... if you need the information.

How to create efficient bit set structure for big data?

Java's BitSet is in memory and it has no compression in it.
Say I have 1 billion entries in bit map - 125 MB is occupied in memory.
Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable.
How to do fast operations on such bit maps without holding them uncompressed in memory?
I do not know the distribution of the bit in the bit-set.
I have also looked at JavaEWAH, which is a variant of the Java BitSet class, using run-length encoding (RLE) compression.
Is there any better solution ?
One solution is to keep the arrays off the heap.
You'll want to read this answer by #PeterLawrey to a related question.
In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap.
The operating system may limit the size of a individual memory mapped region. Its easy to work around this limitation by mapping multiple regions. If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.
Are you sure you need compression? Compression will trade time for space. Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. Can you add an SSD?
If you haven't yet tried memory-mapped files, start with that. I'd take a close look at implementing something on top of Peter's Chronicle.
If you need more speed you could try doing your binary operations in parallel.
If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays.
From the comments here what I would say as a complement to your initial question :
the bit fields distribution is unknown and so BitSet is probably the best we can use
you have to use the bit fields in different modules and want to cache them
That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS.
Pseudo code :
class BitSetHolder {
class BitSetCache extends LinkedHashMap<Integer, Bitset> {
BitSetCache() {
LinkedHashMap(size, loadfactor, true); // access order ...
}
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
}
}
BitSet get(int i) { // get from cache if not from disk
if (bitSetCache.containsKey(i) {
return bitSetCache.get(i);
}
// if not in cache, put it in cache
BitSet bitSet = readFromDisk();
bitSetCache.put(i, bitSet);
return bitSet();
}
}
That way :
you have transparent access to you 10 bit sets
you keep in memory the most recently accessed bit sets
you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others)
If this is an option for your requirements, I could develop a little more. Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap.
The best solution depends a great deal on the usage patterns and structure of the data.
If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. For example, a word list can be represented very efficiently in both space and lookup time using a DAG.
Sample Directed Graph and Topological Sort Code
BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java

Serialized object size vs in memory object size in Java

Is there a way of estimating (roughly) in memory object size from Serialized object size in Java
The size in memory will be usually between half and double the serializable size. The most extreme example might be the Byte which is more than 80 bytes Serialized can be 16 bytes in memory.
You can use a profiler to tell you how much memory an object uses. Another way is to use a tool based on Instrumentation.getObjectSize(object)
You might find this interesting Getting the size of an Object
A very nice Tool for this challenge:
https://github.com/jbellis/jamm
From the readme.txt:
MemoryMeter is as accurate as
java.lang.instrument.Instrumentation.getObjectSize, which only claims
to provide "approximate" results, but in practice seems to work as
expected.
MemoryMeter uses reflection to crawl the object graph for measureDeep.
Reflection is slow: measuring a one-million object Cassandra Memtable
(that is, 1 million children from MemoryMeter.countChildren) took
about 5 seconds wall clock time.

determining java memory usage

Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)
I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.
As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)
You can use JMP, but it's only caught up to Java 1.5.
I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.
If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.
See if PerfAnal will give you what you are looking for.
This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory
I believe the profiler included in Netbeans can moniter memory usage also, you can try that

Categories

Resources