Java json object size: JsonNode vs. String - java

I work on a java application that needs to hold ~50k json objects in memory.
Each json string is ~5000 characters long.
Extra memory consumption is my concern.
I want to compare the json objects later, but processing is not my concern, only extra memory consumption.
What is more efficient:
Keep json as java String
Keep json as Jackson JsonNode object
I tried serializing the JsonNode objects and the resulting files are smaller, than the string size - but I am not sure if the same is true in memory.
My use-case:
I need to detect changes to some objects, which are encoded as json. This change detection runs every minute and compares the current state with the last state (which we hold in memory).
There are no hooks or events or similar to get changes.
We already hold a list of these objects in memory - with only a limited subset of the json fields.
I cannot change that architecture.
Now instead of mapping json data to some Pojo and comparing each property manually, the idea is to hold the json string/objects and then calculate the diff/patch with some library.
This simplifies the logic a lot and is more generic - but we are worried about the extra memory consumption.

You can use the java.lang.instrument package's getObjectSize() method, to get approximations on the sizes of the objects with the both ways.
long getObjectSize(Object objectToSize)
From the javadoc:
Returns an implementation-specific approximation of the amount of storage consumed by the specified object. The result may include some or all of the object's overhead, and thus is useful for comparison within an implementation but not between implementations. The estimate may change during a single invocation of the JVM.

Related

Java object arrays - use of hardware memory cache

Iterating over consecutive elements of an array is generally considered to be more efficient than iterating over consecutive linked list elements because of caching.
This is undoubtedly true as long as elements have elementary data types. But if the elements are objects, my understanding is that only the references to the objects will be stored in the contiguous memory area of the array (which is likely to be cached) while the actual object data will be stored anywhere in main memory and cannot be cached effectively.
As you normally not only iterate over the container but also need to access object data in each iteration, doesn't this more or less kill the performance benefit of the array over the list?
Edit: The comment about different scenarios varying greatly is probably correct. So let's consider a specific one: You search for a specific object in the container. In order to find it you need to compare a given string to another string that is a class variable of the object.
No, for objects ("pointers") there is an indirection in both. A linked list needs for every node to step to the next one. So it still has an overhead.
But yes, in a relative way the gain concerns only a part, very roughly the half of the pure walk through, counting indirection steps.
And ofcourse every indirection makes access more chaotic, slower.
BTW there is the ArrayList too being similar fast as arrays.

Tweaking java classes for CPU cache friendliness

When designing java classes, what are the recommendations for achieving CPU cache friendliness?
What I have learned so far is that one should use POD as much as possible (i.e. int instead of integer). Thus, the data will be allocated consecutively when allocating the containing object. E.g.
class Local
{
private int data0;
private int data1;
// ...
};
is more cache friendly than
class NoSoLocal
{
private Integer data0;
private Integer data1;
//...
};
The latter will require two separate allocations for the Integer objects that can be at arbitrary locations in memory, esp. after a GC run. OTOH the first approach might lead to duplication of data in cases where the data can be reused.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
You cannot force JVM to place related objects close to each other (though JVM tries to do it automatically). But there are certain tricks to make Java programs more cache-friendly.
Let me show you some examples from the real-life projects.
BEWARE! This is not a recommended way to code in Java!
Do not adopt the following techniques unless you are absolutely sure why you are doing it.
Inheritance over composition. You've definitely heard the contrary principle "Favor composition over inheritance". But with composition you have an extra reference to follow. This is not good for cache locality and also requires more memory. The classic example of inheritance over composition is JDK 8 Adder and Accumulator classes which extend utility Striped64 class.
Transform arrays of structures into a structure of arrays. This again helps to save memory and to speed-up bulk operations on a single field, e.g. key lookups:
class Entry {
long key;
Object value;
}
Entry[] entries;
will be replaced with
long[] keys;
Object[] values;
Flatten data structures by inlining. My favorite example is inlining 160-bit SHA1 hash represented by byte[]. The code before:
class Blob {
long offset;
int length;
byte[] sha1_hash;
}
The code after:
class Blob {
long offset;
int length;
int hash0, hash1, hash2, hash3, hash4;
}
Replace String with char[]. You know, String in Java contains char[] object under the hood. Why pay performance penalty for an extra reference?
Avoid linked lists. Linked lists are very cache-unfriendly. Hardware works best with linear structures. LinkedList can be often replaced with ArrayList. A classic HashMap may be replaced with an open address hash table.
Use primitive collections. Trove is a high-performance library containg specialized lists, maps, sets etc. for primitive types.
Build your own data layouts on top of arrays or ByteBuffers. A byte array is a perfect linear structure. To achieve the best cache locality you can pack an object data manually into a single byte array.
the first approach might lead to duplication of data in cases where the data can be reused.
But not in the case you mention. An int is 4 bytes, a reference is typically 4-bytes so you don't gain anything by using Integer. For a more complex type, it can make a big difference however.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
The GC will do this anyway, provided the objects are only used in one place. If the objects are used in multiple places, they will be close to one reference.
Note: this is not guaranteed to be the case, however when allocating objects they will typically be continuous in memory as this is the simplest allocation strategy. When copying retained objects, the HotSpot GC will copy them in reverse order of discovery. i.e. they are still together but in reverse order.
Note 2: Using 4 bytes for an int is still going to be more efficient than using 28 bytes for an Integer (4 bytes for reference, 16 bytes for object header, 4 bytes for value and 4 bytes for padding)
Note 3: Above all, you should favour clarity over performance, unless you have measured you need and have a more performant solution. In this case, an int cannot be a null but an integer can be null. If you want a value which should not be null use int, not for performance but for clarity.

Using Strings vs POJOs ...which one consumes more memory

I have a database column which contains varchar data type. I would like to store this data in an arraylist for comparison with another list. I can think of two ways to do this. One is get the data and assign it to a String and store it in the arraylist. The second method would be to have a POJO , which would have a getter and setter method for this variable and store it in the POJO which in turn is stored in an arraylist. When I have to compare it against another variable, I have to either do a String comparison OR get it out of a POJOs getter method and then compare. While I feel that using the String reduces a lot of code, I would like to know if using one(String) over the other(POJO) has any memory implication. I usually need to compare around 1000 objects. So which one would consume less memory....in other words which one would be much faster to run....better performance. I need to use JAVA 1.4.
In Java, the science is, when you create an object, it takes part of heap, in pojo case it will be heap space for pojo and heap space for String (which is memory) and time complexity also (getter/setter). I would prefer just Strings.

How to estimate the serialization size of objects in Java without actually serializing them?

To enhance messaging in a cluster, it's important to know at runtime about how big a message is (should I prefer processing local or remote).
I could just find frameworks about estimating the object memory size based on java instrumentation. I've tested classmexer, which didn't come close to the serialization size and sourceforge SizeOf.
In a small testcase, SizeOf was around 10% wrong and 10x faster than serialization. (Still transient breaks the estimation completely and since e.g. ArrayList is transient but is serialized as an Array, it's not easy to patch SizeOf. But I could live with that)
On the other hand, 10x faster with 10% error doesn't seem very good. Any ideas how I could do better?
Update: I also tested ObjectSize (http://sourceforge.net/projects/objectsize-java). Results seem just good for non-inheritating objects :(
The size a class takes at runtime doesn't necessarily have any bearing on it's size in memory. The example you've mentioned is transient fields. Other examples include when objects implement Externalizable and handle serialization themselves.
If an object implements Externalizable or provides readObject()/writeObject() then your best bet is to serialize the object to a memory buffer to find out the size. It's not going to be fast, but it will be accurate.
If an object is using the default serialization, then you could amend SizeOf to take into account transient fields.
After serializing many of the same types of objects, you may be able to build up a "serialization profile" for that type that correlates serialized size with runtime size from SizeOf. This will allow you then to estimate the serialized size quickly (using SizeOf) and then correlate this to runtime size, to arrive at a more accurate result than that provided by SizeOf.
There are many good points in the other answers, one thing that is lacking is that the serialization mechanism may cache certain objects.
For example you serialize a series of objects A, B, and C all of the same class that hold two objects o1 and o2 in each object. Let us say that the object overhead is 100 bytes and let us say the objects look like:
Object shared = new Object();
Object shread2 = new Object();
A.o1 = new Object()
A.o2 = shared
B.o1 = shared2
B.o2 = shared
C.o1 = shared2
C.o2 = shared
For simplicity sake we might say that the generic objects take 50 bytes to serialize and A's serialization size is 100 (overhead) + 50 (o1) + 50 (o2) = 200 bytes. One could make a similar naive estimation for B and C as well. However if all three are serialized by the same object output stream before reset is called what you will see in the stream is a serialization of A and o1 and o2, Then a serialization of B and o1 for b, BUT a reference to o2 since it was the same object that was already serialzied. So lets say an object reference takes 16 bytes the size of B is now 100 (overhead) + 50 (o1) + 16 (reference for o2) = 166. So the size that it takes to serialize has now changed!
We could do a simialr calculation for C and get 132 bytes with two objects cached, so the serialization size for all three objects is different with ~33% difference between the largest and smallest.
So unless you are serializing the entire object without a cache every time it is difficult to accurately estimate the size required to serialize the object.
Just an idea - you could serialize the object to a byte buffer first, get its length and decide now whether to send the buffers content to a remote location or do the local processing (if it depends on the messages size).
Drawback - you may waste time for serialization if later to decide not use the buffer. But if you estimate you waste estimation effort in case you need to serialize (because in this case you estimate first and serialize in a second step).
There can be no way to estimate the serialized size of the object with nice precision and speed. For example some object could be a cache of Pi number digits that constructs itself during runtime given only the length you need. So it will serialize only 4 bytes of the 'length' attribute, while the object could be using hundreds of megabytes of memory to store that Pi number.
The only solution I can think of is adding your own interface, having method int estimateSerializeSize(). For every object implementing this interface you would need to call this method to get the proper size. If some Object does not implement it - you would have to use SizeOf.

Java's Representation of Serialized Objects

I'm looking for the format that Java uses to serialize objects. The default serialization serializes the object in a binary format. In particular, I'm curious to know if two runs of a program can serialize the same object differently.
What condition should an object satisfy so that the object maintains its behavior under Java's default serialization/deserialization round-trip?
You need the Java Object Serialization Specification at http://java.sun.com/javase/6/docs/platform/serialization/spec/protocol.html.
If you have two objects with all properties set to identical values, then they will be serialized the same way.
If it weren't repeatable, then it wouldn't be useful!
They will always serialize it the same way. If this wasn't the case, there would be no guarantee that another program could de-serialize the data correctly, defeating the purpose of serialization.
Typically running the same single-threaded algorithm with the same data will result in the same result.
However, things such as the order with which a HashSet serialises entries is not guaranteed. Indeed, an object may be subtly altered when serialised.
I like #Stephen C's example of Object.hashCode(). If such nondeterministic hash codes are serialized, then when we deserialize, the hash codes will be of no use. For example, if we serialize a HashMap that works based on Object.hashCode(), its deserialized version would behave differently than the original map. That is, looking up the same object would give us different results in the two maps.
If you don't want binary then you can use JSON (http://www.json.org/example.html) in java http://www.json.org/java/
Or XML for that matter http://www.developer.com/xml/article.php/1377961/Serializing-Java-Objects-as-XML.htm
I'm looking for the format that Java
uses to serialize objects.
Not to be inane, it writes them somehow. How exactly that is can and probably should be determined by you. A Character maps to .... uh, it gets involved but rather than re-inventing the wheel let us ask exactly what do you need to have available to reconstruct an object to what state?
The default serialization serializes
the object in a binary format.
So? ( again, not trying to be inane - sounds like we need to define a problem that may not have data concepted )
I'm curious to know if two runs of a
program can serialize the same object
differently.
If you had a Stream of information, how would you determine what states the object needed to be restored to?

Categories

Resources