I wrote a small test program to show how many bytes we need to serialize Integer object:
ByteArrayOutputStream data = new ByteArrayOutputStream();
try {
ObjectOutputStream output = new ObjectOutputStream(data);
output.writeObject(1);
output.flush();
System.out.println(data.toByteArray().length);
} catch (IOException e) {
e.printStackTrace();
}
However, the result is surprising, it takes 81 bytes. If I serialize String "1", it only takes 8 bytes instead. I know java has optimization for String serialization, but why not do the same thing for Integer? I think it shouldn't be very difficult.
Or does anyone has some workaround? I need a method which can serialize everything include objects and basic types. Thanks for your answers!
It's a balancing act, between making the serialization protocol more complicated by having direct support for lots of types, and between keeping it simple.
In my experience, Integer values are relatively rare compared with int values - and the latter does have built-in support, along with all the other primitive types. It's also worth noting that although serializing a single Integer object is expensive, the incremental cost is much smaller, because there's already a reference to the class in the stream. So after the first Integer has been written, a new Integer only takes 10 bytes - and a reference to an Integer which has already been written to the stream (common if you're boxing small values) is only 5 bytes.
Personally I would try to avoid native Java binary serialization anyway - it's platform specific and very brittle, as well as not being terribly compact. I like Protocol Buffers but there are lots of other alternatives available too.
Related
Background: I'm writing some code which serializes and deserializes a binary protocol across a network, and I'm writing C#, Swift and Java client code.
We get a packet off the network (Byte array), deserialize it into a tree of structures, and then traverse the tree looking for stuff.
In C#, I made use of the Span<T> struct which gives you a "view" onto an existing memory buffer. Likewise in Swift, I used the Data struct which can also act like a "view".
In C++ I'd probably use some similar concept.
Essentially something like:
raw buffer: R_T1aaa_T2bbb_T3ccc.
parses as Root -> [Tag1[aaa],Tag2[bbb], Tag3[ccc]]
In the above model, the Root,Tag1,Tag2,Tag3 objects are structs allocated on the stack. The tag contents (aaa, bbb, ccc) are just pointers into the underlying raw buffer and we haven't allocated or copied any heap memory for them at all.
In Java, I had begun by creating a class like this (#MarkRotteveel kindly points out that I could use ByteBuffer rather than making my own wrapper class):
class ByteArrayView {
#NonNull final byte[] owner; // shared
final int offset;
final int length;
}
I was going to proceed with the same pattern as used in C#, Swift, however the thought occurred to me - I'm using array views to avoid heap allocation, however everything in Java is a heap allocation. Instead of copying some byte[] values around the place, I allocate ByteArrayView objects, but it's heap-allocations nonetheless.
I'm aware that if I had large "slices" then something like my ByteArrayView or java.nio.ByteBuffer would reduce the total amount of memory that my program allocations, but for the most part my tags are small - e.g. 4 bytes, 12 bytes, 32 bytes.
Does trading small byte[] objects for wrapper objects have any meaningful result if they're all still on the heap? Does it actually make things worse because there's an additional indirection before we get to the actual bytes that we want to read? Is it worth the extra code complexity?
What would experienced Java developers do here?
I want to write ONLY the values of the data members of an object into a file, so here I can can't use serialization since it writes a whole lot other information which i don't need. Here's is what I have implemented in two ways. One using byte buffer and other without using it.
Without using ByteBuffer:
1st method
public class DemoSecond {
byte characterData;
byte shortData;
byte[] integerData;
byte[] stringData;
public DemoSecond(byte characterData, byte shortData, byte[] integerData,
byte[] stringData) {
super();
this.characterData = characterData;
this.shortData = shortData;
this.integerData = integerData;
this.stringData = stringData;
}
public static void main(String[] args) {
DemoSecond dClass= new DemoSecond((byte)'c', (byte)0x7, new byte[]{3,4},
new byte[]{(byte)'p',(byte)'e',(byte)'n'});
File checking= new File("c:/objectByteArray.dat");
try {
if (!checking.exists()) {
checking.createNewFile();
}
// POINT A
FileOutputStream bo = new FileOutputStream(checking);
bo.write(dClass.characterData);
bo.write(dClass.shortData);
bo.write(dClass.integerData);
bo.write(dClass.stringData);
// POINT B
bo.close();
} catch (FileNotFoundException e) {
System.out.println("FNF");
e.printStackTrace();
} catch (IOException e) {
System.out.println("IOE");
e.printStackTrace();
}
}
}
Using byte buffer: One more thing is that the size of the data members will always remain fixed i.e. characterData= 1byte, shortData= 1byte, integerData= 2byte and stringData= 3byte. So the total size of this class is 7byte ALWAYS
2nd method
// POINT A
FileOutputStream bo = new FileOutputStream(checking);
ByteBuffer buff= ByteBuffer.allocate(7);
buff.put(dClass.characterData);
buff.put(dClass.shortData);
buff.put(dClass.integerData);
buff.put(dClass.stringData);
bo.write(buff.array());
// POINT B
I want know which one of the two methods is more optimized? And kindly give the reason also.
The above class DemoSecond is just a sample class.
My original classes will be of size 5 to 50 bytes. I don't think here size might be the issue.
But each of my classes is of fixed size like the DemoSecond
Also there are so many files of this type which I am going to write in the binary file.
PS
if I use serialization it also writes the word "characterData", "shortData", "integerData","stringData" also and other information which I don't want to write in the file. What I am corcern here is about THEIR VALUES ONLY. In case of this example its:'c', 7, 3,4'p','e','n'. I want to write only this 7bytes into the file, NOT the other informations which is USELESS to me.
As you are doing file I/O, you should bear in mind that the I/O operations are likely to be very much slower than any work done by the CPU in your output code. To a first approximation, the cost of I/O is an amount proportional to the amount of data you are writing, plus a fixed cost for each operating system call made to do the I/O.
So in your case you want to minimise the number of operating system calls to do the writing. This is done by buffering data in the application, so the application performs few put larger operating system calls.
Using a byte buffer, as you have done, is one way of doing this, so your ByteBuffer code will be more efficient than your FileOutputStream code.
But there are other considerations. Your example is not performing many writes. So it is likely to be very fast anyway. Any optimisation is likely to be a premature optimisation. Optimisations tend to make code more complicated and harder to understand. To understand your ByteBuffer code a reader needs to understand how a ByteBuffer works in addition to everything they need to understand for the FileOutputStream code. And if you ever change the file format, you are more likely to introduce a bug with the ByteBuffer code (for example, by having a too small a buffer).
Buffering of output is commonly done. So it should not surprise you that Java already provides code to help you. That code will have been written by experts, tested and debugged. Unless you have special requirements you should always use such code rather than writing your own. The code I am referring to is the BufferedOutputStream class.
To use it simply adapt your code that does not use the ByteBuffer, by changing the line of your code that opens the file to
OutputStream bo = new BufferedOutputStream(new FileOutputStream(checking));
The two methods differ only in the byte buffer allocated.
If you are concerning about unnecessary write action to file, there is already a BufferedOutputStream you can use, for which buffer is allocated internally, and if you are writing to same outputstream multiple times, it is definitely more efficient than allocating buffer every time manually.
It would be simplest to use a DataOutputStream around a BufferedOutputStream around the FileOutputStream.
NB You can't squeeze 'shortData' into a byte. Use the various primitives of DataOutputStream, and use the corresponding ones of DataInputStream when reading them back.
I want to try abusing Java classes as structures and for that I'm wondering if it is possible to serialize a byte array to a class and other way around.
So if I have a class like this:
public class Handshake
{
byte command;
byte error;
short size;
int major;
int ts;
char[] secret; // aligned size = 32 bytes
}
Is there an easy way (without having to manually read bytes and fill out the class which requires 3 times as much code) to deserialize a set of bytes into this class? I know that Java doesn't have structs but I'm wondering if it is possible to simplify the serialization process so it does it automatically. The bytes are not from Java's serializer, they are just aligned bytes derived from C structs.
The bytes are not from Java's serializer, they are just aligned bytes
derived from C structs.
Bad idea. It can break as soon as someone compiles that code on a different platform, using a different compiler or settings, etc.
Much better: use a standardized binary interface with implementations in Java and C++ like ASN.1 or Google's Protocol Buffers.
You can write a library to do the deserializtion using reflection. This may result in more code being required, but may suit your needs. It worth nothing that char in Java 16-bit rather than 8 bit and a char[] is a separate Object, unlike in C.
In short you can write a library which reads this data without touching the Handshake class. Only you can decide if this is actually easier than adding a method or two to the handshake class..
Do not do that! I will break sooner or later. Use some binary serialization format, like [Hessian][1], which supports both java and C++ (I'm not aware of anything that works on plain C)
Also remember C does not force size for int's or long's, they are platform dependent.
So if you must use C, and you are forced to write your own library, be very careful.
For evaluating an algorithm I have to count how often the items of a byte-array are read/accessed. The byte-array is filled with the contents of a file and my algorithm can skip over many of the bytes in the array (like for example the Boyer–Moore string search algorithm). I have to find out how often an item is actually read. This byte-array is passed around to multiple methods and classes.
My ideas so far:
Increment a counter at each spot where the byte-array is read. This seems error-prone since there are many of these spots. Additionally I would have to remove this code afterwards such that it does not influence the runtime of my algorithm.
Use an ArrayList instead of a byte-array and overwrite its "get" method. Again, there are a lot of methods that would have to be modified and I suspect that there would be a performance loss.
Can I somehow use the Eclipse debug-mode? I see that I can specify a hit-count for watchpoints but it does not seem to be possible to output the hit count?!
Can maybe the Reflection API help me somehow?
Somewhat like 2), but in order to reduce the effort: Can I make a Java method accept an ArrayList where it wants an array such that it transparently calls the "get" method whenever an item is read?
There might be an out-of-the-box solution but I'd probably just wrap the byte array in a simple class.
public class ByteArrayWrapper {
private byte [] bytes;
private long readCount = 0;
public ByteArrayWrapper( byte [] bytes ) {
this.bytes = bytes;
}
public int getSize() { return bytes.length; }
public byte getByte( int index ) { readCount++; return bytes[ index ]; }
public long getReadCount() { return readCount; }
}
Something along these lines. Of course this does influence the running time but not very much. You could try it and time the difference, if you find it is significant, we'll have to find another way.
The most efficient way to do this is to add some code injection. However this is likely to be much more complicated to get right than writing a wrapper for your byte[] and passing this around. (tedious but at least the compiler will help you) If you use a wrapper which does basicly nothing (no counting) it will be almost as efficient as not using a wrapper and when you want counting you can use an implementation which does that.
You could use EHCache without too much overhead: implement an in-memory cache, keyed by array index. EHCache provides an API which will let you query hit rates "out of the box".
There's no way to do this automatically with a real byte[]. Using JVM TI might help here, but I suspect it's overkill.
Personally I'd write a simple wrapper around the byte[] with methods to read() and write() specific fields. Those methods can then track all accesses (either individually for each byte, or as a total or both).
Of course this would require the actual access to be modified, but if you're testing some algorithms that might not be such a big drawback. The same goes for performance: it will definitely suffer a bit, but the effect might be small enough not to worry about it.
I have a very large object which I wish to serialize. During the process of serialization, it comes to occupy some 130MB of heap as an weblogic.utils.io.UnsyncByteArrayOutputStream. I am using a BufferedOutputStream to speed up writing the data to disk, which reduces the amount of time for which this object is held in memory.
Is it possible to use a buffer to reduce the size of the object in memory though? It would be good if there was a way to serialize it x bytes at a time and write those bytes to disk.
Sample code follows if it is of any use. There's not much to go on though I don't think. If it's the case that there needs to be a complete in-memory copy of the object to be serialised (and therefore no concept of a serialization buffer) then I suppose I am stuck.
ObjectOutputStream tmpSerFileObjectStream = null;
OutputStream tmpSerFileStream = null;
BufferedOutputStream bufferedStream = null;
try {
tmpSerFileStream = new FileOutputStream(tmpSerFile);
bufferedStream = new BufferedOutputStream(tmpSerFileStream);
tmpSerFileObjectStream = new ObjectOutputStream(bufferedStream);
tmpSerFileObjectStream.writeObject(siteGroup);
tmpSerFileObjectStream.flush();
} catch (InvalidClassException invalidClassEx) {
throw new SiteGroupRepositoryException(
"Problem encountered with class being serialised", invalidClassEx);
} catch (NotSerializableException notSerializableEx) {
throw new SiteGroupRepositoryException(
"Object to be serialized does not implement " + Serializable.class,
notSerializableEx);
} catch (IOException ioEx) {
throw new SiteGroupRepositoryException(
"Problem encountered while writing ser file", ioEx);
} catch (Exception ex) {
throw new SiteGroupRepositoryException(
"Unexpected exception encountered while writing ser file", ex);
} finally {
if (tmpSerFileObjectStream != null) {
try {
tmpSerFileObjectStream.close();
if(null!=tmpSerFileStream)tmpSerFileStream.close();
if(null!=bufferedStream)bufferedStream.close();
} catch (IOException ioEx) {
logger.warn("Exception caught on trying to close ser file stream", ioEx);
}
}
}
This is wrong on so many levels. This is a massive abuse of serialization. Serialization is mostly intended for temporarily storing an object. For example,
session objects between tomcat server restarts.
transferring objects between jvms ( load balancing at website )
Java's serialization makes no effort to handle long-term storage of objects (No versioning support) and may not handle large objects well.
For something so big, I would suggest some investigation first:
Ensure that you are not trying to persist the entire JVM Heap.
Look for member variables that can be labeled as 'transient' to avoid including them it the serialization ( perhaps you have references to service objects )
Consider possibility that there is a memory leak and the object is excessively large.
If everything is indeed correct, you will have to research alternatives to java.io.Serialization. Taking more control via java.io.Externalization might work. But I would suggest something like a json or xml representation.
Update:
Investigate :
google's protocol buffer
facebook's Thrift
Avro
Cisco's Etch
Take a look at this benchmarkings as well.
What is the "siteGroup" object that you're trying to save? I ask, because it's unlikely that any one object is 130MB in size, unless it has a ginormous list/array/map/whatever in it -- and if that's the case, the answer would be to persist that data in a database.
But if there's no monster collection in the object, then the problem is likely that the object tree contains references to a bagillion objects, and the serialization of course does a deep copy (this fact has been used as a shortcut to implement clone() a lot of times), so everything gets cataloged all at once in a top-down fashion.
If that's the problem, then the solution would be to implement your own serialization scheme where each object gets serialized in a bottom-up fashion, possibly in multiple files, and only references are maintained to other objects, instead of the whole thing. This would allow you to write each object out individually, which would have the effect you're looking for: smaller memory footprint due to writing the data out in chunks.
However, implementing your own serialization, like implementing a clone() method, is not all that easy. So it's a cost/benefit thing.
It sounds like whatever runtime you are using has a less-than-ideal implementation of object serialization that you likely don't have any control over.
A similar complaint is mentioned here, although it is quite old.
http://objectmix.com/weblogic/523772-outofmemoryerror-adapter.html
Can you use a newer version of weblogic? Can you reproduce this in a unit test? If so, try running it under a different JVM and see what happens.
I don't know about weblogic (that is - JRockit I suppose) serialization in particular: honestly I see no reason for using ByteArrayOutputStreams...
You may want to implement java.io.Externalizable if you need more control on how your object is serialized - or switch to an entirely different serialization system (eg: Terracotta) if you don't want to write read/write methods yourself (if you have many big classes).
Why does it occupy all those bytes as an unsync byte array output stream?
That's not how default serialization works. You must have some special code in there to make it do that. Solution: don't.