Can objects be buffered during java serialization?

Can objects be buffered during java serialization? - java

I have a very large object which I wish to serialize. During the process of serialization, it comes to occupy some 130MB of heap as an weblogic.utils.io.UnsyncByteArrayOutputStream. I am using a BufferedOutputStream to speed up writing the data to disk, which reduces the amount of time for which this object is held in memory.
Is it possible to use a buffer to reduce the size of the object in memory though? It would be good if there was a way to serialize it x bytes at a time and write those bytes to disk.
Sample code follows if it is of any use. There's not much to go on though I don't think. If it's the case that there needs to be a complete in-memory copy of the object to be serialised (and therefore no concept of a serialization buffer) then I suppose I am stuck.
ObjectOutputStream tmpSerFileObjectStream = null;
OutputStream tmpSerFileStream = null;
BufferedOutputStream bufferedStream = null;
try {
tmpSerFileStream = new FileOutputStream(tmpSerFile);
bufferedStream = new BufferedOutputStream(tmpSerFileStream);
tmpSerFileObjectStream = new ObjectOutputStream(bufferedStream);
tmpSerFileObjectStream.writeObject(siteGroup);
tmpSerFileObjectStream.flush();
} catch (InvalidClassException invalidClassEx) {
throw new SiteGroupRepositoryException(
"Problem encountered with class being serialised", invalidClassEx);
} catch (NotSerializableException notSerializableEx) {
throw new SiteGroupRepositoryException(
"Object to be serialized does not implement " + Serializable.class,
notSerializableEx);
} catch (IOException ioEx) {
throw new SiteGroupRepositoryException(
"Problem encountered while writing ser file", ioEx);
} catch (Exception ex) {
throw new SiteGroupRepositoryException(
"Unexpected exception encountered while writing ser file", ex);
} finally {
if (tmpSerFileObjectStream != null) {
try {
tmpSerFileObjectStream.close();
if(null!=tmpSerFileStream)tmpSerFileStream.close();
if(null!=bufferedStream)bufferedStream.close();
} catch (IOException ioEx) {
logger.warn("Exception caught on trying to close ser file stream", ioEx);
}
}
}

This is wrong on so many levels. This is a massive abuse of serialization. Serialization is mostly intended for temporarily storing an object. For example,
session objects between tomcat server restarts.
transferring objects between jvms ( load balancing at website )
Java's serialization makes no effort to handle long-term storage of objects (No versioning support) and may not handle large objects well.
For something so big, I would suggest some investigation first:
Ensure that you are not trying to persist the entire JVM Heap.
Look for member variables that can be labeled as 'transient' to avoid including them it the serialization ( perhaps you have references to service objects )
Consider possibility that there is a memory leak and the object is excessively large.
If everything is indeed correct, you will have to research alternatives to java.io.Serialization. Taking more control via java.io.Externalization might work. But I would suggest something like a json or xml representation.
Update:
Investigate :
google's protocol buffer
facebook's Thrift
Avro
Cisco's Etch
Take a look at this benchmarkings as well.

What is the "siteGroup" object that you're trying to save? I ask, because it's unlikely that any one object is 130MB in size, unless it has a ginormous list/array/map/whatever in it -- and if that's the case, the answer would be to persist that data in a database.
But if there's no monster collection in the object, then the problem is likely that the object tree contains references to a bagillion objects, and the serialization of course does a deep copy (this fact has been used as a shortcut to implement clone() a lot of times), so everything gets cataloged all at once in a top-down fashion.
If that's the problem, then the solution would be to implement your own serialization scheme where each object gets serialized in a bottom-up fashion, possibly in multiple files, and only references are maintained to other objects, instead of the whole thing. This would allow you to write each object out individually, which would have the effect you're looking for: smaller memory footprint due to writing the data out in chunks.
However, implementing your own serialization, like implementing a clone() method, is not all that easy. So it's a cost/benefit thing.

It sounds like whatever runtime you are using has a less-than-ideal implementation of object serialization that you likely don't have any control over.
A similar complaint is mentioned here, although it is quite old.
http://objectmix.com/weblogic/523772-outofmemoryerror-adapter.html
Can you use a newer version of weblogic? Can you reproduce this in a unit test? If so, try running it under a different JVM and see what happens.

I don't know about weblogic (that is - JRockit I suppose) serialization in particular: honestly I see no reason for using ByteArrayOutputStreams...
You may want to implement java.io.Externalizable if you need more control on how your object is serialized - or switch to an entirely different serialization system (eg: Terracotta) if you don't want to write read/write methods yourself (if you have many big classes).

Why does it occupy all those bytes as an unsync byte array output stream?
That's not how default serialization works. You must have some special code in there to make it do that. Solution: don't.

Related

Add flush() into Object serialization

I have this serialization and deserialization classes which I use to send objects into Kafka:
public class SaleRequestFactorySerializer implements Serializable, Serializer<SaleRequestFactory> {
#Override
public byte[] serialize(String topic, SaleRequestFactory data)
{
ByteArrayOutputStream out = new ByteArrayOutputStream();
try
{
ObjectOutputStream outputStream = new ObjectOutputStream(out);
outputStream.writeObject(data);
out.close();
}
catch (IOException e)
{
e.printStackTrace();
}
return out.toByteArray();
}
}
public class SaleResponseFactoryDeserializer implements Serializable, Deserializer<SaleRequestFactory> {
#Override
public SaleRequestFactory deserialize(String topic, byte[] data)
{
SaleRequestFactory saleRequestFactory = null;
try
{
ByteArrayInputStream bis = new ByteArrayInputStream(data);
ObjectInputStream in = new ObjectInputStream(bis);
saleRequestFactory = (SaleRequestFactory) in.readObject();
in.close();
}
catch (IOException | ClassNotFoundException e)
{
e.printStackTrace();
}
return saleRequestFactory;
}
}
It's not clear for me do I need to add flush() methods in order to prevent memory leak. Can you guide me what I'm missing into my code?

You don't need to; the garbage collector doesn't magically get confused by resources. It's that the resource they represent may take years to close if the GC never even kicks in because there's no need for it. There's no 'leaking' whatsoever for these.
BAIS, BAOS, OIS, OOS, etc - these don't actually represent any OS resources. Contrast to e.g. FileInputStream, Files.newInputStream, socket.getInputStream(), etc.
Regardless of whether it represents a resource or not, closing any stream will flush as part of the operation.
You can improve this code considerably by ditching OIS and OOS; you can do some web searching to find that almost everybody, notably including most of the java language designers over at oracle itself, don't particularly like java's built-in serialization. The protocol is tightly bound up with java (so, it's not possible to document the scheme without saying: "Well, uh, take this class file, load up a JVM, deserialize it there, otherwise, no way to read this thing", which is not great. It's also not a particularly efficient algorithm, and produces rather large blobs of binary data. Look instead into serialization schemes which don't use java's built in mechanism, such as Jackson or GSON, which serialize objects into JSON and back out. JSON isn't an efficient format either, but at least it's easy to read with eyeballs, or from other languages, and it is easily specced.
If you want as much performance as you can squeeze out, write a protobuf based serializer.
If you want to stick with this code instead, note that e.printStackTrace() is the worst way to handle an exception. An exception happened - your choice is to dump part of the error into System.err, beyond the reach of loggers and completely invisible on most server deployments, the rest into the void, and then keep on going like nothing went wrong, returning empty byte arrays, most likely causing either wild goose chases (with your app 'silently doing nothing', and you wondering what's going on), or a cascade of errors (Where the weird state you're returning, such as empty arrays, causes other errors, which, if you handle them the same way, cause still other errors, resulting in a single problem sending a flurry of hundreds of cascading stack traces to syserr, all of them except the first utterly irrelevant). It also means you have to dirty up your code - for example, you tend to run into an issue where the compiler demands you initialize a variable before using it - even though you always set that variable, unless exceptions occur.
The right way to handle an exception is to actually handle it (and logging it is not handling it). If that's not an option, then the right way is to throw an exception. Give callers a chance. If you can't throw the exception straight onwards, wrap it. If you can't do that, wrap it in a runtimeexception. Thus, the ¯\(ツ)/¯ I have no clue and I don't want to be bothered line of code is not e.printStackTrace() but throw new RuntimeException("Unhandled", e); - this has none of these problems: It lists ALL the info about the error, it will not cause execution to continue with weird data returned, and log frameworks have a chance to see this stuff.

What is the need to choose serializaiton frameworks, when java provides APIs to do it?

I am reading about Avro, and I am trying to compare avro vs java serialization system. But somehow I am not able to gauge why avro is used for data serialization instead of java serialization. As a matter of fact, why was another system came in to replace the java serialization system?
Here is the summary of my understanding.
To use java serialization capabilities, we will have to make this class implement serilizable interface. If you do so and serialize the object, then during deserialization, something like
e = (Employee) in.readObject();
Next is we can use the getters/setters to play with the employee object.
In avro,
First is schema definition. Next is to use the avro APIs to serialize. Again on deserialization there is something like this.
Next is we can use the getters/setters to play with the employee object.
Question is I don't see any difference, only that the API that's used it different? Can anyone please clarify my doubt?
public AvroHttpRequest deSerealizeAvroHttpRequestJSON(byte[] data) {
DatumReader<AvroHttpRequest> reader
= new SpecificDatumReader<>(AvroHttpRequest.class);
Decoder decoder = null;
try {
decoder = DecoderFactory.get().jsonDecoder(
AvroHttpRequest.getClassSchema(), new String(data));
return reader.read(null, decoder);
} catch (IOException e) {
logger.error("Deserialization error:" + e.getMessage());
}}
Next is we can use the getters/setters to play with the employee object.
Question is I don't see any difference between these two approaches. Both does the same thing. Only that the APIs are different? Can anyone please help me in understanding this better?

The inbuilt java serialization has some pretty significant downsides. For instance, without careful consideration, you may not be able to deserialize an object that may have no changes to data, only changes to the class's methods.
You can also create a case in which the serial uid is the same (set manually) but not actually able to be deserialized because of incompatibility in type between two systems.
A 3rd party serialization library can help mitigate this by using an abstract mapping to pair data together. Well conceived serialization libraries can even provide mappings between different versions of the object.
Finally, the error handling for 3rd party serialization libraries are typically more useful for a developer or operator.

Hazelcast Java Serialization/Deserialization ArrayList Pitfall

I've switched from Memcached to Hazelcast. After a while i've noticed that the size of the Cache was bigger than usual. With man center.
So i did like this:
1. Before to call the IMap.set(key,value(ArrayList) i deserialize the value to a file which has 128K as size.
2. After the IMap.set() is called, i IMap.get() the same map, which suddently this has now 6 MB size.
The object in question has many objects which are referenced multiple times in the same structure.
i've opened the 2 binary files and i've seen that the 6MB file has a lot of duplicated data. The serialization used by hazelcast somehow make copies of the references
All the Classes instantiated for the Cache are Serializable except Enums.
using Memcached the value size is 128K in both cases.
i've tryied Kryo with hazelcast and there was not really a difference, still over 6MB
Have someone a similar problem with hazelcast ? If yes how did you solved this without changing the cache provider.
I could provide the Object Structure and Try to reproduce it with non sensitive data, if someone need it.

I am not pretending, but after a lost day, i finally came up with a solution which workaround this. I cannot say if it is a feature or just a problem to report.
Anyway in Hazelcast if you put in an IMap a value as ArrayList thus will be Serialized Entry By Entry. Which means if we have 100 entries of the same instance A which is 6K we will have 600K with Hazelcast. Here a short RAW code which prove my answer.
To Workaround or avoid this with Java Serialization you should wrap the ArrayList into an object , this will do the trick.
(only with Serializable, no other Implementations)
#Test
public void start() throws Exception {
HazelcastInstance client = produceHazelcastClient();
Data data = new Data();
ArrayList<Data> datas = new ArrayList<>();
IntStream.range(0, 1000).forEach(i -> {
datas.add(data);
});
wirteFile(datas,"DataLeoBefore","1");
client.getMap("data").put("LEO", datas);
Object redeserialized = client.getMap("data").get("LEO");
wirteFile(redeserialized,"DataLeoAfter","1");
}
public void wirteFile(Object value, String key, String fileName) {
try {
Files.write(Paths.get("./" + fileName + "_" + key), SerializationUtils.serialize(((ArrayList) value)));
} catch (IOException e) {
e.printStackTrace();
}
}

Hazelcast can be configured to use several different serialization schemes; Java serialization (the default) is the least efficient in terms of both time and space. Typically choosing the right serialization strategy gives a bigger payoff than almost any other optimization you could do.
The reference manual gives a good overview of the different serialization schemes and the tradeoffs involved.
IMDG Reference Manual v3.11 - Serialization
I typically would go with IdentifiedDataSerializable if my application is all Java, or Portable if I needed to support cross-language clients or object versioning.
If you need to use Java serialization for some reason, you might check and verify that the SharedObject property is set to true to avoid creating multiple copies of the same object. (That property can be set via the element in hazelcast.xml, or programmatically through the SerializationConfig object).

Why does serializing Integer take so many (81) bytes?

I wrote a small test program to show how many bytes we need to serialize Integer object:
ByteArrayOutputStream data = new ByteArrayOutputStream();
try {
ObjectOutputStream output = new ObjectOutputStream(data);
output.writeObject(1);
output.flush();
System.out.println(data.toByteArray().length);
} catch (IOException e) {
e.printStackTrace();
}
However, the result is surprising, it takes 81 bytes. If I serialize String "1", it only takes 8 bytes instead. I know java has optimization for String serialization, but why not do the same thing for Integer? I think it shouldn't be very difficult.
Or does anyone has some workaround? I need a method which can serialize everything include objects and basic types. Thanks for your answers!

It's a balancing act, between making the serialization protocol more complicated by having direct support for lots of types, and between keeping it simple.
In my experience, Integer values are relatively rare compared with int values - and the latter does have built-in support, along with all the other primitive types. It's also worth noting that although serializing a single Integer object is expensive, the incremental cost is much smaller, because there's already a reference to the class in the stream. So after the first Integer has been written, a new Integer only takes 10 bytes - and a reference to an Integer which has already been written to the stream (common if you're boxing small values) is only 5 bytes.
Personally I would try to avoid native Java binary serialization anyway - it's platform specific and very brittle, as well as not being terribly compact. I like Protocol Buffers but there are lots of other alternatives available too.

writing many java objects to a single file

how can I write many serializable objects to a single file and then read a few of the objects as and when needed?

You'd have to implement the indexing aspect yourself, but otherwise this could be done. When you serialize an object you essentially get back an OutputStream, which you can point to wherever you want. Storing multiple objects into a file this way would be straightforward.
The tough part comes when you want to read "a few" objects back. How are you going to know how to seek to the position in the file that contains the specific object you want? If you're always reading objects back in the same order you wrote them, from the start of the file onwards, this will not be a problem. But if you want to have random access to objects in the "middle" of the stream, you're going to have to come up with some way to determine the byte offset of the specific object you're interested in.
(This method would have nothing to do with synchronization or even Java per se; you've got to design a scheme that will fit with your requirements and environment.)

The writing part is easy. You just have to remember that you have to write all objects 'at once'. You can't create a file with serialized objects, close it and open it again to append more objects. If you try it, you'll get error messages on reading.
For deserializing, I think you have to process the complete file and keep the objects you're interested in. The others will be created but collected by the gc on the next occasion.

Make Object[] for storing your objects. It worked for me.

I'd use a Flat File Database (e. g. Berkeley DB Java Edition). Just write your nodes as rows in a table like:
Node
----
id
value
parent_id

To read more Objects from file:
public class ReadObjectFromFile {
public static Object[] readObject() throws IOException {
Object[] list = null;
try {
byte[] bytes = Files.readAllBytes(Paths.get("src/objectFile.txt"));
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(bytes));
list = (Object[]) ois.readObject();
ois.close();
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
}
return list;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.