Serialization is the process of converting an object stored in memory into a stream of bytes to be transferred over a network, stored in a DB, etc.
But isn't the object already stored in memory as bits and bytes? Why do we need another process to convert the object stored as bytes into another byte representation? Can't we just transmit the object directly over the network?
I think I may be missing something in the way the objects are stored in memory, or the way the object fields are accessed.
Can someone please help me in clearing up this confusion?
Different systems don't store things in memory in the same way. The obvious example is endianness.
Serialization defines a way by which systems using different in-memory representations can communicate.
Another important fact is that the requirements on in-memory and serialized data may be different: when in-memory, fast read (and maybe write) access is desirable; when serialized, small size is desirable. It is easier to create two different formats to fit these two use cases than it is to create one format which is good for both.
An example which springs to mind is LinkedHashMap: this basically stores two versions of the mapping when in memory (one to capture insertion order; one as a traditional hash map). However, you don't need both of these representations to reconstruct the same map from a serialized form: you only need the insertion order of key/value pairs. As such, the serialized form does not store the same data as the in-memory form.
Serialization turns the pre-existing bytes from the memory into a universal form.
This is done because different systems allocate memory in different ways. Thus, we cannot ensure that the object can be saved directly from the memory on one machine and then be loaded back in properly into another, different machine.
Mabe you can find more information on this page of Oracle docs.
Explanation of object serialization from book Thinking In Java.
When you create an object, it exists for as long as you need it, but under no circumstances does it exist when the program terminates. While this makes sense at first, there are situations in which it would be incredibly useful if an object could exist and hold its information even while the program wasn’t running. Then, the next time you started the program, the object would be there and it would have the same information it had the previous time the program was running. Of course, you can get a similar effect by writing the information to a file or to a database, but in the spirit of making everything an object, it would be quite convenient to declare an object to be "persistent," and have all the details taken care of for you.
Java’s object serialization allows you to take any object that implements the Serializable interface and turn it into a sequence of bytes that can later be fully restored to regenerate the original object. This is even true across a network, which means that the serialization mechanism automatically compensates for differences in operating systems. That is, you can create an object on a Windows machine, serialize it, and send it across the network to a Unix machine, where it will be correctly reconstructed. You don’t have to worry about the data representations on the different machines, the byte ordering, or any other details.
Hope this helps you.
Let's go with that set of mind : we take the object as is , and we send it as byte array over the network. another socket/httphandler receives that byte array.
now, two things come to mind:
ho much bytes to send?
what are these bytes? what class do these btyes represent?
you will have to provide this data as well. so for this action alone we need extra 2 steps.
Now, in C# and Java, as opposed to C++, the objects are scattered throught the heap, each object hold references to the objects it containes , so now we have another requirement
recursivly "catch" all the inner object and pack them into the byte array
now we get packed byte array which represent some object hirarchy, we need to tell the other side how to de-pack this byte array back to object+the object it holds so
Send information on how to unpack that byte array to object hirarchy
Some entities a obejct have cannot be sent over the net, such as functions. so now we have yet another step
Strip away things that cannot be serialized, like functions
this process goes on and one, for every new solution you will find many problems. Serialization is the process of taking that byte array you are talking about and making it something that can be handled in other enviroments, like network/files.
Related
I was just introduced to the concept of serialisation in Java and while I 'get' the fundamentals, I can't help but feel like it's a bit of an overkill? My logic is that if I have pointers to the objects and I know how many bytes it takes up in memory. Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
The amount of detail my book goes into serialisation is giving me a good indication that I'm not really understanding the importance of this and that there is probably something more subtle than just writing out all the bytes exactly as they are. Any help is greatly appreciated! (I have some background in c++ if that helps)
Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
How could anyone ever read them back in? Say I'm writing code that's supposed to read in your file. Please tell me what the third byte means so that I can decode it properly.
What if the internal representation of the object contains pointers to other objects that might be in different memory locations the next time the program runs? For example, it is quite common to manage identical strings by having internal references to the same internal string object. How will writing that reference to a file be sensible given that the internal string object may not exist in the next run?
To write data to a file, you need to write it out in some specific format that actually contains all the information you need to be able to read back in. What happens to work internally for this program at this time just won't do as there's no guarantee another program at another time can make sense of it.
What you suggest works provided;
the order and type of fields doesn't change. Note this is not set at compile time.
the byte order doesn't change.
you don't have any references eg no String, enum, List or Map.
the name&package of the type doesn't change.
We at Chronicle, use a form of serialization which supports this as it's much faster but it's very limiting. You have to be very aware of those limitations and have a problem which is suitable. We also have a form of serialization which have none of these constraints, but it is slower.
The purpose of Java Serialization is to support arbitrary object graphs even if data is exchanged between systems which might arrange the data differently.
Isn't an object already stored as a bunch of bytes? Is serialization just a protocol that forces some order to how those bytes are organized when transferred over a network?
Technically, yes, everything in a computer is represented as data somewhere. So any in-memory object is "a bunch of bytes".
However, when being used in a live application the state of that object is subject to change. It's in flux. And that state can be stored/changed/known/etc. across multiple mediums.
Serialization is the process of capturing the state of an object into some static form which can be persisted to a more static medium. Specifically this information needs to include everything required to re-create the object at a later time.
It doesn't really matter what the form or medium is. The data can be raw binary, JSON, XML, text, any custom format, etc. The storage medium can be a file system, a database, a network connection, active memory, etc. And it could be stored for milliseconds or for centuries.
As an analogy, consider a human being. There is a lot of information which makes up everything that is "a person". How would you "serialize" a person?
You could save their DNA sequence to a computer (a simple array of characters would do the trick). But does that store the state of the person? You could re-create a person from that data, but could you re-create that same person in the same state? No, all of their memories would be lost.
So in attempting to serialize the person, we've discovered that the information which represents the state of a person includes more than the original information which was used to create the person. That state information is stored in a separate medium during the lifespan of the person and isn't as easily available. But it would be necessary in order to serialize the person.
Continuing the analogy... consider transporters from Star Trek. The "object" is a person, and that person is successfully converted into a stream of data which is then re-constructed on the other end of the transfer. The two transporter systems are separate, simply exchanging information. This information is enough to re-create the original object the exact state at which it was serialized.
"Serialization" means transforming an object into another state for the purposes of transfer or persistance.
Isn't an object already stored as a bunch of bytes? Yes, but that's not the point. The point is for persistance and transference. Take the case of an image: It is one thing in memory, another when saves as a JPEG, and another when saved as a GIF or a TIFF or a BMP.
Is serialization just a protocol that forces some order to how those bytes are organized when transferred over a network? See the answer to your first question on the meaning of serialization
Lets sat I have an object i'd like to store in a direct byte buffer.
I'd like to able access parts of the object from the direct byte buffer without de-serializing the whole object. Is there a safe way to do this?
I'm thinking you could somehow capture the byte array offsets when serializing the object, then once its been written to the direct byte buffer you would adjust these offsets according to the offset of the direct byte buffer. I'm not sure if its possible to do this...
The real question is not, is this possible, since it almost certainly is, but why do you want to do this in the first place?
If you just want to access a few fields from the object, the easiest way to do that will be to deserialize it and then copy those few fields out.
The only reason you might want to avoid the (de-)serialization is for speed, but if this is in one of your busy loops, then you are lost anyway. If the network (de-)serialization is the issue, then you should design your protocol better.
I think the best way of doing is like so, its a bit of a work around, but should be effective.
interface OffsetMemberMap {
Map<String,Long> offsetMemberMap();
}
The idea is to create objects that implement the above interface, this map will store memory addresses against Strings for each member. Child objects would be created first and once added to a DirectByteBuffer the offset position would be stored in the parent within this map.
In order to access a specific member the user would need to supply the string which addresses that member and thus only what's needed would be de-serialized. This would allow you to store large linked objects in DirectByteBuffers whilst being able to only serialize/de-serialize the bits you need when writing/reading.
I want to send some objects through sockets from client to server. I can serialize they like object or convert to xml. Which of this methods take less memory?
Serializing them will take A LOT less space. You can also try kryo to get an even better size for your serialized objects. It supports Deflate compression/decompression. Take note however that it's non-standard, so the other side of the socket must use the library as well to de-serialize.
Naturally serialization takes a lot less memory that converting to XML... think of all those <...> and </...> tags! Serialization takes care of all that with numbers, not ASCII characters.
Also, you can serialize to xml! http://x-stream.github.io/
Converting to XML takes up more space on the client and server than just sending them serialized, since you are basically copying the content into a new variable. Sending them serialized may not use the full capacity of a packet, but you can always just process the first packet and overwrite it with the next to save some space (At least that's how I'm currently doing it).
However, serializing it will probably make the transfer slower, since you have to send multiple packages. On the other hand, if you put everything into one XML, you might run into size restrictions on the packets
(I'm talking about DatagramSocket and DatagramPacket here, since these are the ones I use. I dont know how the situation is with other transfer methods).
XML vs Java Serialization, one may use more bandwidth, but the main memory used will be your objects. If you are worried about memory used, I would make your object structure more efficient (assuming it is a real issue)
You can stream XML and Java Objects as you serialize/deserialize which is why they shouldn't use much memory.
Obviously, if you build your serialized data before sending it, this will be inefficient.
I'm trying to design a lightweight way to store persistent data in Java. I've already got a very efficient way to serialize POJOs to DataOutputStreams (and back), but I'm trying to think of a good way to ensure that changes to the data in the POJOs gets serialized when necessary.
This is for a client-side app where I'm trying to keep the size of the eventual distributable as low as possible, so I'm reluctant to use anything that would pull-in heavy-weight dependencies. Right now my distributable is almost 10MB, and I don't want it to get much bigger.
I've considered DB4O but its too heavy - I need something light. Really its probably more a design pattern I need, rather than a library.
Any ideas?
The 'lightest weight' persistence option will almost surely be simply marking some classes Serializable and reading/writing from some fixed location. Are you trying to accomplish something more complex than this? If so, it's time to bundle hsqldb and use an ORM.
If your users are tech savvy, or you're just worried about initial payload, there are libraries which can pull dependencies at runtime, such as Grape.
If you already have a compact data output format in bytes (which I assume you have if you can persist efficiently to a DataOutputStream) then an efficient and general technique is to use run-length-encoding on the difference between the previous byte array output and the new byte array output.
Points to note:
If the object has not changed, the difference in byte arrays will be an array of zeros and hence will compress very small....
For the first time you serialize the object, consider the previous output to be all zeros so that you communicate a complete set of data
You probably want to be a bit clever when the object has variable-sized substructures....
You can also try zipping the difference rather than RLE - might be more efficient in some cases where you have a large object graph with a lot of changes