Binary serialization protocol - java

I have a requirement where i need to transfer information through the wire(binary over tcp) between 2 applications. One is in Java and the other in C++. I need a protocol implementation to transfer objects between these 2 applications. The Object classes are present in both the applications (are mapped accordingly). I just need some encoding scheme on one side which retains the Object representation on one side and can be decoded on the other side as a complete Object.
For eg,
C++ class
class Person
{
int age;
string name;
};
Java class
class Person
{
int age;
String name;
}
C++ encoding
Person p;
p.age = 20;
p.name = "somename";
char[] arr = SomeProtocolEncoder.encode(p);
socket.send(arr);
Java decoding
byte[] arr = socket.read();
SomeProtocolIntermediateObject object = SomeProtocolDecoder.decode(arr);
Person p = (Person)ReflectionUtil.get(object);
The protocol should provide some intermediate object which maintains the object representational state so that using reflection i can get back the object later.

Sounds like you want Protobufs: http://code.google.com/apis/protocolbuffers/docs/tutorials.html

Check out Google's protocol buffers.

Thrift is what you're looking for. You just create a definition of the structs and methods you need to call and it does all of the heavy lifting. It's got binary protocols (optionally with zlib compression or ssl). It'll probably do your taxes but you didn't hear that from me.

You might want to check out these projects and choose one:
Protocol Buffers
Thrift
Apache Avro
Here is a Thrift-vs-PB comparison I read recently. You should also refer to this Wiki for performance comparisons between these libraries.

You can check the amef protocol, an example of C++ encoding in amef would be like,
//Create a new AMEF object
AMEFObject *object = new AMEFObject();
//Add a child string object
object->addPacket("This is the Automated Message Exchange Format Object property!!","adasd");
//Add a child integer object
object->addPacket(21213);
//Add a child boolean object
object->addPacket(true);
AMEFObject *object2 = new AMEFObject();
string j = "This is the property of a nested Automated Message Exchange Format Object";
object2->addPacket(j);
object2->addPacket(134123);
object2->addPacket(false);
//Add a child character object
object2->addPacket('d');
//Add a child AMEF Object
object->addPacket(object2);
//Encode the AMEF obejct
string str = new AMEFEncoder()->encode(object,false);
Decoding in java would be like,
byte arr = amef encoded byte array value;
AMEFDecoder decoder = new AMEFDecoder()
AMEFObject object1 = AMEFDecoder.decode(arr,true);
The Protocol implementation has codecs for both C++ and Java, the interesting part is it can retain object class representation in the form of name value pairs,
I required a similar protocol in my last project, when i incidentally stumbled upon this protocol, i had actually modified the base library according to my requirements. Hope this helps you.

What about plain old ASN.1?
It would have the advantage of being really backed by a standard (and widely used). The problem is finding a compiler/runtime for each language.

This project is the ultimate comparison of Java serialization protocols:
https://github.com/eishay/jvm-serializers/wiki
Some libraries also provide C++ serialization.
I've personally ported Python Construct to Java. If there's some interest I'll be happy to start a conversion project to C++ and/or JavaScript!
http://construct.wikispaces.com/
https://github.com/ZiglioNZ/construct

Related

Protoc variable consistent

Just to be clear: I am quite an amateur in C++ coding.
Presently, I am using Protobuff to serialize and exchange data between a c++ and a java model. Since both the models use different variables name for the same scientific terminology (for daily river drainage, c++ model uses dailyRiverDrianage and java uses dailyRdrainage). I used a new variable in protoc to define a variable being share.
My question is which is the best way to link both(protoc variable and model variable). Can't change the variable name in Java or C++
Basically you need an intermediate layer on one side to have a consistent mapping to the proto files. Do it on the Java side since you are more comfortable in that language. That intermediate layer would map from what ever is on the Java to the Java variables with different names.
Edit:
C++ side
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
Java Side
message Individual {
required string fullName = 1;
required int32 personal_id = 2;
optional string personal_email = 3;
}
Send the data to the from C++ side to Java side. Generate the same Person message unit on Java side and deserialize the message get the data out and copy (map).
name -> fullName
id -> personal_id
email -> personal_email
This is then your decoder/converter unit that you can tinker around as the interfaces change.

Sending objects through sockets

The only socket programming I have done in the past is simple text streams. I am wondering what is the most effective way to send something like a Java object through a socket.
For instance if I have the following Employee class (Dependent would be a simple class composed of a dependent's information):
public class Employee {
private String name;
private double salary;
private ArrayList<Dependent> dependents;
}
Should I just make the Employee object Serializable and send instances through the socket. Or should I write up an xml file containing the Employees information and send that? Any guidance would be greatly appreciated. Or is there some completely different and better way? Thank you!
If you are only sending data betwen Java JVMs, then either choice is possible.
A textual representation (XML, JSON, or custom) has several advantages:
it's easier to make it interoperable between Java and other languages
it's less brittle in the face of version changes or slightly different versions of your code at each end of the socket
it's vastly easier to test and debug
Depending on the format, it may be a little slower, but this often not significant.
If you are not necessarily tied to using XML you could also try JSON. The google-gson library makes this very trivial. To serialise the code it is as simple as:-
Employee employee = new Employee();
...
Gson gson = new Gson();
String json = gson.toJson(employee);
And to deserialise the String at the other end:-
String socketDataAsString = null;
...<read from socket>...
Gson gson = new Gson();
Employee employee = gson.fromJson(socketDataAsString, Employee.class);
If you must directly use low level sockets, there are a couple of ways you could to it. You could convert it to a text format and send the bytes and then reconstruct it on the other side. If your objects are serializable, you can send them over the socket (http://www.rgagnon.com/javadetails/java-0043.html).
If you have some flexibility, you could use RMI to interact remotely as well.

Is it possible to use struct-like constructs in Java?

I'm considering using Java for a large project but I haven't been able to find anything that remotely represented structures in Java. I need to be able to convert network packets to structures/classes that can be used in the application.
I know that it is possible to use RandomAccessFile but this way is NOT acceptable. So I'm curious if it is possible to "cast" a set of bytes to a structure like I could do in C. If this is not possible then I cannot use Java.
So the question I'm asking is if it is possible to cast aligned data to a class without any extra effort beyond specifying the alignment and data types?
No. You cannot cast a array of bytes to a class object.
That being said, you can use a java.nio.Buffer and easily extract the fields you need to an object like this:
class Packet {
private final int type;
private final float data1;
private final short data2;
public Packet(byte[] bytes) {
ByteBuffer bb = ByteBuffer.wrap(bytes);
bb.order(ByteOrder.BIG_ENDIAN); // or LITTLE_ENDIAN
type = bb.getInt();
data1 = bb.getFloat();
data2 = bb.getShort();
}
}
You're basically asking whether you can use a C-specific solution to a problem in another language. The answer is, predictably, 'no'.
However, it is perfectly possible to construct a class that takes a set of bytes in its constructor and constructs an appropriate instance.
class Foo {
int someField;
String anotherField;
public Foo(byte[] bytes) {
someField = someFieldFromBytes(bytes);
anotherField = anotherFieldFromBytes(bytes);
etc.
}
}
You can ensure there is a one-to-one mapping of class instances to byte arrays. Add a toBytes() method to serialize an instance into bytes.
No, you cannot do that. Java simply doesn't have the same concepts as C.
You can create a class that behaves much like a struct:
public class Structure {
public int field1;
public String field2;
}
and you can have a constructor that takes an array or bytes or a DataInput to read the bytes:
public class Structure {
...
public Structure(byte[] data) {
this(new DataInputStream(new ByteArrayInputStream(data)));
}
public Structure(DataInput in) {
field1 = in.readInt();
field2 = in.readUTF();
}
}
then read bytes off the wire and pump them into Structures:
byte[] bytes = network.read();
DataInputStream stream = new DataInputStream(new ByteArrayInputStream(bytes));
Structure structure1 = new Structure(stream);
Structure structure2 = new Structure(stream);
...
It's not as concise as C but it's pretty close. Note that the DataInput interface cleanly removes any mucking around with endianness on your behalf, so that's definitely a benefit over C.
As Joshua says, serialization is the typical way to do these kinds of things. However you there are other binary protocols like MessagePack, ProtocolBuffers, and AvRO.
If you want to play with the bytecode structures, look at ASM and CGLIB; these are very common in Java applications.
There is nothing which matches your description.
The closest thing to a struct in Java is a simple class which holds values either accessible through it's fields or set/get methods.
The typical means to convert between Java class instances and on-the-wire representations is Java serialization which can be heavily customized as need be. It is what is used by Java's Remote Method Invocation API and works extremely well.
ByteBuffer.wrap(new byte[] {}).getDouble();
No, this is not possible. You're trying to use Java like C, which is bound to cause complications. Either learn to do things the Java way, or go back to C.
In this case, the Java way would probably involve DataInputStream and/or DataOutputStream.
You cannot cast array of bytes to instance of class.
But you can do much much more with java.
Java has internal, very strong and very flexible mechanism of serialization. This is what you need. You can read and write object to/from stream.
If both sides are written in java, there are no problem at all. If one of sides is not java you can customeze your serialization. Start from reading javadoc of java.util.Serializable.

Cannot deserialize protobuf data from C++ in Java

My problem is to serialize protobuf data in C++ and deserialize the data in Java probably.
Here is the code I use to the hints given by dcn:
With this I create the protobuf data in C++ and write it to an ostream which is send via socket.
Name name;
name.set_name("platzhirsch");
boost::asio::streambuf b;
std::ostream os(&b);
ZeroCopyOutputStream *raw_output = new OstreamOutputStream(&os);
CodedOutputStream *coded_output = new CodedOutputStream(raw_output);
coded_output->WriteLittleEndian32(name.ByteSize());
name.SerializeToCodedStream(coded_output);
socket.send(b);
This is the Java side where I try to parse it:
NameProtos.Name name = NameProtos.Name.parseDelimitedFrom(socket.getInputStream());
System.out.println(name.newBuilder().build().toString());
However by this I get this Exception:
com.google.protobuf.UninitializedMessageException: Message missing required fields: name
What am I missing?
The flawed code line is: name.newBuilder().build().toString()
This would have never worked, a new instance is created with uninitialized name field. Anyway the answer here solved the rest of my problem.
One last thing, which I was told in the protobuf mailinglist: In order to flush the CodedOutputStreams, the objects have to be deleted!
delete coded_output;
delete raw_output;
I don't know what received is in your Java code, but your problem may be due to some charset conversion. Note also that protobuf does not delimit the messages when serializing.
Therefore you should use raw data to transmit the messages (byte array or directly (de)serialize from/to streams).
If you intent to send many message you should also send the size before you send the actual messages.
In Java you can do it directly via parseDelimitedFrom(InputStream) and writeDelimitedTo(OutputStream). You can do the same in C++ a litte more complex using CodedOutputStream like
codedOutput.WriteVarint32(protoMessage.ByteSize());
protoMessage.SerializeToCodedStream(&codedOutput);
See also this ealier thread.
You're writing two things to the stream, a size and the Name object, but only trying to read one.
As a general question: why do you feel the need to use CodedInputStream? To quote the docs:
Typically these classes will only be
used internally by the protocol buffer
library in order to encode and decode
protocol buffers. Clients of the
library only need to know about this
class if they wish to write custom
message parsing or serialization
procedures
And to emphasize jtahlborn's comment: why little-endian? Java deals with big-endian values, so will have to convert on reading.

Builders in Java versus C++?

In Google's Protocol Buffer API for Java, they use these nice Builders that create an object (see here):
Person john =
Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.addPhone(
Person.PhoneNumber.newBuilder()
.setNumber("555-4321")
.setType(Person.PhoneType.HOME))
.build();
But the corresponding C++ API does not use such Builders (see here)
The C++ and the Java API are supposed to be doing the same thing, so I'm wondering why they didn't use builders in C++ as well. Are there language reasons behind that, i.e. it's not idiomatic or it's frowned upon in C++? Or probably just the personal preference of the person who wrote the C++ version of Protocol Buffers?
The proper way to implement something like that in C++ would use setters that return a reference to *this.
class Person {
std::string name;
public:
Person &setName(string const &s) { name = s; return *this; }
Person &addPhone(PhoneNumber const &n);
};
The class could be used like this, assuming similarly defined PhoneNumber:
Person p = Person()
.setName("foo")
.addPhone(PhoneNumber()
.setNumber("123-4567"));
If a separate builder class is wanted, then that can be done too. Such builders should be allocated
in stack, of course.
I would go with the "not idiomatic", although I have seen examples of such fluent-interface styles in C++ code.
It may be because there are a number of ways to tackle the same underlying problem. Usually, the problem being solved here is that of named arguments (or rather their lack of). An arguably more C++-like solution to this problem might be Boost's Parameter library.
The difference is partially idiomatic, but is also the result of the C++ library being more heavily optimized.
One thing you failed to note in your question is that the Java classes emitted by protoc are immutable and thus must have constructors with (potentially) very long argument lists and no setter methods. The immutable pattern is used commonly in Java to avoid complexity related to multi-threading (at the expense of performance) and the builder pattern is used to avoid the pain of squinting at large constructor invocations and needing to have all the values available at the same point in the code.
The C++ classes emitted by protoc are not immutable and are designed so that the objects can be reused over multiple message receptions (see the "Optimization Tips" section on the C++ Basics Page); they are thus harder and more dangerous to use, but more efficient.
It is certainly the case that the two implementations could have been written in the same style, but the developers seemed to feel that ease of use was more important for Java and performance was more important for C++, perhaps mirroring the usage patterns for these languages at Google.
Your claim that "the C++ and the Java API are supposed to be doing the same thing" is unfounded. They're not documented to do the same things. Each output language can create a different interpretation of the structure described in the .proto file. The advantage of that is that what you get in each language is idiomatic for that language. It minimizes the feeling that you're, say, "writing Java in C++." That would definitely be how I'd feel if there were a separate builder class for each message class.
For an integer field foo, the C++ output from protoc will include a method void set_foo(int32 value) in the class for the given message.
The Java output will instead generate two classes. One directly represents the message, but only has getters for the field. The other class is the builder class and only has setters for the field.
The Python output is different still. The class generated will include a field that you can manipulate directly. I expect the plug-ins for C, Haskell, and Ruby are also quite different. As long as they can all represent a structure that can be translated to equivalent bits on the wire, they're done their jobs. Remember these are "protocol buffers," not "API buffers."
The source for the C++ plug-in is provided with the protoc distribution. If you want to change the return type for the set_foo function, you're welcome to do so. I normally avoid responses that amount to, "It's open source, so anyone can modify it" because it's not usually helpful to recommend that someone learn an entirely new project well enough to make major changes just to solve a problem. However, I don't expect it would be very hard in this case. The hardest part would be finding the section of code that generates setters for fields. Once you find that, making the change you need will probably be straightforward. Change the return type, and add a return *this statement to the end of the generated code. You should then be able to write code in the style given in Hrnt's answer.
To follow up on my comment...
struct Person
{
int id;
std::string name;
struct Builder
{
int id;
std::string name;
Builder &setId(int id_)
{
id = id_;
return *this;
}
Builder &setName(std::string name_)
{
name = name_;
return *this;
}
};
static Builder build(/* insert mandatory values here */)
{
return Builder(/* and then use mandatory values here */)/* or here: .setId(val) */;
}
Person(const Builder &builder)
: id(builder.id), name(builder.name)
{
}
};
void Foo()
{
Person p = Person::build().setId(2).setName("Derek Jeter");
}
This ends up getting compiled into roughly the same assembler as the equivalent code:
struct Person
{
int id;
std::string name;
};
Person p;
p.id = 2;
p.name = "Derek Jeter";
In C++ you have to explicitly manage memory, which would probably make the idiom more painful to use - either build() has to call the destructor for the builder, or else you have to keep it around to delete it after constructing the Person object.
Either is a little scary to me.

Categories

Resources