Send XML Document and JSON as bytes over web socket in Java - java

What are the benefits of using ByteArrayOutputStream to convert the XML or JSON data to send over web socket instead of sending these values as Strings?

Security: JSON and XML easy to decode.(mostly for WS / compare to WSS)
Efficiency: In traffic usage and in most case encode/decode processing. Byte-Arrays could be very compact compare to string, specially with data that are not string by nature (compare 4-bytes Boolean array of size 32 with over 128 (32*4) byte need for string representation, both data usage and encode/decode CPU usage). check THIS link
Generality: Sending all type of data including any objects with complex hierarchical inheritances between them. In order to decode JSON with complex Tree-Like inheritance, you need very complex parsing method.
Simplicity: Enable to chunk data meaningfully. suppose we always use first 2 byte of data as it's type. (to decode the rest). Normally additional libraries do that for us.
Integrity: Easily recognizing corrupted data. Even without checksum, 1-bit data-corruption could be detected in most case.
Compatibility: Using serialized object with version to control compatibility. (version control)-Although you could add version in JSON, it could cause many difficulty, inefficiency and trouble. check THIS
And probably other reasons in special cases.

Related

Is additional base64 encoding necessary when sending files as byte[] from Java service to Java Service via RestTemplate?

I am sending data via json body in a post request from a client (Java) to a server (Java) using a Spring RestTemplate and RestController.
The data is present as a POJO on the client and will be parsed into a POJO with the same structure on the server.
On the client I am converting a file with Files.readAllBytes to byte[] and store it in the content field.
On the server side the whole object including the byte[] will be marshalled to XML using JAXB annotations.
class BinaryObject {
String fileName;
String mimeCode;
byte[] content;
}
Everything is working fine and running as intended.
I heard it could be beneficial to encode the content field before transmitting the date to the server and decode it there before it is marshaled into XML.
My Question
Is it necessary or recommended to additionally encode / decode the content field with base64?
TL;DR
To the best of my knowledge, you are not going against any good practice with your current implementation. One might question the design (exchanging files in JSON ? Storing binary inside XML ?), but this is a separate question.
Still, there is room for possible optmization, but the toolset you use (e.g. Spring rest template + Spring Controler + JSON serialization (jackson) + XML using JAXB) kind of hide the possible optimizations from you.
You have to carrefully weight the pros and cons of working around your comfortable "automat(g)ical" serializations that work well as of today to see if it is worth the trouble to tweak it.
We can nonetheless discuss the theory of what could be done.
A discussion about Base64
Base64 encoding in an efficient way to encode binary data in pure text formats (e.g. MIME strucutres such as email or some HTTP bodies, JSON, XML, ...) but it has two costs : the first is a non negligible size increase (~ 33% size), the second is CPU time.
Sometimes, (but you'd have to profile, check if that is your case), this cost is not negligible, esp. for large files (due to some buffering and char/byte conversions in the frameworks, you could easilly end up using e.g. 4x the size of the encoded file in the Java Heap).
When handling 10kb files at 10 requests/sec, this is usually NOT an issue.
But 10MB files at 100 req/second, well that is another ball park.
So you'd have to check (I doubt your typical server will reach 100 req/s with 10MB files, because that is a 1GB/s incoming network bandwidth).
What is optimizable in your current process
In your current process, you have multiple encodings taking place : the client needs to Base64 encode the bytes read from the file.
When the request hits the server, the server decodes the base64 to a byte[], then your XML serialization (JAXB) reconverts the byte[] to base64.
So in effect, "you" (more exactly, the REST controler side of things) decoded base64 content, all for nothing because the XML side of things could have used it directly.
What could be done
A few things.
Do you need base64 at the calling site ?
First, you do not have to encode at the client side. When using JSON, there is no choice, but the world did not wait for JSON to exchange files (e.g. arbitrary binary content) over HTTP.
If your content is a file name, a MIME type, and a file body, then standard, direct HTTP calls with no JSON at all is perfectly fine.
The MIME type could be mapped to the Content-Type HTTP Header, the file name inside the Content-Disposition HTTP header, and the contents as the raw HTTP body. No base64 needed (but you need your server-side to accept raw HTTP content as is). This is standard as can be.
This change would allow you to remove the encoding (client side), lower the network size of the call (~33% less), and remove one decoding at the server side. The server would just have to base64 encode (once) a raw stream to produce the XML, and you would not even need to buffer the whole file contents for that (you'd have to tweak you JAXB model a bit, but you can JAXB serialize directly bytes from an InputStream, which means, almost no buffer, and since your CPU probably encodes faster than your network serves content, no real latency incurred).
If this, for some reason, is not an option, let's say your client has to send JSON (and therefore base64 content)
Can you avoid decoding at the server side
Sort of. You can use a server-side bean where the content is actually a String and NOT a byte[]. This is hacky, but your REST controler will no longer deserialize base64, it will keep it "as is", which is a JSON string (which happens to be base64 encoded content, but the controler does not care).
So your server will have saved the CPU cost of one base64 decoding, but in exchange, you'll have a base64 String in java heap (compared to the raw byte[], +33% size on Java >=9 with compact strings, +166% size on Java < 9).
If you are to profit from this, you also have to tweak your JAXB to see the base64 encoded String as a byte[], which is not trivial as far as I can tell, unless you modify the JAXB object in such a way that it accepts a String instead of the byte[] which is kind of hacky (if your JAXB objects are generated from a XML schema, this might really become a pain to implement)
All in all this is much harder - probably too much if you are not really hitting the wall, performance wise, on this particular issue.
A few other stuff
Are your files pure binary, or are they actually text ? If there are text, you may benefit from using CDATA encoding on the XML side instead of base64 ?
Is your XML actually a SOAP call ? If so, and if the service supports MTOM, you could avoid base64 completely, but that is an altogether different subject.

Obtain a string from the compressed data and vice versa in java

I want to compress a string(an XML Document) in Java and store it in Cassandra db as varchar. I should be able to decompress it while reading from db. I looked into GZIP and lz4 and both return a byte array on compressing.
My goal is to obtain a string from the compressed data which can also be used to decompress and get back the original string.
What is the best possible approach?
I don't see any good reasons for you to compress your data: Cassandra can do it for you transparently (it will LZ4 your data by default). So, if your goal is to reduce your data footprint then you have a non-existent problem, and I'd feed the XML document directly to C*.
By the way, all the compression algorithms take array of bytes and produce array of bytes. As a solution, you could apply something like a base64 encoding to your compressed byte array. On decompression, reverse the logic: decode base64 your string and then apply your decompression algorithm.
Not enough reputation to comment so posting as an answer. If you want a string back, then significant compression will depend on your data. A very simple solution might be something like Java compressing Strings but that would work if your string is only characters and no numbers. You can modify this solution to work for most characters but then if you don't have repeating characters then you might actually get a larger string than your original one.

Java Serialised object vs Non serialised object

1) Can a non-serialised java object be sent over the network to be executed by another JVM or stored in local file storage to get the data restored?
2) What is the difference between serialising and storing the java object vs storing the java object without serialising it?
Serialization is a way to represent a java object as a series of bytes. Its just a format nothing more.
A "build-in" java serialization is a class that provides an API for conversion of the java object to a series of bytes. That's it. Of course, deserialization is a "complementary" process that allows to convert this binary stream back to the object.
The serialization/deserialization itself has nothing to do with the "sending over the network" thing. Its just convenient to send a binary stream that can be created from the object with the serialization.
Even more, sometimes the built-in serialization is not an optimal way to get the binary stream, because sometimes the object can be converted by using less bytes.
So you can use you're custom protocol, provide your own customization for serialization (for example, Externalizable)
or even use third party libraries like Apache Avro
I think this effectively answers both of your questions:
You can turn the non-serialized object (I guess the one that doesn't implement "Serializable" interface) to the series of bytes (byte stream) by yourself if you want and then send it over the network, store in a binary file, whatsoever.
Of course you'll have to understand how to read this binary format for converting back.
Since serialization is just a protocol of conversion and not a "storage related thing", the answer is obvious.
Hope this helps.
In short, you don't store a non-serialized object in java. So I would say no to both questions.
Edit: ObjectOutputStream and ObjectInputStream can write primitives as well as serializable objects, if that's what you are using.
1) Can a non-serialised java object be sent over the network to be
executed by another JVM or stored in local file storage to get the
data restored?
An object is marshalled using ObjectOutputStream to be sent over the wire. Serialization is a Java standard way of storing the state of an object. You can devise your own of doing the same but there is no point re-inventing the wheel unless you see a big problem in the standard way.
2) What is the difference between serialising and storing the java
object vs storing the java object without serialising it?
Serialization stores the state of the object using ObjectOuputStream and can de de-serialized using ObjectInputStream. Serialized object can be saved to a file or can be sent over the network. Serialization is the standard way to achieve all this. But you can always invent your ways to do so if you really have a point to.
The purpose of serialization is to store the state of objects in a self contained way that doesn't require raw memory references, run time state etc. In other words, objects can be represented as a string of bits that can be stored on disk, sent over a network etc.

Generate and parse text files in Java

I'm looking for a library/framework to generate/parse TXT files from/into Java objects.
I'm thinking in something like Castor or JAXB, where the mapping between the file and the objects can be defined programmatically or with XML/annotations. The TXT file is not homogeneous and has no separators (fixed positions). The size of the file is not big, therefore DOM-like handling is allowed, no streaming required.
For instance:
TextWriter.write(Collection objects) -> FileOutputStream
TextReader.read(FileInputStream fis) -> Collection
I suggest you use google's protocol buffers
Protocol buffers are a flexible, efficient, automated mechanism for
serializing structured data – think XML, but smaller, faster, and
simpler. You define how you want your data to be structured once, then
you can use special generated source code to easily write and read
your structured data to and from a variety of data streams and using a
variety of languages. You can even update your data structure without
breaking deployed programs that are compiled against the "old" format.
Protobuf messages can be exported/read in binary or text format.
Other solutions would depend on what you call text file : if base64 is texty enough for you, you could simply use java standard serialization with base64 encoding of the binary stream.
You can do this using Jackson serialize to JSON and back
http://jackson.codehaus.org/
Just generate and parse it with XML or JSON formats, there's a whole load of libraries out there that will do all the work for you.

Best way to serialize a C structure to be deserialized by Java, etc

Currently, I'm saving and loading some data in C/C++ structs to files by using fread()/fwrite(). This works just fine when working within this one C app (I can recompile whenever the structure changes to update the sizeof() arguments to fread()/fwrite()), but how can I load this file in other programs without knowing in advance the sizeof()s of the C struct?
In particular, I have written this other Java app that visualizes the data contained in that C struct binary file, but I'd like a general solution as to how read that binary file. (Instead of me having to manually put in the sizeof()s in the Java app source whenever the C structure changes...)
I'm thinking of serializing to text or XML of some sort, but I'm not sure where to start with that (how to serialize in C, then how to deserialize in Java and possibly other languages in the future), and if that is advisable here where one member of the struct is a float array that can go upwards of ~50 MB in binary format (and I have hundreds of these data files to read and write).
The C structure is simple (no severe nesting or pointer references) and looks like the following:
struct MyStructure {
char *title;
int id;
int param1;
int param2;
float *data;
}
The part that are liable to change the most are the param integers.
What are my options here?
If you have control of both code bases, you should consider using Protocol Buffers.
You could use Java's DataInput/DataOutput format that is well described in the javadoc.
Take a look at JSON. http://www.json.org. If you go to from javascript it's a big help. I don't know how good the java support is though.
If your structure isn't going to change (much), and your data is in a pretty consistent format, you could just write the values out to a CSV file, or some other plain format.
This can be easily read in Java, and you won't have to worry about serializing to XML. Sometimes going simple is the easiest route.
Take a look at Resin's Hessian/Burlap services. You may not want the whole service, just part of the API and an understanding of the wire protocol.
If:
your data is essentially a big array of floats;
you are able to test the writing/reading procedure in all the likely environments (=combinations of machines/OS/C compiler) that each end will be running on;
performance is important.
then I would probably just keep writing the data from C in the way that you are doing (maybe with a slight amendment -- see below) and turn the problem into how you read that data from Java.
To read the data back in from Java, use a ByteBuffer. Essentially, pull in slabs of bytes from your data, wrap a ByteBuffer around them, and then use the get(), getFloat(), getInt() etc methods. The NIO package also has "wrapper" buffers, e.g. FloatBuffer, which from tests I've done appear to be about 20% faster for reading large numbers of the same type.
Now, one thing you'll have to be careful about is byte ordering. From Java, you need to call order(ByteOrder.LITTLE _ ENDIAN) or order(ByteOrder.BIG _ ENDIAN) on your buffer before you start reading the data. To decide which to use, I'd recommend that at the very start of the stream, you write some known 16-byte value (e.g. 255 = 0x00ff). Then from Java, pull out these two bytes and check the order (0xff, 0x00 or 0x00, 0xff) to see whether you have little or big endian.
One possibility is creating small XML files with title, ID, params, etc, and then a reference (by filename) to where the float data is contained. Assuming there's nothing special about the float data, and that Java and C are using the same floating point format, you can read that file in with readFloat() of a DataInputStream.
I like the CSV and "Protocol Buffers" answers (though, at a glance, the protocol buffer thing might be very similar to YAML for all I know).
If you need tightly packed records for high volume data, you might consider this:
Create a textual file header describing the current file structure: record sizes (types????) and field names / sizes. Read and parse the header, then use low level binary I/O operations to load up each record's fields, er, object's properties or whatever we are calling it this year.
This gives you the ability to change the strucutre a bit and have it be self-describing, while still allowing you to pack a high volume in a smaller space than XML would allow.
TMTOWTDI, I guess.

Categories

Resources