Message Pack formats can serialize small integers or short strings in a compact way that merges type identifier with actual data.
Now when the data to serialize contains a primitive array (Java double[] for instance) then the Message Pack serialization will apparently waste one byte for each value in the array, to specify its type, instead of seeing that the type is constant for all values in the array.
Is there a way to avoid this behavior while remaining inter-operable? (other than using a binary string and converting in the application)
Related
I planned to make a wrapper to Swift MT203, MT204 messages.
Message Structure as follows,
MT203 -
2 Mandatory Sequences, where first one exists once and second one can exists two to ten times, and each sequence can contains mandatory fields and optional fields.
MT204 -
2 Mandatory Sequences, where first one exists once and second one can exists more than one time, and each sequence can contains mandatory fields and optional fields.
[References for the MT203 and MT204]
https://www2.swift.com/knowledgecentre/publications/usgf_20180720/1.0?topic=finmt203.htm
https://www2.swift.com/knowledgecentre/publications/usgf_20180720/1.0?topic=finmt204.htm
Which data structure is better to use to store the second sequences in each cases,
I prefer, Array for instance MT203, since I know the maximum size of second sequence but for MT204 I was confused to choose which is better from array and array list.
As during unpacking we have to get fields continuously but not all fields are mandatory for the second sequences.
[Also do comment if the first one choice of Array is not valid]
I think you'd do quite fine with either data structures.
Having said that, there's some things you might want to consider: you can make an ArrayList (like any other list) Immutable. That will prevent unwanted modification of the contents. This might be very interesting when you pass these message objects around and want to prevent someone else to modify the message accidentally. There's many ways to make a list immutable - such as Collections.immutableList(myArrayList) or Guava's ImmutableList.copyOf(myArrayList).
Having said that, I believe that there are more important considerations than features of lists over features of array:
First of all, I would consider having them both use the same data structure - especially if both messages are used in the same part of the codebase, it's going to be very confusing if one message type is an array, while the other one is a list. This might ultimately become a pain in the back as both messages will have to be handled differently. e.g. if you want to log messages - you'll have to do that differently for lists vs arrays.
Secondly, I would recommend, modelling each of these messages as a class. That class would (obviousely) use an array or a list internally to store the message data, but it would also give higher level semantical access to the contents of the message.
say you wanted the ValueDate of MTS203 (field index 1): you'd always need to call dateFormat.parse(message[1]) for that - and everyone would need to remember what index 1 was and how to parse the date string into an actual date object. If you had a class like this:
class MultipleGeneralFinancialInstitutionTransfer {
private List<String> messageData;
/** constructor... */
public Date getValueDate() {
return parseDate(messageData.get(1)); // imagine parse date being a method to parse the actual format
}
}
it would be much more convenient to work with that message - and nobody would need to remember the actual format of that message.
I. Size: Array in Java is fixed in size. We can not change the size of array after creating it. ArrayList is dynamic in size. When we add elements to an ArrayList, its capacity increases automatically.
II. Performance: In Java Array and ArrayList give different performance for different operations.
add() or get(): Adding an element to or retrieving an element from an array or ArrayList object has similar performance. These are constant time operations.
resize(): Automatic resize of ArrayList slows down the performance. ArrayList is internally backed by an Array. In resize() a temporary array is used to copy elements from old array to new array.
III. Primitives: Array can contain both primitive data types as well as objects. But ArrayList can not contain primitive data types. It contains only objects.
IV. Iterator: In an ArrayList we use an Iterator object to traverse the elements. We use for loop for iterating elements in an array.
V. Type Safety: Java helps in ensuring Type Safety of elements in an ArrayList by using Generics. An Array can contain objects of same type of classe. If we try to store a different data type object in an Array then it throws ArrayStoreException.
VI. Length: Size of ArrayList can be obtained by using size() method. Every array object has length variable that is same as the length/size of the array.
VII. Adding elements: In an ArrayList we can use add() method to add objects. In an Array assignment operator is used for adding elements.
VIII. Multi-dimension: An Array can be multi-dimensional. An ArrayList is always of single dimension
Now you can chose as per your need which is better for you
I work with messages that contain a few attributes and an array of a thousand floating point values (double[]). When the messages are serialized with protocol buffers, thanks to the "packed=true" directive, the double values are aligned and stored compactly in the messages.
But by default the Java classes generated for that message represent the double array as an array list (!), boxing primitive double values into objects, scattering those objects in memory, while at the end I need the double[] representation for further aggregations...
Is there an option to generate classes that handle repeated primitive values as Java primitive arrays?
As explained here what is needed is versions of ArrayList which store unboxed values. Since java generics works only with objects(boxed types), an implementation should be needed for each primitive type. So you can use the one provided by Apache Commons Primitives.
After discussing this topic in several places, the answer is a clear no.
With protocol buffers the binary representation for vectors of numbers is efficient. But it is currently not possible with the Java implementation to efficiently deserialize those vectors (instead of primitive arrays you get collections of boxed numbers...)
Lets supose I define a class
public class PointFloat {
float x;
float y;
}
Then I instantiate an array
PointFloat[] points = new PointFloat[10];
At this point I have an array of ten PointFloat Objects. Lets supose that some code assigns values x and y to every pointfloats.
What I need is to store that array in a VARBINARY in a Mysql database.
To accomplish this I would need to convert this array of PointFloats to byte[] so I can insert into the database using a PreparedStatement
Nothing new for me to use a PreparedStatement but first time using objects serialization.
How do you convert an array of PointFloat of any size to a byte[]?.
Please keep it as simple as possible.
Thank you very much for reading.
You can simply use an ObjectOutputStream to write your array into a ByteArrayOutputStream. See this answer for details and example: https://stackoverflow.com/a/2836659/337621
Since your object contains two floats, the standard serialization completely fits your needs.
At this point I have an array of ten PointFloat Objects
No. At this point, you have an array of 10 null references.
Choose how you want to transform the points to a byte array. You could design a custom representation, or use Java serialization, or JSON, or XML, for example.
I would choose a format that is readable whatever the language is, and that won't be unreadable as soon as you change the Point class (so not the native Java serialization). JSON is very compact (for a text-based representation). There are dozens of JSON serializers, for every language. They're all documented.
I'm using Kryo IO directly to do my own low level primitive serialization of Strings, Longs and Doubles.
What I'm wondering is if there is any way for Kryo IO to automatically detect the primitive data types from the serialized bytes when reading them back?
If I have a byte array of say 10 serialized values, and I don't know if they were Strings, Longs, or Doubles; is there any way for Kryo to determine the data types (like MsgPack can)?
Kryo is no different to the normal Java serialization in this respect. There are two ways in which the deserializer can know what type it is deserializing each time:
It is a field in a known class, so the deserializer implementation reads each field in its proper order.
There is type information embedded in the stream in some manner to let it know. The writeClassAndObject() method in Kryo does just that - it prepends a compact class identifier to the actual object content, letting the deserializer know what to do.
Alternatively, you can do something like this manually e.g. by sending a single byte that would select among a limited number of supported types.
Besides, this is what the MessagePack format mandates as well...
I have a Tuple object that holds 3 primitives: Tuple(double, long, long). To avoid creating a huge amount of Tuple, I'm thinking using Trove library's primitive MAP, which would take two primitive as key and value. In my case, it would be Map<double, some primitive>.
My question: is it possible efficiently to encode the two long into a single primitive that I can store in the map, and later decode them?
is it possible efficiently to encode the two long into a single primitive
No, simply because longs are 64-bit, and no Java primitive is longer than that. You would need a 128-bit primitive to encode two longs into it.
It's right, you cannot pack two 64-bit primitives into another primitive, which is at most 64 bits of size. Both, double and long by standard are mapped by 64 binary digits.
The question is, whether you can impose some restrictions on the numbers you are dealing with. If you know, you will always have even numbers or uneven numbers or the first component will have integer range or you are dealing with multitudes of 1000, you can win some bits here.
Practically speaking, you will never make use of all
2^64 x 2^64 combinations
of pairs of long values.
On the other hand, it's no big deal to handle maps on pairs of values. That was the whole effort to make object-oriented languages like Java to not only deal with data types like struct in C, but also to bind methods to the data.
You can find good implementations of a Pair class in the web, e.g. angelikalanger.com. Or you can easily code an implementation yourself, especially, since you only need a pair of Long values.Also consider to use Pair<Double, Pair<Long, Long>> or implement a Tuple<M,N,T> class right away instead of a Map, i.e. key-value combination, following the outline of the Pair<M,N> implementation.
Finally, you could even employ an in-memory database like H2 to hold your Tuple(double, long, long) entries. It is enough to enclose it in your project as a Java library and configure it properly.
By the way, a 3-tuple is called a triple. Therefore, you could correctly call your class Triple(double, long, long) or better Triple(Double, Long, Long).
You could use Trove's double-Object map and encode the two longs into a BigInteger, but if your objective is to stay strictly with primitive types, that obviously isn't any help.
As Joonas says, there is no single primitive that will hold 128 bits. What might meet your need is to use an array to hold the two longs: Map<Double, long[]>. While Double and long[] are not strictly primitives that might suit. Remember that you cannot put double (small-d) into a Map as Maps can only contain reference types, not primitives.
Alternatively, how about Map(Double, Pair), where Pair is a small class to hold two longs? Most libraries have something like that lying around somewhere.