I'm trying to pipe data from Spark (app written in Java) to a C++ executable.
My RDD are like : JavaRDD<CustomMatrix>, where CustomMatrix implements Serializable. It is made up of metadata (int, long, String ...) and a short[][].
Other transformations, like map/flatMap/... , work well.
I would like to send the array (short[][]) to a C++ program, to perform some transformations, and get back the modified array.
I used pipe function for piping data as String to a C++ exec. But now I have to serialize my data and send it to the C++ exec. Does anybody have any idea how this should be effectively handled?
Related
I have a Java script - with a function that I wrote, that I send her a list of strings, the function encrypt each element, and returns a list with the encrypted elements.
My problem is this:
I need to use this function in a python script (send a "list" Python object as input, and receive an "ArrayList" Java object).
How can I call a Java function - that I wrote, in a python script?
And does the list objects are consistent between Python and Java (list Vs. ArrayList)?
A big thank you to all!
** Edit: I'm about to use this entire package in AWS Lambda Function **
The main decisions for choosing a solution seem to be
What do we use to execute the Java program?
How do we transfer computed data from the Java program to the Python program?
E.g. you could decide to use a Java JVM and execute via a call to the operating system from Python.
The computed data could be sent to standard output (in some suitable format) and read in and processed by Python. (See link for the os call and i/o)
Explanation
I need to exchange binary structured data over a stream (TCP socket or
pipe) between C++, Java and Python programs.
Therefore my question:
How to exchange binary structured data over a stream for C++, Java and Python?
There is no way to create the complete object to be serialized beforehand - there must be the possibility to stream in and stream out the data.
Because of performance issues I need some binary protocol format.
I want to use (if possible) some existing library, because hand-crafting all the (de-)serialization is a pain.
What I want
My idea is something like (for C++ writer):
StreamWriter sw(7); // fd to output to.
while( (DataSet const ds(get_next_row_from_db())) ) {
sw << ds; // data set is some structured data
}
and for C++ reader
StreamReader sr(9); // fd for input
while(sr) {
DataSet const ds(sr);
// handle ds
}
with a similar syntax and semantics for Java and Python.
What I did
I thought about using an existing library like Google Protocol Buffers, but this does not support stream handling and there is the need to create the complete object hierarchy before serialization.
Also I though about creating my own binary format, but this is too much work and pain.
I would recommend explicitly documenting how your data types are to be serialized, and writing serialization and deserialization code in each language as needed. I have found in the past that with good documentation of how the data is to be serialized, this is fairly painless.
Your other major option is to standardize on one platform's default serialization method, but that means you have to figure out that method and implement in the other languages. This tends to be trickier as the default serialization methods are often complex and not well documented.
The options are Apache Thrift, Google's protocol buffer and Pache Avro. Good comparison is there at http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro
So I recommend you to try apache Avro.
Problem: Trying to read from some electronic scales using the comport via Java
I am trying to read from a com port using Java. So far I have been successfull in creating a small application that uses the Java SerialPort and InputStream classes to read from the comport.
The application uses a SerialPortEventListener to listen to event sent via the comport of the scale to the computer. So far I have had some success by using an InputStream inside the event listener to read some bytes from the comport, however the output does not make any sense and I keep getting messages in the form of:
[B#8813f2
or
[B#1d58aae
To clarify I am receiving these messages on screen when I interact with the keypad of the scale. I just need some help on interpreting the output correctly. Am I using the correct classes to read and write to the comport?
You have read the data into a byte[], and then attempted to dump it by using System.out.println(data) where data is declared byte[] data. That, unfortunately will just print the array's internal representation, which is, uselessly, '[' followed by the hex hash code.
Instead, you want to dump the contents of the array. Using
System.out.println(Arrays.toString(data))
is the simplest way which should work for you.
Otherwise, you need to iterate the array, and print each byte, or transform the byte array to a String using, for example, new String(data) (which will use the platform default encoding).
Those look like the result of printing a byte array object as a raw object reference. So your call has some sort of confused call to System.out.something or System.err.something, most likely.
The object you have there is apparently a byte array. I take it you're taking the object and printing it to the console.
See: http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Class.html#getName()
and: http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Object.html#toString()
I have a general socket programming question for you.
I have a C struct called Data:
struct data {
double speed;
double length;
char carName[32];
struct Attribs;
}
struct Attribs {
int color;
}
I would like to be able to create a similar structure in Java, create a socket, create the data packet with the above struct, and send it to a C++ socket listener.
What can you tell me about having serialized data (basically, the 1's and 0's that are transferred in the packet). How does C++ "read" these packets and recreate the struct? How are structs like this stored in the packet?
Generally, anything you can tell me to give me ideas on how to solve such a matter.
Thanks!
Be weary of endianness if you use binary serialization. Sun's JVM is Big Endian, and if you are on an Intel x86 you are on a little endian machine.
I would use Java's ByteBuffer for fast native serialization. ByteBuffers are part of the NIO library, thus supposedly higher performance than the ol' DataInput/OutputStreams.
Be especially weary of serializing floats! As suggested above, its safer to transfer all your data to character strings across the wire.
On the C++ side, regardless of the the networking, you will have a filled buffer of data at some point. Thus your deserialization code will look something like:
size_t amount_read = 0;
data my_data;
memcpy(buffer+amount_read, &my_data.speed, sizeof(my_data.speed))
amount_read += sizeof(my_data.speed)
memcpy(buffer+amount_read, &my_data.length, sizeof(my_data.length))
amount_read += sizeof(my_data.length)
Note that the sizes of basic C++ types is implementation defined, so you primitive types in Java and C++ don't directly translate.
You could use Google Protocol buffers. My preferred solution if dealing with a variety of data structures.
You could use JSON for serialization too.
The basic process is:
java app creates some portable version of the structs in the java app, for example XML
java app sends XML to C++ app via a socket
C++ app receives XML from java app
C++ app creates instances of structs using the data in the XML message
Currently, I'm saving and loading some data in C/C++ structs to files by using fread()/fwrite(). This works just fine when working within this one C app (I can recompile whenever the structure changes to update the sizeof() arguments to fread()/fwrite()), but how can I load this file in other programs without knowing in advance the sizeof()s of the C struct?
In particular, I have written this other Java app that visualizes the data contained in that C struct binary file, but I'd like a general solution as to how read that binary file. (Instead of me having to manually put in the sizeof()s in the Java app source whenever the C structure changes...)
I'm thinking of serializing to text or XML of some sort, but I'm not sure where to start with that (how to serialize in C, then how to deserialize in Java and possibly other languages in the future), and if that is advisable here where one member of the struct is a float array that can go upwards of ~50 MB in binary format (and I have hundreds of these data files to read and write).
The C structure is simple (no severe nesting or pointer references) and looks like the following:
struct MyStructure {
char *title;
int id;
int param1;
int param2;
float *data;
}
The part that are liable to change the most are the param integers.
What are my options here?
If you have control of both code bases, you should consider using Protocol Buffers.
You could use Java's DataInput/DataOutput format that is well described in the javadoc.
Take a look at JSON. http://www.json.org. If you go to from javascript it's a big help. I don't know how good the java support is though.
If your structure isn't going to change (much), and your data is in a pretty consistent format, you could just write the values out to a CSV file, or some other plain format.
This can be easily read in Java, and you won't have to worry about serializing to XML. Sometimes going simple is the easiest route.
Take a look at Resin's Hessian/Burlap services. You may not want the whole service, just part of the API and an understanding of the wire protocol.
If:
your data is essentially a big array of floats;
you are able to test the writing/reading procedure in all the likely environments (=combinations of machines/OS/C compiler) that each end will be running on;
performance is important.
then I would probably just keep writing the data from C in the way that you are doing (maybe with a slight amendment -- see below) and turn the problem into how you read that data from Java.
To read the data back in from Java, use a ByteBuffer. Essentially, pull in slabs of bytes from your data, wrap a ByteBuffer around them, and then use the get(), getFloat(), getInt() etc methods. The NIO package also has "wrapper" buffers, e.g. FloatBuffer, which from tests I've done appear to be about 20% faster for reading large numbers of the same type.
Now, one thing you'll have to be careful about is byte ordering. From Java, you need to call order(ByteOrder.LITTLE _ ENDIAN) or order(ByteOrder.BIG _ ENDIAN) on your buffer before you start reading the data. To decide which to use, I'd recommend that at the very start of the stream, you write some known 16-byte value (e.g. 255 = 0x00ff). Then from Java, pull out these two bytes and check the order (0xff, 0x00 or 0x00, 0xff) to see whether you have little or big endian.
One possibility is creating small XML files with title, ID, params, etc, and then a reference (by filename) to where the float data is contained. Assuming there's nothing special about the float data, and that Java and C are using the same floating point format, you can read that file in with readFloat() of a DataInputStream.
I like the CSV and "Protocol Buffers" answers (though, at a glance, the protocol buffer thing might be very similar to YAML for all I know).
If you need tightly packed records for high volume data, you might consider this:
Create a textual file header describing the current file structure: record sizes (types????) and field names / sizes. Read and parse the header, then use low level binary I/O operations to load up each record's fields, er, object's properties or whatever we are calling it this year.
This gives you the ability to change the strucutre a bit and have it be self-describing, while still allowing you to pack a high volume in a smaller space than XML would allow.
TMTOWTDI, I guess.