Which options do I have for Java process communication? - java

We have a place in a code of such form:
void processParam(Object param)
{
wrapperForComplexNativeObject result = jniCallWhichMayCrash(param);
processResult(result);
}
processParam - method which is called with many different arguments.
jniCallWhichMayCrash - a native method which is intended to do some complex processing of it's parameter and to create some complex object. It can crash in some cases.
wrapperForComplexNativeObject - wrapper type generated by SWIG
processResult - a method written in pure Java which processes it's parameter by creation of several kinds (by the kinds I'm not meaning classes, maybe some like hierarchies) of objects:
1 - Some non-unique objects which are referencing each other (from the same hierarchy), these objects can have duplicates created from the invocations of processParam() method with different parameter values. Since it's costly to keep all the duplicates it's necessary to cache them.
2 - Some unique objects which are referencing each other (from the same hierarchy) and some of the objects of 1st kind.
After processParam is executed for each of the arguments from some set the data created in processResult will be processed together. The problem is in fact that jniCallWhichMayCrash method may crash the entire JVM and this will be very bad. The reason of crash may be such that it can happen for one argument value and not for the other. We've decided that it's better to ignore crashes inside of JVM and just skip some chunks of data when such crashes occur. In order to do this we should run processParam function inside of separate process and pass the result somehow (HOW? HOW?! This is a question) to the main process and in case of any crashes we will only lose some part of data (It's ok) without lose of everything else. So for now the main problem is implementation of transport between different processes. Which options do I have? I can think about serialization and transmitting of binary data by the streams, but serialization may be not very fast due to object complexity. Maybe I have some other options of implementing this?

Let us assume that the processes are on the same machine. Your options include:
Use a Process.exec() to launch a new process for each request, passing the parameter object as a command line argument or via the processes standard input and reading the result from thr processes standard output. The process exits on completion of a single request.
Use a Process.exec() to launch a long running process, using the Processes standard input / output for sending the requests and replies. The process instance handles multiple requests.
Use a "named pipe" to send requests / replies to an existing local (or possibly remote) process.
Use raw TCP/IP Sockets or Unix Domain Sockets to send requests / replies to an existing local (or possibly remote) process.
For each of the above, you will need to design your own request formats and deal with parameter / result encoding and decoding on both sides.
Implement the process as a web service and use JSON or XML (or something else) to encode the parameters and results. Depending on your chosen encoding scheme, there will be existing libraries deal with encoding / decoding and (possibly) mapping to Java types.
SOAP / WSDL - with these, you typically design the application protocol at a higher level of abstraction, and the framework libraries take care of encoding / decoding, dispatching requests and so on.
CORBA or an equivalent like ICE. These options are like SOAP / WSDL, but using more efficient wire representations, etc.
Message queuing systems like MQ-series.
Note that the last four are normally used in systems where the client and server are on separate machines, but they work just as well (and maybe faster) when client and server are colocated.
I should perhaps add that an alternative approach is to get rid of the problematic JNI code. Either replace it with pure Java code, or run it as an external command or service without a Java wrapper around it.

Have you though about using web-inspired methods ? in your case, typically, web-services could/would be a solution in all its diversity :
REST invocation
WSDL and all the heavy-weight mechanism
Even XML-RPC over http, like the one used by Spring remoting or JSPF net export could inspire you

If you can isolate the responsibilities of the process, ie P1 is a producer of data and P2 is a consumer, the most robust answer is to use a file to communicate your data. There is overhead (read CPU cycles) involved in serailization/deserialization however your process(es) will not crash and it is very easy to debug/synchronize.

Related

Best approach to stream multiple types with GRPC

I have a server that passes messages to a client. The messages are of different types and the server has a generic handleMessage and passMessage method for the clients.
Now I intend to adapt this and use GRPC for it. I know I could expose all methods of the server by defining services in my .proto file. But is there also a way to:
Stream
heterogenous types
with one RPC call
using GRPC
There is oneof which allows me to set a message that has only one of the properties set. I could have a MessageContainer that is oneof and all my message types are included in this container. Now the container only has one of the types and I would only need to write one
service {
rpc messageHandler(ClientInfo) returns (stream MessageContainer)
}
This way, the server could stream multiple types to the client through one unique interface. Does this make sense? Or is it better to have all methods exposed individually?
UPDATE
I found this thread which argues oneof would be the way to go. I'd like that obviously as it avoids me having to create potentially dozens of services and stubs. It would also help to make sure it's a FIFO setup instead of multiplexing several streams and not being sure which message came first. But it feels dirty for some reason.
Yes, this makes sense (and what you are calling MessageContainer is best understood as a sum type).
... but it is still better to define different methods when you can ("better" here means "more idiomatic, more readable by future maintainers of your system, and better able to be changed in the future when method semantics need to change").
The question of whether to express your service as a single RPC method returning a sum type or as multiple RPC methods comes down to whether or not the particular addend type that will be used can be known at RPC invocation time. Is it the case that when you set request.my_type_determining_field to 5 that the stream transmitted by the server always consists of MessageContainer messages that have their oneof set to a MyFifthKindOfParticularMessage instance? If so then you should probably just write a separate RPC method that returns a stream of MyFifthKindOfParticularMessage messages. If, however, it is the case that at RPC invocation time you don't know with certainty what the used addend types of the messages transmitted from the server will be (and "messages with different addend types in the same stream" is a sub-use-case of this), then I don't think it's possible for your service to be factored into different RPCs and the right thing for you to do is have one RPC method that returns a stream of a sum type.

How does Apache Spark send functions to other machines under the hood

I started playing with Pyspark to do some data processing. It was interesting to me that I could do something like
rdd.map(lambda x : (x['somekey'], 1)).reduceByKey(lambda x,y: x+y).count()
And it would send the logic in these functions over potentially numerous machines to execute in parallel.
Now, coming from a Java background, if I wanted to send an object containing some methods to another machine, that machine would need to know the class definition of the object im streaming over the network. Recently java had the idea of Functional Interfaces, which would create an implementation of that interface for me at compile time (ie. MyInterface impl = ()->System.out.println("Stuff");)
Where MyInterface would just have one method, 'doStuff()'
However, if I wanted to send such a function over the wire, the destination machine would need to know the implementation (impl itself) in order to call its 'doStuff()' method.
My question boils down to... How does Spark, written in Scala, actually send functionality to other machines? I have a couple hunches:
The driver streams class definitions to other machines, and those machines dynamically load them with a class loader. Then the driver streams the objects and the machines know what they are, and can execute on them.
Spark has a set of methods defined on all machines (core libraries) which are all that are needed for anything I could pass it. That is, my passed function is converted into one or more function calls on the core library. (Seems unlikely since the lambda can be just about anything, including instantiating other objects inside)
Thanks!
Edit: Spark is written in Scala, but I was interested in hearing how this might be approached in Java (Where a function can not exist unless its in a class, thus changing the class definition which needs updated on worker nodes).
Edit 2:
This is the problem in java in case of confusion:
public class Playground
{
private static interface DoesThings
{
public void doThing();
}
public void func() throws Exception {
Socket s = new Socket("addr", 1234);
ObjectOutputStream oos = new ObjectOutputStream(s.getOutputStream());
oos.writeObject("Hello!"); // Works just fine, you're just sending a string
oos.writeObject((DoesThings)()->System.out.println("Hey, im doing a thing!!")); // Sends the object, but error on other machine
DoesThings dt = (DoesThings)()->System.out.println("Hey, im doing a thing!!");
System.out.println(dt.getClass());
}
}
The System.out,println(dt.getClass()) returns:
"class JohnLibs.Playground$$Lambda$1/23237446"
Now, assume that the Interface definition wasn't in the same file, it was in a shared file both machines had. But this driver program, func(), essentially creates a new type of class which implements DoesThings.
As you can see, the destination machine is not going to know what JohnLibs.Playground$$Lambda$1/23237446 is, even though it knows what DoesThings is. It all comes down to you cant pass a function without it being bound to a class. In python you could just send a String with the definition, and then execute that string (Since its interpreted). Perhaps thats what spark does, since it uses scala instead of java (If scala can have functions outside of classes)
Java bytecode, which is, of course, what both Java and Scala are compiled to, was created specifically to be platform independent. So, if you have a classfile you can move it to any other machine, regardless of "silicon" architecture, and provided it has a JVM of at least that verion, it will run. James Gosling and his team did this deliberately to allow code to move between machines right from the very start, and it was easy to demonstrate in Java 0.98 (the first version I played with).
When the JVM tries to load a class, it uses an instance of a ClassLoader. Classloaders encompass two things, the ability to fetch the binary of a bytecode file, and the ability to load the code (verify its integrity, convert it into an in-memory instance of java.lang.Class, and make it available to other code in the system). At Java 1, you mostly had to write your own classloader if you wanted to take control of how the byes were loaded, although there was a sun-specific AppletClassLoader, which was written to load classfiles from http, rather than from the file system.
A little later, at Java 1.2, the "how to fetch the bytes of the classfile" part was separated out in the URLClassloader. That could use any supported protocol to load classes. Indeed, the protocol support mechanism was and is extensible via pluggable protocol handlers. So, now you can load classes from anywhere without the risk of making mistakes in the harder part, which is how you verify and install the class.
Along with that, Java's RMI mechanism allows a serialized object (the class name, along with the "state" part of an object) to be wrapped in a MarshaledObject. This adds "where this class may be loaded from", represented as a URL. RMI automates the conversion of real objects in memory to MarshaledObjects and also shipping them around on the network. If a JVM receives a marshaled object for which it already has the class definition, it always uses that class definition (for security). If not, however, then provided a bunch of criteria are met (security, and just plain working correctly, criteria) then the classfile may be loaded from that remote server, allowing a JVM to load classes for which it has never seen the definitions. (Obviously, the code for such systems must typically be written against ubiquitous interfaces--if not, there's going to be a lot of reflection going on!)
Now, I don't know (indeed, I found your question trying to determine the same thing whether Spark uses RMI infrastructure (I do know that hadoop does not, because, seemingly because the authors wanted to create their own system--which is fun and educational of course--rather than use a flexible, configurable, extensively-tested, including security tested!- system.)
However, all that has to happen to make this work in general are the steps that I outlined for RMI, those requirements are essentially:
1) Objects can be serialized into some byte sequence format understood by all participants
2) When objects are sent across the wire the receiving end must have some way to obtain the classfile that defines them. This can be a) pre-installation, b) RMI's approach of "here's where to find this" or c) the sending system sends the jar. Any of these can work
3) Security should probably be maintained. In RMI, this requirement was rather "in your face", but I don't see it in Spark, so they either hid the configuration, or perhaps just fixed what it can do.
Anyway, that's not really an answer, since I described principles, with a specific example, but not the actual specific answer to your question. I'd still like to find that!
When you submit a spark application to the cluster, your code is deployed to all worker nodes, so your class and function definitions exist on all nodes.

Servlet and Command pattern, compile vs runtime?

I'm writing a Java servlet that acts as a Front Controller. To carry out functions I'm using the Domain Command pattern. Currently, I'm initializing all my commands and storing them in a map with the name (string) of the command as the key and the object as the value. Whenever the servlet receives a request, I get the command from the map by passing the command query from url as:
// at init
Hashmap<String, DomainCommand> commands = new Hashmap<String, DomainCommand>();
commands.put("someCommand", new SomeCommand());
// at request
String command = request.getParameter("command");
DomainCommand c = commands.get(command);
c.execute();
This works well and does what I want since my DomainCommands have no class attributes to be shared between threads. An alternative to this is using reflection to create the object like so:
String command = request.getParameter("command");
DomainCommand c = Class.forName(command).newInstance(); // assuming in same (default) package
c.execute();
Both of these work. Which is better from a performance/memory saving point of view?
Performance
When using Map the only cost is accessing a HashMap (negligible). Reflection on the other hand might take much more time and is less safe - remember you have to make sure the user is not passing bogus command, allowing him to run arbitrary code.
Memory
When creating DomainCommand at startup they will end up in old generation after some time, thus not being subject to garbage collection for most of the time. On the other when created per request most likely they will be garbage collected immediately. So in overall, the memory footprint will be comparable, except that the second approach requires mor GC runs.
All in all, map of commands is a much better approach. BTW if you DI frameworks like Spring or Guice (unless this is an overkill for you) or web frameworks like Struts/Spring MVC, they will do precisely the same work for you.
The first approach of storing the commands in HashMap is better. The problem with the second approach is that you have to load the command class every-time you execute that command.
In fact frameworks like Struts which precisely on command pattern with Controller Servlet as front controller with individual action classes as commands.
From performance perspective the 1st approach you mentioned is definitely faster.
How about the following options?
using Visitor pattern for command
storing your command beans and do a lookup for command bean by its name (from the request) in JNDI (have a service that retrieves the command from JNDI)
using IoC framework (Spring) where all the command beans are initialized from the container startup and lookup for command is done on the application context
Performance-wise I would prefer the 3rd option.
You asked for an answer specifically from a performance/memory saving point of view, and the other answers answer that. I agree that the Map approach is probably better in this regard.
However, you should be sure that this is even a concern before worrying about that at this point; I'm assuming the network overhead to one call to your servlet by far outweighs a single HashMap lookup of a short string.
A larger concern should be clarity and maintainability. In this regard as well, I would say that the Map approach is much superior, as it:
Doesn't tie the API (legal values of the command parameter) to the implementation (names of classes)
Makes it clear which classes are intended to be used as commands and which are not (very important if you later want to make a change)
Allows the API to be more flexible (for example, you could allow the command parameter to be case-insensitive, or to have more than one command map to the same class)
To quote the Zen of Python: "Explicit is better than implicit".
How about merging the 2 options together?
Struts does the exact same thing. It contains a Map that caches all your commands that has been requested by the Servlet. If the command doesn't exist, then it creates a newInstance() of the command (just like option 2 you've created).
The advantage of this is quicker execution of your process: Retrieve the command from the cache else create a new one & and store the created new command in cache. It's definitely faster than option 2.

Ideal web service protocol for single-operation Java program with many parameters?

I have been tasked with making an existing Java program available via a web service that will be invoked through a PHP form.
The program performs biological simulations and takes a large number of input parameters (around 30 different double, double[], and String fields). These parameters are usually setup as instance variables for an "Input" class which is passed into the the program, which in turn produces an "Output" class (which itself has many important fields). For example, this is how they usually call the program:
Input in = new Input();
in.a = 3.14;
in.b = 42;
in.c = 2.71;
Output out = (new Program()).run(in);
I am looking for a way to turn this into a web service in the easiest way possible. Ideally, I would want it to work like this:
1) The PHP client form formats an array containing the input data and sends it to the service.
2) The Java web service automatically parses this data and uses it to populate an Input instance.
3) The web service invokes the program and produces an Output instance, which is is formatted and sent back to the client.
4) The client retrieves the input formatted as an associative array.
I have already tried a few different methods (such as a SOAP operation and wrapping the data in XML), but nothing seems as elegant as I would like. The problem is that the program's Input variable specifications are likely to change, and I would prefer the only maintenance for such changes to be on the PHP form end.
Unfortunately, I don't have enough experience with web services to make an educated decision on what my setup should be like. Does anyone have any advice for what the most effective solution would be?
IMHO JSON RESFULL will be the best. Look here http://db.tmtec.biz/blogs/index.php/json-restful-web-service-in-java
If your program is stand alone java use jetty as an embedded web server.
If the application is already running as a web application skip this.
The second phase is creating web service. I'd recommend you restful web service. It is easier to implement and maintain than SOAP. You can use XML or JSON format for data. I'd probably recommend you JSON.
Now it is up to you. If you already use some kind of web framework, check whether it supports REST. If you do not use any framework you can implement some kind of REST API using simple servlet that implements doPost(), parses input (JSON) and calls implementation.
There are a lot of possibilities. You should search for data serialization formats and choose the most suitable for this purpose.
One possibility would be to use Protocol Buffers.

web service client reference, in a servlet

I have a servlet, and that servlet uses a .net web service to perform some function. To do this, I created the webservice client in Netbeans using the "jax-rpc" style client.
Let's say that my service name is "Tester". Then two of the generated classes are called "Tester", and "TesterSoap".
To get a reference to the web service, I need to do this:
Tester t = new Tester_Impl();
TesterSoap tsoap = t.getTesterSoap();
To use the webservice, I can then do this:
tsoap.runTest();
My question is, since this is a servlet which gets executed many times, should I store the first two lines in static variables (so they only ever get executed once), or store them locally so that they execute everytime the servlet is executed?
Another way of asking the same question: is there a performance hit everytime the first two lines are called? (I'm testing everything locally so it's hard to measure).
Thanks...
If the default constructor and any of the initialization blocks of the Tester_Impl() class and the method getTesterSoap() doesn't do anything expensive (e.g. reading file from disk, loading data from DB, connecting a socket, etc, I however suppose it doesn't) then you don't need to worry about it.
You can consider declaring them as an instance variable of the class extending from HttpServlet. But, a big but, it is going to be shared among all HTTP requests, because there will be only one instance of the particular servlet class during whole application's lifetime. So if the Tester_Impl class is supposed to have a state, then it is a very bad idea to declare it as an instance variable. It would then be shared among all requests. With other words, it's not threadsafe. If you want to ensure threadsafety in servlets, then declare everything in the very same method block.
I would not optimize prematurely here. Test this out in as close to a production environment as you can (i. e. not on your local box) and see what the performance hit is. What I've done in the past is write a small shell script that hits my server with wget n times with a delay of k milliseconds and then measured the latency, possibly instrumenting the code with some timing or profiling myself (or with jvisualvm or some other profiling tool).
If you want to protect your design from a possible performance hit without doing the testing, you could use a factory to provide instances of the service client and then you could swap out singleton service clients for many of them whenever you feel like it.

Categories

Resources