How does Apache Spark send functions to other machines under the hood

How does Apache Spark send functions to other machines under the hood - java

I started playing with Pyspark to do some data processing. It was interesting to me that I could do something like
rdd.map(lambda x : (x['somekey'], 1)).reduceByKey(lambda x,y: x+y).count()
And it would send the logic in these functions over potentially numerous machines to execute in parallel.
Now, coming from a Java background, if I wanted to send an object containing some methods to another machine, that machine would need to know the class definition of the object im streaming over the network. Recently java had the idea of Functional Interfaces, which would create an implementation of that interface for me at compile time (ie. MyInterface impl = ()->System.out.println("Stuff");)
Where MyInterface would just have one method, 'doStuff()'
However, if I wanted to send such a function over the wire, the destination machine would need to know the implementation (impl itself) in order to call its 'doStuff()' method.
My question boils down to... How does Spark, written in Scala, actually send functionality to other machines? I have a couple hunches:
The driver streams class definitions to other machines, and those machines dynamically load them with a class loader. Then the driver streams the objects and the machines know what they are, and can execute on them.
Spark has a set of methods defined on all machines (core libraries) which are all that are needed for anything I could pass it. That is, my passed function is converted into one or more function calls on the core library. (Seems unlikely since the lambda can be just about anything, including instantiating other objects inside)
Thanks!
Edit: Spark is written in Scala, but I was interested in hearing how this might be approached in Java (Where a function can not exist unless its in a class, thus changing the class definition which needs updated on worker nodes).
Edit 2:
This is the problem in java in case of confusion:
public class Playground
{
private static interface DoesThings
{
public void doThing();
}
public void func() throws Exception {
Socket s = new Socket("addr", 1234);
ObjectOutputStream oos = new ObjectOutputStream(s.getOutputStream());
oos.writeObject("Hello!"); // Works just fine, you're just sending a string
oos.writeObject((DoesThings)()->System.out.println("Hey, im doing a thing!!")); // Sends the object, but error on other machine
DoesThings dt = (DoesThings)()->System.out.println("Hey, im doing a thing!!");
System.out.println(dt.getClass());
}
}
The System.out,println(dt.getClass()) returns:
"class JohnLibs.Playground$$Lambda$1/23237446"
Now, assume that the Interface definition wasn't in the same file, it was in a shared file both machines had. But this driver program, func(), essentially creates a new type of class which implements DoesThings.
As you can see, the destination machine is not going to know what JohnLibs.Playground$$Lambda$1/23237446 is, even though it knows what DoesThings is. It all comes down to you cant pass a function without it being bound to a class. In python you could just send a String with the definition, and then execute that string (Since its interpreted). Perhaps thats what spark does, since it uses scala instead of java (If scala can have functions outside of classes)

Java bytecode, which is, of course, what both Java and Scala are compiled to, was created specifically to be platform independent. So, if you have a classfile you can move it to any other machine, regardless of "silicon" architecture, and provided it has a JVM of at least that verion, it will run. James Gosling and his team did this deliberately to allow code to move between machines right from the very start, and it was easy to demonstrate in Java 0.98 (the first version I played with).
When the JVM tries to load a class, it uses an instance of a ClassLoader. Classloaders encompass two things, the ability to fetch the binary of a bytecode file, and the ability to load the code (verify its integrity, convert it into an in-memory instance of java.lang.Class, and make it available to other code in the system). At Java 1, you mostly had to write your own classloader if you wanted to take control of how the byes were loaded, although there was a sun-specific AppletClassLoader, which was written to load classfiles from http, rather than from the file system.
A little later, at Java 1.2, the "how to fetch the bytes of the classfile" part was separated out in the URLClassloader. That could use any supported protocol to load classes. Indeed, the protocol support mechanism was and is extensible via pluggable protocol handlers. So, now you can load classes from anywhere without the risk of making mistakes in the harder part, which is how you verify and install the class.
Along with that, Java's RMI mechanism allows a serialized object (the class name, along with the "state" part of an object) to be wrapped in a MarshaledObject. This adds "where this class may be loaded from", represented as a URL. RMI automates the conversion of real objects in memory to MarshaledObjects and also shipping them around on the network. If a JVM receives a marshaled object for which it already has the class definition, it always uses that class definition (for security). If not, however, then provided a bunch of criteria are met (security, and just plain working correctly, criteria) then the classfile may be loaded from that remote server, allowing a JVM to load classes for which it has never seen the definitions. (Obviously, the code for such systems must typically be written against ubiquitous interfaces--if not, there's going to be a lot of reflection going on!)
Now, I don't know (indeed, I found your question trying to determine the same thing whether Spark uses RMI infrastructure (I do know that hadoop does not, because, seemingly because the authors wanted to create their own system--which is fun and educational of course--rather than use a flexible, configurable, extensively-tested, including security tested!- system.)
However, all that has to happen to make this work in general are the steps that I outlined for RMI, those requirements are essentially:
1) Objects can be serialized into some byte sequence format understood by all participants
2) When objects are sent across the wire the receiving end must have some way to obtain the classfile that defines them. This can be a) pre-installation, b) RMI's approach of "here's where to find this" or c) the sending system sends the jar. Any of these can work
3) Security should probably be maintained. In RMI, this requirement was rather "in your face", but I don't see it in Spark, so they either hid the configuration, or perhaps just fixed what it can do.
Anyway, that's not really an answer, since I described principles, with a specific example, but not the actual specific answer to your question. I'd still like to find that!

When you submit a spark application to the cluster, your code is deployed to all worker nodes, so your class and function definitions exist on all nodes.

Related

Why use enums when it creates dependency across teams?

I know enums are used when we are expecting only a set of values to be passed. We don't want the caller to pass anything other than the well defined set.
And this works very well inside a project. Because you know what you've to pass.
But consider 2 projects, I am using the models of 1st project in 2nd.
Second project has a method like this.
public void updateRefundMode(RefundMode refundMode)
enum RefundMode("CASH","CARD","GIFT_VOUCHER")
Now, I realise RefundMode can be PHONEPE also, So If I start passing this to 1st project, it would fail at their end (Unable to desirialize enum PHONEPE). Although I've added this enum at my end.
Which is fine, because If my first project doesn't know about the "PHONEPE", then it doesn't know how to handle it, so he has to update the models too.
But my problem is, Let's imagine a complex Object am trying to pass, which also takes this RefundMode, when I pass a new RefundMode just this field should be become null or ignored at their end right ? Rather than not accepting the whole object, and breaking the entire flow/request.
Is there a way I can specify jackson (jsonproperties) to just ignore that field if an unknown value is being passed. Curious to know.. (Although In that case, I am breaking the rule of ENUM) So, why not keep a String which solves all the problem ?

It's all about contracts.
When you are in a client/server situation, being a mobile app and a web server, or a Java library (jar) and another Java project, you have to keep the contracts in mind.
As you observed, a change in contracts need to be propagated to both parties: the client and the server (supplier).
One way of working with this is to use versioning. You may say: "Version 1: those are the refund modes.". Then the mobile app may call the web server by specifying the contract version in the URL: /api/v1/refund?mode=CASH
When the contract needs to be changed, you need to consider what to do with the clients. In the case of mobile apps, the users might not have updated their app to the latest version, so their app may still be calling /api/v1 (and not supporting new refund modes). In that case, you may want to support both /api/v1 and /api/v2 (with the new refund mode) in your web server.
As your example shows, it is not always possible to transparently adapt one contract version to another (in your example, there is no good equivalent to PHONEPE in the original enum). If you have to deal with contract updates, I suggest explicitly writing code to them (you can use dedicated JSON schemas, classes and services) instead of trying to bridge the gaps. Think of what would happen with a third, fourth version.
Edit: to answer your last question, you can ignore unknown fields in JSON by following this answer (with the caveats explained above): https://stackoverflow.com/a/59307683/2223027
Edit 2: in general, using Enums is a form of strong typing. Sure, you could use Strings, or even bits, but then it would be easier to make mistakes, like using GiftVoucher instead of GIFT_VOUCHER.

Generating code for converting between classes

In one of the project I'm working on, we have different systems.
Since those system should evolve independently we have a number of CommunicationLib to handle communication between those Systems.
CommunicationLib objects are not used inside any System, but only in communication between systems.
Since many functionality require data retrieval, I am often forced to create "local" system object that are equal to CommLib objects. I use Converter Utility class to convert from such objects to CommLib objects.
The code might look like this:
public static CommLibObjX objXToCommLib(objX p) {
CommLibObjX b = new CommLibObjX();
b.setAddressName(p.getAddressName());
b.setCityId(p.getCityId());
b.setCountryId(p.getCountryId());
b.setFieldx(p.getFieldx());
b.setFieldy(p.getFieldy());
[...]
return b;
}
Is there a way to generate such code automatically? Using Eclipse or other tools? Some field might have a different name, but I would like to generate a Converter method draft and edit it manually.

try Apache commons-beanutils
BeanUtils.copyProperties(p, b);
It copies property values from the origin bean to the destination bean for all cases where the property names are the same

If you feel the need to have source code automatically generated, you are probably doing something wrong. I think you need to reexamine the design of the communication between your two "systems". How do these "systems" communicate?
If they are on different computers or in different processes, design a wire protocol for them to use, rather than serializing objects.
If they are classes used together, design better entity classes, which are suitable for them both.

What exactly is the point of the codebase in Java RMI?

Im currently learning about RMI.
I dont really understand the concept of the codebase. Every paper i read suggests, that the client, which calls the Remote object can load the Method definitions from the codebase.
The Problem is now: Dont I need the descriptions/interfaces in my classpath anyway? How can i call methods on the remote object, if i only know them during Runtime? This Wouldnt even compile.
Am i completely missing the point here? What exactly is the point of the codebase then? It seems like a lot of extra work and requirements to provide a codebase
thanks

Well, let's say you provide to your client only interfaces, and the implementations will be located in a given code base. Then the client requests the server to send a given object, the client expects to receive an object that implements a given interface, but the actual implementation is unknown to the client, when it deserializes the sent object is when it has to go to the code base and download the corresponding implementing class for the actual object being passed.
This will make the client very thin, and you will very easily update your classes in the code base without having to resort to updating every single client.
EDIT
Let's say you have a RMI server with the following interface
public interface MiddleEarth {
public List<Creature> getAllCreatures();
}
The client will only have the interfaces for MiddleEarth and Creature, but none of the implementations in the class path.
Where the implementations of Creature are serializable objects of type Elf, Man, Dwarf and Hobbit. And these implementations are located in your code base, but not in your client's class path.
When you ask your RMI server to send you the list of all creatures in Middle Earth, it will send objects that implement Creature, that is, any of the classes listed above.
When the client receives the serialized objects it has to look for the class files in order to deserialized them, but these are not located in the local class path. Every object in this stream comes tagged with the given code base that can be used to look for missing classes. Therefore, the client resort to the code base to look for these classes. There it will find the actual creature classes being used.
The code base works in both directions, so it means that if you send your server a Creature (i.e. an Ent) it will look for it in the code base as well.
This means that when both, client and server need to publish new types of creatures all they have to do is to update the creaturesImpl.jar in the code base, and nothing in the server or client applications themselves.

Which options do I have for Java process communication?

We have a place in a code of such form:
void processParam(Object param)
{
wrapperForComplexNativeObject result = jniCallWhichMayCrash(param);
processResult(result);
}
processParam - method which is called with many different arguments.
jniCallWhichMayCrash - a native method which is intended to do some complex processing of it's parameter and to create some complex object. It can crash in some cases.
wrapperForComplexNativeObject - wrapper type generated by SWIG
processResult - a method written in pure Java which processes it's parameter by creation of several kinds (by the kinds I'm not meaning classes, maybe some like hierarchies) of objects:
1 - Some non-unique objects which are referencing each other (from the same hierarchy), these objects can have duplicates created from the invocations of processParam() method with different parameter values. Since it's costly to keep all the duplicates it's necessary to cache them.
2 - Some unique objects which are referencing each other (from the same hierarchy) and some of the objects of 1st kind.
After processParam is executed for each of the arguments from some set the data created in processResult will be processed together. The problem is in fact that jniCallWhichMayCrash method may crash the entire JVM and this will be very bad. The reason of crash may be such that it can happen for one argument value and not for the other. We've decided that it's better to ignore crashes inside of JVM and just skip some chunks of data when such crashes occur. In order to do this we should run processParam function inside of separate process and pass the result somehow (HOW? HOW?! This is a question) to the main process and in case of any crashes we will only lose some part of data (It's ok) without lose of everything else. So for now the main problem is implementation of transport between different processes. Which options do I have? I can think about serialization and transmitting of binary data by the streams, but serialization may be not very fast due to object complexity. Maybe I have some other options of implementing this?

Let us assume that the processes are on the same machine. Your options include:
Use a Process.exec() to launch a new process for each request, passing the parameter object as a command line argument or via the processes standard input and reading the result from thr processes standard output. The process exits on completion of a single request.
Use a Process.exec() to launch a long running process, using the Processes standard input / output for sending the requests and replies. The process instance handles multiple requests.
Use a "named pipe" to send requests / replies to an existing local (or possibly remote) process.
Use raw TCP/IP Sockets or Unix Domain Sockets to send requests / replies to an existing local (or possibly remote) process.
For each of the above, you will need to design your own request formats and deal with parameter / result encoding and decoding on both sides.
Implement the process as a web service and use JSON or XML (or something else) to encode the parameters and results. Depending on your chosen encoding scheme, there will be existing libraries deal with encoding / decoding and (possibly) mapping to Java types.
SOAP / WSDL - with these, you typically design the application protocol at a higher level of abstraction, and the framework libraries take care of encoding / decoding, dispatching requests and so on.
CORBA or an equivalent like ICE. These options are like SOAP / WSDL, but using more efficient wire representations, etc.
Message queuing systems like MQ-series.
Note that the last four are normally used in systems where the client and server are on separate machines, but they work just as well (and maybe faster) when client and server are colocated.
I should perhaps add that an alternative approach is to get rid of the problematic JNI code. Either replace it with pure Java code, or run it as an external command or service without a Java wrapper around it.

Have you though about using web-inspired methods ? in your case, typically, web-services could/would be a solution in all its diversity :
REST invocation
WSDL and all the heavy-weight mechanism
Even XML-RPC over http, like the one used by Spring remoting or JSPF net export could inspire you

If you can isolate the responsibilities of the process, ie P1 is a producer of data and P2 is a consumer, the most robust answer is to use a file to communicate your data. There is overhead (read CPU cycles) involved in serailization/deserialization however your process(es) will not crash and it is very easy to debug/synchronize.

Getting to Guice created objects from dumb data objects

I've taken the plunge and used Guice for my latest project. Overall impressions are good, but I've hit an issue that I can't quite get my head around.
Background: It's a Java6 application that accepts commands over a network, parses those commands, and then uses them to modify some internal data structures. It's a simulator for some hardware our company manufactures. The changes I make to the internal data structures match the effect the commands have on the real hardware, so subsequent queries of the data structures should reflect the hardware state based on previously run commands.
The issue I've encountered is that the command objects need to access those internal data structures. Those structures are being created by Guice because they vary depending on the actual instance of the hardware being emulated. The command objects are not being created by Guice because they're essentially dumb objects: they accept a text string, parse it, and invoke a method on the data structure.
The only way I can get this all to work is to have those command objects be created by Guice and pass in the data structures via injection. It feels really clunky and totally bloats the constructor of the data objects.
What have I missed here?

Dependency injection works best for wiring services. It can be used to inject value objects, but this can be a bit awkward especially if those objects are mutable.
That said, you can use Providers and #Provides methods to bind objects that you create yourself.

Assuming that responding to a command is not that different from responding to a http request, I think you're going the right path.
A commonly used pattern in http applications is to wrap logic of the application into short lived objects that have both parameters from request and some backends injected. Then you instantiate such object and call a simple, parameterless method that does all magic.
Maybe scopes could inspire you somehow? Look into documentation and some code examples for read the technical details. In code it looks more less like that. Here's how this might work for your case:
class MyRobot {
Scope myScope;
Injector i;
public void doCommand(Command c) {
myScope.seed(Key.get(Command.class),
i.getInstance(Handler.class).doSomething();
}
}
class Handler {
private final Command c;
#Inject
public Handler(Command c, Hardware h) {
this.c = c;
}
public boolean doSomething() {
h.doCommand(c);
// or c.modifyState(h) if you want c to access internals of h
}
}
Some people frown upon this solution, but I've seen this in code relying heavily on Guice in the past in at least two different projects.
Granted you'll inject a bit of value objects in the constructors, but if you don't think of them as value objects but rather parameters of the class that change it's behaviour it all makes sense.
It is a bit awkward and some people frown upon injecting value objects that way, but I have seen this in the past in projects that relied heavily on Guice for a while and it worked great.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.