Share SparkContext between Java and R Apps under the same Master - java

So here is the setup.
Currently I have two Spark Applications initialized. I need to pass data between them (preferably through shared sparkcontext/sqlcontext so I can just query a temp table). I currently use Parquet Files to dataframe transfer, but is it possible any other way?
MasterURL points to the same SparkMaster
Start Spark via Terminal:
/opt/spark/sbin/start-master.sh;
/opt/spark/sbin/start-slave.sh spark://`hostname`:7077
Java App Setup:
JavaSparkContext context = new JavaSparkContext(conf);
//conf = setMaster(MasterURL), 6G memory, and 4 cores.
SQLContext sqlContext = new SQLContext(parentContext.sc());
Then I register an existing frame later on
//existing dataframe to temptable
df.registerTempTable("table");
and
SparkR
sc <- sparkR.init(master='MasterURL', sparkEnvir=list(spark.executor.memory='6G', spark.cores.max='4')
sqlContext <- sparkRSQL.init(sc)
# attempt to get temptable
df <- sql(sqlContext, "SELECT * FROM table"); # throws the error

As far as I know it is not possible given your current configuration. Tables created using registerTempTable are bound to the specific SQLContext which has been used to create corresponding DataFrame. Even if your Java and SparkR applications use the same master their drivers run on separate JVMs and cannot share single SQLContext.
There are tools, like Apache Zeppelin, which take a different approach with a single SQLContext (and SparkContext) which is exposed to individual backends. This way you can register table using for example Scala and read it from Python. There is a fork of Zeppelin which provides some support for SparkR and R. You can check how it starts and interacts R backend.

Related

How can I add classes or jars when connecting to an already existing Spark cluster?

I'm writing a program that accesses a Spark cluster as a client. It connects like this:
val sc = new SparkContext(new SparkConf(loadDefaults = false)
.setMaster(sparkMasterEndpoint)
.setAppName("foo")
.set("spark.cassandra.connection.host", cassandraHost)
.setJars(Seq("target/scala-2.11/foo_2.11-1.0.jar"))
)
And that context is then used to run operations on Spark. However, any lambdas / anonymous functions I use in my code can't run on Spark. For example, I might have:
val groupsDescription = sc.someRDD()
.groupBy(x => x.getSomeString())
.map(x => x._1 + " " + x._2.count(_ => true))
This returns a lazily evaluated RDD, but when I try to extract some value from that RDD, I get this exception from Spark:
java.lang.ClassNotFoundException: my.app.Main$$anonfun$groupify$1$$anonfun$2$$anonfun$apply$1
Even though I've supplied my application's jar file to Spark. I even see a log line (in my application, not in my spark cluster) telling me the jar has been uploaded like this:
[info] o.a.s.SparkContext - Added JAR target/scala-2.11/foo_2.11-1.0.jar at spark://192.168.51.15:53575/jars/foo_2.11-1.0.jar with timestamp 1528320841157
I can find absolutely NOTHING on this subject anywhere, and it's driving me crazy! How has nobody else run in to this issue? All the related results I see are about bundling your jars for use with spark-submit which is not what I'm doing, I have a standalone application that's connecting to an independent spark cluster. Is this simply not supported? What else could I be missing? What else could be causing this?

BaseX and Java: could modules be used in Java API of BaseX

Say, I have 2 .xqy files. One .xqy references functions from another file by importing module.
I want to port my .xqy files to use Java API of BaseX. I found some Java_Examples of BaseX. In RunQueries.java the query is placed in a string variable. Could I reference a function in another module like:
String query = "mymodule:testFunction()";
// Process the query by using the database command
System.out.println("\n* Use the database command:");
query(query);

pyspark: call a custom java function from pyspark. Do I need Java_Gateway?

I wrote the following MyPythonGateway.java so that I can call my custom java class from Python:
public class MyPythonGateway {
public String findMyNum(String input) {
return MyUtiltity.parse(input).getMyNum();
}
public static void main(String[] args) {
GatewayServer server = new GatewayServer(new MyPythonGateway());
server.start();
}
}
and here is how I used it in my Python code:
def main():
gateway = JavaGateway() # connect to the JVM
myObj = gateway.entry_point.findMyNum("1234 GOOD DAY")
print(myObj)
if __name__ == '__main__':
main()
Now I want to use MyPythonGateway.findMyNum() function from PySpark, not just a standalone python script. I did the following:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
print(myNum)
However, I got the following error:
... line 43, in main:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
File "/home/edamameQ/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.
So what did I miss here? I don't know if I should run a separate JavaApplication of MyPythonGateway to start a gateway server when using pyspark. Please advice. Thanks!
Below is exactly what I need:
input.map(f)
def f(row):
// call MyUtility.java
// x = MyUtility.parse(row).getMyNum()
// return x
What would be the best way to approach this? Thanks!
First of all the error you see usually means the class you're trying to use is not accessible. So most likely it is a CLASSPATH issue.
Regarding general idea there are two important issues:
you cannot access SparkContext inside an action or transformation so using PySpark gateway won't work (see How to use Java/Scala function from an action or a transformation? for some details)). If you want to use Py4J from the workers you'll have to start a separate gateways on each worker machine.
you really don't want to pass data between Python an JVM this way. Py4J is not designed for data intensive tasks.
In PySpark before start calling the method -
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
you have to import MyPythonGateway java class as follows
java_import(sparkContext._jvm, "myPackage.MyPythonGateway")
myPythonGateway = spark.sparkContext._jvm.MyPythonGateway()
myPythonGateway.findMyNum("1234 GOOD DAY")
specify the jar containing myPackage.MyPythonGateway with --jars option in spark-submit
If input.map(f) has inputs as an RDD for example, this might work, since you can't access the JVM variable (attached to spark context) inside the executor for a map function of an RDD (and to my knowledge there is no equivalent for #transient lazy val in pyspark).
def pythonGatewayIterator(iterator):
results = []
jvm = py4j.java_gateway.JavaGateway().jvm
mygw = jvm.myPackage.MyPythonGateway()
for value in iterator:
results.append(mygw.findMyNum(value))
return results
inputs.mapPartitions(pythonGatewayIterator)
all you need to do is compile jar and add to pyspark classpath with --jars or --driver-class-path spark submit options. Then access class and method with below code-
sc._jvm.com.company.MyClass.func1()
where sc - spark context
Tested with Spark 2.3. Keep in mind, you can call JVM class method only from driver program and not executor.

How to submit spark job from within java program to standalone spark cluster without using spark-submit?

I am using spark to perform some computations but want it to be submitted from java application.It works proper using when submitted using spark-submit script.Has anyone tried to do this?
Thanks.
Don't forget to add the fat JAR containing your code to the context.
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.setJars("/path/to/code.jar")
val sc = new SparkContext(conf)
As long as you have a master and available worker started, you should be able to if you have the following in your java application:
String master = "spark://IP:7077"; //set IP address to that of your master
String appName = "Name of your Application Here";
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);;
JavaSparkContext sc = new JavaSparkContext(conf);
I was able to run junit tests from within IntelliJ that utilized the JavaSparkContext without having to use the spark-submit script. I am running into issues when performing actions on DataFrames though (not sure if that's related).

Running Pig Jobs remotely

I am learning Pig jobs and want to run pig script on a remote cluster through java code using PigServer. Can anybody guide me how to achieve this? Thanks in advance.
Can the above code be used to do a remote call i.e. Pig is installed on cluster1 & call is made from the application server outside the cluster?
You have to use the PigServer class to connect to your cluster, register your Pig queries and get results. You can either choose to run a script by passing your filename on your disk, or you can directly write your Pig script lines and pass it as Java strings.
To pass a Pig script from the filename:
PigServer pig = new PigServer(ExecType.MAPREDUCE);
pig.registerScript("/path/to/test.pig");
To pass your Pig program as Strings:
PigServer pig = new PigServer(ExecType.MAPREDUCE);
pig.registerQuery("A = LOAD 'something' USING PigLoader();");
You can get back the results for example this way:
Iterator<Tuple> i = pig.openIterator("A");
HashMap<Integer, Integer> map = new HashMap<Integer, Integer>();
while (i.hasNext()) {
Integer val = DataType.toInteger(i.next().get(0));
map.put(val, val);
}
Note that you need to have some properties in your classpath, namely fs.default.name and mapred.job.tracker or you can just add them to the PigServer constructor.

Categories

Resources