Running Pig Jobs remotely

Running Pig Jobs remotely - java

I am learning Pig jobs and want to run pig script on a remote cluster through java code using PigServer. Can anybody guide me how to achieve this? Thanks in advance.

Can the above code be used to do a remote call i.e. Pig is installed on cluster1 & call is made from the application server outside the cluster?

You have to use the PigServer class to connect to your cluster, register your Pig queries and get results. You can either choose to run a script by passing your filename on your disk, or you can directly write your Pig script lines and pass it as Java strings.
To pass a Pig script from the filename:
PigServer pig = new PigServer(ExecType.MAPREDUCE);
pig.registerScript("/path/to/test.pig");
To pass your Pig program as Strings:
PigServer pig = new PigServer(ExecType.MAPREDUCE);
pig.registerQuery("A = LOAD 'something' USING PigLoader();");
You can get back the results for example this way:
Iterator<Tuple> i = pig.openIterator("A");
HashMap<Integer, Integer> map = new HashMap<Integer, Integer>();
while (i.hasNext()) {
Integer val = DataType.toInteger(i.next().get(0));
map.put(val, val);
}
Note that you need to have some properties in your classpath, namely fs.default.name and mapred.job.tracker or you can just add them to the PigServer constructor.

Related

Java Linux Shell Application

So I just received a task for creating a Java Shell App, without using any 3rd party libraries, and without using Runtime.exec() or ProcessBuilder APIs.
I don't want the solution (obviously I want to do this myself) but I do need a hint how to do this? I want the app to open a shell prompt which will accept various commands with usage of JDK 8 (Nashorn?).
Thanks!

Not really clear what you want to achieve. If you want to run a Nashhorn shell you can achieve it like this (Java 8)
import jdk.nashorn.tools.Shell;
public class NashornShell {
public static void main(String[] args) {
Shell.main(new String[]{ "-scripting"});
}
}
When you see the Nashorn prompt jjs> you can execute Linux commands...
jjs> $EXEC("ls");
which will list the current directory (using the Linux ls command).
... or execute Java commands ...
jjs> java.lang.System.out.println("foo");
... or execute JavaScript commands ...
jjs> print("foo");
For more information have a look in the nashorn guide.
edit If you want to pass only yourCommand as parameter.
NashornScriptEngineFactory factory = new NashornScriptEngineFactory();
ScriptEngine engine = factory.getScriptEngine(new String[]{"-scripting"});
String yourCommand = "ls";
Object eval = engine.eval("$EXEC(\"" + yourCommand + "\")");
System.out.println(eval);
edit do you think that instead of Nashorn I could just use raw streams directed to the OS from JVM
Following is possible
Commands.java
class Commands {
public static void main(String[] args) {
System.out.println("ls");
System.out.println("whoami");
}
}
run.sh
#!/bin/sh
java Commands | while read command; do
echo
echo "command: $command"
$command
done
But obviously this is not to recommend when you want to execute the output of Commands:
your Java application has no control about the return state of the executed single commands
if one command wait for user input your Java application don't know it
your Java application has no access to the output produced by the commands
all commands are blindly exected
and some more downsides

Regardless of what you're trying to do, using nashorn internal class like jdk.nashorn.tools.Shell is not a good idea. With java 9, this package is not an exported package of jdk.scripting.nashorn module. So, with java 9, you'll get package access failure (even in the absence of security manager - as module read/export access check is more like member access check for classes).

pyspark: call a custom java function from pyspark. Do I need Java_Gateway?

I wrote the following MyPythonGateway.java so that I can call my custom java class from Python:
public class MyPythonGateway {
public String findMyNum(String input) {
return MyUtiltity.parse(input).getMyNum();
}
public static void main(String[] args) {
GatewayServer server = new GatewayServer(new MyPythonGateway());
server.start();
}
}
and here is how I used it in my Python code:
def main():
gateway = JavaGateway() # connect to the JVM
myObj = gateway.entry_point.findMyNum("1234 GOOD DAY")
print(myObj)
if __name__ == '__main__':
main()
Now I want to use MyPythonGateway.findMyNum() function from PySpark, not just a standalone python script. I did the following:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
print(myNum)
However, I got the following error:
... line 43, in main:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
File "/home/edamameQ/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.
So what did I miss here? I don't know if I should run a separate JavaApplication of MyPythonGateway to start a gateway server when using pyspark. Please advice. Thanks!
Below is exactly what I need:
input.map(f)
def f(row):
// call MyUtility.java
// x = MyUtility.parse(row).getMyNum()
// return x
What would be the best way to approach this? Thanks!

First of all the error you see usually means the class you're trying to use is not accessible. So most likely it is a CLASSPATH issue.
Regarding general idea there are two important issues:
you cannot access SparkContext inside an action or transformation so using PySpark gateway won't work (see How to use Java/Scala function from an action or a transformation? for some details)). If you want to use Py4J from the workers you'll have to start a separate gateways on each worker machine.
you really don't want to pass data between Python an JVM this way. Py4J is not designed for data intensive tasks.

In PySpark before start calling the method -
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
you have to import MyPythonGateway java class as follows
java_import(sparkContext._jvm, "myPackage.MyPythonGateway")
myPythonGateway = spark.sparkContext._jvm.MyPythonGateway()
myPythonGateway.findMyNum("1234 GOOD DAY")
specify the jar containing myPackage.MyPythonGateway with --jars option in spark-submit

If input.map(f) has inputs as an RDD for example, this might work, since you can't access the JVM variable (attached to spark context) inside the executor for a map function of an RDD (and to my knowledge there is no equivalent for #transient lazy val in pyspark).
def pythonGatewayIterator(iterator):
results = []
jvm = py4j.java_gateway.JavaGateway().jvm
mygw = jvm.myPackage.MyPythonGateway()
for value in iterator:
results.append(mygw.findMyNum(value))
return results
inputs.mapPartitions(pythonGatewayIterator)

all you need to do is compile jar and add to pyspark classpath with --jars or --driver-class-path spark submit options. Then access class and method with below code-
sc._jvm.com.company.MyClass.func1()
where sc - spark context
Tested with Spark 2.3. Keep in mind, you can call JVM class method only from driver program and not executor.

How to submit spark job from within java program to standalone spark cluster without using spark-submit?

I am using spark to perform some computations but want it to be submitted from java application.It works proper using when submitted using spark-submit script.Has anyone tried to do this?
Thanks.

Don't forget to add the fat JAR containing your code to the context.
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.setJars("/path/to/code.jar")
val sc = new SparkContext(conf)

As long as you have a master and available worker started, you should be able to if you have the following in your java application:
String master = "spark://IP:7077"; //set IP address to that of your master
String appName = "Name of your Application Here";
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);;
JavaSparkContext sc = new JavaSparkContext(conf);
I was able to run junit tests from within IntelliJ that utilized the JavaSparkContext without having to use the spark-submit script. I am running into issues when performing actions on DataFrames though (not sure if that's related).

Can we Runtime.getRuntime().exec("groovy");

I installed Groovy.
And I am trying to run groovy scripts from a command prompt that I created using Java, like so:
Runtime.getRuntime().exec("groovy");
So if I type in "groovy" to the command line, this is what I get:
>>>groovy
Cannot run program "groovy": CreateProcess error=2, The system cannot find the file specified
Does anyone have an idea as to what might be going wrong? Should I just use Groovy's implementation of exec? Like:
def processBuilder=new ProcessBuilder("ls")
processBuilder.redirectErrorStream(true)
processBuilder.directory(new File("Your Working dir")) // <--
def process = processBuilder.start()
My guess is that it wouldn't matter whether using Java's implementation or Groovy's implementation.
So how do I run a groovy script?

The way originally described in the question above calling the groovy executable invokes a second Java runtime instance and class loader while the efficient way is to embed the Groovy script directly into the Java runtime as a Java class and invoke it.
Here are three ways to execute a Groovy script from Java:
1) Simplest way is using GroovyShell:
Here is an example Java main program and target Groovy script to invoke:
== TestShell.java ==
import groovy.lang.Binding;
import groovy.lang.GroovyShell;
// call groovy expressions from Java code
Binding binding = new Binding();
binding.setVariable("input", "world");
GroovyShell shell = new GroovyShell(binding);
Object retVal = shell.evaluate(new File("hello.groovy"));
// prints "hello world"
System.out.println("x=" + binding.getVariable("x")); // 123
System.out.println("return=" + retVal); // okay
== hello.groovy ==
println "Hello $input"
x = 123 // script-scoped variables are available via the GroovyShell
return "ok"
2) Next is to use GroovyClassLoader to parse the script into a class then create an instance of it. This approach treats the Groovy script as a class and invokes methods on it as on any Java class.
GroovyClassLoader gcl = new GroovyClassLoader();
Class clazz = gcl.parseClass(new File("hello.groovy");
Object aScript = clazz.newInstance();
// probably cast the object to an interface and invoke methods on it
3) Finally, you can create GroovyScriptEngine and pass in objects as variables using binding. This runs the Groovy script as a script and passes in input using binding variables as opposed to calling explicit methods with arguments.
Note: This third option is for developers who want to embed groovy scripts into a server and have them reloaded on modification.
import groovy.lang.Binding;
import groovy.util.GroovyScriptEngine;
String[] roots = new String[] { "/my/groovy/script/path" };
GroovyScriptEngine gse = new GroovyScriptEngine(roots);
Binding binding = new Binding();
binding.setVariable("input", "world");
gse.run("hello.groovy", binding);
System.out.println(binding.getVariable("output"));
Note: You must include the groovy_all jar in your CLASSPATH for these approaches to work.
Reference: http://groovy.codehaus.org/Embedding+Groovy

Different output -- when running mathtext in command line and when the command is executed from a java program using apache-commons-exec

I am trying to run mathtext from a java program using apache-commons-exec. The problem is I am getting different output when I run the same command from a java program and when I run it through shell.
so if run mathtext like this in the shell:
./mathtext test.png "\$\frac{{\left( {{p^2} - {q^2}} \right)}}{2}\$"
in a shell I get the perfect png
but when I run the same thing using apache-commons-exec
Map map = new HashMap();
map.put("target", new File(trgtFileName));
DefaultExecuteResultHandler resultHandler = new DefaultExecuteResultHandler();
Executor exec = new DefaultExecutor();
exec.setWorkingDirectory(/*I set the working directory where the mathtext is*/);
CommandLine cl = new CommandLine("./mathtext");
cl.addArgument("${target}");
cl.addArgument(latex);
cl.setSubstitutionMap(map);
// Logger.log4j.info("command is:::"+cl.toString());
ExecuteWatchdog watchdog = new ExecuteWatchdog(5000);
exec.setWatchdog(watchdog);
exec.execute(cl,EnvironmentUtils.getProcEnvironment(),resultHandler);
resultHandler.waitFor();
I get the image, not the equation but the raw TeX string :(
Can somebody please help me in solving the issue? I want to get the exact output.
Thanks.

I figured out where the problem was:
$ is a special character for the unix shell and not for java. So even if in the command line the input needs to escape $ like:
"\$\frac{{\left( {{p^2} - {q^2}} \right)}}{2}\$"
inside the java program I dont need to escape the '$' or put " (double quotes) at the beginning and at the end.I had to put the command like:
$\frac{{\left( {{p^2} - {q^2}} \right)}}{2}$
Hope this helps somebody :)
--Shankhoneer

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Running Pig Jobs remotely - java

I am learning Pig jobs and want to run pig script on a remote cluster through java code using PigServer. Can anybody guide me how to achieve this? Thanks in advance.

Can the above code be used to do a remote call i.e. Pig is installed on cluster1 & call is made from the application server outside the cluster?

Related

Java Linux Shell Application

pyspark: call a custom java function from pyspark. Do I need Java_Gateway?

How to submit spark job from within java program to standalone spark cluster without using spark-submit?

Can we Runtime.getRuntime().exec("groovy");

Different output -- when running mathtext in command line and when the command is executed from a java program using apache-commons-exec

Categories

Resources