SparkLauncher Run spark-submit with yarn-client with user as hive

SparkLauncher Run spark-submit with yarn-client with user as hive - java

Trying to run spark job with masterURL=yarn-client. Using SparkLauncher 2.10. The java code is wrapped in nifi processor. Nifi is currently running as root. When I do yarn application -list, I see the spark job started with USER = root. I want to run it with USER = hive.
Following is my SparkLauncher code.
Process spark = new SparkLauncher()
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Do I need to pass user as driver extra options? Environment is non-kerberos.
Read somewhere that I need to pass user name as driver extra java option. Cannot find that post now!!

export HADOOP_USER_NAME=hive worked. SparkLauncher has overload to accept Map of environment variables. As for spark.yarn.principle, the environment is non-kerberos. As per my reading yarn.principle works only with kerboros. Did the following
Process spark = new SparkLauncher(getEnvironmentVar(ps.getRunAs()))
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Instead of new SparkLancher() used SparkLauncher(java.util.Map<String,String> env).Added or replacedHADOOP_USER_NAME=hive.
Checked yarn application -listlaunches as intended withUSER=hive.

Related

Spark in Kubernetes container does not see local file

I have a trivially small Spark application written in Java that I am trying to run in a K8s cluster using spark-submit. I built an image with Spark binaries, my uber-JAR file with all necessary dependencies (in /opt/spark/jars/my.jar), and a config file (in /opt/spark/conf/some.json).
In my code, I start with
SparkSession session = SparkSession.builder()
.appName("myapp")
.config("spark.logConf", "true")
.getOrCreate();
Path someFilePath = FileSystems.getDefault().getPath("/opt/spark/conf/some.json");
String someString = new String(Files.readAllBytes(someFilePath));
and get this exception at readAllBytes from the Spark driver:
java.nio.file.NoSuchFileException: /opt/spark/conf/some.json
If I run my Docker image manually I can definitely see the file /opt/spark/conf/some.json as I expect. My Spark job runs as root so file permissions should not be a problem.
I have been assuming that, since the same Docker image, with the file indeed present, will be used to start the driver (and executors, but I don't even get to that point), the file should be available to my application. Is that not so? Why wouldn't it see the file?

You seem to get this exception from one of your worker nodes, not from the container.
Make sure that you've specified all files needed as --files option for spark-submit.
spark-submit --master yarn --deploy-mode cluster --files <local files dependecies> ...
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

Unable to set Environment Variables in Spark Application

I am trying to set environment variable for my spark application, running in local mode.
Here is the spark-submit job:-
spark-submit --conf spark.executorEnv.FOO=bar --class com.amazon.Main SWALiveOrderModelSpark-1.0-super.jar
However, when I am trying to access this:-
System.out.println("env variable:- " + System.getenv("FOO"));
the output is:-
env variable:- null
Does anyone know how I can resolve this?

spark.executorEnv.[EnvironmentVariableName] is used to (emphasis mine):
Add the environment variable specified by EnvironmentVariableName to the Executor process.
It won't be visible on the driver, excluding org.apache.spark.SparkConf. To access it using System.getenv you have do it in the right context, for example from a task:
sc.range(0, 1).map(_ => System.getenv("FOO")).collect.foreach(println)
bar

You are setting a Spark environment variable using SparkConf. You'll have to use SparkConf to fetch it as well
sc.getConf.get("spark.executorEnv.FOO")

Use docker-machine create from java

I have an application that (I want to) uses Java to start and stop Docker containers. It seems that the way to do this is using docker-machine create, which works fine when I test from the command line.
However, when running using Commons-Exec from Java I get the error:
(aa4567c1-058f-46ae-9e97-56fb8b45211c) Creating SSH key...
Error creating machine: Error in driver during machine creation: /usr/local/bin/VBoxManage modifyvm aa4567c1-058f-46ae-9e97-56fb8b45211c --firmware bios --bioslogofadein off --bioslogofadeout off --bioslogodisplaytime 0 --biosbootmenu disabled --ostype Linux26_64 --cpus 1 --memory 1024 --acpi on --ioapic on --rtcuseutc on --natdnshostresolver1 off --natdnsproxy1 on --cpuhotplug off --pae on --hpet on --hwvirtex on --nestedpaging on --largepages on --vtxvpid on --accelerate3d off --boot1 dvd failed:
VBoxManage: error: Could not find a registered machine with UUID {aa4567c1-058f-46ae-9e97-56fb8b45211c}
VBoxManage: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee nsISupports
VBoxManage: error: Context: "FindMachine(Bstr(a->argv[0]).raw(), machine.asOutParam())" at line 500 of file VBoxManageModifyVM.cpp
I have set my VBOX_USER_HOME variable in an initializationScript that I'm using to start the machine:
export WORKERID=$1
export VBOX_USER_HOME=/Users/me/Library/VirtualBox
# create the machine
docker-machine create $WORKERID && \ # create the worker using docker-machine
eval $(docker-machine env $WORKERID) && \ # load the env of the newly created machine
docker run -d myimage
And I'm executing this from Java via the Commons Exec CommandLine class:
CommandLine cmdline = new CommandLine("/bin/sh");
cmdline.addArgument(initializeWorkerScript.getAbsolutePath());
cmdline.addArgument("test");
Executor executor = new DefaultExecutor();
If there is another library that can interface with docker-machine from Java I'm happy to use that, or to change out Commons Exec if that's the issue (though I don't understand why). The basic requirement is that I have some way to get docker-machine to create a machine using Java and then later to be able to use docker-machine to stop that machine.

As it turns out the example that I posted should work, the issue that I was having is that I was provisioning machines with a UUID name. That name contained dash (-) characters which apparently break VBoxManage. This might be because of some kind of path problem but I'm just speculating. When I changed my UUID to have dot (.) instead of dash it loaded and started the machine just fine.
I'm happy to remove this post if the moderators want, but will leave it up here in case people are looking for solutions to problems with docker-machine create naming issues.

Java spark app hangs when using method reference

I'm writing a Spark app in Java called MyApp, and I have the following code:
final JavaRDD<MyClass<MyInnerClass>> myRDD = ...
myRDD.repartition(50).map(ObjectMapperFactory.OBJECT_MAPPER_MIXIN::writeValueAsString)
.saveAsTextFile(outputPath, GzipCodec.class);
When I run the tests for my package, this app hangs. However, if I use a lambda function instead of passing in a method reference, like this
myRDD.repartition(50).map(t -> ObjectMapperFactory.OBJECT_MAPPER_MIXIN.writeValueAsString(t))
.saveAsTextFile(outputPath, GzipCodec.class);
Then the tests work fine. The only change is in the map statement. I am using Java 8, Spark 1.5.2 and running the cluster in local mode.
EDIT: If I run the test for MyApp by itself, the test works fine with the old code. However, if I run the full test suite, that's when the app starts to hang. There are other java spark apps in my package.

Google cloud startup script on custom image

I want to start an ubuntu in google cloud. my problem is when I want to run a startup script like:
Metadata.Items item2 = new Metadata.Items();
item2.setKey("startup-script");
item2.setValue("my script....");
the problem is that my startup script never running. Has anybody idea how can I run startup script on custom image automatically?
I have preinstalled cloud-init on my image.

In the Google Developers' Console, go your Ubuntu instance settings.
Then add a new Custom Metadata key-value pair.
Key: startup-script
Value: your script or a URL of where to find it
Checked it works with own Ubuntu custom image. Try with something simple at the beginning like
echo `date` >> /tmp/startup-script

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SparkLauncher Run spark-submit with yarn-client with user as hive - java

Related

Spark in Kubernetes container does not see local file

Unable to set Environment Variables in Spark Application

Use docker-machine create from java

Java spark app hangs when using method reference

Google cloud startup script on custom image

Categories

Resources