Unable to set Environment Variables in Spark Application

Unable to set Environment Variables in Spark Application - java

I am trying to set environment variable for my spark application, running in local mode.
Here is the spark-submit job:-
spark-submit --conf spark.executorEnv.FOO=bar --class com.amazon.Main SWALiveOrderModelSpark-1.0-super.jar
However, when I am trying to access this:-
System.out.println("env variable:- " + System.getenv("FOO"));
the output is:-
env variable:- null
Does anyone know how I can resolve this?

spark.executorEnv.[EnvironmentVariableName] is used to (emphasis mine):
Add the environment variable specified by EnvironmentVariableName to the Executor process.
It won't be visible on the driver, excluding org.apache.spark.SparkConf. To access it using System.getenv you have do it in the right context, for example from a task:
sc.range(0, 1).map(_ => System.getenv("FOO")).collect.foreach(println)
bar

You are setting a Spark environment variable using SparkConf. You'll have to use SparkConf to fetch it as well
sc.getConf.get("spark.executorEnv.FOO")

Related

getting InvocationTargetException when running my glue job

I am trying to understand why the following error occurs.
"Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.SparkSession. java.lang.reflect.InvocationTargetException"
Basically, I am trying to use delta module to perform "upsert" method on my table in a glue job.
when I run the following code, I get the error mentioned above.
from delta import *
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars.packages", "io.delta:delta-core_2.11:0.5.0")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()```
This is the only piece I run and get the error. Do you have any ideas why this is happening?

Most probably, you are using the wrong version, probably Glue3.0. There were some workarounds to use Delta with Glue2.0 but those might give that kind of error when you try them with Glue3.0. Also setting the spark session config inside does not work for some parameters and it depends on the version I guess.
But no worries, AWS announced the 4th version of Glue, here is the official announcement.
Here is the official guide on using Delta with Glue, and below I will state the key points to make it work.
The first and the most tricky part is giving the configuration for delta. You can now do it the way you do in Glue4.0. In the older versions, you did this by sending the conf parameter inside the conf parameter through the job parameters of Glue :)
--conf = spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
You have to set the --datalake-format parameter in job params as delta.
After that, make sure you selected Glue4.0. Also, make sure to handle symlink manifest files in your scripts or using crawlers.
If you want more flexibility you can also choose to use the EMR service of AWS, here is a walkthrough on using Delta there.

How to set a environment variable using Gradle but not in Exec scope?

I want to set an environment variable which is System.env.TEMP from initd.gradle file in init.d folder which should be accessible within the other Gradle projects.
I tried to set it like this in C:\Users\myuser\.gradle\init.d\initd.gradle file.
def envVars = [:]
envVars['System.env.TEMP'] += "$buildDir/tmp";
tasks.withType(Exec) { environment << envVars }
And I tried to access it from D:\test-workspace\GradleTaskExample\build.gradle file like in the following code.
println "$System.env.TEMP";
But it prints only the old value. Not the value that I set. So to access it using println command which is in Gradle scope, How can I set the value? If what I'm trying to do is not possible please explain.
Thank you.

Use docker-machine create from java

I have an application that (I want to) uses Java to start and stop Docker containers. It seems that the way to do this is using docker-machine create, which works fine when I test from the command line.
However, when running using Commons-Exec from Java I get the error:
(aa4567c1-058f-46ae-9e97-56fb8b45211c) Creating SSH key...
Error creating machine: Error in driver during machine creation: /usr/local/bin/VBoxManage modifyvm aa4567c1-058f-46ae-9e97-56fb8b45211c --firmware bios --bioslogofadein off --bioslogofadeout off --bioslogodisplaytime 0 --biosbootmenu disabled --ostype Linux26_64 --cpus 1 --memory 1024 --acpi on --ioapic on --rtcuseutc on --natdnshostresolver1 off --natdnsproxy1 on --cpuhotplug off --pae on --hpet on --hwvirtex on --nestedpaging on --largepages on --vtxvpid on --accelerate3d off --boot1 dvd failed:
VBoxManage: error: Could not find a registered machine with UUID {aa4567c1-058f-46ae-9e97-56fb8b45211c}
VBoxManage: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee nsISupports
VBoxManage: error: Context: "FindMachine(Bstr(a->argv[0]).raw(), machine.asOutParam())" at line 500 of file VBoxManageModifyVM.cpp
I have set my VBOX_USER_HOME variable in an initializationScript that I'm using to start the machine:
export WORKERID=$1
export VBOX_USER_HOME=/Users/me/Library/VirtualBox
# create the machine
docker-machine create $WORKERID && \ # create the worker using docker-machine
eval $(docker-machine env $WORKERID) && \ # load the env of the newly created machine
docker run -d myimage
And I'm executing this from Java via the Commons Exec CommandLine class:
CommandLine cmdline = new CommandLine("/bin/sh");
cmdline.addArgument(initializeWorkerScript.getAbsolutePath());
cmdline.addArgument("test");
Executor executor = new DefaultExecutor();
If there is another library that can interface with docker-machine from Java I'm happy to use that, or to change out Commons Exec if that's the issue (though I don't understand why). The basic requirement is that I have some way to get docker-machine to create a machine using Java and then later to be able to use docker-machine to stop that machine.

As it turns out the example that I posted should work, the issue that I was having is that I was provisioning machines with a UUID name. That name contained dash (-) characters which apparently break VBoxManage. This might be because of some kind of path problem but I'm just speculating. When I changed my UUID to have dot (.) instead of dash it loaded and started the machine just fine.
I'm happy to remove this post if the moderators want, but will leave it up here in case people are looking for solutions to problems with docker-machine create naming issues.

SparkLauncher Run spark-submit with yarn-client with user as hive

Trying to run spark job with masterURL=yarn-client. Using SparkLauncher 2.10. The java code is wrapped in nifi processor. Nifi is currently running as root. When I do yarn application -list, I see the spark job started with USER = root. I want to run it with USER = hive.
Following is my SparkLauncher code.
Process spark = new SparkLauncher()
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Do I need to pass user as driver extra options? Environment is non-kerberos.
Read somewhere that I need to pass user name as driver extra java option. Cannot find that post now!!

export HADOOP_USER_NAME=hive worked. SparkLauncher has overload to accept Map of environment variables. As for spark.yarn.principle, the environment is non-kerberos. As per my reading yarn.principle works only with kerboros. Did the following
Process spark = new SparkLauncher(getEnvironmentVar(ps.getRunAs()))
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Instead of new SparkLancher() used SparkLauncher(java.util.Map<String,String> env).Added or replacedHADOOP_USER_NAME=hive.
Checked yarn application -listlaunches as intended withUSER=hive.

Pass System Property to Application using Gradle and Netbeans

To run my application from the command line, I run:
java -Dconfig.file=./config/devApp.config -jar ./build/libs/myJar.jar
and inside my code, I have:
String configPath = System.getProperty("config.file");
Which gets the property just fine. However, when I try to debug using the built in debug Netbeans task, the property is null. The output of my run is:
Executing: gradle debug
Arguments: [-Dconfig.file=./config/devApp.config, -PmainClass=com.comp.entrypoints.Runner, -c, /home/me/Documents/projects/proj/settings.gradle]
JVM Arguments: [-Dconfig.file=./config/devApp.config]
Which is coming from:
I set it in both the arguments and JVM arguemtns to see if either would set it. Regardless of what I do, it is null. Can someone help me figure out how to set the system property so my app can get it?

You are setting the property on the Gradle JVM which has almost nothing to do with the JVM your application runs in. If you want to use Gradle to start your app for debugging, you have to tweak your Gradle build file to set or forward the system property to the debug task.
Assuming the debug task is of type JavaExec this would be something like
systemProperty 'config.file', System.properties.'config.file'
in the configuration of your debug task to forward what you set up in the "JVM Arguments" field in Netbeans.

It seems that the "Arguments (each line is an argument):" and "JVM Arguments (each line is an argument):" fields provide values to the Gradle task itself. How I managed to pass properties over to the application was to append them to the jvmLineArgs argument (see image).
My application is now receiving the profiles property.
Thanks to #Vampire for the "guess work", lol!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to set Environment Variables in Spark Application - java

You are setting a Spark environment variable using SparkConf. You'll have to use SparkConf to fetch it as well sc.getConf.get("spark.executorEnv.FOO")

Related

getting InvocationTargetException when running my glue job

How to set a environment variable using Gradle but not in Exec scope?

Use docker-machine create from java

SparkLauncher Run spark-submit with yarn-client with user as hive

Pass System Property to Application using Gradle and Netbeans

Categories

Resources