Override spark's libraries in spark submit

Override spark's libraries in spark submit - java

Our application's hadoop cluster has spark 1.5 installed. But due to specific requirements we have developed spark job with version 2.0.2. When I submit the job to yarn, I use the --jars command to override the spark libraries in cluster. But still it is not picking the scala library jar. It throws an error saying
ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:713)
at org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:704)
Any ideas about how to override the cluster libraries during spark submit ?
The shell command I use to submit the job is below.
spark-submit \
--jars test.jar,spark-core_2.11-2.0.2.jar,spark-sql_2.11-2.0.2.jar,spark-catalyst_2.11-2.0.2.jar,scala-library-2.11.0.jar \
--class Application \
--master yarn \
--deploy-mode cluster \
--queue xxx \
xxx.jar \
<params>

That's fairly easy - Yarn doesn't care which version of Spark you are running, it will execute the jars provided by the yarn client which is packaged by spark submit. That process packages your application jar along the spark libs.
In order to deploy Spark 2.0 instead of the provided 1.5, you just need to install spark 2.0 on the host from which you start your job, e.g. in your home dir, set the YARN_CONF_DIR env vars to point to your hadoop conf and then use that spark-submit.

Related

How to pass java arguments to Flink job artifacts in Application Mode

I just upgrade the Flink from version 1.10 to 1.11. In 1.11, Flink provides new features that users can deploy the job in Application Mode on Kubernetes.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#deploy-session-cluster
In V1.10, we start the Flink K8s cluster and then submit the job to Flink by run
exec ./bin/flink run \
-d \
/streakerflink_deploy.jar \
--arg1 blablabla
--arg2 blablabla
--arg3 blablabla
...
We pass the java arguments through this command.
But, in V1.11, if we run Application mode, we don't need to run the flink run command above. I am wondering how do we pass the arguments to Flink job in Application mode (aka Job Cluster)?
Any help will be appreciated!

Since you are using helm chart to start the Flink cluster on Kubernetes(aka K8s), i assume you are talking about the standalone K8s mode. Actually, the Application mode is very similar to the job cluster in 1.10 and before.
So you could set the job arguments in the args field in the jobmanager-job.yaml just like following.
...
args: ["standalone-job", "--job-classname", "org.apache.flink.streaming.examples.join.WindowJoin", "--windowSize", "3000", "--rate", "100"]
...
If you really mean the native K8s mode, then it could be directly added after the flink run-application command.
$ ./bin/flink run-application -p 8 -t kubernetes-application \
-Dkubernetes.cluster-id=<ClusterId> \
-Dtaskmanager.memory.process.size=4096m \
-Dkubernetes.taskmanager.cpu=2 \
-Dtaskmanager.numberOfTaskSlots=4 \
-Dkubernetes.container.image=<CustomImageName> \
local:///opt/flink/examples/streaming/WindowJoin.jar \
--windowSize 3000 --rate 100
Note:
Please keep in mind that the key difference between standalone K8s and native K8s mode is the dynamic resource allocation. In native mode, we have an embedded K8s client, so Flink JobManager could allocate/release TaskManager pods on demands. Currently, native mode could only be used in Flink commands(kubernetes-session.sh, flink run-application).

Flink's application mode on Kubernetes is described in the docs. You have to create a Docker image containing your job. The job can be executed using ./bin/flink run-application [...] as described in the docs.

Spark Job using old application resources and jar

I am new to spark . Trying to run spark job with client mode and it works well if I use the same path for jar and other resource files. After killing the running application using Yarn command and if spark job is resubmitted with updated jar and file locations, job still uses my old path. After reboot of system , spark job takes new path. Spark-submit command
spark-submit \
--class export.streaming.DataExportStreaming \
--jars /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--driver-class-path /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.driver.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.executor.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--master yarn --deploy-mode client \
--files /usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_assignment.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_queue.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_user_login_session.sql /usr/lib/firebet-spark/52.0.2-1/data-export/lib/data-export-assembly-52.0.2-1.jar /usr/lib/firebet-spark/52.0.2-1/data-export/resources/application.conf
How to fix this problem ?
Is spark-submit command is correct ?
Which deployment mode is better client or cluster in production ?

Running Spark jobs with Marathon

I've got a Marathon job running for the following:
./bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher
However, following that I'd like to now be able to run individual Spark jobs as separate Marathon jobs with the command:
./bin/spark-submit ....
My question is:
how can I call spark-submit from a Mesos executor of the binaries are not installed on it?
(Note: I'm aware that http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos also recommends installing Spark on all the Mesos slaves, but is that the only option?)
Any guidance is much appreciated.

Just run the following command:
/opt/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://127.0.0.1:31258 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 1 /opt/spark/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar 1000

spark-submit yarn-cluster with --jars does not work?

I am trying to submit a spark job to the CDH yarn cluster via the following commands
I have tried several combinations and it all does not work...
I now have all the poi jars located in both my local /root, as well as HDFS /user/root/lib, hence I have tried the following
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars /root/poi-3.12.jars, /root/poi-ooxml-3.12.jar, /root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars file:/root/poi-3.12.jars, file:/root/poi-ooxml-3.12.jar, file:/root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars hdfs://mynamenodeIP:8020/user/root/poi-3.12.jars,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-3.12.jar,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-schemas-3.12.jar
How do I propogate the jars to all cluster nodes? because none of the above is working, and the job still somehow does not get to reference the class, as I keep getting the same error:
java.lang.NoClassDefFoundError: org/apache/poi/ss/usermodel/WorkbookFactory
The same command works with "--master local", without specifying the --jars, as I have copied my jars to /opt/cloudera/parcels/CDH/lib/spark/lib.
However for yarn-cluster mode, I would need to distribute the external jars to all cluster, but the above code does not work.
Appreciate your help, thanks.
p.s. I am using CDH5.4.2 with spark 1.3.0

According to help options from Spark Submit
--jars includes the local jars to include on the driver and executor classpaths. [it will just set the path]
---files will copy the jars needed for you appication to run to all the working dir of executor nodes [it will transport your jar to
working dir]
Note: This is similar to -file options in hadoop streaming , which transports the mapper/reducer scripts to slave nodes.
So try with --files options as well.
$ spark-submit --help
Options:
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
hope this helps

Have you tried the solution posted in this thread:
Spark on yarn jar upload problems
The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below:
hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar

JAVA_HOME error with upgrade to Spark 1.3.0

I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my build.sbt like so:
-libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
then make an assembly jar, and submit it:
HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar
The job fails to submit, with the following exception in the terminal:
15/03/19 10:30:07 INFO yarn.Client:
15/03/19 10:20:03 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1420225286501_4698 failed 2 times due to AM
Container for appattempt_1420225286501_4698_000002 exited with exitCode: 127
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these:
Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
Log Type: stdout
Log Length: 0
I’m not sure how to interpret that – is {{JAVA_HOME}} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment & troubleshoot?
I do have JAVA_HOME set in the hadoop config files on all the nodes of the cluster:
% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine.
[Note: I originally posted this on the Spark mailing list, I'll update both places if/when I find a solution.]

Have you tried setting JAVA_HOME in the etc/hadoop/yarn-env.sh file? It's possible that your JAVA_HOME environment variable not available to the YARN containers that are running your job.
It has happened to me before that certain env variables that were in the .bashrc on the nodes were not being read by the yarn workers spawned on the cluster.
There is a chance that the error is unrelated to the version upgrade but instead related to YARN environment configuration.

Okay, so I got some other people in the office to help work on this, and we figured out a solution. I'm not sure how much of this is specific to the file layouts of Hortonworks HDP 2.0.6 on CentOS, which is what we're running on our cluster.
We manually copy some directories from one of the cluster machines (or any machine that can successfully use the Hadoop client) to your local machine. Let's call that machine $GOOD.
Set up Hadoop config files:
cd /etc
sudo mkdir hbase hadoop
sudo scp -r $GOOD:/etc/hbase/conf hbase
sudo scp -r $GOOD:/etc/hadoop/conf hadoop
Set up Hadoop libraries & executables:
mkdir ~/my-hadoop
scp -r $GOOD:/usr/lib/hadoop\* ~/my-hadoop
cd /usr/lib
sudo ln –s ~/my-hadoop/* .
path+=(/usr/lib/hadoop*/bin) # Add to $PATH (this syntax is for zsh)
Set up the Spark libraries & executables:
cd ~/Downloads
wget http://apache.mirrors.lucidnetworks.net/spark/spark-1.4.1/spark-1.4.1-bin-without-hadoop.tgz
tar -zxvf spark-1.4.1-bin-without-hadoop.tgz
cd spark-1.4.1-bin-without-hadoop
path+=(`pwd`/bin)
hdfs dfs -copyFromLocal lib/spark-assembly-*.jar /apps/local/
Set some environment variables:
export JAVA_HOME=$(/usr/libexec/java_home -v 1.7)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_DIST_CLASSPATH=$(hadoop --config $HADOOP_CONF_DIR classpath)
`grep 'export HADOOP_LIBEXEC_DIR' $HADOOP_CONF_DIR/yarn-env.sh`
export SPOPTS="--driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib"
export SPOPTS="$SPOPTS --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.4.1-hadoop2.2.0.jar"
Now the various spark shells can be run like so:
sparkR --master yarn $SPOPTS
spark-shell --master yarn $SPOPTS
pyspark --master yarn $SPOPTS
Some remarks:
The JAVA_HOME setting is the same as I've had all along - just included it here for completion. All the focus on JAVA_HOME turned out to be a red herring.
The --driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib was necessary because I was getting errors about java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path. The jnilib file is the correct choice for OS X.
The --conf spark.yarn.jar piece is just to save time, avoiding re-copying the assembly file to the cluster every time you fire up the shell or submit a job.

Well, to start off I would recommend you to move to Java 7. However, that is not what you are looking for or need help with.
For setting JAVA_HOME, I would recommend you set it in your bashrc, rather than setting in multiple files. Moreover, I would recommend you installing java with alternatives to /usr/bin.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Override spark's libraries in spark submit - java

Related

How to pass java arguments to Flink job artifacts in Application Mode

Spark Job using old application resources and jar

Running Spark jobs with Marathon

spark-submit yarn-cluster with --jars does not work?

JAVA_HOME error with upgrade to Spark 1.3.0

Categories

Resources