Spark Job using old application resources and jar - java

I am new to spark . Trying to run spark job with client mode and it works well if I use the same path for jar and other resource files. After killing the running application using Yarn command and if spark job is resubmitted with updated jar and file locations, job still uses my old path. After reboot of system , spark job takes new path. Spark-submit command
spark-submit \
--class export.streaming.DataExportStreaming \
--jars /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--driver-class-path /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.driver.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.executor.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--master yarn --deploy-mode client \
--files /usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_assignment.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_queue.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_user_login_session.sql /usr/lib/firebet-spark/52.0.2-1/data-export/lib/data-export-assembly-52.0.2-1.jar /usr/lib/firebet-spark/52.0.2-1/data-export/resources/application.conf
How to fix this problem ?
Is spark-submit command is correct ?
Which deployment mode is better client or cluster in production ?

Related

Spark submit on Yarn Cluster mode with config file put into HDFS issue

I have a spark program that needs to be passed a config file as a parameter for the main method. Currently when I submit the job in yarn cluster mode, I need to put the config file in all worker nodes so that the program can find it. However, I want to put it into HDFS path but will get the file not found error. Below is the command I use:
spark-submit --master yarn\
--name StreamingApp \
--deploy-mode cluster \
--class com.test.streaming.App \
--driver-java-options "-Djava.security.auth.login=/home/spark/auth.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraClassPath=/etc/hbase/conf/" \
/home/spark/StreamingFramework-0.0.1-SNAPSHTO-jar-with-dependencies.jar /home/spark/config.json
How can I put the last parameter (/home/spark/config.json) into HDFS so it works?
Need some clarity with regards to the usage of this config file here.
In case it is just needed as an argument to the main method, & the content is being used for spark session initialisation, then there should be no need to copy it onto any of the worker nodes.
In case the file is needed in the driver or the executors, then you should be passing it using the --files argument.
Copying to hdfs from local can be done using https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal

hail.utils.java.FatalError: IllegalStateException: unread block data

I am trying to run a basic script on spark cluster that takes in a file, converts it and outputs in different format. The spark cluster at the moment consists of 1 master and 1 slave both running on the same node. The full command is:
nohup spark-submit --master spark://tr-nodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /clinvar_37.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/clinvar_37.ht &
And it gives an error:
hail.utils.java.FatalError: IllegalStateException: unread block data
More detailed stack trace can be found on another forum where I asked the same question:
https://discuss.hail.is/t/unread-block-data-error-spark-master-slave-issue/1182
Such command works fine:
nohup spark-submit --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /hgmd_pro_2019.3_hg19_noDB.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/hgmd_2019.3_hg19_noDB.ht &
So, in local mode it runs fine, but in standalone it's not. So, I guess it is the issue of master-slave different settings, possibly JAVA. However, setting them in spark-env.sh like that:
export JAVA_HOME=/usr/lib/jvm/java
export SPARK_JAVA_OPTS+=" -Djava.library.path= $SPARK_LIBRARY_PATH : $JAVA_HOME "
Does not fix the issue. To start master + slave I just use start-all.sh script. Any suggestions would be greatly appreciated.
Ok, we fixed it and the solution was to add the following setting to our command that runs the script:
–jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
So, the working command is the following:
spark-submit --master spark://ai-grisnodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar --jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar test_hail.py
For future Hail 0.2 users may be important to know that this --jars parameter is required to specify, and that it should point to hail-all-spark.jar.

configure log4j for each spark job running on yarn mode

I am running spark jobs on yarn client mode. I am running these jobs using spark-submit command inside unix script. I want to have logs for each spark job running.
I tried using below command to get log :
spark-submit --master yarn --deploy-mode client --num-executors 10 --executor-memory 2G --driver-memory 2G --jars $spark_jars --class $spark_class $main_jar |& tee -a ${log_file}
but here if spark job gets failed , it will not be caught in command status check , may be unix checks status of |$tee command which is always success whether spark job succeed or fails
if [ $? -eq 0 ]; then
echo "===========SPARK JOB COMPLETED==================" |& tee -a ${log_file}
else
echo "===========SPARK JOB FAILED=====================" |& tee -a ${log_file}
fi
I tried using log4j but couldn't succeed.
I want to have each spark job log file stored on local unix server.
Please help !!
As soon as you submit your spark application. It generates an application_id. Since this application is running in distributed cluster you can't get logs of spark application with redirection.
However, when you do something like below, it just redirects console logging into a file.
spark-submit --master yarn --deploy-mode client --num-executors 10 --executor-memory 2G --driver-memory 2G --jars $spark_jars --class $spark_class $main_jar > ${log_file}
To get the logging of the spark application submitted to yarn cluster for example, you need to use yarn logs command:
yarn logs -applicationId <application ID> [OPTIONS]

Override spark's libraries in spark submit

Our application's hadoop cluster has spark 1.5 installed. But due to specific requirements we have developed spark job with version 2.0.2. When I submit the job to yarn, I use the --jars command to override the spark libraries in cluster. But still it is not picking the scala library jar. It throws an error saying
ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:713)
at org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:704)
Any ideas about how to override the cluster libraries during spark submit ?
The shell command I use to submit the job is below.
spark-submit \
--jars test.jar,spark-core_2.11-2.0.2.jar,spark-sql_2.11-2.0.2.jar,spark-catalyst_2.11-2.0.2.jar,scala-library-2.11.0.jar \
--class Application \
--master yarn \
--deploy-mode cluster \
--queue xxx \
xxx.jar \
<params>
That's fairly easy - Yarn doesn't care which version of Spark you are running, it will execute the jars provided by the yarn client which is packaged by spark submit. That process packages your application jar along the spark libs.
In order to deploy Spark 2.0 instead of the provided 1.5, you just need to install spark 2.0 on the host from which you start your job, e.g. in your home dir, set the YARN_CONF_DIR env vars to point to your hadoop conf and then use that spark-submit.

spark-submit yarn-cluster with --jars does not work?

I am trying to submit a spark job to the CDH yarn cluster via the following commands
I have tried several combinations and it all does not work...
I now have all the poi jars located in both my local /root, as well as HDFS /user/root/lib, hence I have tried the following
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars /root/poi-3.12.jars, /root/poi-ooxml-3.12.jar, /root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars file:/root/poi-3.12.jars, file:/root/poi-ooxml-3.12.jar, file:/root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars hdfs://mynamenodeIP:8020/user/root/poi-3.12.jars,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-3.12.jar,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-schemas-3.12.jar
How do I propogate the jars to all cluster nodes? because none of the above is working, and the job still somehow does not get to reference the class, as I keep getting the same error:
java.lang.NoClassDefFoundError: org/apache/poi/ss/usermodel/WorkbookFactory
The same command works with "--master local", without specifying the --jars, as I have copied my jars to /opt/cloudera/parcels/CDH/lib/spark/lib.
However for yarn-cluster mode, I would need to distribute the external jars to all cluster, but the above code does not work.
Appreciate your help, thanks.
p.s. I am using CDH5.4.2 with spark 1.3.0
According to help options from Spark Submit
--jars includes the local jars to include on the driver and executor classpaths. [it will just set the path]
---files will copy the jars needed for you appication to run to all the working dir of executor nodes [it will transport your jar to
working dir]
Note: This is similar to -file options in hadoop streaming , which transports the mapper/reducer scripts to slave nodes.
So try with --files options as well.
$ spark-submit --help
Options:
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
hope this helps
Have you tried the solution posted in this thread:
Spark on yarn jar upload problems
The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below:
hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar

Categories

Resources