configure log4j for each spark job running on yarn mode

configure log4j for each spark job running on yarn mode - java

I am running spark jobs on yarn client mode. I am running these jobs using spark-submit command inside unix script. I want to have logs for each spark job running.
I tried using below command to get log :
spark-submit --master yarn --deploy-mode client --num-executors 10 --executor-memory 2G --driver-memory 2G --jars $spark_jars --class $spark_class $main_jar |& tee -a ${log_file}
but here if spark job gets failed , it will not be caught in command status check , may be unix checks status of |$tee command which is always success whether spark job succeed or fails
if [ $? -eq 0 ]; then
echo "===========SPARK JOB COMPLETED==================" |& tee -a ${log_file}
else
echo "===========SPARK JOB FAILED=====================" |& tee -a ${log_file}
fi
I tried using log4j but couldn't succeed.
I want to have each spark job log file stored on local unix server.
Please help !!

As soon as you submit your spark application. It generates an application_id. Since this application is running in distributed cluster you can't get logs of spark application with redirection.
However, when you do something like below, it just redirects console logging into a file.
spark-submit --master yarn --deploy-mode client --num-executors 10 --executor-memory 2G --driver-memory 2G --jars $spark_jars --class $spark_class $main_jar > ${log_file}
To get the logging of the spark application submitted to yarn cluster for example, you need to use yarn logs command:
yarn logs -applicationId <application ID> [OPTIONS]

Related

Spark submit on Yarn Cluster mode with config file put into HDFS issue

I have a spark program that needs to be passed a config file as a parameter for the main method. Currently when I submit the job in yarn cluster mode, I need to put the config file in all worker nodes so that the program can find it. However, I want to put it into HDFS path but will get the file not found error. Below is the command I use:
spark-submit --master yarn\
--name StreamingApp \
--deploy-mode cluster \
--class com.test.streaming.App \
--driver-java-options "-Djava.security.auth.login=/home/spark/auth.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraClassPath=/etc/hbase/conf/" \
/home/spark/StreamingFramework-0.0.1-SNAPSHTO-jar-with-dependencies.jar /home/spark/config.json
How can I put the last parameter (/home/spark/config.json) into HDFS so it works?

Need some clarity with regards to the usage of this config file here.
In case it is just needed as an argument to the main method, & the content is being used for spark session initialisation, then there should be no need to copy it onto any of the worker nodes.
In case the file is needed in the driver or the executors, then you should be passing it using the --files argument.
Copying to hdfs from local can be done using https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal

hail.utils.java.FatalError: IllegalStateException: unread block data

I am trying to run a basic script on spark cluster that takes in a file, converts it and outputs in different format. The spark cluster at the moment consists of 1 master and 1 slave both running on the same node. The full command is:
nohup spark-submit --master spark://tr-nodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /clinvar_37.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/clinvar_37.ht &
And it gives an error:
hail.utils.java.FatalError: IllegalStateException: unread block data
More detailed stack trace can be found on another forum where I asked the same question:
https://discuss.hail.is/t/unread-block-data-error-spark-master-slave-issue/1182
Such command works fine:
nohup spark-submit --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /hgmd_pro_2019.3_hg19_noDB.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/hgmd_2019.3_hg19_noDB.ht &
So, in local mode it runs fine, but in standalone it's not. So, I guess it is the issue of master-slave different settings, possibly JAVA. However, setting them in spark-env.sh like that:
export JAVA_HOME=/usr/lib/jvm/java
export SPARK_JAVA_OPTS+=" -Djava.library.path= $SPARK_LIBRARY_PATH : $JAVA_HOME "
Does not fix the issue. To start master + slave I just use start-all.sh script. Any suggestions would be greatly appreciated.

Ok, we fixed it and the solution was to add the following setting to our command that runs the script:
–jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
So, the working command is the following:
spark-submit --master spark://ai-grisnodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar --jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar test_hail.py
For future Hail 0.2 users may be important to know that this --jars parameter is required to specify, and that it should point to hail-all-spark.jar.

Override spark's libraries in spark submit

Our application's hadoop cluster has spark 1.5 installed. But due to specific requirements we have developed spark job with version 2.0.2. When I submit the job to yarn, I use the --jars command to override the spark libraries in cluster. But still it is not picking the scala library jar. It throws an error saying
ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:713)
at org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:704)
Any ideas about how to override the cluster libraries during spark submit ?
The shell command I use to submit the job is below.
spark-submit \
--jars test.jar,spark-core_2.11-2.0.2.jar,spark-sql_2.11-2.0.2.jar,spark-catalyst_2.11-2.0.2.jar,scala-library-2.11.0.jar \
--class Application \
--master yarn \
--deploy-mode cluster \
--queue xxx \
xxx.jar \
<params>

That's fairly easy - Yarn doesn't care which version of Spark you are running, it will execute the jars provided by the yarn client which is packaged by spark submit. That process packages your application jar along the spark libs.
In order to deploy Spark 2.0 instead of the provided 1.5, you just need to install spark 2.0 on the host from which you start your job, e.g. in your home dir, set the YARN_CONF_DIR env vars to point to your hadoop conf and then use that spark-submit.

Spark Job using old application resources and jar

I am new to spark . Trying to run spark job with client mode and it works well if I use the same path for jar and other resource files. After killing the running application using Yarn command and if spark job is resubmitted with updated jar and file locations, job still uses my old path. After reboot of system , spark job takes new path. Spark-submit command
spark-submit \
--class export.streaming.DataExportStreaming \
--jars /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--driver-class-path /usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.driver.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--conf spark.executor.extraClassPath=/usr/hdp/current/spark-client/lib/postgresql-9.4.1209.jar \
--master yarn --deploy-mode client \
--files /usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_selfservice_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_assignment.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_daily.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_device.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_queue.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_session_workflow_step.sql,/usr/lib/firebet-spark/52.0.2-1/data-export/resources/rollup_user_login_session.sql /usr/lib/firebet-spark/52.0.2-1/data-export/lib/data-export-assembly-52.0.2-1.jar /usr/lib/firebet-spark/52.0.2-1/data-export/resources/application.conf
How to fix this problem ?
Is spark-submit command is correct ?
Which deployment mode is better client or cluster in production ?

Running Spark jobs with Marathon

I've got a Marathon job running for the following:
./bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher
However, following that I'd like to now be able to run individual Spark jobs as separate Marathon jobs with the command:
./bin/spark-submit ....
My question is:
how can I call spark-submit from a Mesos executor of the binaries are not installed on it?
(Note: I'm aware that http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos also recommends installing Spark on all the Mesos slaves, but is that the only option?)
Any guidance is much appreciated.

Just run the following command:
/opt/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://127.0.0.1:31258 --deploy-mode cluster --supervise --executor-memory 2G --total-executor-cores 1 /opt/spark/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar 1000

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

configure log4j for each spark job running on yarn mode - java

Related

Spark submit on Yarn Cluster mode with config file put into HDFS issue

hail.utils.java.FatalError: IllegalStateException: unread block data

Override spark's libraries in spark submit

Spark Job using old application resources and jar

Running Spark jobs with Marathon

Categories

Resources