Spark-Submit through command line does not enforce UTF-8 encoding - java

When I run my spark job from an IDE using Spark's Java APIs, I get the output in a desired encoding format (UTF-8). But if I start the 'spark-submit' method from command line, the output misses out on the encoding.
Is there a way where I can enforce encoding to 'spark-submit' when used through command line interface.
I am using Windows 10 OS and Eclipse IDE.
Your help will be really appreciated.
Thank you.

Run your Spark job like this :
spark-submit --class com.something.class --name "someName" --conf "spark.driver.extraJavaOptions=-Dfile.encoding=utf-8"

Not working in my case
The command i use is
spark-submit --class com.rera.esearch --jars /Users/nitinthakur/.ivy2/cache/mysql/mysql-connector-java/jars/mysql-connector-java-8.0.11.jar /Users/nitinthakur/IdeaProjects/Rera2/target/scala-2.11/rera2_2.11-0.1.jar
--conf "spark.driver.extraJavaOptions=-Dfile.encoding=utf-8" 127.0.0.1 root
Output of below commands
println(System.getProperty("file.encoding")) // US-ASCII
println(scala.util.Properties.encodingString) // US-ASCII

If you are seeing the issue in a code that runs in executor(like the code between foreachPartition or mapPartition) you would have to set spark.executor.extraJavaOptions that is
--conf 'spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8'
if your code is running in driver then set as said above, i.e
--conf "spark.driver.extraJavaOptions=-Dfile.encoding=utf-8"

Related

Spark submit on Yarn Cluster mode with config file put into HDFS issue

I have a spark program that needs to be passed a config file as a parameter for the main method. Currently when I submit the job in yarn cluster mode, I need to put the config file in all worker nodes so that the program can find it. However, I want to put it into HDFS path but will get the file not found error. Below is the command I use:
spark-submit --master yarn\
--name StreamingApp \
--deploy-mode cluster \
--class com.test.streaming.App \
--driver-java-options "-Djava.security.auth.login=/home/spark/auth.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraClassPath=/etc/hbase/conf/" \
/home/spark/StreamingFramework-0.0.1-SNAPSHTO-jar-with-dependencies.jar /home/spark/config.json
How can I put the last parameter (/home/spark/config.json) into HDFS so it works?
Need some clarity with regards to the usage of this config file here.
In case it is just needed as an argument to the main method, & the content is being used for spark session initialisation, then there should be no need to copy it onto any of the worker nodes.
In case the file is needed in the driver or the executors, then you should be passing it using the --files argument.
Copying to hdfs from local can be done using https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal

hail.utils.java.FatalError: IllegalStateException: unread block data

I am trying to run a basic script on spark cluster that takes in a file, converts it and outputs in different format. The spark cluster at the moment consists of 1 master and 1 slave both running on the same node. The full command is:
nohup spark-submit --master spark://tr-nodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /clinvar_37.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/clinvar_37.ht &
And it gives an error:
hail.utils.java.FatalError: IllegalStateException: unread block data
More detailed stack trace can be found on another forum where I asked the same question:
https://discuss.hail.is/t/unread-block-data-error-spark-master-slave-issue/1182
Such command works fine:
nohup spark-submit --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
--conf spark.executor.extraClassPath=./hail-all-spark.jar ./hail_scripts/v02/convert_vcf_to_hail.py /hgmd_pro_2019.3_hg19_noDB.vcf -ht
--genome-version 37 --output /seqr-reference-hail2/hgmd_2019.3_hg19_noDB.ht &
So, in local mode it runs fine, but in standalone it's not. So, I guess it is the issue of master-slave different settings, possibly JAVA. However, setting them in spark-env.sh like that:
export JAVA_HOME=/usr/lib/jvm/java
export SPARK_JAVA_OPTS+=" -Djava.library.path= $SPARK_LIBRARY_PATH : $JAVA_HOME "
Does not fix the issue. To start master + slave I just use start-all.sh script. Any suggestions would be greatly appreciated.
Ok, we fixed it and the solution was to add the following setting to our command that runs the script:
–jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar
So, the working command is the following:
spark-submit --master spark://ai-grisnodedev1:7077 --verbose --conf spark.driver.port=40065 --driver-memory 4g --conf spark.driver.extraClassPath=/opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=./hail-all-spark.jar --jars /opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/hail/hail-all-spark.jar test_hail.py
For future Hail 0.2 users may be important to know that this --jars parameter is required to specify, and that it should point to hail-all-spark.jar.

Why does spark-submit fail with “Error executing Jupyter command”?

When trying to run Spark locally on my Mac (which used to work) ...
/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/bin/java \
-cp /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/:/usr/local/Cellar/apache-spark/2.4.0/libexec/jars/* \
-Xmx1g org.apache.spark.deploy.SparkSubmit \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
/Users/crump/main.py
I'm now getting the following error:
Error executing Jupyter command '/Users/crump/main.py': [Errno 2] No such file or directory
The file is there. Since I know this used to work, I must have installed something recently that changed a library, sdk, etc.
Ok, I found the answer finally: PYSPARK_DRIVER_PYTHON=jupyter in my environment. I set this up to launch Jupyter/Spark notebooks with just the pyspark command, but it causes spark-submit to fail.
The solution is set the variable to use python, not jupyter: PYSPARK_DRIVER_PYTHON=python.

spark-submit yarn-cluster with --jars does not work?

I am trying to submit a spark job to the CDH yarn cluster via the following commands
I have tried several combinations and it all does not work...
I now have all the poi jars located in both my local /root, as well as HDFS /user/root/lib, hence I have tried the following
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars /root/poi-3.12.jars, /root/poi-ooxml-3.12.jar, /root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars file:/root/poi-3.12.jars, file:/root/poi-ooxml-3.12.jar, file:/root/poi-ooxml-schemas-3.12.jar
spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars hdfs://mynamenodeIP:8020/user/root/poi-3.12.jars,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-3.12.jar,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-schemas-3.12.jar
How do I propogate the jars to all cluster nodes? because none of the above is working, and the job still somehow does not get to reference the class, as I keep getting the same error:
java.lang.NoClassDefFoundError: org/apache/poi/ss/usermodel/WorkbookFactory
The same command works with "--master local", without specifying the --jars, as I have copied my jars to /opt/cloudera/parcels/CDH/lib/spark/lib.
However for yarn-cluster mode, I would need to distribute the external jars to all cluster, but the above code does not work.
Appreciate your help, thanks.
p.s. I am using CDH5.4.2 with spark 1.3.0
According to help options from Spark Submit
--jars includes the local jars to include on the driver and executor classpaths. [it will just set the path]
---files will copy the jars needed for you appication to run to all the working dir of executor nodes [it will transport your jar to
working dir]
Note: This is similar to -file options in hadoop streaming , which transports the mapper/reducer scripts to slave nodes.
So try with --files options as well.
$ spark-submit --help
Options:
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
hope this helps
Have you tried the solution posted in this thread:
Spark on yarn jar upload problems
The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below:
hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar

JAVA_HOME error with upgrade to Spark 1.3.0

I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my build.sbt like so:
-libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
then make an assembly jar, and submit it:
HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar
The job fails to submit, with the following exception in the terminal:
15/03/19 10:30:07 INFO yarn.Client:
15/03/19 10:20:03 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1420225286501_4698 failed 2 times due to AM
Container for appattempt_1420225286501_4698_000002 exited with exitCode: 127
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these:
Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
Log Type: stdout
Log Length: 0
I’m not sure how to interpret that – is {{JAVA_HOME}} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment & troubleshoot?
I do have JAVA_HOME set in the hadoop config files on all the nodes of the cluster:
% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine.
[Note: I originally posted this on the Spark mailing list, I'll update both places if/when I find a solution.]
Have you tried setting JAVA_HOME in the etc/hadoop/yarn-env.sh file? It's possible that your JAVA_HOME environment variable not available to the YARN containers that are running your job.
It has happened to me before that certain env variables that were in the .bashrc on the nodes were not being read by the yarn workers spawned on the cluster.
There is a chance that the error is unrelated to the version upgrade but instead related to YARN environment configuration.
Okay, so I got some other people in the office to help work on this, and we figured out a solution. I'm not sure how much of this is specific to the file layouts of Hortonworks HDP 2.0.6 on CentOS, which is what we're running on our cluster.
We manually copy some directories from one of the cluster machines (or any machine that can successfully use the Hadoop client) to your local machine. Let's call that machine $GOOD.
Set up Hadoop config files:
cd /etc
sudo mkdir hbase hadoop
sudo scp -r $GOOD:/etc/hbase/conf hbase
sudo scp -r $GOOD:/etc/hadoop/conf hadoop
Set up Hadoop libraries & executables:
mkdir ~/my-hadoop
scp -r $GOOD:/usr/lib/hadoop\* ~/my-hadoop
cd /usr/lib
sudo ln –s ~/my-hadoop/* .
path+=(/usr/lib/hadoop*/bin) # Add to $PATH (this syntax is for zsh)
Set up the Spark libraries & executables:
cd ~/Downloads
wget http://apache.mirrors.lucidnetworks.net/spark/spark-1.4.1/spark-1.4.1-bin-without-hadoop.tgz
tar -zxvf spark-1.4.1-bin-without-hadoop.tgz
cd spark-1.4.1-bin-without-hadoop
path+=(`pwd`/bin)
hdfs dfs -copyFromLocal lib/spark-assembly-*.jar /apps/local/
Set some environment variables:
export JAVA_HOME=$(/usr/libexec/java_home -v 1.7)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_DIST_CLASSPATH=$(hadoop --config $HADOOP_CONF_DIR classpath)
`grep 'export HADOOP_LIBEXEC_DIR' $HADOOP_CONF_DIR/yarn-env.sh`
export SPOPTS="--driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib"
export SPOPTS="$SPOPTS --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.4.1-hadoop2.2.0.jar"
Now the various spark shells can be run like so:
sparkR --master yarn $SPOPTS
spark-shell --master yarn $SPOPTS
pyspark --master yarn $SPOPTS
Some remarks:
The JAVA_HOME setting is the same as I've had all along - just included it here for completion. All the focus on JAVA_HOME turned out to be a red herring.
The --driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib was necessary because I was getting errors about java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path. The jnilib file is the correct choice for OS X.
The --conf spark.yarn.jar piece is just to save time, avoiding re-copying the assembly file to the cluster every time you fire up the shell or submit a job.
Well, to start off I would recommend you to move to Java 7. However, that is not what you are looking for or need help with.
For setting JAVA_HOME, I would recommend you set it in your bashrc, rather than setting in multiple files. Moreover, I would recommend you installing java with alternatives to /usr/bin.

Categories

Resources