JAVA_HOME error with upgrade to Spark 1.3.0 - java

I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my build.sbt like so:
-libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
then make an assembly jar, and submit it:
HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar
The job fails to submit, with the following exception in the terminal:
15/03/19 10:30:07 INFO yarn.Client:
15/03/19 10:20:03 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1420225286501_4698 failed 2 times due to AM
Container for appattempt_1420225286501_4698_000002 exited with exitCode: 127
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these:
Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
Log Type: stdout
Log Length: 0
I’m not sure how to interpret that – is {{JAVA_HOME}} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment & troubleshoot?
I do have JAVA_HOME set in the hadoop config files on all the nodes of the cluster:
% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine.
[Note: I originally posted this on the Spark mailing list, I'll update both places if/when I find a solution.]

Have you tried setting JAVA_HOME in the etc/hadoop/yarn-env.sh file? It's possible that your JAVA_HOME environment variable not available to the YARN containers that are running your job.
It has happened to me before that certain env variables that were in the .bashrc on the nodes were not being read by the yarn workers spawned on the cluster.
There is a chance that the error is unrelated to the version upgrade but instead related to YARN environment configuration.

Okay, so I got some other people in the office to help work on this, and we figured out a solution. I'm not sure how much of this is specific to the file layouts of Hortonworks HDP 2.0.6 on CentOS, which is what we're running on our cluster.
We manually copy some directories from one of the cluster machines (or any machine that can successfully use the Hadoop client) to your local machine. Let's call that machine $GOOD.
Set up Hadoop config files:
cd /etc
sudo mkdir hbase hadoop
sudo scp -r $GOOD:/etc/hbase/conf hbase
sudo scp -r $GOOD:/etc/hadoop/conf hadoop
Set up Hadoop libraries & executables:
mkdir ~/my-hadoop
scp -r $GOOD:/usr/lib/hadoop\* ~/my-hadoop
cd /usr/lib
sudo ln –s ~/my-hadoop/* .
path+=(/usr/lib/hadoop*/bin) # Add to $PATH (this syntax is for zsh)
Set up the Spark libraries & executables:
cd ~/Downloads
wget http://apache.mirrors.lucidnetworks.net/spark/spark-1.4.1/spark-1.4.1-bin-without-hadoop.tgz
tar -zxvf spark-1.4.1-bin-without-hadoop.tgz
cd spark-1.4.1-bin-without-hadoop
path+=(`pwd`/bin)
hdfs dfs -copyFromLocal lib/spark-assembly-*.jar /apps/local/
Set some environment variables:
export JAVA_HOME=$(/usr/libexec/java_home -v 1.7)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_DIST_CLASSPATH=$(hadoop --config $HADOOP_CONF_DIR classpath)
`grep 'export HADOOP_LIBEXEC_DIR' $HADOOP_CONF_DIR/yarn-env.sh`
export SPOPTS="--driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib"
export SPOPTS="$SPOPTS --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.4.1-hadoop2.2.0.jar"
Now the various spark shells can be run like so:
sparkR --master yarn $SPOPTS
spark-shell --master yarn $SPOPTS
pyspark --master yarn $SPOPTS
Some remarks:
The JAVA_HOME setting is the same as I've had all along - just included it here for completion. All the focus on JAVA_HOME turned out to be a red herring.
The --driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib was necessary because I was getting errors about java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path. The jnilib file is the correct choice for OS X.
The --conf spark.yarn.jar piece is just to save time, avoiding re-copying the assembly file to the cluster every time you fire up the shell or submit a job.

Well, to start off I would recommend you to move to Java 7. However, that is not what you are looking for or need help with.
For setting JAVA_HOME, I would recommend you set it in your bashrc, rather than setting in multiple files. Moreover, I would recommend you installing java with alternatives to /usr/bin.

Related

Google App Engine: JAVA command not recognized by Python in Flex environment with Docker (`java` command is not found from this Python process)

I deployed a Python project on Google App Engine. As the project has a dependency on Java, I used a Docker container configuring two environments: Python + Java.
However, when I make a call to my Python service in GAE it's getting "java command is not found from this Python process. Please ensure Java is installed and PATH is set for java" error.
During the build process of the Docker file I am able to access Java after installing it. But during API execution it is not recognized by Python.
The "app.yaml" file used:
runtime: custom
env: flex
entrypoint: gunicorn -w 4 -k uvicorn.workers.UvicornWorker src.main:app
Below are the the Docker file used in Deploy:
### 1. Get Linux
FROM alpine:3.7
# Default to UTF-8 file.encoding
ENV LANG C.UTF-8
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre
#### OPTIONAL : 4. SET JAVA_HOME environment variable, uncomment the line below if you need it
ENV JAVA_HOME="/usr/lib/jvm/java-1.8-openjdk"
RUN export JAVA_HOME
ENV PATH $PATH:$JAVA_HOME/bin
RUN export PATH
RUN find / -name "java"
RUN java -version
FROM python:3.7
EXPOSE 8080
ENV APP_HOME /src
WORKDIR /src
COPY requirements.txt ./requirements.txt
RUN pip3 install -r requirements.txt
COPY . /src
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]
Here is a printout of the RUN java -version command during the deploy process:
RUN java -version command
Does anyone know why the error is happening even with Python and Java running on the same service on App Engine?
Are there any additional settings missing?
You are using two FROM in a Dockerfile, so you're effectively doing a multi-stage Docker build, but you are doing it wrong.
A multi-stage build should be something like this:
### 1. Get Linux
FROM alpine:3.7 as build
# here you install with apk and build your Java program
FROM python:3.7
EXPOSE 8080
...
# here you copy what you need from the previous Docker build, though
# since it was Java, which is not installed here anymore because it is
# a python image, you need to consider if you really need a multi-stage
# build
COPY --from=builder /src /src
Otherwise, just remove the second FROM python:3.7 and install it with apk.

Why does spark-submit fail with “Error executing Jupyter command”?

When trying to run Spark locally on my Mac (which used to work) ...
/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/bin/java \
-cp /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/:/usr/local/Cellar/apache-spark/2.4.0/libexec/jars/* \
-Xmx1g org.apache.spark.deploy.SparkSubmit \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
/Users/crump/main.py
I'm now getting the following error:
Error executing Jupyter command '/Users/crump/main.py': [Errno 2] No such file or directory
The file is there. Since I know this used to work, I must have installed something recently that changed a library, sdk, etc.
Ok, I found the answer finally: PYSPARK_DRIVER_PYTHON=jupyter in my environment. I set this up to launch Jupyter/Spark notebooks with just the pyspark command, but it causes spark-submit to fail.
The solution is set the variable to use python, not jupyter: PYSPARK_DRIVER_PYTHON=python.

Maven specify settings file location via MAVEN_OPTS

I need to use maven with a settings file in a specific location, normally you can give MAVEN_OPTS env variable but they are passed to JVM so the following will yield:
$ MAVEN_OPTS="-s /settings.xml"
$ mvn clean
Unrecognized option: -s
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
I searched a lot but found two keys, org.apache.maven.user-settings and org.apache.maven.global-settings which is explained here but it seemed it was working with Maven 2 only. Aliasing mvn to mvn -s /settings.xml would probably work but I do not like it.
From the mvn shell script:
# -----------------------------------------------------------------------------
# Apache Maven Startup Script
#
# Environment Variable Prerequisites
#
# JAVA_HOME Must point at your Java Development Kit installation.
# MAVEN_OPTS (Optional) Java runtime options used when Maven is executed.
# MAVEN_SKIP_RC (Optional) Flag to disable loading of mavenrc files.
# -----------------------------------------------------------------------------
so MAVEN_OPTS contains JVM arguments, not Maven arguments (which is consistent with the error message indicating the JVM doesn't like your arguments).
The actual invocation is
exec "$JAVACMD" \
$MAVEN_OPTS \
$MAVEN_DEBUG_OPTS \
-classpath "${CLASSWORLDS_JAR}" \
"-Dclassworlds.conf=${MAVEN_HOME}/bin/m2.conf" \
"-Dmaven.home=${MAVEN_HOME}" \
"-Dlibrary.jansi.path=${MAVEN_HOME}/lib/jansi-native" \
"-Dmaven.multiModuleProjectDirectory=${MAVEN_PROJECTBASEDIR}" \
${CLASSWORLDS_LAUNCHER} "$#"
so there is nowhere to put it. I would therefore suggest that you write your own ? mvn script which in turn calls the real maven command with the arguments you like (in my experience scripts are more robust than aliases). Additionally I have recently found myself that the Java versions later than 8 have ... interesting issues... so I really need to have mvn8, mvn11 (and perhaps more) commands anyway.
Another approach that I only started using recently is the Maven wrapper (https://github.com/takari/maven-wrapper) where a ./mvnw command is placed in your project which then downloads Maven when needed. This is very useful. To get started use
mvn -N io.takari:maven:wrapper
after which ./mvnw should be directly usable instead of mvn. The interesting part here is that the generated Maven command looks like
exec "$JAVACMD" \
$MAVEN_OPTS \
-classpath "$MAVEN_PROJECTBASEDIR/.mvn/wrapper/maven-wrapper.jar" \
"-Dmaven.home=${M2_HOME}" "-Dmaven.multiModuleProjectDirectory=${MAVEN_PROJECTBASEDIR}" \
${WRAPPER_LAUNCHER} $MAVEN_CONFIG "$#"
and MAVEN_CONFIG is not set earlier in the script. So for mvnw you can set MAVEN_CONFIG to your "-s /settings.xml" string.
Maven 4
The MAVEN_ARGS environment variable is supported and can be used.
Maven 3
There was a feature request MNG-5824: Support MAVEN_ARGS environment variable as a way of supplying default command line arguments. This was closed unimplemented with a suggestion to use the .mvn/maven.config in project directory

Override spark's libraries in spark submit

Our application's hadoop cluster has spark 1.5 installed. But due to specific requirements we have developed spark job with version 2.0.2. When I submit the job to yarn, I use the --jars command to override the spark libraries in cluster. But still it is not picking the scala library jar. It throws an error saying
ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:713)
at org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:704)
Any ideas about how to override the cluster libraries during spark submit ?
The shell command I use to submit the job is below.
spark-submit \
--jars test.jar,spark-core_2.11-2.0.2.jar,spark-sql_2.11-2.0.2.jar,spark-catalyst_2.11-2.0.2.jar,scala-library-2.11.0.jar \
--class Application \
--master yarn \
--deploy-mode cluster \
--queue xxx \
xxx.jar \
<params>
That's fairly easy - Yarn doesn't care which version of Spark you are running, it will execute the jars provided by the yarn client which is packaged by spark submit. That process packages your application jar along the spark libs.
In order to deploy Spark 2.0 instead of the provided 1.5, you just need to install spark 2.0 on the host from which you start your job, e.g. in your home dir, set the YARN_CONF_DIR env vars to point to your hadoop conf and then use that spark-submit.

How to create and configure Hadoop client script?

There is a running Hadoop cluster.
And I have downloaded Hadoop distribution (in this case 0.20.205.0)
I need to create some shell script (bash/zsh/perl) that will be capable of calling Hadoop on that cluster. Ideally it should be able to be called from Sqoop script this way:
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop "$#"
How can I call Hadoop and provide namenode/jobtracker URIs?
How do I provide extra libs with Sqoop and DB drivers?
Should be simple enough using the hadoop generic options - Im assuming you've configured the contents of ${HADOOP_HOME}/conf for your cluster (namely core-site.xml and mapred-site.xml)
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop \
-libjars myjar1.jar,myjar2,jar "$#"
Here you pass the jars to be placed on the classpath via the -libjars option.
If you have multiple clusters you want to target, then you'll just need to either create different conf folders for each cluster and set the HADOOP_CONF_DIR environment variable prior to calling the hadoop script, or you can use the -Dkey=value generic arguments to set the fs.default.name and mapred.job.tracker appropriately:
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop \
-libjars myjar1.jar,myjar2,jar \
-Dfs.default.name=hdfs://namenode-servername:9000 \
-Dmapred.job.jobtracker=jobtracker-servername:9001 \
"$#"
My problem actually was to run Sqoop.
So I solved it by simply supplying -fs and -jt parameters as first arguments to Sqoop command (e.g. sqoop-import)
sqoop-import \
-fs $HADOOP_FILESYSTEM -jt $HADOOP_JOB_TRACKER \
--connect $DB_CONNECTION_STRING --username $DB_USER -P \
--outdir /home/user/sqoop/generated_code \
"$#" # <- other parameters

Categories

Resources