How to create and configure Hadoop client script?

How to create and configure Hadoop client script? - java

There is a running Hadoop cluster.
And I have downloaded Hadoop distribution (in this case 0.20.205.0)
I need to create some shell script (bash/zsh/perl) that will be capable of calling Hadoop on that cluster. Ideally it should be able to be called from Sqoop script this way:
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop "$#"
How can I call Hadoop and provide namenode/jobtracker URIs?
How do I provide extra libs with Sqoop and DB drivers?

Should be simple enough using the hadoop generic options - Im assuming you've configured the contents of ${HADOOP_HOME}/conf for your cluster (namely core-site.xml and mapred-site.xml)
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop \
-libjars myjar1.jar,myjar2,jar "$#"
Here you pass the jars to be placed on the classpath via the -libjars option.
If you have multiple clusters you want to target, then you'll just need to either create different conf folders for each cluster and set the HADOOP_CONF_DIR environment variable prior to calling the hadoop script, or you can use the -Dkey=value generic arguments to set the fs.default.name and mapred.job.tracker appropriately:
exec ${HADOOP_HOME}/bin/hadoop com.cloudera.sqoop.Sqoop \
-libjars myjar1.jar,myjar2,jar \
-Dfs.default.name=hdfs://namenode-servername:9000 \
-Dmapred.job.jobtracker=jobtracker-servername:9001 \
"$#"

My problem actually was to run Sqoop.
So I solved it by simply supplying -fs and -jt parameters as first arguments to Sqoop command (e.g. sqoop-import)
sqoop-import \
-fs $HADOOP_FILESYSTEM -jt $HADOOP_JOB_TRACKER \
--connect $DB_CONNECTION_STRING --username $DB_USER -P \
--outdir /home/user/sqoop/generated_code \
"$#" # <- other parameters

Related

Spark submit on Yarn Cluster mode with config file put into HDFS issue

I have a spark program that needs to be passed a config file as a parameter for the main method. Currently when I submit the job in yarn cluster mode, I need to put the config file in all worker nodes so that the program can find it. However, I want to put it into HDFS path but will get the file not found error. Below is the command I use:
spark-submit --master yarn\
--name StreamingApp \
--deploy-mode cluster \
--class com.test.streaming.App \
--driver-java-options "-Djava.security.auth.login=/home/spark/auth.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/spark/auth.conf" \
--conf "spark.driver.extraClassPath=/etc/hbase/conf/" \
/home/spark/StreamingFramework-0.0.1-SNAPSHTO-jar-with-dependencies.jar /home/spark/config.json
How can I put the last parameter (/home/spark/config.json) into HDFS so it works?

Need some clarity with regards to the usage of this config file here.
In case it is just needed as an argument to the main method, & the content is being used for spark session initialisation, then there should be no need to copy it onto any of the worker nodes.
In case the file is needed in the driver or the executors, then you should be passing it using the --files argument.
Copying to hdfs from local can be done using https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal

Use cp in entrypoint for docker run

There is Dockerfile
FROM openjdk:11.0.12-jre-slim
COPY target/app.jar /app.jar
COPY configs configs
ENTRYPOINT ["java","-jar","/app.jar"]
In folder configs contains json configs for java application.
The build docker command is:
docker build --build-arg -f ~/IdeaProjects/app --no-cache -t app:latest
And the run command is:
docker run --entrypoint="cp configs var/opt/configs/ && java -jar app.jar" app:latest
Let's omit the ability to copy configs in the Dockerfile via COPY command. Unfortunately, this must be done using --entrypoint.
An error occurs when the docker run command was executed:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "cp configs var/opt/configs/ && java -jar app.jar": stat cp configs var/opt/configs/ && java -jar app.jar: no such file or directory: unknown.
Could you explain why the error occurred in this case?

I would do this with an entrypoint wrapper script. A Dockerfile can have both an ENTRYPOINT and a CMD; if you do, the CMD gets passed as arguments to the ENTRYPOINT. This means you can make the ENTRYPOINT a shell script that does first-time setup, then ends with exec "$#" to replace itself with the CMD.
#!/bin/sh
# docker-entrypoint.sh
# copy the configuration to the right place
cp configs var/opt/configs/
# run the main container command
exec "$#"
In the Dockerfile, make sure to COPY the script in (it should be checked in to source control as executable) and set it as the ENTRYPOINT.
...
COPY docker-entrypoint.sh .
ENTRYPOINT ["./docker-entrypoint.sh"] # must be JSON-array syntax
CMD ["java", "-jar", "/app.jar"] # what was previously ENTRYPOINT
When you run the container it's straightforward to replace the CMD, so you can double-check that this is doing the right thing by running an interactive shell in place of the java application.
docker run -v "$PWD/alt-configs:/configs" --rm -it my-image sh
If you do need to override the command like this at docker run time, the command you show uses && to run two commands consecutively. This needs to run a shell to be understood correctly, and in this context you need to manually provide a /bin/sh -c wrapper.
I would still recommend changing ENTRYPOINT to CMD in your Dockerfile; then you could run a relatively straightforward
docker run \
... \
-v "$PWD/alt-configs:/configs" \
my-image \
/bin/sh -c 'cp configs var/opt/configs && java -jar /app.jar'
If you use --entrypoint, it only takes the first word out of this command, and it is a Docker options so it needs to come before the image name. I'd recommend designing your image to avoid needing this awkward construct.
docker run \
... \
-v "$PWD/alt-configs:/configs" \
--entrypoint /bin/sh \
my-image \
-c 'cp configs var/opt/configs && java -jar /app.jar'
Your proposed command is having problems because it's trying to pass the entire command, including the embedded spaces and shell operators, as a single word, but that causes the OS-level process handling to try to look for an executable file with spaces and ampersands in the filename, hence the "no such file or directory" error.

Maven specify settings file location via MAVEN_OPTS

I need to use maven with a settings file in a specific location, normally you can give MAVEN_OPTS env variable but they are passed to JVM so the following will yield:
$ MAVEN_OPTS="-s /settings.xml"
$ mvn clean
Unrecognized option: -s
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
I searched a lot but found two keys, org.apache.maven.user-settings and org.apache.maven.global-settings which is explained here but it seemed it was working with Maven 2 only. Aliasing mvn to mvn -s /settings.xml would probably work but I do not like it.

From the mvn shell script:
# -----------------------------------------------------------------------------
# Apache Maven Startup Script
#
# Environment Variable Prerequisites
#
# JAVA_HOME Must point at your Java Development Kit installation.
# MAVEN_OPTS (Optional) Java runtime options used when Maven is executed.
# MAVEN_SKIP_RC (Optional) Flag to disable loading of mavenrc files.
# -----------------------------------------------------------------------------
so MAVEN_OPTS contains JVM arguments, not Maven arguments (which is consistent with the error message indicating the JVM doesn't like your arguments).
The actual invocation is
exec "$JAVACMD" \
$MAVEN_OPTS \
$MAVEN_DEBUG_OPTS \
-classpath "${CLASSWORLDS_JAR}" \
"-Dclassworlds.conf=${MAVEN_HOME}/bin/m2.conf" \
"-Dmaven.home=${MAVEN_HOME}" \
"-Dlibrary.jansi.path=${MAVEN_HOME}/lib/jansi-native" \
"-Dmaven.multiModuleProjectDirectory=${MAVEN_PROJECTBASEDIR}" \
${CLASSWORLDS_LAUNCHER} "$#"
so there is nowhere to put it. I would therefore suggest that you write your own ? mvn script which in turn calls the real maven command with the arguments you like (in my experience scripts are more robust than aliases). Additionally I have recently found myself that the Java versions later than 8 have ... interesting issues... so I really need to have mvn8, mvn11 (and perhaps more) commands anyway.
Another approach that I only started using recently is the Maven wrapper (https://github.com/takari/maven-wrapper) where a ./mvnw command is placed in your project which then downloads Maven when needed. This is very useful. To get started use
mvn -N io.takari:maven:wrapper
after which ./mvnw should be directly usable instead of mvn. The interesting part here is that the generated Maven command looks like
exec "$JAVACMD" \
$MAVEN_OPTS \
-classpath "$MAVEN_PROJECTBASEDIR/.mvn/wrapper/maven-wrapper.jar" \
"-Dmaven.home=${M2_HOME}" "-Dmaven.multiModuleProjectDirectory=${MAVEN_PROJECTBASEDIR}" \
${WRAPPER_LAUNCHER} $MAVEN_CONFIG "$#"
and MAVEN_CONFIG is not set earlier in the script. So for mvnw you can set MAVEN_CONFIG to your "-s /settings.xml" string.

Maven 4
The MAVEN_ARGS environment variable is supported and can be used.
Maven 3
There was a feature request MNG-5824: Support MAVEN_ARGS environment variable as a way of supplying default command line arguments. This was closed unimplemented with a suggestion to use the .mvn/maven.config in project directory

test JNI on Hadoop using MapReduce

[I try to run a JNI program on Hadoop using MapReduce.Here is the command:
bin/hadoop jar /Users/ming/Desktop/mctest/mctest.jar -files /Users/ming/Desktop/mctest/libGenerateRandom.jnilib mc hdfs://localhost:9000/Users/ming/seeds_shuffle.txt hdfs://localhost:9000/Users/ming/output
The jnilib(It's a file on Mac OS X just like .so file on Linux) should be sent to tasknode with the jar file.But I got an error below:]1
Anyone can help?
Thanks.

Instead use:
bin/hadoop jar /Users/ming/Desktop/mctest/mctest.jar \
<main-class> \
-files /Users/ming/Desktop/mctest/libGenerateRandom.jnilib \
mc \
hdfs://localhost:9000/Users/ming/seeds_shuffle.txt \
hdfs://localhost:9000/Users/ming/output
Where <main-class> should be of the form com.you.MainRunner.
That's because it's expecting the package to appear before any additional arguments such as -file.

JAVA_HOME error with upgrade to Spark 1.3.0

I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my build.sbt like so:
-libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % "provided"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
then make an assembly jar, and submit it:
HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar
The job fails to submit, with the following exception in the terminal:
15/03/19 10:30:07 INFO yarn.Client:
15/03/19 10:20:03 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1420225286501_4698 failed 2 times due to AM
Container for appattempt_1420225286501_4698_000002 exited with exitCode: 127
due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these:
Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
Log Type: stdout
Log Length: 0
I’m not sure how to interpret that – is {{JAVA_HOME}} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment & troubleshoot?
I do have JAVA_HOME set in the hadoop config files on all the nodes of the cluster:
% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine.
[Note: I originally posted this on the Spark mailing list, I'll update both places if/when I find a solution.]

Have you tried setting JAVA_HOME in the etc/hadoop/yarn-env.sh file? It's possible that your JAVA_HOME environment variable not available to the YARN containers that are running your job.
It has happened to me before that certain env variables that were in the .bashrc on the nodes were not being read by the yarn workers spawned on the cluster.
There is a chance that the error is unrelated to the version upgrade but instead related to YARN environment configuration.

Okay, so I got some other people in the office to help work on this, and we figured out a solution. I'm not sure how much of this is specific to the file layouts of Hortonworks HDP 2.0.6 on CentOS, which is what we're running on our cluster.
We manually copy some directories from one of the cluster machines (or any machine that can successfully use the Hadoop client) to your local machine. Let's call that machine $GOOD.
Set up Hadoop config files:
cd /etc
sudo mkdir hbase hadoop
sudo scp -r $GOOD:/etc/hbase/conf hbase
sudo scp -r $GOOD:/etc/hadoop/conf hadoop
Set up Hadoop libraries & executables:
mkdir ~/my-hadoop
scp -r $GOOD:/usr/lib/hadoop\* ~/my-hadoop
cd /usr/lib
sudo ln –s ~/my-hadoop/* .
path+=(/usr/lib/hadoop*/bin) # Add to $PATH (this syntax is for zsh)
Set up the Spark libraries & executables:
cd ~/Downloads
wget http://apache.mirrors.lucidnetworks.net/spark/spark-1.4.1/spark-1.4.1-bin-without-hadoop.tgz
tar -zxvf spark-1.4.1-bin-without-hadoop.tgz
cd spark-1.4.1-bin-without-hadoop
path+=(`pwd`/bin)
hdfs dfs -copyFromLocal lib/spark-assembly-*.jar /apps/local/
Set some environment variables:
export JAVA_HOME=$(/usr/libexec/java_home -v 1.7)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_DIST_CLASSPATH=$(hadoop --config $HADOOP_CONF_DIR classpath)
`grep 'export HADOOP_LIBEXEC_DIR' $HADOOP_CONF_DIR/yarn-env.sh`
export SPOPTS="--driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib"
export SPOPTS="$SPOPTS --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.4.1-hadoop2.2.0.jar"
Now the various spark shells can be run like so:
sparkR --master yarn $SPOPTS
spark-shell --master yarn $SPOPTS
pyspark --master yarn $SPOPTS
Some remarks:
The JAVA_HOME setting is the same as I've had all along - just included it here for completion. All the focus on JAVA_HOME turned out to be a red herring.
The --driver-java-options=-Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib was necessary because I was getting errors about java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path. The jnilib file is the correct choice for OS X.
The --conf spark.yarn.jar piece is just to save time, avoiding re-copying the assembly file to the cluster every time you fire up the shell or submit a job.

Well, to start off I would recommend you to move to Java 7. However, that is not what you are looking for or need help with.
For setting JAVA_HOME, I would recommend you set it in your bashrc, rather than setting in multiple files. Moreover, I would recommend you installing java with alternatives to /usr/bin.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.