I am learning to work on hadoop cluster. I have worked for some time on hadoop streaming where I coded map-reduce scripts in perl/python and ran the job.
However, I didn't find any good explanation for running a java map reduce job.
For example:
I have the following program-
http://www.infosci.cornell.edu/hadoop/wordcount.html
Can somebody tell me how shall I actually compile this program and run the job.
Create a directory to hold the compiled class:
mkdir WordCount_classes
Compile your class:
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d WordCount_classes WordCount.java
Create a jar file from your compiled class:
jar -cvf $HOME/code/hadoop/WordCount.jar -C WordCount_classes/ .
Create a directory for your input and copy all your input files into it, then run your job as follows:
bin/hadoop jar $HOME/code/WordCount.jar WordCount ${INPUTDIR} ${OUTPUTDIR}
The output of your job will be put in the ${OUTPUTDIR} directory. This directory is created by the Hadoop job, so make sure it doesn't exist before you run the job.
See here for a full example.
Related
I am following Scheduled Jobs with Custom Clock Processes in Java with Quartz and RabbitMQ but I struggle to actually run another dyno from withing the jar file packeg by Spring.
In the package (server-1.0-SNAPSHOT.jar) I need to run the company.server.Scheduler. The .class file is in BOOT-INF/classes.
I've tried to do this but I am always getting
Error: Could not find or load main class company.server.Scheduler
I'm struggling to get the syntax right.
So what I need to run is
BOOT-INF/classes/company/server/Scheduler.class
I have tried this:
java -classpath BOOT-INF/classes -jar server-1.0-SNAPSHOT.jar company.server.Scheduler
java -classpath server-1.0-SNAPSHOT.jar:/BOOT-INF/classes company.server.Scheduler
But this either runs the main class from the Manifest or crashes.
Also tried:
java -classpath server-1.0-SNAPSHOT.jar BOOT-INF.classes.company.server.Scheduler
java -classpath server-1.0-SNAPSHOT.jar BOOT-INF/classes/company/server/Scheduler
Try it out ...
git clone https://github.com/silentsnooc/run-scheduler
cd run-scheduler/
mvn clean install
cd target/
java -cp .:scheduler-test-1.0-SNAPSHOT.jar BOOT-INF.classes.Scheduler
If I follow this on Heroku then it should be something like
java -cp scheduler-test-1.0-SNAPSHOT.jar:BOOT-INF/classes/* Scheduler
but it's not working telling me the main class Scheduler could not be found.
I want to run my custom java code/program on a single node hadoop cluster.
How do I run a Java program in a single node cluster in hadoop? Do I need to convert my Java code into a JAR file and then execute?
Yes, you need to convert into .Jar file. I will explain you step by step
1)Write your java code in Eclipse IDE.
2)To create jar of your project, follow this link
3)Copy your dataset to HDFS using following command
$ bin/hadoop dfs -copyFromLocal /path/to/file/on/filesystem /path/to/input/on/hdfs
4)Run your jar by giving path of a dataset which is stored in HDFS, you can follow command
$ bin/hadoop jar path/to/jar/on/filesystem /path/to/input/on/hdfs /path/to/outputdir/on/hdfs
5)The following command is used to verify the resultant files in the output folder.
$ bin/hadoop fs -ls /path/to/outputdir/on/hdfs
6)The following command is used to see the output in Part-00000 file. This file is generated by HDFS.
$ bin/hadoop fs -cat path/to/output_dir/part-00000
Hope this helps you.
I am trying to run hadoop in stand alone mode and have set up all the correct configuration files and have successfully run the wordCount example. The problem arises when I try to organize my source code and jar files into a file hierarchy to make things a little more organized.
hadoop --config ~/myconfig jar ~/MYPROGRAMSRC/WordCount.jar MYPROGRAMSRC.WordCount ~/wordCountInput/allData ~/wordCountOutput
I use the above code to invoke hadoop from a script file in my home directory. It fails to recognize the WordCount file one level below in the MYPROGRAMSRC directory.
The ~/MYPROGRAMSRC directory contains the:
WordCount.jar, WordCount.java, WordCount.class, WordCount$Map.class and WordCont$Reduce.class files.
Buy why is hadoop throwing a ClassNotFoundException:
Exception in thread "main" java.lang.ClassNotFoundException: MYPROGRAMSRC.WordCount
I know my program runs because if I transfer the script file into the same directory as the WordCount.class file and run the following command:
hadoop --config ~/myconfig jar WordCount.jar WordCount ~/wordCountInput/allData ~/wordCountOutput
It runs fine.
Try
hadoop --config ~/myconfig jar ~/MYPROGRAMSRC/WordCount.jar ~/MYPROGRAMSRC/WordCount ~/wordCountInput/allData ~/wordCountOutput
MYPROGRAMSRC.WordCount makes no sense if MYPROGRAMSRC is a directory.
I have class files loaded on to Hadoop file system.And also i have loaded input file to hdfs.
When I run class file through hadoop command in terminal i get Class not found error.
E.G.:
I have HDFS contents as
WordCount.class
WordCountMapper.class
WordCOuntReducer.class
SampleInpujt.txt
Can Some one correct me where i am doing wrong.Or is this can be done in real.
Below is the commandline we use for running a Java mapreduce job on our 4-node Hadoop-2.2.0 cluster daily and it works fine. We run it from the namenode but any machine in the cluster should work fine.
hadoop jar ~/..path../mr_orchestrate/target/mr-orchestrate-1.0.jar com.rr.ap.orchestrate.MROrchestrate /user/hduser/in/Sample_15Feb2014.txt /user/hduser/out/out15Feb2014
You may need the "-libjars" option to add other library paths.
I'm really new to Hadoop and not familiar to terminal commands.
I followed step by step to install hadoop on my mac and can run some inner hadoop examples. However, when i tried to run the WordCount example, it generate many errors such as org.apache can't be resolved.
The post online said you should put it in where you write your java code.. I used to use eclipse. However, in Eclipse there're so many errors that the project was enable to be compiled.
And suggestion?
Thanks!
Assuming you have also followed the directions to start up a local cluster, or pseudo-distributed cluster, then here is the easiest way.
Go to the hadoop directory, which should be whatever directory is unzipped when you download the hadoop library from apache. From there you can run these command to run hadoop
for Hadoop version 0.23.*
cd $HOME/path/to/hadoop-0.23.*
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.5.jar wordcount myinput outputdir
for Hadoop version 0.20.*
cd $HOME/path/to/hadoop-0.20.*
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount myinput outputdir