I'm really new to Hadoop and not familiar to terminal commands.
I followed step by step to install hadoop on my mac and can run some inner hadoop examples. However, when i tried to run the WordCount example, it generate many errors such as org.apache can't be resolved.
The post online said you should put it in where you write your java code.. I used to use eclipse. However, in Eclipse there're so many errors that the project was enable to be compiled.
And suggestion?
Thanks!
Assuming you have also followed the directions to start up a local cluster, or pseudo-distributed cluster, then here is the easiest way.
Go to the hadoop directory, which should be whatever directory is unzipped when you download the hadoop library from apache. From there you can run these command to run hadoop
for Hadoop version 0.23.*
cd $HOME/path/to/hadoop-0.23.*
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.5.jar wordcount myinput outputdir
for Hadoop version 0.20.*
cd $HOME/path/to/hadoop-0.20.*
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount myinput outputdir
Related
I am new to Hadoop/Giraph and Java. As part of a task, I downloaded Cloudera Quickstart VM and Giraph on top of it. I am using this book named "Practical Graph Analytics with Apache Giraph; Authors: Shaposhnik, Roman, Martella, Claudio, Logothetis, Dionysios" from which I tried to run the first example on Page 111 (Twitter Followership Graph).
Defining the Shell Environment for Giraph Execution
$export HADOOP_HOME=/usr/lib/hadoop
$export GIRAPH_HOME=/usr/local/giraph
$export HADOOP_CONF_DIR=$GIRAPH_HOME/conf
$PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin:$PATH
Running the Giraph Application
$ giraph target/*.jar GiraphHelloWorld -vip src/main/resources/1
-vif org.apache.giraph.io.formats.IntIntNullTextInputFormat
-w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error
I created both jar file and java program in /home/cloudera/target folder and the graph txt is created in src/main/resources/1.
I am facing the below attached error after running the above commands with the below attached program.
https://i.stack.imgur.com/tAQaT.jpg (Error1)
https://i.stack.imgur.com/GqY2O.jpg (Error2)
https://i.stack.imgur.com/ATacy.jpg (Java Program)
Please let me know if anything else is needed.
The issue with the above error was the process in which the jar file and class were created. It needs to be created in Eclipse with a new Maven Project. I created my own pom file, java program and build the project.
Once it was successful in creating jars and classes, I then tried to run the GiraphHelloWorld example by following a systematic approach as before. Also make sure to provide the HADOOP_CLASSPATH to the folder which contains "classes" folder.
I am trying to run hadoop in stand alone mode and have set up all the correct configuration files and have successfully run the wordCount example. The problem arises when I try to organize my source code and jar files into a file hierarchy to make things a little more organized.
hadoop --config ~/myconfig jar ~/MYPROGRAMSRC/WordCount.jar MYPROGRAMSRC.WordCount ~/wordCountInput/allData ~/wordCountOutput
I use the above code to invoke hadoop from a script file in my home directory. It fails to recognize the WordCount file one level below in the MYPROGRAMSRC directory.
The ~/MYPROGRAMSRC directory contains the:
WordCount.jar, WordCount.java, WordCount.class, WordCount$Map.class and WordCont$Reduce.class files.
Buy why is hadoop throwing a ClassNotFoundException:
Exception in thread "main" java.lang.ClassNotFoundException: MYPROGRAMSRC.WordCount
I know my program runs because if I transfer the script file into the same directory as the WordCount.class file and run the following command:
hadoop --config ~/myconfig jar WordCount.jar WordCount ~/wordCountInput/allData ~/wordCountOutput
It runs fine.
Try
hadoop --config ~/myconfig jar ~/MYPROGRAMSRC/WordCount.jar ~/MYPROGRAMSRC/WordCount ~/wordCountInput/allData ~/wordCountOutput
MYPROGRAMSRC.WordCount makes no sense if MYPROGRAMSRC is a directory.
I have class files loaded on to Hadoop file system.And also i have loaded input file to hdfs.
When I run class file through hadoop command in terminal i get Class not found error.
E.G.:
I have HDFS contents as
WordCount.class
WordCountMapper.class
WordCOuntReducer.class
SampleInpujt.txt
Can Some one correct me where i am doing wrong.Or is this can be done in real.
Below is the commandline we use for running a Java mapreduce job on our 4-node Hadoop-2.2.0 cluster daily and it works fine. We run it from the namenode but any machine in the cluster should work fine.
hadoop jar ~/..path../mr_orchestrate/target/mr-orchestrate-1.0.jar com.rr.ap.orchestrate.MROrchestrate /user/hduser/in/Sample_15Feb2014.txt /user/hduser/out/out15Feb2014
You may need the "-libjars" option to add other library paths.
I am learning to work on hadoop cluster. I have worked for some time on hadoop streaming where I coded map-reduce scripts in perl/python and ran the job.
However, I didn't find any good explanation for running a java map reduce job.
For example:
I have the following program-
http://www.infosci.cornell.edu/hadoop/wordcount.html
Can somebody tell me how shall I actually compile this program and run the job.
Create a directory to hold the compiled class:
mkdir WordCount_classes
Compile your class:
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d WordCount_classes WordCount.java
Create a jar file from your compiled class:
jar -cvf $HOME/code/hadoop/WordCount.jar -C WordCount_classes/ .
Create a directory for your input and copy all your input files into it, then run your job as follows:
bin/hadoop jar $HOME/code/WordCount.jar WordCount ${INPUTDIR} ${OUTPUTDIR}
The output of your job will be put in the ${OUTPUTDIR} directory. This directory is created by the Hadoop job, so make sure it doesn't exist before you run the job.
See here for a full example.
I am new to hadoop.
I have a file Wordcount.java which refers hadoop.jar and stanford-parser.jar
I am running the following commnad
javac -classpath .:hadoop-0.20.1-core.jar:stanford-parser.jar -d ep WordCount.java
jar cvf ep.jar -C ep .
bin/hadoop jar ep.jar WordCount gutenburg gutenburg1
After executing i am getting the following error:
lang.ClassNotFoundException: edu.stanford.nlp.parser.lexparser.LexicalizedParser
The class is in stanford-parser.jar ...
What can be the possible problem?
Thanks
I think you need to add the standford-parser jar when invoking hadoop also, not just the compiler. (If you look in ep.jar, I imagine it will only have one file in it - WordCount.class)
E.g.
bin/hadoop jar ep.jar WordCount -libjars stanford-parser.jar gutenburg gutenburg1
See Map/Reduce Tutorial
mdma is on the right track, but you'll also need your job driver to implement Tool.
I had the same problem. I think the reason -libjars option doesn't get recognized by your program is because you are not parsing it by calling GenericOptionsParser.getRemainingArgs(). In Hadoop 0.21.0's WordCount.java example (in mapred/src/examples/org/apache/hadoop/examples/), this pieces of code is found, and after doing the same in my program, -libjars comma-separated-jars is recognized:
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
...
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
I've just found out that you can simply edit $HADOOP_HOME/conf/hadoop-env.sh and add your JARs to HADOOP_CLASSPATH.
This is probably simplest and most efficient.
Another option you can try since the -libjars doesn't seem to be working for you is to package everything into a single jar, ie your code + the dependencies into a single jar.
This was how it had to be done prior to ~Hadoop-0.18.0 (somewhere around there they fixed this).
Using ant (i use ant in eclipse) you can set up a build that unpacks the dependencies and adds them to the target build project. You can probably hack this yourself though, by manually unpacking the dependency jar and adding the contents to your jar.
Even though I use 0.20.1 now I still use this method. It makes starting a job form the command-line simpler.