Running word count on hadoop

Running word count on hadoop - java

MY HADOOP started on single node .
But while running word count programm it shows following error.
What may be the problem ?
hduser#reshmi-Inspiron-1545:~/hadoop-1.0.4$hadoop-examples-1.0.4.jar wordcount
/home/hduser/hadoop-1.0.4/dft /home/hduser/hadoop-1.0.4/dft-output
/home/hduser/hadoop-1.0.4/hadoop-examples-1.0.4.jar: line 1: $'PK\003\004':command not found
/home/hduser/hadoop-1.0.4/hadoop-examples-1.0.4.jar: line 2: syntax error near unexpected token `)'
/home/hduser/hadoop-1.0.4/hadoop-examples-1.0.4.jar: line 2: `)CA META-INF/��PK'
hduser#reshmi-Inspiron-1545:~/hadoop-1.0.4$ ^C
hduser#reshmi-Inspiron-1545:~/hadoop-1.0.4$ cd ..
hduser#reshmi-Inspiron-1545:~$ /hadoop-1.0.4 hadoop-examples-1.0.4.jar wordcount
/home/hduser/hadoop-1.0.4/dft /home/hduser/hadoop-1.0.4/dft-output
bash: /hadoop-1.0.4: No such file or directory
hduser#reshmi-Inspiron-1545:~$ ^C

You need to use the hadoop script in the bin folder and the jar sub command to invoke your job, try this:
hduser#reshmi-Inspiron-1545:~$ cd ~/hadoop-1.0.4
hduser#reshmi-Inspiron-1545:~/hadoop-1.0.4$ bin/hadoop jar hadoop-examples-1.0.4.jar \
wordcount /home/hduser/hadoop-1.0.4/dft \
/home/hduser/hadoop-1.0.4/dft-output
I'm also assuming from your input and output paths that your hadoop is configured for local mode (as the paths are local paths: /home/hduser)

Related

Java: Class not found for PageRank algorithm in Apache Hadoop

I am trying to run PageRank algorithm in Apache Hadoop (2.6.5) cluster (1 master 2 slaves). I am using the program in this repository - https://github.com/danielepantaleone/hadoop-pagerank.git. I was able to compile all the sources using this command -
sudo javac -classpath ${HADOOP_CLASSPATH} -d ./build src/it/uniroma1/hadoop/pagerank/PageRank.java src/it/uniroma1/hadoop/pagerank/job1/PageRankJob1Mapper.java src/it/uniroma1/hadoop/pagerank/job1/PageRankJob1Reducer.java src/it/uniroma1/hadoop/pagerank/job2/PageRankJob2Mapper.java src/it/uniroma1/hadoop/pagerank/job2/PageRankJob2Reducer.java src/it/uniroma1/hadoop/pagerank/job3/PageRankJob3Mapper.java
I created the jar file using this command sudo jar -cf build/pagerank.jar build/.
I am trying to run the program just like the wordcount example like this -
sudo bin/hadoop jar hadoop-pagerank/build/pagerank.jar PageRank --
input /usr/local/hdfs/web-Google.txt --output /usr/local/hdfs-out-PR
Sometimes I get an error like this -
Exception in thread "main" java.lang.NoClassDefFoundError: PageRank (wrong name: it/uniroma1/hadoop
/pagerank/PageRank)
and sometimes I get an error like this - Exception in thread "main" java.lang.ClassNotFoundException: PageRank for different types of compilation.
I am not sure what am I doing wrong. Can anyone please help me in proper steps to compile and run the program in Hadoop ? I dont have any pom.xml file and I am able to run the provided wordcount example jar.

You have to use package name before the name of the class,
it means you have to use :
it.uniroma1.hadoop.pagerank.PageRank
rather than PageRank
in your command.
like this :
hadoop jar hadoop-pagerank/build/pagerank.jar it.uniroma1.hadoop.pagerank.PageRank --input /usr/local/hdfs/web-Google.txt --output /usr/local/hdfs-out-PR

Error: Could not find or load main class in python

I am trying to run this command in Python:
java JSHOP2.InternalDomain logistics
It works well when I run it in cmd.
I wrote this in Python:
args = ['java',
r"-classpath",
r".;./JSHOP2.jar;./antlr.jar",
r"JSHOP2.InternalDomain",
thisDir+"/logistics"
]
proc = subprocess.Popen(args, stdout=subprocess.PIPE)
proc.communicate()
I have the jar files in the current directory.
but I got this error:
Error: Could not find or load main class JSHOP2.InternalDomain
Does anyone know what the problem is? can't it find the jar files?

You can't count on the current working directory always being the same when running your Python code. Explicitly set a working directory using the cwd argument:
proc = subprocess.Popen(args, stdout=subprocess.PIPE,
cwd='/directory/containing/jarfiles')
Alternatively, use absolute paths in your -classpath commandline argument. If that path is thisDir, then use that:
proc = subprocess.Popen(args, stdout=subprocess.PIPE,
cwd=thisDir)

class: Configuration not found on when simulating Hadoop YARN SLS

I am trying to simulate Hadoop YARN SLS (Scheduling Load Simulator) with the sources given in Hadoop's GitHub and the SLS source files are located in [REF-1].
Here the step I have done :
Using VMWARE as the Host.
Using Ubuntu 14.04
Installing Hadoop v 2.6.0 [REF-2]
User : hduser | group : hadoop
Installing any needed packages (e.g. maven)
Get the clonning file of Hadoop's GitHub [REF-1]
Syntax : git clone https://git.apache.org/hadoop.git
Result : hduser#ubuntu:~/hadoop$
I made the changes inside directory hduser#ubuntu:~/hadoop/hadoop-tools$
FYI : I used the codes from MaxinetSLS [REF-3] as the way I compile the source files. The SLS source files can be downloaded by using this syntax in Linux : git clone https://github.com/wette/netSLS.git. By default, I can run this program with no error. The SLS Simulator can work perfectly.
From MaxiNetSLS's source files, I copied this files below into my work in hduser#ubuntu:~/hadoop/hadoop-tools$ :
netSLS/generator > hduser#ubuntu:~/hadoop/hadoop-tools$
netSLS/html > hduser#ubuntu:~/hadoop/hadoop-tools$
netSLS/sls.sh > hduser#ubuntu:~/hadoop/hadoop-tools$
netSLS/sls/hadoop/ > hduser#ubuntu:~/hadoop/hadoop-tools/hadoop-sls$
Then, I modified some files as follows.
netSLS/sls.sh
#!/usr/bin/env bash
function print_usage {
echo -e "usage: sls.sh TraceFile"
echo -e
echo -e "Starts SLS with the given trace file."
}
if [[ -z $1 ]]; then
print_usage
exit 1
fi
TRACE_FILE=$(realpath $1)
if [[ ! -f ${TRACE_FILE} ]]; then
echo "File not found: ${TRACE_FILE}"
print_usage
exit 1
fi
cd hadoop-sls
OUTPUT_DIRECTORY="/tmp/sls"
mkdir -p ${OUTPUT_DIRECTORY}
ARGS="-inputsls ${TRACE_FILE}"
ARGS+=" -output ${OUTPUT_DIRECTORY}"
ARGS+=" -printsimulation"
mvn exec:java -Dexec.args="${ARGS}"
hduser#ubuntu:~/hadoop/hadoop-tools/hadoop-sls/pom.xml$
[REF-4]
hduser#ubuntu:~/hadoop/hadoop-tools$ nano hadoop-sls/hadoop/etc/hadoop/sls-runner.xml
[REF-5]
Next step, I try to :
Compile the script using hduser#ubuntu:~/hadoop/hadoop-tools/hadoop-sls$ mvn compile
Compiled with no error (mvn_compile_perfect.jpg).
Run the program using hduser#ubuntu:~/hadoop/hadoop-tools$ ./sls.sh generator/small.json
Got the error here (error_json_compile.jpg). :(
Until now, I have went through some information related with similar problems I faced [REF-6] and tried it, but I still get the same problem. I guess I think the problem is in the ~/hadoop/hadoop-tools/hadoop-sls/pom.xml I mistakenly modified. I have lack of knowledge with Linux Environment. :(
References : http://1drv.ms/21zcJIH (txt file)
*Cannot post more than 2 links in my post. :(

Hadoop Hive unable to move source to destination

I am trying to use Hive 1.2.0 over Hadoop 2.6.0. I have created an employee table. However, when I run the following query:
hive> load data local inpath '/home/abc/employeedetails' into table employee;
I get the following error:
Failed with exception Unable to move source file:/home/abc/employeedetails to destination hdfs://localhost:9000/user/hive/warehouse/employee/employeedetails_copy_1
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
What wrong am I doing here? Are there any specific permissions that I need to set? Thanks in advance!

As mentioned by Rio, the issue involved lack of permissions to load data into hive table. I figures out the following command solves my problems:
hadoop fs -chmod g+w /user/hive/warehouse

See the permission for the HDFS directory:
hdfs dfs -ls /user/hive/warehouse/employee/employeedetails_copy_1
Seems like you may not have permission to load data into hive table.

The error might be due to permission issue on local filesystem.
Change the permission for local filesystem:
sudo chmod -R 777 /home/abc/employeedetails
Now, run:
hive> load data local inpath '/home/abc/employeedetails' into table employee;

If we face same error After running the above command in distributed mode, we can try the the below cammand in all super users of all nodes.
sudo usermod -a -G hdfs yarn
Note:we get this error after restart the all the services of YARN(in AMBARI).My problem was resolved.This is admin command better to care when you are running.

I meet the same problems and search it two days .Finally I find the reason is that datenode start a moment and shut down.
solve steps:
hadoop fs -chmod -R 777 /home/abc/employeedetails
hadoop fs -chmod -R 777 /user/hive/warehouse/employee/employeedetails_copy_1
vi hdfs-site.xml and add follow infos :
dfs.permissions.enabled
false
hdfs --daemon start datanode
vi hdfs-site.xml #find the location of 'dfs.datanode.data.dir'and'dfs.namenode.name.dir'.If it is the same location ,you must change it ,this is why I can't start datanode reason.
follow 'dfs.datanode.data.dir'/data/current edit the VERSION and copy clusterID to 'dfs.namenode.name.dir'/data/current clusterID of VERSION。
start-all.sh
if above it is unsolved , to be careful to follow below steps because of the safe of data ,but I already solve the problem because follow below steps.
stop-all.sh
delete the data folder of 'dfs.datanode.data.dir' and the data folder of 'dfs.namenode.name.dir' and tmp folder.
hdfs namenode -format
start-all.sh
solve the problem
maybe you will meet other problem like this.
problems:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
Cannot create directory
/opt/hive/tmp/root/1be8676a-56ac-47aa-ab1c-aa63b21ce1fc. Name node is
in safe mode
methods: hdfs dfsadmin -safemode leave

It might be because your hive user does not have the access to the HDFS' local directories

Need assistance with running the WordCount.java provided by Cloudera

Hey guys so I am trying to run the WordCount.java example, provided by cloudera. I ran the command below and am getting the exception that I have put below the command. So do you have any suggestions on how to proceed. I have gone through all the steps provided by cloudera.
Thanks in advance.
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount ~/Desktop/input
~/Desktop/output
Error:
ERROR security.UserGroupInformation: PriviledgedActionException
as:root (auth:SIMPLE)
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: hdfs://localhost/home/rushabh/Desktop/input
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://localhost/home/rushabh/Desktop/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:977)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:969)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1248)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Your input and output files should be at hdfs. Atleast input should be at hdfs.
use the following command:
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount hdfs:/input
hdfs:/output
To copy a file from your linux to hdfs use the following command:
hadoop dfs -copyFromLocal ~/Desktop/input hdfs:/
and check your file using :
hadoop dfs -ls hdfs:/
Hope this will help.

The error message says that this file does not exist: "hdfs://localhost/home/rushabh/Desktop/input".
Check that the file does exist at the location you've told it to use.
Check the hostname is correct. You are using "localhost" which most likely resolves to a loopback IP address; e.g. 127.0.0.1. That always means "this host" ... in the context of the machine that you are running the code on.

When I tried to run wordcount MapReduce code, I was getting error as:
ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/user/hduser/wordcount
I was trying to execute the wordcount MapReduce java code with input and output path as /user/hduser/wordcount and /user/hduser/wordcount-output. I just added 'fs.default.name' from core-site.xml before this path and it ran perfectly.

The error clearly states that your input path is local. Please specify the input path to something on HDFS rather than on local machine. My guess
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount ~/Desktop/input
~/Desktop/output
needs to be changed to
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount <hdfs-input-dir>
<hdfs-output-dir>
NOTE: To run MapReduce job, the input directory should be in HDFS, not local.
Hope this helps.

So I added the input folder to HDFS using the following command
hadoop dfs -put /usr/lib/hadoop/conf input/

Check the ownership of the files in hdfs to ensure that the owner of the job (root) has read privileges on the input files. Cloudera provides an hdfs viewer that you can use to view the filespace; open a web browser to either localhost:50075 or {fqdn}:50075 and click on "Browse the filesystem" to view the Input directory and input files. Check the ownership flags; just like *nix filesystem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Running word count on hadoop - java

Related

Java: Class not found for PageRank algorithm in Apache Hadoop

Error: Could not find or load main class in python

class: Configuration not found on when simulating Hadoop YARN SLS

Hadoop Hive unable to move source to destination

Need assistance with running the WordCount.java provided by Cloudera

Categories

Resources