Need assistance with running the WordCount.java provided by Cloudera

Need assistance with running the WordCount.java provided by Cloudera - java

Hey guys so I am trying to run the WordCount.java example, provided by cloudera. I ran the command below and am getting the exception that I have put below the command. So do you have any suggestions on how to proceed. I have gone through all the steps provided by cloudera.
Thanks in advance.
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount ~/Desktop/input
~/Desktop/output
Error:
ERROR security.UserGroupInformation: PriviledgedActionException
as:root (auth:SIMPLE)
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: hdfs://localhost/home/rushabh/Desktop/input
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://localhost/home/rushabh/Desktop/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:977)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:969)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1248)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Your input and output files should be at hdfs. Atleast input should be at hdfs.
use the following command:
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount hdfs:/input
hdfs:/output
To copy a file from your linux to hdfs use the following command:
hadoop dfs -copyFromLocal ~/Desktop/input hdfs:/
and check your file using :
hadoop dfs -ls hdfs:/
Hope this will help.

The error message says that this file does not exist: "hdfs://localhost/home/rushabh/Desktop/input".
Check that the file does exist at the location you've told it to use.
Check the hostname is correct. You are using "localhost" which most likely resolves to a loopback IP address; e.g. 127.0.0.1. That always means "this host" ... in the context of the machine that you are running the code on.

When I tried to run wordcount MapReduce code, I was getting error as:
ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/user/hduser/wordcount
I was trying to execute the wordcount MapReduce java code with input and output path as /user/hduser/wordcount and /user/hduser/wordcount-output. I just added 'fs.default.name' from core-site.xml before this path and it ran perfectly.

The error clearly states that your input path is local. Please specify the input path to something on HDFS rather than on local machine. My guess
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount ~/Desktop/input
~/Desktop/output
needs to be changed to
hadoop jar ~/Desktop/wordcount.jar org.myorg.WordCount <hdfs-input-dir>
<hdfs-output-dir>
NOTE: To run MapReduce job, the input directory should be in HDFS, not local.
Hope this helps.

So I added the input folder to HDFS using the following command
hadoop dfs -put /usr/lib/hadoop/conf input/

Check the ownership of the files in hdfs to ensure that the owner of the job (root) has read privileges on the input files. Cloudera provides an hdfs viewer that you can use to view the filespace; open a web browser to either localhost:50075 or {fqdn}:50075 and click on "Browse the filesystem" to view the Input directory and input files. Check the ownership flags; just like *nix filesystem.

Related

Hadoop Hdfs DFSAdmin - Cannot initialize Cluster or hdfs points to file:/// not hdfs://

I'm using hadoop DFSAdmin api to report the dead blocks for an hdfs backup utility.
enter link description here
But by invoking my jar in the env (included the hadoop jars in classpath)
java -classpath "/usr/local/flytxt/hadoop:/usr/local/HBackup/conf:/usr/local/HBackup/lib/*:/usr/local/HBackup/hadoop-distcp-2.7.5.jar" -Djava.library.path="/usr/local/HBackup/lib/native" -jar hdfs-br-dfsadmin-0.0.1-SNAPSHOT.jar
fails returning:
report: FileSystem file:/// is not an HDFS file system
Usage: hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning]
This is the code:
String[] argv= {"-report", "-dead" };
DFSAdmin.main(argv);
Tried this thread: enter link description here
but no help!

MapReduce WordCount example issue

I was trying to run the basic WordCount example of Apache MapReduce 2.7 from here:
https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0
I put the input files at : /user/hadoopLearning/WordCount/input/
Output path : /user/hadoopLearning/WordCount/output/
then I ran the following command :
hadoop jar wc.jar WordCount /user/hadoopLearning/WordCount/input/file01 /user/hadoopLearning/WordCount/output
However on running I am getting the following error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: **Output directory** hdfs://sandbox.hortonworks.com:8020/user/hadoopLearning/WordCount/**input**/file01 already exists
I haven't written a single piece of code and copied everything from above location from Apache's website.
I understand the error , but the if we closely look at the error it says the output directory already exists and in the stack trace it gives the path of input directory.
Please can anyone help me. I am a beginner in the field of hadoop. Thanks in advance.

You're trying to create a file which already exists.
HDFS doesn't allow that.
replace your output path ('/user/hadoopLearning/WordCount/output'), with something else.
try this command
hadoop jar wc.jar WordCount /user/hadoopLearning/WordCount/input/file01 /user/hadoopLearning/WordCount/new_output_path

Hadoop Hive unable to move source to destination

I am trying to use Hive 1.2.0 over Hadoop 2.6.0. I have created an employee table. However, when I run the following query:
hive> load data local inpath '/home/abc/employeedetails' into table employee;
I get the following error:
Failed with exception Unable to move source file:/home/abc/employeedetails to destination hdfs://localhost:9000/user/hive/warehouse/employee/employeedetails_copy_1
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
What wrong am I doing here? Are there any specific permissions that I need to set? Thanks in advance!

As mentioned by Rio, the issue involved lack of permissions to load data into hive table. I figures out the following command solves my problems:
hadoop fs -chmod g+w /user/hive/warehouse

See the permission for the HDFS directory:
hdfs dfs -ls /user/hive/warehouse/employee/employeedetails_copy_1
Seems like you may not have permission to load data into hive table.

The error might be due to permission issue on local filesystem.
Change the permission for local filesystem:
sudo chmod -R 777 /home/abc/employeedetails
Now, run:
hive> load data local inpath '/home/abc/employeedetails' into table employee;

If we face same error After running the above command in distributed mode, we can try the the below cammand in all super users of all nodes.
sudo usermod -a -G hdfs yarn
Note:we get this error after restart the all the services of YARN(in AMBARI).My problem was resolved.This is admin command better to care when you are running.

I meet the same problems and search it two days .Finally I find the reason is that datenode start a moment and shut down.
solve steps:
hadoop fs -chmod -R 777 /home/abc/employeedetails
hadoop fs -chmod -R 777 /user/hive/warehouse/employee/employeedetails_copy_1
vi hdfs-site.xml and add follow infos :
dfs.permissions.enabled
false
hdfs --daemon start datanode
vi hdfs-site.xml #find the location of 'dfs.datanode.data.dir'and'dfs.namenode.name.dir'.If it is the same location ,you must change it ,this is why I can't start datanode reason.
follow 'dfs.datanode.data.dir'/data/current edit the VERSION and copy clusterID to 'dfs.namenode.name.dir'/data/current clusterID of VERSION。
start-all.sh
if above it is unsolved , to be careful to follow below steps because of the safe of data ,but I already solve the problem because follow below steps.
stop-all.sh
delete the data folder of 'dfs.datanode.data.dir' and the data folder of 'dfs.namenode.name.dir' and tmp folder.
hdfs namenode -format
start-all.sh
solve the problem
maybe you will meet other problem like this.
problems:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
Cannot create directory
/opt/hive/tmp/root/1be8676a-56ac-47aa-ab1c-aa63b21ce1fc. Name node is
in safe mode
methods: hdfs dfsadmin -safemode leave

It might be because your hive user does not have the access to the HDFS' local directories

Hadoop: NullPointerException when calling getFsStatistics

I'm encountering the following Exception when running a MapReduce job taking a file stored on HDFS as input:
15/03/27 17:18:12 INFO mapreduce.Job: Task Id : attempt_1427398929405_0005_m_000005_2, Status : FAILED
Error: java.lang.NullPointerException
at org.apache.hadoop.mapred.Task.getFsStatistics(Task.java:347)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:486)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
I don't have a good sense of what this means myself, other than this looks possibly related to file permissions. I've also have found the following other StackOverflow post with the same Exception/trace: NullPointerException with MR2 in windows.
To summarize the post: this Exception was caused due to the hadoop being run as a different user than that of the MR job.
I've tried the following:
chmod-ing all files and directories in HDFS to 777 (just to experiment)
running the hadoop job with sudo
but neither approach has yielded any results.
I'm running all Hadoop processes on localhost ("pseudo-distributed mode"). I started hadoop using start-yarn.sh and start-dfs.sh with my normal local user. I'm running the hadoop job with the same user. I've also set dfs.datanode.data.dir and dfs.namenode.name.dir to paths on my local machine to which I have permission to read/write with my local user. I've set dfs.permissions.enabled to false.
Am I misinterpreting this Exception? Is there anything else I should try? Thank you.

In the end, it was my own FileSplit subclass causing the problem. I was not correctly (de)serializing the the FileSplit's Path, so when send across the wire this field became null. Hadoop calls getFsStatistics on the null Path, causing the NullPointerException.

java.lang.NoClassDefFoundError with HBase Scan

I am trying to run a MapReduce job to scan a HBase table. Currently I am using the version 0.94.6 of HBase that comes with Cloudera 4.4. At some point in my program I use Scan(), and I properly import it with:
import org.apache.hadoop.hbase.client.Scan;
It compiles well and I am able to create a jar file too. I do it by passing the hbase classpath as the value for the -cp option. When running the program, I obtain the following message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/Scan
I run the code using:
hadoop jar my_program.jar MyJobClass -libjars <list_of_jars>
where list_of_jars contains /opt/cloudera/parcels/CDH/lib/hbase/hbase.jar. Just to double-check, I confirmed that hbase.jar contains Scan. I do it with:
jar tf /opt/cloudera/parcels/CDH/lib/hbase/hbase.jar
And I can see the line:
org/apache/hadoop/hbase/client/Scan.class
in the output. All looks ok to me. I don't understand why is saying that Scan is not defined. I pass the correct jar, and it contains the class.
Any help is appreciated.

Setting the HADOOP_CLASSPATH variable fixed the issue:
export HADOOP_CLASSPATH=`/usr/bin/hbase classpath`

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.