Simple word count MapReduce example yielding strange results

Simple word count MapReduce example yielding strange results - java

I am having a strange problem with a Hadoop Map/Reduce job. The job submits correctly, runs, but produces incorrect/strange results. It seems as if the mapper and reducer are not run at all. The input file is transformed from:
12
16
132
654
132
12
to
0 12
4 16
8 132
13 654
18 132
23 12
I assume the first column are the generated keys for pairs before the mapper, but neither mapper nor reducer seem to run. The job ran fine when I used the old API.
Source for the job is provided below. I am using Hortonworks as the platform.
public class HadoopAnalyzer
{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(HadoopAnalyzer.class);
conf.setJobName("wordcount");
conf.set("mapred.job.tracker", "192.168.229.128:50300");
conf.set("fs.default.name", "hdfs://192.168.229.128:8020");
conf.set("fs.defaultFS", "hdfs://192.168.229.128:8020");
conf.set("hbase.master", "192.168.229.128:60000");
conf.set("hbase.zookeeper.quorum", "192.168.229.128");
conf.set("hbase.zookeeper.property.clientPort", "2181");
System.out.println("Executing job.");
Job job = new Job(conf, "job");
job.setInputFormatClass(InputFormat.class);
job.setOutputFormatClass(OutputFormat.class);
job.setJarByClass(HadoopAnalyzer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path("/user/usr/in"));
TextOutputFormat.setOutputPath(job, new Path("/user/usr/out"));
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.waitForCompletion(true);
System.out.println("Done.");
}
}
Maybe I am missing something obvious, but can anyone shed some light on what might be going wrong here?

The output is as expected because you used the following,
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
Which should have been --
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
You extended the Mapper and Reducer classes with Map and Reduce but didn't use them in your job.

Related

Map reduce job executes but does not produce output

Looking for some help please.
The mapreduce job executes but no output is produced. It is a simple program to count the total number of words in a file. I began very simple to ensure that it works with a txt file which has one row with the following content:
tiny country second largest country second tiny food exporter second
second second
Unfortunately it does not, any suggestion about where to look next would be appreciated. I have cut and pasted the last bit of the output log.
File System Counters
FILE: Number of bytes read=890
FILE: Number of bytes written=947710
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=87
Map output materialized bytes=95
Input split bytes=198
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=95
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=7
Total committed heap usage (bytes)=468713472
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=82
File Output Format Counters
Bytes Written=97
Process finished with exit code 0
public class Map extends Mapper<LongWritable, Text, Text,
IntWritable>{
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] datas = line.split("\t");
for(String data: datas) {
Text outputKey = new Text(data);
IntWritable outputValue = new IntWritable();
context.write(outputKey, outputValue);
}
}
}
public class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
#Override
public void reduce(final Text outputKey,
final Iterable<IntWritable> values,
final Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
context.write(outputKey, new IntWritable(sum));
}
}
public class Main extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf());
job.setJobName("WordCount");
job.setJarByClass(Main.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
Path inputFilePath = new Path("/Users/francesco/input/input.txt");
Path outputFilePath = new Path("/Users/francesco/output/first");
FileInputFormat.addInputPath(job, inputFilePath);
FileOutputFormat.setOutputPath(job, outputFilePath);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new Main(), args);
System.exit(exitCode);
}
}

You don't set any IntWritable value to emit in your mapper:
IntWritable outputValue = new IntWritable();
Need to replace by:
IntWritable outputValue = new IntWritable(1);

Map reduce Driver Code not Working

I am learning Big Data Hadoop by my own and I wrote a Simple Map Reduce code for
Word Count Which is not working .Please lets have a look
// importing all classes
public class WordCount {
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String Line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(Line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable x : values) {
sum = sum + x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
But after replacing these lines
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
in driver code from these
Path outputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
then it works properly.
May I know the reason and what these lines are for.

I am assuming that when you say it is not working, so you would be getting below error :-
org.apache.hadoop.mapred.FileAlreadyExistsException: **Output directory hdfs://localhost:54310/<<your_output_directory>> already exists**
The output directory should not exists before the map reduce job is submitted. So it might have given you the above exception.
The new lines of code that you used in driver, gets the FileSystem (either local/hdfs based on conf object) from path and deletes the output path before the map reduce job is submitted. So now the job executes as the output directory doesn't exist.

Can you please tell me what error you are getting earlier.
may this error is related to output path is already exist.
here you have written code that will delete output path every time.
If output path already exist then it will delete that path.
you can write this code like that .
Path outputPath = new Path(args[1]);
if (outputPath.getFileSystem(conf).exist()) {
outputPath.getFileSystem(conf).delete(outputPath);
}

hadoop mapreduce Mapper reading incorrect value from text file

I am writing a mapreduce program to process a text file append a string to each line.The problem i am facing is that the text value coming in map method of the mapper is incorrect .
Whenever a line in the file is lesser than previous line , few characters are automatically appended to the line to make the line length equal to previous read line.
Map method params as below
*#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {*
I am logging the value inside the map method and observing this behavior.
Any pointers?
Code Snippet
Driver
Configuration configuration = new Configuration();
configuration.set("CLIENT_ID", "Test");
Job job = Job.getInstance(configuration, JOB_NAME);
job.setJarByClass(JobDriver.class);
job.setMapperClass(AdwordsMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Mapper
public class AdwordsMapper extends Mapper<LongWritable, Text, Text, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String textLine = new String(value.getBytes());
textLine = new StringBuffer(textLine).append(",")
.append(context.getConfiguration().get("CLIENT_ID")).toString();
context.write(new Text(""), new Text(textLine));
}
}

as of my knowledge ,the problem with in your mapper is getBytes();
instead of this
String textLine = new String(value.getBytes());
try it.
String textLine = value.toString();

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
one,first line
two,second line
Ouput Required :
Key : one
Value : first line
Key : two
Value : second line
I am specifying KeyValueTextInputFormat as :
Job job = new Job(conf, "Sample");
job.setInputFormatClass(KeyValueTextInputFormat.class);
KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));
This is working fine for tab as a separator.

In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.
Here's an example:
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// next job set-up

Please set the following in the Driver Code.
conf.set("key.value.separator.in.input.line", ",");

For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"
Key1 Value1,Value2
By changing default seperator, You will be able to read as you wish.
For New Api
Here is the solution
//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
Map
public class Map extends Mapper<Text, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
System.out.println("key---> "+key);
System.out.println("value---> "+value.toString());
.
.
Output
key---> one
value---> first line
key---> two
value---> second line

It's a sequence matter.
The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);

First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve.
Ignore the LongWritable key, and split the Text value on comma yourself.

By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.
If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.
For the new Hadoop APIs, it is different:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");

Example
public class KeyValueTextInput extends Configured implements Tool {
public static void main(String args[]) throws Exception {
String log4jConfPath = "log4j.properties";
PropertyConfigurator.configure(log4jConfPath);
int res = ToolRunner.run(new KeyValueTextInput(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
//conf.set("key.value.separator.in.input.line", ",");
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",
",");
Job job = Job.getInstance(conf, "WordCountSampleTemplate");
job.setJarByClass(KeyValueTextInput.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outputPath = new Path(args[1]);
FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
fs.delete(outputPath, true);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
}
class Map extends Mapper<Text, Text, Text, Text> {
public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
context.write(k1, v1);
}
}
class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String sum = " || ";
for (Text value : values)
sum = sum + value.toString() + " || ";
context.write(Key, new Text(sum));
}
}

hadoop map reduce job with HDFS input and HBASE output

I'm new on hadoop.
I have a MapReduce job which is supposed to get an input from Hdfs and write the output of the reducer to Hbase. I haven't found any good example.
Here's the code, the error runing this example is Type mismatch in map, expected ImmutableBytesWritable recieved IntWritable.
Mapper Class
public static class AddValueMapper extends Mapper < LongWritable,
Text, ImmutableBytesWritable, IntWritable > {
/* input <key, line number : value, full line>
* output <key, log key : value >*/
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
byte[] key;
int value, pos = 0;
String line = value.toString();
String p1 , p2 = null;
pos = line.indexOf("=");
//Key part
p1 = line.substring(0, pos);
p1 = p1.trim();
key = Bytes.toBytes(p1);
//Value part
p2 = line.substring(pos +1);
p2 = p2.trim();
value = Integer.parseInt(p2);
context.write(new ImmutableBytesWritable(key),new IntWritable(value));
}
}
Reducer Class
public static class AddValuesReducer extends TableReducer<
ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
long total =0;
// Loop values
while(values.iterator().hasNext()){
total += values.iterator().next().get();
}
// Put to HBase
Put put = new Put(key.get());
put.add(Bytes.toBytes("data"), Bytes.toBytes("total"),
Bytes.toBytes(total));
Bytes.toInt(key.get()), total));
context.write(key, put);
}
}
I had a similar job only with HDFS and works fine.
Edited 18-06-2013. The college project finished successfully two years ago. For job configuration (driver part) check correct answer.

Here is the code which will solve your problem
Driver
HBaseConfiguration conf = HBaseConfiguration.create();
Job job = new Job(conf,"JOB_NAME");
job.setJarByClass(yourclass.class);
job.setMapperClass(yourMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Intwritable.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
TableMapReduceUtil.initTableReducerJob(TABLE,
yourReducer.class, job);
job.setReducerClass(yourReducer.class);
job.waitForCompletion(true);
Mapper&Reducer
class yourMapper extends Mapper<LongWritable, Text, Text,IntWritable> {
//#overide map()
}
class yourReducer
extends
TableReducer<Text, IntWritable,
ImmutableBytesWritable>
{
//#override reduce()
}

Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as
job.setInputFormatClass(TextInputFormat.class);

The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility.
You will find a sample code here:
Hope this will be useful :)

public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
change this to immutableBytesWritable, intwritable.
I am not sure..hope it works

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Simple word count MapReduce example yielding strange results - java

Related

Map reduce job executes but does not produce output

Map reduce Driver Code not Working

hadoop mapreduce Mapper reading incorrect value from text file

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

hadoop map reduce job with HDFS input and HBASE output

Categories

Resources