Mapreduce Mapper explaination - java

There is an NCDC weather data set example in Hadoop definite guide.
The Mapper class code is as follows
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
And the driver code is:
Example 2-5. Application to find the maximum temperature in the weather dataset
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'm not able to understand since we pass a file containing multiple lines why there is no iteration on lines. The code seems as if it is processing on a single line.

The book explains what Mapper<LongWritable, Text,means... The key is the offset of the file. The value is a line.
It also mentions that TextInputFormat is the default mapreduce input type, which is a type of FileInputFormat
public class TextInputFormat
extends FileInputFormat<LongWritable,Text>
And therefore, the default input types must be Long, Text pairs
As the JavaDoc says
Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
The book also has sections on defining custom RecordReaders
You need to call job.setInputFormatClass to change it to read anything other than single lines

Related

Hadoop - MapReduce not reducing

I'm trying to reduce a map like this:
01 true
01 true
01 false
02 false
02 false
where the first column is of Text, the second is BooleanWritable. The aim is to keep only those keys which only contain false next to them, and then write pairs of the first columns digits (so the output for above input would be 0, 2). For this, I wrote the following reducer:
import java.io.IOException;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class BeadReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text _key, Iterable<BooleanWritable> values, Context context) throws IOException, InterruptedException {
// process values
boolean dontwrite= false;
for (BooleanWritable val : values) {
dontwrite = (dontwrite || val.get());
}
if (!dontwrite) {
context.write(new Text(_key.toString().substring(0,1)), new Text(_key.toString().substring(1,2)));
}
else {
context.write(new Text("not"), new Text("good"));
}
}
}
This, however, does nothing. Nor does it write the pairs, not "not good", as if it doesn't even enter the if-else branch. All I get is the mapped (mapping works as intended) values.
The driver:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class BeadDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "task2");
job.setJarByClass(hu.pack.task2.BeadDriver.class);
// TODO: specify a mapper
job.setMapperClass(hu.pack.task2.BeadMapper.class);
// TODO: specify a reducer
job.setReducerClass(hu.pack.task2.BeadReducer.class);
// TODO: specify output types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BooleanWritable.class);
// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(job, new Path("local"));
FileOutputFormat.setOutputPath(job, new Path("outfiles"));
FileSystem fs;
try {
fs = FileSystem.get(conf);
if (fs.exists(new Path("outfiles")))
fs.delete(new Path("outfiles"),true);
} catch (IOException e1) {
e1.printStackTrace();
}
if (!job.waitForCompletion(true))
return;
}
}
The mapper:
import java.io.IOException;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class BeadMapper extends Mapper<LongWritable, Text, Text, BooleanWritable > {
private final Text wordKey = new Text("");
public void map(LongWritable ikey, Text value, Context context) throws IOException, InterruptedException {
String[] friend = value.toString().split(";");
String[] friendswith = friend[1].split(",");
for (String s : friendswith) {
wordKey.set(friend[0] + s);
context.write(wordKey, new BooleanWritable(true));
wordKey.set(s + friend[0]);
context.write(wordKey, new BooleanWritable(true));
}
if (friendswith.length > 0) {
for(int i = 0; i < friendswith.length-1; ++i) {
for(int j = i+1; j < friendswith.length; ++j) {
wordKey.set(friendswith[i] + friendswith[j]);
context.write(wordKey, new BooleanWritable(false));
}
}
}
}
}
I wonder what the problem is, what am I missing?
The output key and value types of a mapper should be input types for reducer and therefore in your case, the reducer must inherit from
Reducer<Text, BooleanWritable, Text, BooleanWritable>
The setOutputKeyClass and setOutputValueClass set the types for the job output i.e. both map and reduce. If you want to specify a different type for mapper, you should use the methods setMapOutputKeyClass and setMapOutputValueClass.
As a side note, when you don't want the true values in the output then why emit them from the mapper. Also with the below code in reducer,
for (BooleanWritable val : values) {
dontwrite = (dontwrite || val.get());
}
if dontwrite becomes true once it will be true till the end of the loop. You may want to change your logic for optimization.

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

This question already has an answer here:
Output a list from a Hadoop Map Reduce job using custom writable
(1 answer)
Closed 7 years ago.
How can we pass an arraylist as value from the mapper to the reducer.
My code basically has certain rules to work with and would create new values(String) based on the rules.I am maintaining all the outputs(generated after the rule execution) in a list and now need to send this output(Mapper value) to the Reducer and do not have a way to do so.
Can some one please point me to a direction
Adding Code
package develop;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import utility.RulesExtractionUtility;
public class CustomMap{
public static class CustomerMapper extends Mapper<Object, Text, Text, Text> {
private Map<String, String> rules;
#Override
public void setup(Context context)
{
try
{
URI[] cacheFiles = context.getCacheFiles();
setupRulesMap(cacheFiles[0].toString());
}
catch (IOException ioe)
{
System.err.println("Error reading state file.");
System.exit(1);
}
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Map<String, String> rules = new LinkedHashMap<String, String>();
// rules.put("targetcolumn[1]", "ASSIGN(source[0])");
// rules.put("targetcolumn[2]", "INCOME(source[2]+source[3])");
// rules.put("targetcolumn[3]", "ASSIGN(source[1]");
// Above is the "rules", which would basically create some list values from source file
String [] splitSource = value.toString().split(" ");
List<String>lists=RulesExtractionUtility.rulesEngineExecutor(splitSource,rules);
// lists would have values like (name, age) for each line from a huge text file, which is what i want to write in context and pass it to the reducer.
// As of now i havent implemented the reducer code, as m stuck with passing the value from mapper.
// context.write(new Text(), lists);---- I do not have a way of doing this
}
private void setupRulesMap(String filename) throws IOException
{
Map<String, String> rule = new LinkedHashMap<String, String>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line = reader.readLine();
while (line != null)
{
String[] split = line.split("=");
rule.put(split[0], split[1]);
line = reader.readLine();
// rules logic
}
rules = rule;
}
}
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: customerMapper <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf);
job.setJarByClass(CustomMap.class);
job.setMapperClass(CustomerMapper.class);
job.addCacheFile(new URI("Some HDFS location"));
URI[] cacheFiles= job.getCacheFiles();
if(cacheFiles != null) {
for (URI cacheFile : cacheFiles) {
System.out.println("Cache file ->" + cacheFile);
}
}
// job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To pass an arraylist from mapper to reducer, it's clear that objects must implement Writable interface. Why don't you try this library?
<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0-hadoop2</version>
</dependency>
It has an abstract class:
public abstract class ArrayListWritable<M extends org.apache.hadoop.io.Writable>
extends ArrayList<M>
implements org.apache.hadoop.io.Writable, org.apache.hadoop.conf.Configurable
You could create your own class and source code filling the abstract methods and implementing the interface methods with your code. For instance:
public class MyListWritable extends ArrayListWritable<Text>{
...
}
A way to do that (probably not the only nor the best one), would be to
serialize your list in a string to pass it to the output value in the mapper
deserialize and rebuild your list from the string when you read the input value in the reducer
If you do so, then you should also get rid of all special symbols in the string containing the serialized list (symbols like \n or \t for instance). An easy way to achieve that is to used base64 encoded strings.
You should send Text objects instead String objects. Then you can use object.toString() in your Reducer. Be sure to config your driver properly.
If you post your code we will help you further.

How to override the default sorting of Hadoop

I have a map-reduce job in which the keys are numbers from 1-200. My intended output was (number,value) in the number order.
But I'm getting the output as :
1 value
10 value
11 value
:
:
2 value
20 value
:
:
3 value
I know this is due to the default behavior of Map-Reduce to sort keys in ascending order.
I want my keys to be sorted in numerical order only. How can I achieve this?
If I had to take a guess, I'd say that you are storing your numbers as Text objects and not IntWritable objects.
Either way, once you have more than one reducer, only the items within a reducer will be sorted, but it won't be totally sorted.
The default WritableComparator in MapReduce framework would normally handle your numerical ordering if the key was IntWritable. I suspect it's getting a Text key thus resulting in lexicographical ordering in your case. Please have a look at the sample code which uses IntWritable key to emit the values:
1) Mapper Implementaion
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SourceFileMapper extends Mapper<LongWritable, Text, IntWritable, Text> {
private static final String DEFAULT_DELIMITER = "\t";
private IntWritable keyToEmit = new IntWritable();
private Text valueToEmit = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
keyToEmit.set(Integer.parseInt(line.split(DEFAULT_DELIMITER)[0]));
valueToEmit.set(line.split(DEFAULT_DELIMITER)[1]);
context.write(keyToEmit, valueToEmit);
}
}
2) Reducer Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SourceFileReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
for (Text value : values) {
context.write(key, value);
}
}
}
3) Driver Implementation
package com.stackoverflow.answers.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class SourceFileDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Path inputPath = new Path(args[0]);
Path outputDir = new Path(args[1]);
// Create configuration
Configuration conf = new Configuration(true);
// Create job
Job job = new Job(conf, "SourceFileDriver");
job.setJarByClass(SourceFileDriver.class);
// Setup MapReduce
job.setMapperClass(SourceFileMapper.class);
job.setReducerClass(SourceFileReducer.class);
job.setNumReduceTasks(1);
// Specify key / value
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
}
Thank you!

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !
Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

Map Reduce String out of bound error on String Concatenation

I am trying to write a map reduce code that takes a table stored in text file. The table has two attributes. One is id and second is name and the code should take all the values with same id and concatenate them . Ex: 1 xyz 2 xyz 1 abc should result to 1 xyzabc 2 xyz.
Following is my version of code.As a beginner i have modified the MaxTemperature code to learn doing that
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperature {
public static class MaxTemperatureMapper
extends Mapper<Text, Text, Text, Text> {
#Override
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String lastWord = line.substring(line.lastIndexOf(" ")+1);
Text valq = new Text();
valq.set(line.substring(0,4));
context.write(new Text(lastWord), valq );
}
}
public static class MaxTemperatureReducer
extends Reducer<Text, Text, Text, Text> {
#Override
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
String p="";
for (Text value : values) {
p=p+value.toString();
}
Text aa= new Text();
aa.set(p);
context.write(key, new Text(aa));
}
}
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My input file
123456 name
123456 name
123456 age
123456 age
123456 relation
132323 age
123565 name
258963 test
258963 age
254789 age
254259 age
652145 name
985745 name
523698 name
214569 ame
123546 name
123456 age
321456 age
123456 age
124589 hyderabad
~
Expected Output
123456 name,name,age (all values with index 123456)
124589 hyderabad (al values with index 124589)
I got the following error
java.lang.StringIndexOutOfBoundsException: String index out of range: 4
at java.lang.String.substring(String.java:1907)
at MaxTemperature$MaxTemperatureMapper.map(MaxTemperature.java:39)
at MaxTemperature$MaxTemperatureMapper.map(MaxTemperature.java:26)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
3 things:
You haven't described the expected input very well, especially in the context of your code.
You haven't described what you're trying to do with your map/reduce methods, even if I can understand what you're trying to do.
You should check out the Javadoc for String.substring(int, int): http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#substring(int, int)

Categories

Resources