I have an input text file as given below (partial):
{"author":"Martti Paturi","book":"Aiotko oppikouluun"}
{"author":"International Meeting of Neurobiologists Amsterdam 1959.","book":"Structure and function of the cerebral cortex"}
{"author":"Paraná (Brazil : State). Comissão de Desenvolvimento Municipal.","book":"Plano diretor de desenvolvimento de Maringá"}
I need to perform MapReduce on this file to get as output a JSON object which has all the books from the same author in a JSON array, in the form:
{"author": "Ian Fleming", "books": [{"book": "Goldfinger"},{"book": "Moonraker"}]}
My code is as follows:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
public class CombineBooks {
//TODO define variables and implement necessary components
/*public static class MyTuple implements Writable{
private String author;
private String book;
public void readFields(DataInput in){
JSONObject obj = new JSONObject(in.readLine());
author = obj.getString("author");
book = obj.getString("book");
}
public void write(DataOutput out){
out.writeBytes(author);
out.writeBytes(book);
}
public static MyTuple read(DataInput in){
MyTuple tup = new MyTuple();
tup.readFields(in);
return tup;
}
}*/
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Combine extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
String booklist = null;
int i = 0;
for(Text val : values){
if(booklist.equals(null)){
booklist = booklist + val.toString();
}
else{
booklist = booklist + "," + val.toString();
}
i++;
}
context.write(key, new Text(booklist));
}
}
public static class Reduce extends Reducer<Text,Text,JSONObject,NullWritable>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONArray ja = new JSONArray();
String[] book = null;
for(Text val : values){
book = val.toString().split(",");
}
for(int i=0; i<book.length; i++){
JSONObject jo = new JSONObject().put("book", book[i]);
ja.put(jo);
}
JSONObject obj = new JSONObject();
obj.put("author", key.toString());
obj.put("books", ja);
context.write(obj, NullWritable.get());
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Combine.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(JSONObject.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I am trying to run it, I am getting the follwoing error:
java.lang.ClassCastException: class org.json.JSONObject
at java.lang.Class.asSubclass(Class.java:3165)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I am using java-json.jar as an external dependency. I am not sure what the error is here. Any halp is appreciated!
the json jar file have to be saved in the hadoop lib folder and then try and execute the program.
Have a look at: Hadoop Writable. While you are indeed telling Hadoop to set the value of the output key, but JSONObject doesn't implement Writable interface.
Why you just don't output text?
context.write(new Text(jo.toString()), NullWritable.get());
Related
I am new in mapreduce and hadoop (hadoop 3.2.3 and java 8).
I am trying to separate some lines based on a symbol in a line.
Example: "q1,a,q0," should be return ('a',"q1,a,q0,") as (key, value).
My dataset contains ten(10) lines , five(5) for key 'a' and five for key 'b'.
I expect to get 5 line for each key but i always get five for 'a' and 10 for 'b'
Data
A,q0,a,q1;A,q0,b,q0;A,q1,a,q1;A,q1,b,q2;A,q2,a,q1;A,q2,b,q0;B,s0,a,s0;B,s0,b,s1;B,s1,a,s1;B,s1,b,s0
Mapper class:
import java.io.IOException;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<LongWritable, Text, ByteWritable ,Text>{
private ByteWritable key1 = new ByteWritable();
//private int n ;
private int count =0 ;
private Text wordObject = new Text();
#Override
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String ftext = value.toString();
for (String line: ftext.split(";")) {
wordObject = new Text();
if (line.split(",")[2].equals("b")) {
key1.set((byte) 'b');
wordObject.set(line) ;
context.write(key1,wordObject);
continue ;
}
key1.set((byte) 'a');
wordObject.set(line) ;
context.write(key1,wordObject);
}
}
}
Reducer class:
import java.io.IOException;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class MyReducer extends Reducer<ByteWritable, Text, ByteWritable ,Text>{
private Integer count=0 ;
#Override
public void reduce(ByteWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text val : values ) {
count++ ;
}
Text symb = new Text(count.toString()) ;
context.write(key , symb);
}
}
Driver class:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: %s [generic options] <inputdir> <outputdir>\n", getClass().getSimpleName());
return -1;
}
#SuppressWarnings("deprecation")
Job job = new Job(getConf());
job.setJarByClass(MyDriver.class);
job.setJobName("separation ");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(ByteWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(ByteWritable.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new MyDriver(), args);
System.exit(exitCode);
}
}
The problem was solved by putting the variable "count" inside the function "Reduce()".
Does your input read more than one line that has 5 more b's? I cannot reproduce for that one line, but your code can be cleaned up.
For the following code, I get output as
a 5
b 5
static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, ByteWritable, Text> {
final ByteWritable keyOut = new ByteWritable();
final Text valueOut = new Text();
#Override
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, ByteWritable, Text>.Context context) throws IOException, InterruptedException {
String line = value.toString();
if (line.isEmpty()) {
return;
}
StringTokenizer tokenizer = new StringTokenizer(line, ";");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
String[] parts = token.split(",");
String keyStr = parts[2];
if (keyStr.matches("[ab]")) {
keyOut.set((byte) keyStr.charAt(0));
valueOut.set(token);
context.write(keyOut, valueOut);
}
}
}
}
static class Reducer extends org.apache.hadoop.mapreduce.Reducer<ByteWritable, Text, Text, LongWritable> {
static final Text keyOut = new Text();
static final LongWritable valueOut = new LongWritable();
#Override
protected void reduce(ByteWritable key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer<ByteWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
keyOut.set(new String(new byte[]{key.get()}, StandardCharsets.UTF_8));
valueOut.set(StreamSupport.stream(values.spliterator(), true)
.mapToLong(v -> 1).sum());
context.write(keyOut, valueOut);
}
}
I am not an expert in Hadoop and I have the following problem. I have a job that have to run on a cluster with Hadoop version 0.20.2.
When I start the job I specify some parameters. Two of that I want to pass to mapper and reduce class becase I need it.
I try different solution and now my code looks like this:
package bigdata;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.TreeMap;
import org.apache.commons.math3.stat.regression.SimpleRegression;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobConfigurable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.vividsolutions.jts.geom.Geometry;
import com.vividsolutions.jts.io.ParseException;
import com.vividsolutions.jts.io.WKTReader;
public class BoxCount extends Configured implements Tool{
private static String mbr;
private static double cs;
public static class Map extends Mapper<LongWritable, Text, IntWritable, Text> implements JobConfigurable
{
public void configure(JobConf job) {
mbr = job.get(mbr);
cs = job.getDouble("cellSide", 0.1);
}
protected void setup(Context context)
throws IOException, InterruptedException {
// metodo in cui leggere l'MBR passato come parametro
System.out.println("mbr: " + mbr + "\ncs: " + cs);
// ...
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// some code here
}
protected void cleanup(Context context) throws IOException, InterruptedException
{
// other code
}
}
public static class Reduce extends Reducer<IntWritable,Text,IntWritable,IntWritable>implements JobConfigurable
{
private static String mbr;
private static double cs;
public void configure(JobConf job) {
mbr = job.get(mbr);
cs = job.getDouble("cellSide", 0.1);
}
protected void setup(Context context) throws IOException, InterruptedException
{
System.out.println("mbr: " + mbr + " cs: " + cs);
}
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//the reduce code
}
#SuppressWarnings("unused")
protected void cleanup(Context context)
throws IOException, InterruptedException {
// cleanup code
}
public BoxCount (String[] args) {
if (args.length != 4) {
// 0 1 2 3
System.out.println("Usage: OneGrid <mbr (Rectangle: (xmin,ymin)-(xmax,ymax))> <cell_Side> <input_path> <output_path>");
System.out.println("args.length = "+args.length);
for(int i = 0; i< args.length;i++)
System.out.println("args["+i+"]"+" = "+args[i]);
System.exit(0);
}
this.numReducers = 1;
//this.mbr = new String(args[0]);
// this.mbr = "Rectangle: (0.01,0.01)-(99.99,99.99)";
// per sierpinski_jts
this.mbr = "Rectangle: (0.0,0.0)-(100.01,86.6125)";
// per diagonale
//this.mbr = "Rectangle: (1.5104351688932738,1.0787616413335854)-(99999.3453727045,99999.98043392139)";
// per uniforme
// this.mbr = "Rectangle: (0.3020720559407146,0.2163091760095974)-(99999.68881210628,99999.46079314972)";
this.cellSide = Double.parseDouble(args[1]);
this.inputPath = new Path(args[2]);
this.outputDir = new Path(args[3]);
// Ricalcola la cellSize in modo da ottenere
// almeno minMunGriglie (10) griglie!
Grid g = new Grid(mbr, cellSide);
if ((this.cellSide*(Math.pow(2,minNumGriglie))) > g.width)
this.cellSide = g.width/(Math.pow(2,minNumGriglie));
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new BoxCount(args), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
// define new job instead of null using conf
Configuration conf = getConf();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "BoxCount");
// conf.set("mapreduce.framework.name", "local");
// conf.set("mapreduce.jobtracker.address", "local");
// conf.set("fs.defaultFS","file:///");
// passo il valore mbr per creare la griglia
conf.set("mbr", mbr);
// passo lato cella
conf.setDouble("cellSide", cellSide);
job.setJarByClass(BoxCount.class);
// set job input format
job.setInputFormatClass(TextInputFormat.class);
// set map class and the map output key and value classes
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(Map.class);
// set reduce class and the reduce output key and value classes
job.setReducerClass(Reduce.class);
// set job output format
job.setOutputFormatClass(TextOutputFormat.class);
// add the input file as job input (from HDFS) to the variable
// inputFile
TextInputFormat.setInputPaths(job, inputPath);
// set the output path for the job results (to HDFS) to the variable
// outputPath
TextOutputFormat.setOutputPath(job, outputDir);
// set the number of reducers using variable numberReducers
job.setNumReduceTasks(numReducers);
// set the jar class
job.setJarByClass(BoxCount.class);
return job.waitForCompletion(true) ? 0 : 1; // this will execute the job
}
}
But the job not run. What is the correct solution?
I'd like to replace values of input data in my mapper, using dictionalies(csv) defined in another file. So I tried to put the csv data to HashMap and refer it in the mapper.
The java code and csv below are simplified version of my program. This code works in my local environment(Mac OS X, pseudo-distributed mode), but doesn't in my EC2 instance(ubuntu, pseudo-distributed mode).
In detail, I got this stdout in process:
cat:4
human:2
flamingo:1
this means the filereader successfully put csv data into HashMap.
However the mapper mapped nothing and therefore I got empty output in the EC2 environment, although it mapped 3 * (the number of lines of the input file) elements and generated the following in the local:
test,cat
test,flamingo
test,human
Does anyone have answers or hints?
Test.java
import java.io.IOException;
import java.util.StringTokenizer;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.DataInput;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.io.WritableUtils;
public class Test {
public static HashMap<String, Integer> map = new HashMap<String, Integer>();
public static class Mapper1 extends Mapper<LongWritable, Text, Text, Text> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
for(Map.Entry<String, Integer> e : map.entrySet()) {
context.write(new Text(e.getKey()), new Text("test"));
}
}
}
public static class Reducer1 extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> vals, Context context) throws IOException, InterruptedException {
context.write(new Text("test"), key);
}
}
public static class CommaTextOutputFormat extends TextOutputFormat<Text, Text> {
#Override
public RecordWriter<Text, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
String extension = ".txt";
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<Text, Text>(fileOut, ",");
}
}
public static void get_list(String list_path){
try {
FileReader fr = new FileReader(list_path);
BufferedReader br = new BufferedReader(fr);
String line = null, name = null;
int leg = 0;
while ((line = br.readLine()) != null) {
if (!line.startsWith("name") && !line.trim().isEmpty()) {
String[] name_leg = line.split(",", 0);
name = name_leg[0];
leg = Integer.parseInt(name_leg[1]);
map.put(name, leg);
}
}
br.close();
}
catch(IOException ex) {
System.err.println(ex.getMessage());
ex.printStackTrace();
}
for(Map.Entry<String, Integer> e : map.entrySet()) {
System.out.println(e.getKey() + ":" + e.getValue());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 3) {
System.err.println(
"Need 3 arguments: <input dir> <output base dir> <list path>");
System.exit(1);
}
get_list(args[2]);
Job job = Job.getInstance(conf, "test");
job.setJarByClass(Test.class);
job.setMapperClass(Mapper1.class);
job.setReducerClass(Reducer1.class);
job.setNumReduceTasks(1);
job.setInputFormatClass(TextInputFormat.class);
// mapper output
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// formtter
job.setOutputFormatClass(CommaTextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
if(!job.waitForCompletion(true)){
System.exit(1);
}
System.out.println("All Finished");
System.exit(0);
}
}
list.csv (args[2])
name,legs
cat,4
human,2
flamingo,1
=================================
I refer to #Rahul Sharma 's answer and modifiy my code as below. Then my code works in the both environments.
Thank you very much #Rahul Sharma and #Serhiy for your precise answer and useful comments.
Test.java
import java.io.IOException;
import java.util.StringTokenizer;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.DataInput;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.net.URI;
import java.io.InputStreamReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.io.WritableUtils;
public class Test {
public static HashMap<String, Integer> map = new HashMap<String, Integer>();
public static class Mapper1 extends Mapper<LongWritable, Text, Text, Text> {
#Override
protected void setup(Context context) throws IOException, InterruptedException {
URI[] files = context.getCacheFiles();
Path list_path = new Path(files[0]);
try {
FileSystem fs = list_path.getFileSystem(context.getConfiguration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(list_path)));
String line = null, name = null;
int leg = 0;
while ((line = br.readLine()) != null) {
if (!line.startsWith("name") && !line.trim().isEmpty()) {
String[] name_leg = line.split(",", 0);
name = name_leg[0];
leg = Integer.parseInt(name_leg[1]);
map.put(name, leg);
}
}
br.close();
}
catch(IOException ex) {
System.err.println(ex.getMessage());
ex.printStackTrace();
}
for(Map.Entry<String, Integer> e : map.entrySet()) {
System.out.println(e.getKey() + ":" + e.getValue());
}
}
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
for(Map.Entry<String, Integer> e : map.entrySet()) {
context.write(new Text(e.getKey()), new Text("test"));
}
}
}
public static class Reducer1 extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> vals, Context context) throws IOException, InterruptedException {
context.write(new Text("test"), key);
}
}
// Writer
public static class CommaTextOutputFormat extends TextOutputFormat<Text, Text> {
#Override
public RecordWriter<Text, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
String extension = ".txt";
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<Text, Text>(fileOut, ",");
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 3) {
System.err.println(
"Need 3 arguments: <input dir> <output base dir> <list path>");
System.exit(1);
}
Job job = Job.getInstance(conf, "test");
job.addCacheFile(new Path(args[2]).toUri());
job.setJarByClass(Test.class);
job.setMapperClass(Mapper1.class);
job.setReducerClass(Reducer1.class);
job.setNumReduceTasks(1);
job.setInputFormatClass(TextInputFormat.class);
// mapper output
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// formtter
job.setOutputFormatClass(CommaTextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
if(!job.waitForCompletion(true)){
System.exit(1);
}
System.out.println("All Finished");
System.exit(0);
}
}
First you need to learn more about mapreduce framework.
Your program behave as expected in local mode because Mapper, reducer and Job are launched on same JVM. In case, of pseudo-distributed mode or distributed modes there will be separate jvms allocated for each component. The values you put into hashMap using get_list are not visible to mapper and reducer as they are in separate jvms
Use distributed cache to make it work in cluster mode.
Job Main class add file to distributed cache:
JobConf job = new JobConf();<br>
DistributedCache.addCacheArchive(new URI(args[2]), job);
Access file in mapper or reducer:
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// Implement here get_list method functionality
}
This program is supposed to accomplish the MapReduce job. The output of the first job has to be taken as the input of the second job.
When I run it, I get two errors:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException
The mapping part is running 100% but the reducer is not running.
Here's my code:
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.LongWritable;
public class MaxPubYear {
public static class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Text word = new Text();
String delim = ";";
Integer year = 0;
String tokens[] = value.toString().split(delim);
if (tokens.length >= 4) {
year = TryParseInt(tokens[3].replace("\"", "").trim());
if (year > 0) {
word = new Text(year.toString());
context.write(word, new IntWritable(1));
}
}
}
}
public static class FrequencyReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
public static class MaxPubYearMapper extends
Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String delim = "\t";
Text valtosend = new Text();
String tokens[] = value.toString().split(delim);
if (tokens.length == 2) {
valtosend.set(tokens[0] + ";" + tokens[1]);
context.write(new IntWritable(1), valtosend);
}
}
}
public static class MaxPubYearReducer extends
Reducer<IntWritable, Text, Text, IntWritable> {
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int maxiValue = Integer.MIN_VALUE;
String maxiYear = "";
for (Text value : values) {
String token[] = value.toString().split(";");
if (token.length == 2
&& TryParseInt(token[1]).intValue() > maxiValue) {
maxiValue = TryParseInt(token[1]);
maxiYear = token[0];
}
}
context.write(new Text(maxiYear), new IntWritable(maxiValue));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Frequency");
job.setJarByClass(MaxPubYear.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
int exitCode = job.waitForCompletion(true) ? 0 : 1;
if (exitCode == 0) {
Job SecondJob = new Job(conf, "Maximum Publication year");
SecondJob.setJarByClass(MaxPubYear.class);
SecondJob.setOutputKeyClass(Text.class);
SecondJob.setOutputValueClass(IntWritable.class);
SecondJob.setMapOutputKeyClass(IntWritable.class);
SecondJob.setMapOutputValueClass(Text.class);
SecondJob.setMapperClass(MaxPubYearMapper.class);
SecondJob.setReducerClass(MaxPubYearReducer.class);
FileInputFormat.addInputPath(SecondJob, new Path(args[1] + "_temp"));
FileOutputFormat.setOutputPath(SecondJob, new Path(args[1]));
System.exit(SecondJob.waitForCompletion(true) ? 0 : 1);
}
}
public static Integer TryParseInt(String trim) {
// TODO Auto-generated method stub
return(0);
}
}
Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException
Map-reduce job does not overwrite the contents in a existing directory. Output path to MR job must be a directory path which does not exist. MR job will create a directory at specified path with files within it.
In your code:
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
Make sure this path does not exist when you run MR job.
Here's my source code
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class PageRank {
public static final String MAGIC_STRING = ">>>>";
boolean overwrite = true;
PageRank(boolean overwrite){
this.overwrite = overwrite;
}
public static class TextPair implements WritableComparable<TextPair>{
Text x;
int ordering;
public TextPair(){
x = new Text();
ordering = 1;
}
public void setText(Text t, int o){
x = t;
ordering = o;
}
public void setText(String t, int o){
x.set(t);
ordering = o;
}
public void readFields(DataInput in) throws IOException {
x.readFields(in);
ordering = in.readInt();
}
public void write(DataOutput out) throws IOException {
x.write(out);
out.writeInt(ordering);
}
public int hashCode() {
return x.hashCode();
}
public int compareTo(TextPair o) {
int x = this.x.compareTo(o.x);
if(x==0)
return ordering-o.ordering;
else
return x;
}
}
public static class MapperA extends Mapper<LongWritable, Text, TextPair, Text> {
private Text word = new Text();
Text title = new Text();
Text link = new Text();
TextPair textpair = new TextPair();
boolean start=false;
String currentTitle="";
private Pattern linkPattern = Pattern.compile("\\[\\[\\s*(.+?)\\s*\\]\\]");
private Pattern titlePattern = Pattern.compile("<title>\\s*(.+?)\\s*</title>");
private Pattern pagePattern = Pattern.compile("<page>\\s*(.+?)\\s*</page>");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int startPage=line.lastIndexOf("<title>");
if(startPage<0)
{
Matcher matcher = linkPattern.matcher(line);
int n = 0;
title.set(currentTitle);
while(matcher.find()){
textpair.setText(matcher.group(1), 1);
context.write(textpair, title);
}
link.set(MAGIC_STRING);
textpair.setText(title.toString(), 0);
context.write(textpair, link);
}
else
{
String result=line.trim();
Matcher titleMatcher = titlePattern.matcher(result);
if(titleMatcher.find()){
currentTitle = titleMatcher.group(1);
}
else
{
currentTitle=result;
}
}
}
}
public static class ReducerA extends Reducer<TextPair, Text, Text, Text>{
Text aw = new Text();
boolean valid = false;
String last = "";
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
TextPair key = context.getCurrentKey();
Text value = context.getCurrentValue();
if(key.ordering==0){
last = key.x.toString();
}
else if(key.x.toString().equals(last)){
context.write(key.x, value);
}
}
cleanup(context);
}
}
public static class MapperB extends Mapper<Text, Text, Text, Text>{
Text t = new Text();
public void map(Text key, Text value, Context context) throws InterruptedException, IOException{
context.write(value, key);
}
}
public static class ReducerB extends Reducer<Text, Text, Text, PageRankRecord>{
ArrayList<String> q = new ArrayList<String>();
public void reduce(Text key, Iterable<Text> values, Context context)throws InterruptedException, IOException{
q.clear();
for(Text value:values){
q.add(value.toString());
}
PageRankRecord prr = new PageRankRecord();
prr.setPageRank(1.0);
if(q.size()>0){
String[] a = new String[q.size()];
q.toArray(a);
prr.setlinks(a);
}
context.write(key, prr);
}
}
public boolean roundA(Configuration conf, String inputPath, String outputPath, boolean overwrite) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round A");
//job.setJarByClass(GraphBuilder.class);
job.setMapperClass(MapperA.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerA.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
FileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean roundB(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round B");
//job.setJarByClass(PageRank.class);
job.setMapperClass(MapperB.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerB.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PageRankRecord.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
SequenceFileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean build(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
System.err.println(inputPath);
if(roundA(conf, inputPath, "cgb", true)){
return roundB(conf, "cgb", outputPath);
}
else
return false;
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
Configuration conf = new Configuration();
//PageRanking.banner("ClosureGraphBuilder");
PageRank cgb = new PageRank(true);
cgb.build(conf, args[0], args[1]);
}
}
Here's how i compile and run
javac -classpath hadoop-0.20.1-core.jar -d pagerank_classes PageRank.java PageRankRecord.java
jar -cvf pagerank.jar -C pagerank_classes/ .
bin/hadoop jar pagerank.jar PageRank pagerank result
but I am getting the following errors:
INFO mapred.JobClient: Task Id : attempt_201001012025_0009_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: PageRank$MapperA
Can someone tell me whats wrong
Thanks
If you are using the 0.2.0 hadoop (want to use the non-deprecated classes), you can do:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(YourMapReduceClass.class); // <-- omitting this causes above error
job.setMapperClass(MyMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
Did "PageRank$MapperA.class" end up inside that jar file? It should be in the same place as "PageRank.class".
Try to add "--libjars pagerank.jar". Mapper and reducer are running across machines, thus you need to distribute your jar to every machine. "--libjars" helps to do that.
For the HADOOP_CLASSPATH you should specify the folder where the JAR file is located...
If you want to understand how the classpath works: http://download.oracle.com/javase/6/docs/technotes/tools/windows/classpath.html
I guess you should change your HADOOP_CLASSPATH variable, so that it points to the jar file.
e.g. HADOOP_CLASSPATH=<what ever the path>/PageRank.jar or something like that.
If you are using ECLIPSE for generating jar then use "Extract generated libraries into generated JAR" option.
Though MapReduce program is parallel processing. Mapper, Combiner and Reducer class has sequence flow. Have to wait for completing each flow depends on other class so need job.waitForCompletion(true); But It must to set input and output path before starting Mapper, Combiner and Reducer class. Reference
Solution for this already answered in https://stackoverflow.com/a/38145962/3452185