I don't understand why the first output of my map reduce job is 0 and null
The output is : url ; number of visits
and here is the mapper class :
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable cle, Text valeur, Context sortie)
throws IOException
{
String url="";
int nbVisites=0;
Pattern httplogPattern = Pattern.compile("([^\\s]+) - - \\[(.+)\\] \"([^\\s]+) (/[^\\s]*) HTTP/[^\\s]+\" [^\\s]+ ([0-9]+)");
String ligne = valeur.toString();
if (ligne.length()>0) {
Matcher matcher = httplogPattern.matcher(ligne);
if (matcher.matches()) {
url = matcher.group(1);
nbVisites = Integer.parseInt(matcher.group(5));
}
}
Text urlText = new Text(url);
IntWritable value = new IntWritable(nbVisites);
try
{
sortie.write(urlText, value);
System.out.println(urlText + " ; " + value);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
and reducer :
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context sortie) throws IOException, InterruptedException
{
Iterator<IntWritable> it = values.iterator();
int nb=0;
while (it.hasNext()) {
nb = nb + it.next().get();
}
try {
sortie.write(key, new IntWritable(nb));
System.out.println(key.toString() + ";" + nb);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Each line of the input file looks like this :
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
and here is the output :
0
04-dynamic-c.rotterdam.luna.net 4
06-dynamic-c.rotterdam.luna.net 1
10.salc.wsu.edu 3
11.ts2.mnet.medstroms.se 1
128.100.183.222 4
128.102.149.149 4
As you can see first line is a couple of null values
Thank you
You get an empty key (not null) because your default mapper Text is an empty string. Then the reducer counts that as 0...
It works fine if you check that your lines actually match before writing the output
Here's a refactored version of your code
public class WebLogDriver extends Configured implements Tool {
public static final String APP_NAME = WebLogDriver.class.getSimpleName();
public static void main(String[] args) throws Exception {
final int status = ToolRunner.run(new Configuration(), new WebLogDriver(), args);
System.exit(status);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, APP_NAME);
job.setJarByClass(WebLogDriver.class);
// outputs for mapper and reducer
job.setOutputKeyClass(Text.class);
// setup mapper
job.setMapperClass(WebLogDriver.WebLogMapper.class);
job.setMapOutputValueClass(IntWritable.class);
// setup reducer
job.setReducerClass(WebLogDriver.WebLogReducer.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
final Path outputDir = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputDir);
return job.waitForCompletion(true) ? 0 : 1;
}
static class WebLogMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
static final Pattern HTTP_LOG_PATTERN = Pattern.compile("(\\S+) - - \\[(.+)] \"(\\S+) (/\\S*) HTTP/\\S+\" \\S+ (\\d+)");
final Text keyOut = new Text();
final IntWritable valueOut = new IntWritable();
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
String line = value.toString();
if (line.isEmpty()) return;
Matcher matcher = HTTP_LOG_PATTERN.matcher(line);
if (matcher.matches()) {
keyOut.set(matcher.group(1));
try {
valueOut.set(Integer.parseInt(matcher.group(5)));
context.write(keyOut, valueOut);
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
}
static class WebLogReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
static final IntWritable valueOut = new IntWritable();
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int nb = StreamSupport.stream(values.spliterator(), true)
.mapToInt(IntWritable::get)
.sum();
valueOut.set(nb);
context.write(key, valueOut);
}
}
}
Related
I have made a test with Hadoop Map and Reducer with code below:
App.java
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.out.println("Words Count Start!");
System.setProperty("hadoop.home.dir", "C:/Users/xx/Meekou/Meekou.hadoop/hadoop-3.2.2.tar/hadoop-3.2.2");
URL url = App.class.getClassLoader().getResource("wordcount.txt");
Path inputPath = new Path(URLDecoder.decode(url.getFile(),"UTF-8") );
Configuration config = new Configuration();
Job job = Job.getInstance(config, "WordsCount");
job.setJarByClass(App.class);
job.setMapperClass(WordsCountMapper.class);
job.setReducerClass(WordsCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inputPath);
FileOutputFormat.setOutputPath(job, new Path("./result.txt"));
job.waitForCompletion(true);
System.out.println("Words Count Complete!");
}
WordsCountMapper.java
public class WordsCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private static final Log LOG = LogFactory.getLog(WordsCountMapper.class);
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
LOG.info(line);
String[] words = line.split(",");
for (String word : words) {
LOG.info(word);
context.write(new Text(word), one);
}
// StringTokenizer tokenizer = new StringTokenizer(line);
// while (tokenizer.hasMoreTokens()) {
// word.set(tokenizer.nextToken());
// context.write(word, one);
// }
}
}
WordsCountReducer.java
public class WordsCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> words,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int frequencyForWord = 0;
for(IntWritable word: words){
frequencyForWord += word.get();
}
System.out.println(key);
context.write(key, new IntWritable(frequencyForWord));
}
}
And test txt file content is
tom,jack,mary,
rose,anly,billo,anly,
billo,mary,zoor,
zoor,poly,
It generated empty result folder instead of expected result.
Aim
I have two csv files trying to make a join between them. One containing movieId, title and the other containing userId, movieId, comment-tag. I want to find out how many comments-tags each movie has, by printing title, comment_count. So my code:
Driver
public class Driver
{
public Driver(String[] args)
{
if (args.length < 3) {
System.err.println("input path ");
}
try {
Job job = Job.getInstance();
job.setJobName("movie tag count");
// set file input/output path
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, TagMapper.class);
MultipleInputs.addInputPath(job, new Path(args[2]), TextInputFormat.class, MovieMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[3]));
// set jar class name
job.setJarByClass(Driver.class);
// set mapper and reducer to job
job.setReducerClass(Reducer.class);
// set output key class
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
int returnValue = job.waitForCompletion(true) ? 0 : 1;
System.out.println(job.isSuccessful());
System.exit(returnValue);
} catch (IOException | ClassNotFoundException | InterruptedException e) {
e.printStackTrace();
}
}
}
MovieMapper
public class MovieMapper extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, Text>
{
#Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
String[] items = line.split("(?!\\B\"[^\"]*),(?![^\"]*\"\\B)"); //comma not in quotes
String movieId = items[0].trim();
if(tryParseInt(movieId))
{
context.write(new Text(movieId), new Text(items[1].trim()));
}
}
private boolean tryParseInt(String s)
{
try {
Integer.parseInt(s);
return true;
} catch (NumberFormatException e) {
return false;
}
}
}
TagMapper
public class TagMapper extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, Text>
{
#Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
String[] items = line.split("(?!\\B\"[^\"]*),(?![^\"]*\"\\B)");
String movieId = items[1].trim();
if(tryParseInt(movieId))
{
context.write(new Text(movieId), new Text("_"));
}
}
private boolean tryParseInt(String s)
{
try {
Integer.parseInt(s);
return true;
} catch (NumberFormatException e) {
return false;
}
}
}
Reducer
public class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, IntWritable>
{
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
int noOfFrequency = 0;
Text movieTitle = new Text();
for (Text o : values)
{
if(o.toString().trim().equals("_"))
{
noOfFrequency++;
}
else
{
System.out.println(o.toString());
movieTitle = o;
}
}
context.write(movieTitle, new IntWritable(noOfFrequency));
}
}
The problem
The result I get is something like this:
title, count
_, count
title, count
title, count
_, count
title, count
_, count
How does this _ gets to be the key? I can't understand it. There is an if statment checking if there is an _ count it and don't put it as the title. Is there something wrong with the toString() method and the equals operation fails? Any ideas?
it is not weird because you iterate through values and o is a pointer to elements of values which is here are Text. at some point in time you make movieTitle to points to where o points movieTitle = o. in next iterations o points to "_" and also movieTitle points to "_".
if you change your code like this every thing works fine:
int noOfFrequency = 0;
Text movieTitle = null;
for (Text o : values)
{
if(o.toString().trim().equals("_"))
{
noOfFrequency++;
}
else
{
movieTitle = new Text(o.toString());
}
}
context.write(movieTitle, new IntWritable(noOfFrequency));
I am trying to practice Big Data Mapreduce by making Movie recommendation System . My code:
*imports
public class MRS {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context con)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while(token.hasMoreTokens()){
String userId = token.nextToken();
String movieId = token.nextToken();
String ratings =token.nextToken();
token.nextToken();
con.write(new Text(userId), new Text(movieId + "," + ratings));
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterable<Text> value,Context con ) throws IOException, InterruptedException{
int item_count=0;
int item_sum =0;
String result="[";
for(Text t : value){
String s = t.toString();
StringTokenizer token = new StringTokenizer(s,",");
while(token.hasMoreTokens()){
token.nextToken();
item_sum=item_sum+Integer.parseInt(token.nextToken());
item_count++;
}
result=result+"("+s+"),";
}
result=result.substring(0, result.length()-1);
result=result+"]";
result=String.valueOf(item_count)+","+String.valueOf(item_sum)+","+result;
con.write(key, new Text(result));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration con = new Configuration();
Job job = new Job(con,"Movie Recommendation");
job.setJarByClass(MRS.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I am using the movielens dataset from here
Of which input file is u.data
and my output after running this code should be like
userId Item_count,Item_sum,[list of movie_Id with rating]
However, I am getting this
99 173,4
99 288,4
99 66,3
99 203,4
99 105,2
99 12,5
99 1,4
99 741,3
99 895,3
99 619,4
99 742,5
99 294,4
99 196,4
99 328,4
99 120,2
99 246,3
99 232,4
99 181,5
99 201,3
99 978,3
99 123,3
99 433,4
99 345,3
This should be the output of the Map class
I made few adjustment to the code and it is giving me the exact expected result .
Here is my new code
imports*
public class MRS {
public static class Map extends
Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context con)
throws IOException, InterruptedException {
String line = value.toString();
String[] s = line.split("\t");
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
IntWritable userId = new IntWritable(Integer.parseInt(token
.nextToken()));
String movieId = token.nextToken();
String ratings = token.nextToken();
token.nextToken();
con.write(userId, new Text(movieId + "," + ratings));
}
}
}
public static class Reduce extends
Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> value, Context con)
throws IOException, InterruptedException {
int item_count = 0;
int item_sum = 0;
String result = "";
for (Text t : value) {
String s = t.toString();
StringTokenizer token = new StringTokenizer(s, ",");
result = result + "[" + s + "],";
}
result = result.substring(1, result.length() - 2);
System.out.println(result);
con.write(key, new Text(result));
}
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration con = new Configuration();
Job job = new Job(con, "Movie Recommendation");
job.setJarByClass(MRS.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
what I change is
Driver code
job.setOutputKeyClass(IntWritable.class);
Mapper code
Mapper<LongWritable, Text, IntWritable, Text>
Reducer code
public static class Reduce extends
Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterable<Text> value,Context con ) throws
IOException, InterruptedException{
I think the problem was that the outputkey and outputvalue data is matching the mapper class thats why it is printing mapper and not even executng reducer
Correct me if I am wrong.
I am new to Hadoop to so pardon me if this looks like silly question.
I am running my below MapReduce program and getting the following error:
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
Any help is appreciated.
public class WordCount {
// Mapper Class
public static class MapperClass extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// Mapper method defined
public void mapperMethod(Object key,Text lineContent,Context context){
try{
StringTokenizer strToken = new StringTokenizer(lineContent.toString());
//Iterating through the line
while(strToken.hasMoreTokens()){
word.set(strToken.nextToken());
try{
context.write(word, one);
}
catch(Exception e){
System.err.println(new Date()+" ---> Cannot write data to hadoop in Mapper.");
e.printStackTrace();
}
}
}
catch(Exception ex){
ex.printStackTrace();
}
}
}
// Reducer Class
public static class ReducerClass extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
//Reducer method
public void reduce(Text key,Iterable<IntWritable> values,Context context){
try{
int sum=0;
for(IntWritable itr : values){
sum+=itr.get();
}
result.set(sum);
try {
context.write(key,result);
} catch (Exception e) {
System.err.println(new Date()+" ---> Error while sending data to Hadoop in Reducer");
e.printStackTrace();
}
}
catch (Exception err){
err.printStackTrace();
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
try{
Configuration conf = new Configuration();
String [] arguments = new GenericOptionsParser(conf, args).getRemainingArgs();
if(arguments.length!=2){
System.err.println("Enter both and input and output location.");
System.exit(1);
}
Job job = new Job(conf,"Simple Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(arguments[0]));
FileOutputFormat.setOutputPath(job, new Path(arguments[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
catch(Exception e){
}
}
}
You need to override Map method in the Mapper Class instead you have a new method.
Comming to your error, as you dont have map method overridden your program boils down to a reduce only job. Reducer is getting input as LongWritable,Text but you have declared Intwritable and text as input.
Hope this explains.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
New to Hadoop and I'm trying to understand how Hadoop read file input : I am able to use this code below to run Hadoop job from 2 column ( key / value ) input file :
But what if I have 5 columns and the ( key /value ) i want is A&E ( instead of A&B) which function do I need to modify exactly ?
public class InverterCounter extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
output.collect(value, key);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, IntWritable> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int count = 0;
while (values.hasNext()) {
values.next();
count++;
}
output.collect(key, new IntWritable(count));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, InverterCounter.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("InverterCounter");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new InverterCounter(), args);
System.exit(res);
}
}
Any recommendation would appreciated, I was trying to change job.set("key.value.separator.in.input.line", ","); and job.setInputFormat(KeyValueTextInputFormat.class); with no luck still could not figure this out.
Thanks
KeyValueTextInputFormat assumes that the key is at the start of each line, so it isn't applicable for your 6 column data set.
Instead, you can use TextInputFormat and extract the key and value yourself. I'm assuming all values in the line are separated by commas (and that there are no commas in the data, which is another story).
With TextInputFormat you receive the full line in value, and the position of the line in the file in key. We don't need the position so we will ignore it. With the full line in a single Text we can turn it into a String, split it by commas, and derive the key and value to emit:
public class InverterCounter extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String[] lineFields = value.toString().split(",");
Text outputKey = new Text(lineFields[0] + "," + lineFields[4]);
Text outputValue = new Text(lineFields[1] + "," + lineFields[2] + "," +
lineFields[3] + "," + lineFields[5]);
output.collect(outputKey, outputValue);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, IntWritable> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int count = 0;
while (values.hasNext()) {
values.next();
count++;
}
output.collect(key, new IntWritable(count));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, InverterCounter.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("InverterCounter");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new InverterCounter(), args);
System.exit(res);
}
}
I haven't had a chance to test this, so there may be small bugs. You would probably want to rename the class because it is no longer inverting anything. Finally, the value has been sent to the reducer but it isn't being used, so you could just as easily send a NullWritable instead.