Map Reduce Wrong Output / Reducer not working - java

I'm trying to gather max and min temperature of a particular station and then finding the sum of temperature per different day but i keep getting an error in the mapper and Have tried a lot of other ways such as use stringtokenizer but same thing, i get an error.
Sample Input.
Station Date(YYYYMMDD) element temperature flag1 flat2 othervalue
i only need station, date(key), element and temperature from the input
USW00003889,20180101,TMAX,122,7,1700
USW00003889,20180101,TMIN,-67,7,1700
UK000056225,20180101,TOBS,56,7,1700
UK000056225,20180101,PRCP,0,7,1700
UK000056225,20180101,SNOW,0,7
USC00264341,20180101,SNWD,0,7,1700
USC00256837,20180101,PRCP,0,7,800
UK000056225,20180101,SNOW,0,7
UK000056225,20180101,SNWD,0,7,800
USW00003889,20180102,TMAX,12,E
USW00003889,20180102,TMIN,3,E
UK000056225,20180101,PRCP,42,E
SWE00138880,20180101,PRCP,50,E
UK000056225,20180101,PRCP,0,a
USC00256480,20180101,PRCP,0,7,700
USC00256480,20180101,SNOW,0,7
USC00256480,20180101,SNWD,0,7,700
SWE00138880,20180103,TMAX,-228,7,800
SWE00138880,20180103,TMIN,-328,7,800
USC00247342,20180101,PRCP,0,7,800
UK000056225,20180101,SNOW,0,7
SWE00137764,20180101,PRCP,63,E
UK000056225,20180101,SNWD,0,E
USW00003889,20180104,TMAX,-43,W
USW00003889,20180104,TMIN,-177,W
public static class MaxMinMapper
extends Mapper<Object, Text, Text, IntWritable> {
private Text newDate = new Text();
public void map(Object key, Text value, Context context) throws
IOException,
InterruptedException {
String stationID = "USW00003889";
String[] tokens = value.toString().split(",");
String station = "";
String date = "";
String element = "";
int data = 0;
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = Integer.parseInt(tokens[3]);
if (stationID.equals(station) && ( element.equals("TMAX") ||
element.equals("TMIN")) ) {
newDate.set(date);
context.write(newDate, new IntWritable(data));
}
}
}
public static class MaxMinReducer
extends Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sumResult = 0;
int val1 = 0;
int val2 = 0;
while (values.iterator().hasNext()) {
val1 = values.iterator().next().get();
val2 = values.iterator().next().get();
sumResult = val1 + val2;
}
result.set(sumResult);
context.write(key, result);
}
}
}
Please help me out, thanks.
UPDATE: Verified each row with condition and changed data variable to String (change back to Integer -> IntWritable at later stage).
if (tokens.length <= 5) {
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = tokens[3];
otherValue = tokens[4];
}else{
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = tokens[3];
otherValue = tokens[4];
otherValue2 = tokens[5];
}
Update2: Ok i'm getting output written to file now but its the wrong output. I need it to add the two values that have the same date (key) What am i doing wrong ?
OUTPUT:
20180101 -67
20180101 122
20180102 3
20180102 12
20180104 -177
20180104 -43
Desired Output
20180101 55
20180102 15
20180104 -220
This is the error i recieve aswell, even though i get output.
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [8e31c44ccd394017a4a28b3b16471aca] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/8e31c44ccd394017a4a28b3b16471aca
?project=driven-airway-257512&region=us-central1' and in 'gs://dataproc-261a376e-7874-4151-b6b7-566c18758206-us-central1/google-cloud-dataproc-metainfo/f912a2f0-107f-40b6-94
56-b6a72cc8bfc4/jobs/8e31c44ccd394017a4a28b3b16471aca/driveroutput'.
19/11/14 12:53:24 INFO client.RMProxy: Connecting to ResourceManager at cluster-1e8f-m/10.128.0.12:8032
19/11/14 12:53:25 INFO client.AHSProxy: Connecting to Application History server at cluster-1e8f-m/10.128.0.12:10200
19/11/14 12:53:26 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/11/14 12:53:26 INFO input.FileInputFormat: Total input files to process : 1
19/11/14 12:53:26 INFO mapreduce.JobSubmitter: number of splits:1
19/11/14 12:53:26 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/11/14 12:53:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573654432484_0035
19/11/14 12:53:27 INFO impl.YarnClientImpl: Submitted application application_1573654432484_0035
19/11/14 12:53:27 INFO mapreduce.Job: The url to track the job: http://cluster-1e8f-m:8088/proxy/application_1573654432484_0035/
19/11/14 12:53:27 INFO mapreduce.Job: Running job: job_1573654432484_0035
19/11/14 12:53:35 INFO mapreduce.Job: Job job_1573654432484_0035 running in uber mode : false
19/11/14 12:53:35 INFO mapreduce.Job: map 0% reduce 0%
19/11/14 12:53:41 INFO mapreduce.Job: map 100% reduce 0%
19/11/14 12:53:52 INFO mapreduce.Job: map 100% reduce 20%
19/11/14 12:53:53 INFO mapreduce.Job: map 100% reduce 40%
19/11/14 12:53:54 INFO mapreduce.Job: map 100% reduce 60%
19/11/14 12:53:56 INFO mapreduce.Job: map 100% reduce 80%
19/11/14 12:53:57 INFO mapreduce.Job: map 100% reduce 100%
19/11/14 12:53:58 INFO mapreduce.Job: Job job_1573654432484_0035 completed successfully
19/11/14 12:53:58 INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=120
FILE: Number of bytes written=1247665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=846
GS: Number of bytes written=76
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=139
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Killed reduce tasks=1
Launched map tasks=1
Launched reduce tasks=5
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=17348
Total time spent by all reduces in occupied slots (ms)=195920
Total time spent by all map tasks (ms)=4337
Total time spent by all reduce tasks (ms)=48980
Total vcore-milliseconds taken by all map tasks=4337
Total vcore-milliseconds taken by all reduce tasks=48980
Total megabyte-milliseconds taken by all map tasks=8882176
Total megabyte-milliseconds taken by all reduce tasks=100311040
Map-Reduce Framework
Map input records=25
Map output records=6
Map output bytes=78
Map output materialized bytes=120
Input split bytes=139
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=120
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=1409
CPU time spent (ms)=6350
Physical memory (bytes) snapshot=1900220416
Virtual memory (bytes) snapshot=21124952064
Total committed heap usage (bytes)=1492123648
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=846
File Output Format Counters
Bytes Written=76
Job output is complete
Update 3:
I updated the Reducer (after what LowKey said) and its giving me the same as output as above. It's not doing the addition I want it to do. It's completely ignoring that operation. Why ?
public static class MaxMinReducer
extends Reducer<Text, Text, Text, IntWritable> {
public IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int value = 0;
int sumResult = 0;
Iterator<IntWritable> iterator = values.iterator();
while (values.iterator().hasNext()) {
value = iterator.next().get();
sumResult = sumResult + value;
}
result.set(sumResult);
context.write(key, result);
}
}
Update 4: Adding my imports and driver class to work out why my reducer won't run ?
package mapreduceprogram;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "tempmin");
job.setJarByClass(TempMin.class);
job.setMapperClass(MaxMinMapper.class);
job.setReducerClass(MaxMinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path (args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Anything wrong with it, as to why my reducer class isn't running ?

What are you doing wrong ? Well, for one thing, why do you have:
final int missing = -9999;
That doesn't make any sense.
Below that, you have some code that apparently is supposed to add two values, but it seems like you are accidentally throwing away items from your list. See where you have:
if (values.iterator().next().get() != missing)
well... you never saved the value, so that means you threw it away.
Another problem is that you are adding incorrectly... For some reason you are trying to add two values for every iteration of the loop. You should be adding one, so your loop should look like this:
IntWritable value = null;
Iterator iterator = values.iterator();
while (values.iterator().hasNext()) {
value = iterator.next().get();
if (value != missing){
sumResult = sumResult + value;
}
}
The next obvious problem is that you put your output line inside your while loop:
while (values.iterator().hasNext()) {
[...]
context.write(key, result);
}
That means that every time you read an item into your reducer, you write an item out. I think you what are trying to do is read in all the items for a given key, and then write a single reduced value (the sum). In that case, you shouldn't have your output inside the loop. It should be after.
while ([...]) {
[...]
}
result.set(sumResult);
context.write(key, result);

Are those columns separated by tabs?
If yes, then don't expect to find a space character in there.

Related

HADOOP REDUCER JAVA - context.write don't write anything

I have a context.write(...) method in my reducer function but it don't write anything. The weird thing is that the System.out.println(...) just above work fine and print the desired result (like you can see on the following screen) :
Image of the System.out.println trace
Here is the complete code :
public class Jointure {
public static class TokenizerMapper extends Mapper<Object, Text, IntWritable, Text> {
private boolean tab2 = false; // true quand iteration sur les lignes arrive au tab2
public void map(Object key, org.apache.hadoop.io.Text value, Context context)
throws IOException, InterruptedException {
Arrays.stream(value.toString().split("\\r?\\n")).forEach(line -> { // iterer sur chaque ligne du fichier
// input
if ((!tab2) && (!line.equals(""))) { // si ligne dans tab1
String[] parts = line.split(";");
int idtoWrite = Integer.parseInt(parts[0]);
String valueToWrite = parts[1] + ";Table1";
try {
context.write(new IntWritable(idtoWrite), new Text(valueToWrite)); // creer un couple cle/valeur
// en output
} catch (Exception e) {
}
} else if (line.equals("")) { // si séparation des deux tabs
tab2 = true;
} else if (tab2 && (!line.equals(""))) { // si ligne dans tab2
String[] parts = line.split(";");
int idtoWrite = Integer.parseInt(parts[0]);
String valueToWrite = parts[1] + ";Table2";
try {
context.write(new IntWritable(idtoWrite), new Text(valueToWrite)); // creer un couple cle/valeur
// en output
} catch (Exception e) {
}
}
});
}
}
public static class IntSumReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> listPrenom = new ArrayList<String>();
ArrayList<String> listPays = new ArrayList<String>();
for (Text val : values) {
String[] parts = val.toString().split(";");
String nomOuPays = parts[0];
String table = "";
try {
table = parts[1];
} catch (Exception e) {
}
if (table.equals("Table1")) {
listPrenom.add(nomOuPays);
} else if (table.equals("Table2")) {
listPays.add(nomOuPays);
}
}
for (int i = 0; i < listPrenom.size(); i++) {
for (int j = 0; j < listPays.size(); j++) {
String toWrite = listPrenom.get(i) + " " + listPays.get(j);
System.out.println("=====================WRITE=======================");
System.out.println(toWrite);
context.write(key, new Text(toWrite));
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jointure");
job.setJarByClass(Jointure.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Have you got any idea? Thanks for your time.
EDIT :
Here are the complete trace of the log when I launch the program :
2019-03-14 20:05:03,049 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2019-03-14 20:05:03,116 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2019-03-14 20:05:03,116 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2019-03-14 20:05:03,475 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2019-03-14 20:05:03,542 INFO input.FileInputFormat: Total input files to process : 1
2019-03-14 20:05:03,564 INFO mapreduce.JobSubmitter: number of splits:1
2019-03-14 20:05:03,674 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1184033728_0001
2019-03-14 20:05:03,675 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-14 20:05:03,803 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2019-03-14 20:05:03,803 INFO mapreduce.Job: Running job: job_local1184033728_0001
2019-03-14 20:05:03,804 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2019-03-14 20:05:03,808 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2019-03-14 20:05:03,808 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-03-14 20:05:03,809 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2019-03-14 20:05:03,845 INFO mapred.LocalJobRunner: Starting task: attempt_local1184033728_0001_m_000000_0
2019-03-14 20:05:03,848 INFO mapred.LocalJobRunner: Waiting for map tasks
2019-03-14 20:05:03,867 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2019-03-14 20:05:03,867 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-03-14 20:05:03,918 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2019-03-14 20:05:03,934 INFO mapred.MapTask: Processing split: file:/media/mathis/OS/Cours/Semestre4/Cloud-Internet-objet/Hadoop-MapReduce/inputTab/file-tab:0+56
2019-03-14 20:05:04,046 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2019-03-14 20:05:04,046 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2019-03-14 20:05:04,046 INFO mapred.MapTask: soft limit at 83886080
2019-03-14 20:05:04,046 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2019-03-14 20:05:04,046 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2019-03-14 20:05:04,049 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-03-14 20:05:04,059 INFO mapred.LocalJobRunner:
2019-03-14 20:05:04,059 INFO mapred.MapTask: Starting flush of map output
2019-03-14 20:05:04,059 INFO mapred.MapTask: Spilling map output
2019-03-14 20:05:04,059 INFO mapred.MapTask: bufstart = 0; bufend = 110; bufvoid = 104857600
2019-03-14 20:05:04,059 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214376(104857504); length = 21/6553600
=====================WRITE=======================
Pierre Allemagne
=====================WRITE=======================
Pierre France
=====================WRITE=======================
Jacques France
2019-03-14 20:05:04,184 INFO mapred.MapTask: Finished spill 0
2019-03-14 20:05:04,234 INFO mapred.Task: Task:attempt_local1184033728_0001_m_000000_0 is done. And is in the process of committing
2019-03-14 20:05:04,237 INFO mapred.LocalJobRunner: map
2019-03-14 20:05:04,238 INFO mapred.Task: Task 'attempt_local1184033728_0001_m_000000_0' done.
2019-03-14 20:05:04,250 INFO mapred.Task: Final Counters for attempt_local1184033728_0001_m_000000_0: Counters: 18
File System Counters
FILE: Number of bytes read=4319
FILE: Number of bytes written=502994
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=7
Map output records=6
Map output bytes=110
Map output materialized bytes=70
Input split bytes=158
Combine input records=6
Combine output records=3
Spilled Records=3
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
Total committed heap usage (bytes)=212860928
File Input Format Counters
Bytes Read=56
2019-03-14 20:05:04,251 INFO mapred.LocalJobRunner: Finishing task: attempt_local1184033728_0001_m_000000_0
2019-03-14 20:05:04,252 INFO mapred.LocalJobRunner: map task executor complete.
2019-03-14 20:05:04,256 INFO mapred.LocalJobRunner: Waiting for reduce tasks
2019-03-14 20:05:04,256 INFO mapred.LocalJobRunner: Starting task: attempt_local1184033728_0001_r_000000_0
2019-03-14 20:05:04,269 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2019-03-14 20:05:04,269 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-03-14 20:05:04,270 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2019-03-14 20:05:04,274 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#721f3077
2019-03-14 20:05:04,276 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2019-03-14 20:05:04,300 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=625370688, maxSingleShuffleLimit=156342672, mergeThreshold=412744672, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-03-14 20:05:04,301 INFO reduce.EventFetcher: attempt_local1184033728_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-03-14 20:05:04,321 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1184033728_0001_m_000000_0 decomp: 66 len: 70 to MEMORY
2019-03-14 20:05:04,325 INFO reduce.InMemoryMapOutput: Read 66 bytes from map-output for attempt_local1184033728_0001_m_000000_0
2019-03-14 20:05:04,326 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 66, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->66
2019-03-14 20:05:04,327 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2019-03-14 20:05:04,327 INFO mapred.LocalJobRunner: 1 / 1 copied.
2019-03-14 20:05:04,327 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2019-03-14 20:05:04,433 INFO mapred.Merger: Merging 1 sorted segments
2019-03-14 20:05:04,433 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 60 bytes
2019-03-14 20:05:04,436 INFO reduce.MergeManagerImpl: Merged 1 segments, 66 bytes to disk to satisfy reduce memory limit
2019-03-14 20:05:04,438 INFO reduce.MergeManagerImpl: Merging 1 files, 70 bytes from disk
2019-03-14 20:05:04,440 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2019-03-14 20:05:04,440 INFO mapred.Merger: Merging 1 sorted segments
2019-03-14 20:05:04,443 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 60 bytes
2019-03-14 20:05:04,445 INFO mapred.LocalJobRunner: 1 / 1 copied.
2019-03-14 20:05:04,493 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2019-03-14 20:05:04,498 INFO mapred.Task: Task:attempt_local1184033728_0001_r_000000_0 is done. And is in the process of committing
2019-03-14 20:05:04,504 INFO mapred.LocalJobRunner: 1 / 1 copied.
2019-03-14 20:05:04,505 INFO mapred.Task: Task attempt_local1184033728_0001_r_000000_0 is allowed to commit now
2019-03-14 20:05:04,541 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1184033728_0001_r_000000_0' to file:/media/mathis/OS/Cours/Semestre4/Cloud-Internet-objet/Hadoop-MapReduce/output
2019-03-14 20:05:04,542 INFO mapred.LocalJobRunner: reduce > reduce
2019-03-14 20:05:04,542 INFO mapred.Task: Task 'attempt_local1184033728_0001_r_000000_0' done.
2019-03-14 20:05:04,544 INFO mapred.Task: Final Counters for attempt_local1184033728_0001_r_000000_0: Counters: 24
File System Counters
FILE: Number of bytes read=4491
FILE: Number of bytes written=503072
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=70
Reduce input records=3
Reduce output records=0
Spilled Records=3
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=6
Total committed heap usage (bytes)=212860928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=8
2019-03-14 20:05:04,544 INFO mapred.LocalJobRunner: Finishing task: attempt_local1184033728_0001_r_000000_0
2019-03-14 20:05:04,544 INFO mapred.LocalJobRunner: reduce task executor complete.
2019-03-14 20:05:04,807 INFO mapreduce.Job: Job job_local1184033728_0001 running in uber mode : false
2019-03-14 20:05:04,811 INFO mapreduce.Job: map 100% reduce 100%
2019-03-14 20:05:04,816 INFO mapreduce.Job: Job job_local1184033728_0001 completed successfully
2019-03-14 20:05:04,846 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=8810
FILE: Number of bytes written=1006066
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=7
Map output records=6
Map output bytes=110
Map output materialized bytes=70
Input split bytes=158
Combine input records=6
Combine output records=3
Reduce input groups=2
Reduce shuffle bytes=70
Reduce input records=3
Reduce output records=0
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=6
Total committed heap usage (bytes)=425721856
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=56
File Output Format Counters
Bytes Written=8

MapReduce application counter outputs 1

I have developed a MapReduce application and i want to find the avg and sum of input data.But the counter outputs only 1.I checked the counter value in the for loop of the Reducer and the value is correct, BUT in the output file prints 1.I will post sample of input data and my code below
Input data
14974|Customer#000014974|cTBm50vGWOXsnoYdbLR9z|4|14-465-794-1875|8431.32|AUTOMOBILE|pending grouches. silent theodolites sleep furiously quick dependencies. dolphins maintain sly
14970|Customer#000014970|FG9Pxox q6cHPHGomY08u|3|13-185-927-7901|9054.14|AUTOMOBILE|ut the carefully even deposits. regular ideas beneath the deposits nag
14963|Customer#000014963|w75qInZOQrR,WzgipSwdpueOM7qeu|6|16-462-356-2145|8397.42|MACHINERY|ly ironic packages: packages cajole ideas. ironic foxes boost. depe
14929|Customer#000014929|mht7IoZNn1Rcmbgwj3OjxqND3|11|21-970-694-9116|9615.16|MACHINERY| according to the final instructions. carefully even requests sleep across t
14904|Customer#000014904|g4Y,pOSAYE 1|9|19-348-888-7443|9924.56|AUTOMOBILE| final, even deposits wake fluffily along the blithely regular excuses. regular, even excuses unwind about
14867|Customer#000014867| V01ThLgnisvKLqnyA7RLMxi|13|23-436-741-1980|9278.31|HOUSEHOLD| final dependencies sleep furiously along the carefully special accounts. requests engage fluffily amo
14856|Customer#000014856|kzt2v lzu,TvOhL|4|14-475-481-5051|9692.63|AUTOMOBILE|ts haggle blithely final, final foxes. furiously regular ideas nag slyly blithely pending deposi
14848|Customer#000014848|K6rA91M3M2HXTjxz46gJWuj|9|19-592-694-6275|9078.19|BUILDING|en, bold warthogs. silent, regular theodolites sleep quickly theodolites. slyl
4412|Customer#000004412|MNJ9DEIivjnbcGZk2W|7|17-665-838-5600|9781.29|MACHINERY| special, regular foxes above the quickly sp
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou
Code
public static class TokenizerMapper extends Mapper<LongWritable, Text,Text ,Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
float balance = 0;
String custKey = "";
int nation = 0;
Text word = new Text();
Text segment = new Text();
String[] line = value.toString().split("\\|");
if (line.length < 7) {
System.err.println("map: Not enough records");
return;
}
custKey = line[1];
try {
nation = Integer.parseInt(line[3]);
balance = Float.parseFloat(line[5]);
} catch (NumberFormatException e) {
e.printStackTrace();
return;
}
if(balance > 8000 && (nation < 15 && nation > 1)){
segment.set(line[6]);
word.set(custKey + "\t" + balance);
context.write(segment,word);
}
}
}
public static class AvgReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException {
float sumBalance = 0,avgBalance = 0;
int count = 0;
for(Text v : values){
String[] a = v.toString().trim().split("\t");
sumBalance += Float.parseFloat(a[1]);
count++;
}
System.out.println("counter2 "+count);
avgBalance = count <= 1 ? sumBalance : avgBalance / count;
context.write(key,new Text(avgBalance+"\t"+count));
}
}
CMD Output
counter 1counter 2counter 3counter 4counter 5counter 6counter 7counter 8counter 9counter 10counter 11counter 12counter 13counter 14counter 15counter 16counter 17counter 18counter 19counter 20counter 21counter 22counter 23counter 24counter 25counter 26counter 27counter 28counter 29counter 30counter 31counter 32counter 33counter 34counter 35counter 36counter 37counter 38counter 39counter 40counter 41counter 42counter 43counter 44counter 45counter 46counter 47counter 48counter 49counter 50counter 51counter 52counter 53counter 54counter 55counter 56counter 57counter 58counter 59counter 60counter 61counter 62counter 63counter 64counter 65counter 66counter 67counter 68counter 69counter 70counter 71counter 72counter 73counter 74counter 75counter 76counter 77counter 78counter 79counter 80counter 81counter 82counter 83counter 84counter 85counter 86counter 87counter 88counter 89counter 90counter 91counter 92counter 93counter 94counter 95counter 96counter 97counter 98counter 99counter 100counter 101counter 102counter 103counter 104counter 105counter 106counter 107counter 108counter 109counter 110counter 111counter 112counter 113counter 114counter 115counter 116counter 117counter 118counter 119counter 120counter 121counter 122counter 123counter 124counter 125counter 126counter 127counter 128counter 129counter 130counter 131counter 132counter 133counter 134counter 135counter 136counter 137counter 138counter 139counter 140counter 141counter 142counter 143counter 144counter 145counter 146counter 147counter 148counter 149counter 150counter 151counter 152counter 153counter 154counter 155counter 156counter 157counter 158counter 159counter 160counter 161counter 162counter 163counter 164counter 165counter 166counter 167counter 168counter 169counter 170counter 171counter 172counter 173counter 174counter 175counter 176counter 177counter 178counter 179counter 180counter 181counter 182counter 183counter 184counter 185counter 186counter 187counter 188counter 189counter 190counter 191counter 192counter 193counter 194counter 195counter 196counter 197counter 198counter 199counter 200counter 201counter 202counter 203counter 204counter 205counter 206counter 207counter 208counter 209counter 210counter 211counter 212counter 213counter 214counter 215counter 216counter 217counter 218counter 219counter 220counter 221counter 222counter 223counter 224counter 225counter 226counter 227counter 228counter 229counter 230counter 231counter 232counter 233counter 234counter 235counter 236counter 237counter 238counter 239counter 240counter 241counter 242counter 243counter 244counter 245counter 246counter 247counter 248counter 249counter 250counter 251counter 252counter 253counter 254counter 255counter 256counter 257counter 258counter 259counter 260counter 261counter 262counter 263counter 264counter 265counter 266counter 267counter 268counter 269counter 270counter 271counter 272counter 273counter 274counter 275counter 276counter 277counter 278counter 279counter 280counter 281counter 282counter 283counter 284counter 285counter 286counter 287counter 288counter2 288
counter2 0
counter2 0
counter2 0
counter2 0
17/04/15 16:51:57 INFO mapred.MapTask: Finished spill 0
17/04/15 16:51:57 INFO mapred.Task: Task:attempt_local1738495890_0001_m_000000_0 is done. And is in the process of committing
17/04/15 16:51:57 INFO mapred.LocalJobRunner: map
17/04/15 16:51:57 INFO mapred.Task: Task 'attempt_local1738495890_0001_m_000000_0' done.
17/04/15 16:51:57 INFO mapred.LocalJobRunner: Finishing task: attempt_local1738495890_0001_m_000000_0
17/04/15 16:51:57 INFO mapred.LocalJobRunner: map task executor complete.
17/04/15 16:51:57 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/04/15 16:51:57 INFO mapred.LocalJobRunner: Starting task: attempt_local1738495890_0001_r_000000_0
17/04/15 16:51:57 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/04/15 16:51:57 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
17/04/15 16:51:57 INFO mapred.Task: Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree#748e52c
17/04/15 16:51:57 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#1615c2c3
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/04/15 16:51:57 INFO reduce.EventFetcher: attempt_local1738495890_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
17/04/15 16:51:57 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1738495890_0001_m_000000_0 decomp: 94 len: 98 to MEMORY
17/04/15 16:51:57 INFO reduce.InMemoryMapOutput: Read 94 bytes from map-output for attempt_local1738495890_0001_m_000000_0
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 94, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->94
17/04/15 16:51:57 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/04/15 16:51:57 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
17/04/15 16:51:57 INFO mapred.Merger: Merging 1 sorted segments
17/04/15 16:51:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: Merged 1 segments, 94 bytes to disk to satisfy reduce memory limit
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: Merging 1 files, 98 bytes from disk
17/04/15 16:51:57 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/04/15 16:51:57 INFO mapred.Merger: Merging 1 sorted segments
17/04/15 16:51:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/04/15 16:51:57 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/04/15 16:51:57 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
counter 1counter2 1
counter2 0
counter2 0
counter2 0
counter2 0
SQL query to implement in MapReduce
select
c_mktsegment, count(c_custkey), avg(c_acctbal)
from
customer
where c_nationkey == '[NATION]' and c_acctbal > [BALANCE]
group by
c_mktsegment;
You're only incrementing count for one of the reducer keys. You're not outputting any AUTOMOBILE record in the mapper because you insist that the balance exceeds 8000 and the nation be in (1, 15). EDIT: I see now that you're pulling in a lot more data than the 7 sample records you posted.
This may also be a problem, once you get your count thing figured out:
avgBalance = count <= 1 ? sumBalance : avgBalance / count;
I think I had a typo in my last answer and you attempted to fix it by assigning avgBalance = 0 for some unknown reason.
You want to divide the count from the sum! Not the average.
float avgBalance = count <= 1 ? sumBalance : (sumBalance / count);
Then, your counter prints the length of values , not the count of the customers for a particular key.
SQL query to implement in MapReduce
where c_nationkey == '[NATION]'
That isn't what your MapReduce is going, by the way. nation < 15 && nation > 1
Other than that, I've fixed your code to produce this output.
AUTOMOBILE 4 9275.662
BUILDING 1 9078.19
HOUSEHOLD 1 9278.31
MACHINERY 3 9264.623
And here is the solution
(Use a HashSet to count unique customers)
public class AvgMapRed extends Configured implements Tool {
public static final String APP_NAME = AvgMapRed.class.getSimpleName();
public static void main(String[] args) throws Exception {
final int status = ToolRunner.run(new Configuration(), new AvgMapRed(), args);
System.exit(status);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, APP_NAME);
job.setJarByClass(AvgMapRed.class);
job.setMapperClass(TokenizerMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(AverageReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
static class TokenizerMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text word = new Text();
private final Text segment = new Text();
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\|");
if (tokens.length < 7) {
System.err.printf("mapper: not enough records for %s", Arrays.toString(tokens));
return;
}
String custKey = tokens[1];
int nation = 0;
float balance = 0;
try {
nation = Integer.parseInt(tokens[3]);
balance = Float.parseFloat(tokens[5]);
} catch (NumberFormatException e) {
e.printStackTrace();
return;
}
if (balance > 8000 && (nation < 15 && nation > 1)) {
segment.set(tokens[6]);
word.set(custKey + "\t" + balance);
context.write(segment, word);
}
}
}
static class AverageReducer extends Reducer<Text, Text, Text, Text> {
private final Text output = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
float sumBalance = 0;
int count = 0;
List<String> customers = new ArrayList<>();
for (Text v : values) {
String[] a = v.toString().trim().split("\t");
customers.add(a[0]); // Count all customers for this key
sumBalance += Float.parseFloat(a[1]);
count++;
}
float avgBalance = count <= 1 ? sumBalance : (sumBalance / count);
output.set(customers.size() + "\t" + avgBalance);
context.write(key, output);
}
}
}

Error: Java heap space in reducer phase

I am getting JAVA Heap space error in my reducer phase .I have used 41 reducer in my application and also Custom Partitioner class .
Below is my reducer code that throws below error .
17/02/12 05:26:45 INFO mapreduce.Job: map 98% reduce 0%
17/02/12 05:28:02 INFO mapreduce.Job: map 100% reduce 0%
17/02/12 05:28:09 INFO mapreduce.Job: map 100% reduce 17%
17/02/12 05:28:10 INFO mapreduce.Job: map 100% reduce 39%
17/02/12 05:28:11 INFO mapreduce.Job: map 100% reduce 46%
17/02/12 05:28:12 INFO mapreduce.Job: map 100% reduce 51%
17/02/12 05:28:13 INFO mapreduce.Job: map 100% reduce 54%
17/02/12 05:28:14 INFO mapreduce.Job: map 100% reduce 56%
17/02/12 05:28:15 INFO mapreduce.Job: map 100% reduce 88%
17/02/12 05:28:16 INFO mapreduce.Job: map 100% reduce 90%
17/02/12 05:28:18 INFO mapreduce.Job: map 100% reduce 93%
17/02/12 05:28:18 INFO mapreduce.Job: Task Id : attempt_1486663266028_2653_r_000020_0, Status : FAILED
Error: Java heap space
17/02/12 05:28:19 INFO mapreduce.Job: map 100% reduce 91%
17/02/12 05:28:20 INFO mapreduce.Job: Task Id : attempt_1486663266028_2653_r_000021_0, Status : FAILED
Error: Java heap space
17/02/12 05:28:22 INFO mapreduce.Job: Task Id : attempt_1486663266028_2653_r_000027_0, Status : FAILED
Error: Java heap space
17/02/12 05:28:23 INFO mapreduce.Job: map 100% reduce 89%
17/02/12 05:28:24 INFO mapreduce.Job: map 100% reduce 90%
17/02/12 05:28:24 INFO mapreduce.Job: Task Id : attempt_1486663266028_2653_r_000029_0, Status : FAILED
Error: Java heap space
Here is my reducer code..
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
StringBuilder sb = new StringBuilder();
private MultipleOutputs<NullWritable, Text> multipleOutputs;
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
if (valueStr.contains("Japan")) {
sb.append(valueStr.substring(0, valueStr.length() - 20));
} else if (valueStr.contains("SelfSourcedPrivate")) {
sb.append(valueStr.substring(0, valueStr.length() - 29));
} else if (valueStr.contains("SelfSourcedPublic")) {
sb.append(value.toString().substring(0, valueStr.length() - 29));
} else if (valueStr.contains("ThirdPartyPrivate")) {
sb.append(valueStr.substring(0, valueStr.length() - 25));
}
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), "MyFileName");
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}
Can you suggest any change that will solve my problem.
If we use combiner class will it improve?
Finally i manged to resolve it .
I just used multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);inside the for loop and that solved my problem .I have tested it with very huge data set 19 gb file and it worked fine for me .
This is my final solution .Initially i thought it might create many objects but it is working fine for me .Map reduce is also getting competed very fast .
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if (valueStr.contains("Japan")) {
sb.append(valueStr.substring(0, valueStr.length() - 20));
} else if (valueStr.contains("SelfSourcedPrivate")) {
sb.append(valueStr.substring(0, valueStr.length() - 24));
} else if (valueStr.contains("SelfSourcedPublic")) {
sb.append(value.toString().substring(0, valueStr.length() - 25));
} else if (valueStr.contains("ThirdPartyPrivate")) {
sb.append(valueStr.substring(0, valueStr.length() - 25));
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),
strName);
}
}

MapReduce Hadoop Runtime String Exception

I am trying my hands on MapReduce program in Hadoop 2.6 using JAVA code. I tried to refer to other posts on Stack Overflow but failed to debug my code.
First let me describe the type of records :
subId=00001111911128052627towerid=11232w34532543456345623453456984756894756bytes=122112212212212218.4621702216543667E17
subId=00001111911128052639towerid=11232w34532543456345623453456984756894756bytes=122112212212212219.6726312167218586E17
subId=00001111911128052615towerid=11232w34532543456345623453456984756894756bytes=122112212212212216.9431647633139046E17
subId=00001111911128052615towerid=11232w34532543456345623453456984756894756bytes=122112212212212214.7836041833447418E17
Now the Mapper Class: AircelMapper.class
import java.io.IOException;
import java.lang.String;
import java.lang.Long;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class AircelMapper extends Mapper<LongWritable,Text,Text, LongWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String acquire=value.toString();
String st=acquire.substring(81, 84);
LongWritable bytes=new LongWritable(Long.parseLong(st));
context.write(new Text(acquire.substring(6, 26)), bytes);
}
}
Now the Driver Class: AircelDriver.class
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
public class AircelDriver
{
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException
{
if(args.length<2)
{ System.out.println(" type ip and op file correctly");
System.exit(-1);
}
Job job = Job.getInstance();
job.setJobName(" ############### MY FIRST PROGRAM ###############");
job.setJarByClass(AircelDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(AircelMapper.class);
job.setReducerClass(AircelReducer.class);
job.submit();
job.waitForCompletion(true);
}
}
I am not posting the Reducer class since the problem is in mapper code during runtime. The output of the Hadoop runtime is as follows (which is essentially an indication of job failure):
16/12/18 04:11:00 INFO mapred.LocalJobRunner: Starting task: attempt_local1618565735_0001_m_000000_0
16/12/18 04:11:01 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/12/18 04:11:01 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/12/18 04:11:01 INFO mapred.MapTask: Processing split: hdfs://quickstart.cloudera:8020/practice/Data_File.txt:0+1198702
16/12/18 04:11:01 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/12/18 04:11:01 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/12/18 04:11:01 INFO mapred.MapTask: soft limit at 83886080
16/12/18 04:11:01 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/12/18 04:11:01 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/12/18 04:11:01 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/12/18 04:11:01 INFO mapreduce.Job: Job job_local1618565735_0001 running in uber mode : false
16/12/18 04:11:01 INFO mapreduce.Job: map 0% reduce 0%
16/12/18 04:11:02 INFO mapred.MapTask: Starting flush of map output
16/12/18 04:11:02 INFO mapred.MapTask: Spilling map output
16/12/18 04:11:02 INFO mapred.MapTask: bufstart = 0; bufend = 290000; bufvoid = 104857600
16/12/18 04:11:02 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26174400(104697600); length = 39997/6553600
16/12/18 04:11:03 INFO mapred.MapTask: Finished spill 0
16/12/18 04:11:03 INFO mapred.LocalJobRunner: map task executor complete.
16/12/18 04:11:03 WARN mapred.LocalJobRunner: job_local1618565735_0001
****java.lang.Exception: **java.lang.StringIndexOutOfBoundsException: String index out of range: 84******
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 84
at java.lang.String.substring(String.java:1907)
at AircelMapper.map(AircelMapper.java:13)
at AircelMapper.map(AircelMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(Fut
Why it is giving String Index out of bounds exception? Does String class have internally a limit on the size of the string? I do not understand what is problem on line 13-15 in the Mapper class.
IndexOutOfBoundsException - if the beginIndex is negative, or endIndex is larger than the length of this String object, or beginIndex is larger than endIndex.
public StringIndexOutOfBoundsException(int index)
Constructs a new StringIndexOutOfBoundsException class with an argument indicating the illegal index. - 84 (in your case)
public StringIndexOutOfBoundsException(String s)
Constructs a StringIndexOutOfBoundsException with the specified detail message. - array out of range (in your case)
Check your input at index 84.

Min Max Count Using Map Reduce

I have developed a Map reduce application to determine the first and last time a user com‐ mented and the total number of comments from that user based on book written by Donald Miner.
But the problem with my algorithm is the reducer. I have grouped the comments based on user Id. My test data contains two userid each posting 3 comments on different dates. hence a total of 6 rows.
So my reducer output should print two records each showing first and last time a user commented and total comments for each userid.
But, my reducer is printing six records. Can some one point out whats wrong with the following code ?
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.arjun.mapreduce.patterns.mapreducepatterns.MRDPUtils;
import com.sun.el.parser.ParseException;
public class MinMaxCount {
public static class MinMaxCountMapper extends
Mapper<Object, Text, Text, MinMaxCountTuple> {
private Text outuserId = new Text();
private MinMaxCountTuple outTuple = new MinMaxCountTuple();
private final static SimpleDateFormat sdf =
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSS");
#Override
protected void map(Object key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
Map<String, String> parsed =
MRDPUtils.transformXMLtoMap(value.toString());
String date = parsed.get("CreationDate");
String userId = parsed.get("UserId");
try {
Date creationDate = sdf.parse(date);
outTuple.setMin(creationDate);
outTuple.setMax(creationDate);
} catch (java.text.ParseException e) {
System.err.println("Unable to parse Date in XML");
System.exit(3);
}
outTuple.setCount(1);
outuserId.set(userId);
context.write(outuserId, outTuple);
}
}
public static class MinMaxCountReducer extends
Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {
private MinMaxCountTuple result = new MinMaxCountTuple();
protected void reduce(Text userId, Iterable<MinMaxCountTuple> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
result.setMin(null);
result.setMax(null);
result.setCount(0);
int sum = 0;
int count = 0;
for(MinMaxCountTuple tuple: values )
{
if(result.getMin() == null ||
tuple.getMin().compareTo(result.getMin()) < 0)
{
result.setMin(tuple.getMin());
}
if(result.getMax() == null ||
tuple.getMax().compareTo(result.getMax()) > 0) {
result.setMax(tuple.getMax());
}
System.err.println(count++);
sum += tuple.getCount();
}
result.setCount(sum);
context.write(userId, result);
}
}
/**
* #param args
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if(otherArgs.length < 2 )
{
System.err.println("Usage MinMaxCout input output");
System.exit(2);
}
Job job = new Job(conf, "Summarization min max count");
job.setJarByClass(MinMaxCount.class);
job.setMapperClass(MinMaxCountMapper.class);
//job.setCombinerClass(MinMaxCountReducer.class);
job.setReducerClass(MinMaxCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(MinMaxCountTuple.class);
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
boolean result = job.waitForCompletion(true);
if(result)
{
System.exit(0);
}else {
System.exit(1);
}
}
}
Input:
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-30T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-01T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="831878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-06-30T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-01T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="931878" />
output file contents part-r-00000:
831878 2011-07-30T07:29:33.343 2011-07-30T07:29:33.343 1
831878 2011-08-01T07:29:33.343 2011-08-01T07:29:33.343 1
831878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1
931878 2011-06-30T07:29:33.343 2011-06-30T07:29:33.343 1
931878 2011-07-01T07:29:33.343 2011-07-01T07:29:33.343 1
931878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1
job submission output:
12/12/16 11:13:52 INFO input.FileInputFormat: Total input paths to process : 1
12/12/16 11:13:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/16 11:13:52 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/16 11:13:52 INFO mapred.JobClient: Running job: job_201212161107_0001
12/12/16 11:13:53 INFO mapred.JobClient: map 0% reduce 0%
12/12/16 11:14:06 INFO mapred.JobClient: map 100% reduce 0%
12/12/16 11:14:18 INFO mapred.JobClient: map 100% reduce 100%
12/12/16 11:14:23 INFO mapred.JobClient: Job complete: job_201212161107_0001
12/12/16 11:14:23 INFO mapred.JobClient: Counters: 26
12/12/16 11:14:23 INFO mapred.JobClient: Job Counters
12/12/16 11:14:23 INFO mapred.JobClient: Launched reduce tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12264
12/12/16 11:14:23 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient: Launched map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: Data-local map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10124
12/12/16 11:14:23 INFO mapred.JobClient: File Output Format Counters
12/12/16 11:14:23 INFO mapred.JobClient: Bytes Written=342
12/12/16 11:14:23 INFO mapred.JobClient: FileSystemCounters
12/12/16 11:14:23 INFO mapred.JobClient: FILE_BYTES_READ=204
12/12/16 11:14:23 INFO mapred.JobClient: HDFS_BYTES_READ=888
12/12/16 11:14:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43479
12/12/16 11:14:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=342
12/12/16 11:14:23 INFO mapred.JobClient: File Input Format Counters
12/12/16 11:14:23 INFO mapred.JobClient: Bytes Read=761
12/12/16 11:14:23 INFO mapred.JobClient: Map-Reduce Framework
12/12/16 11:14:23 INFO mapred.JobClient: Map output materialized bytes=204
12/12/16 11:14:23 INFO mapred.JobClient: Map input records=6
12/12/16 11:14:23 INFO mapred.JobClient: Reduce shuffle bytes=0
12/12/16 11:14:23 INFO mapred.JobClient: Spilled Records=12
12/12/16 11:14:23 INFO mapred.JobClient: Map output bytes=186
12/12/16 11:14:23 INFO mapred.JobClient: Total committed heap usage (bytes)=269619200
12/12/16 11:14:23 INFO mapred.JobClient: Combine input records=0
12/12/16 11:14:23 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
12/12/16 11:14:23 INFO mapred.JobClient: Reduce input records=6
12/12/16 11:14:23 INFO mapred.JobClient: Reduce input groups=2
12/12/16 11:14:23 INFO mapred.JobClient: Combine output records=0
12/12/16 11:14:23 INFO mapred.JobClient: Reduce output records=6
12/12/16 11:14:23 INFO mapred.JobClient: Map output records=6
Ah caught the culprit, just change your reduce method's signature to following:
protected void reduce(Text userId, Iterable<MinMaxCountTuple> values,
Context context)
throws IOException, InterruptedException {
Basically you just need to have Context and not org.apache.hadoop.mapreduce.Reducer.Context
Now the output looks like :
831878 2011-07-30T07:29:33.343 2011-08-02T07:29:33.343 3
931878 2011-06-30T07:29:33.343 2011-08-02T07:29:33.343 3
I tested it locally for you, and this change did the trick. Though it is an odd behavior and if anyone would shed light on this it would be great. It has something to do with generics though. As when org.apache.hadoop.mapreduce.Reducer.Context is used it says that :
"Reducer.Context is a raw type. References to generic type Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context should be parameterized"
But when only 'Context' is used it's alright.

Categories

Resources