Related
I'm trying to gather max and min temperature of a particular station and then finding the sum of temperature per different day but i keep getting an error in the mapper and Have tried a lot of other ways such as use stringtokenizer but same thing, i get an error.
Sample Input.
Station Date(YYYYMMDD) element temperature flag1 flat2 othervalue
i only need station, date(key), element and temperature from the input
USW00003889,20180101,TMAX,122,7,1700
USW00003889,20180101,TMIN,-67,7,1700
UK000056225,20180101,TOBS,56,7,1700
UK000056225,20180101,PRCP,0,7,1700
UK000056225,20180101,SNOW,0,7
USC00264341,20180101,SNWD,0,7,1700
USC00256837,20180101,PRCP,0,7,800
UK000056225,20180101,SNOW,0,7
UK000056225,20180101,SNWD,0,7,800
USW00003889,20180102,TMAX,12,E
USW00003889,20180102,TMIN,3,E
UK000056225,20180101,PRCP,42,E
SWE00138880,20180101,PRCP,50,E
UK000056225,20180101,PRCP,0,a
USC00256480,20180101,PRCP,0,7,700
USC00256480,20180101,SNOW,0,7
USC00256480,20180101,SNWD,0,7,700
SWE00138880,20180103,TMAX,-228,7,800
SWE00138880,20180103,TMIN,-328,7,800
USC00247342,20180101,PRCP,0,7,800
UK000056225,20180101,SNOW,0,7
SWE00137764,20180101,PRCP,63,E
UK000056225,20180101,SNWD,0,E
USW00003889,20180104,TMAX,-43,W
USW00003889,20180104,TMIN,-177,W
public static class MaxMinMapper
extends Mapper<Object, Text, Text, IntWritable> {
private Text newDate = new Text();
public void map(Object key, Text value, Context context) throws
IOException,
InterruptedException {
String stationID = "USW00003889";
String[] tokens = value.toString().split(",");
String station = "";
String date = "";
String element = "";
int data = 0;
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = Integer.parseInt(tokens[3]);
if (stationID.equals(station) && ( element.equals("TMAX") ||
element.equals("TMIN")) ) {
newDate.set(date);
context.write(newDate, new IntWritable(data));
}
}
}
public static class MaxMinReducer
extends Reducer<Text, Text, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sumResult = 0;
int val1 = 0;
int val2 = 0;
while (values.iterator().hasNext()) {
val1 = values.iterator().next().get();
val2 = values.iterator().next().get();
sumResult = val1 + val2;
}
result.set(sumResult);
context.write(key, result);
}
}
}
Please help me out, thanks.
UPDATE: Verified each row with condition and changed data variable to String (change back to Integer -> IntWritable at later stage).
if (tokens.length <= 5) {
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = tokens[3];
otherValue = tokens[4];
}else{
station = tokens[0];
date = tokens[1];
element = tokens[2];
data = tokens[3];
otherValue = tokens[4];
otherValue2 = tokens[5];
}
Update2: Ok i'm getting output written to file now but its the wrong output. I need it to add the two values that have the same date (key) What am i doing wrong ?
OUTPUT:
20180101 -67
20180101 122
20180102 3
20180102 12
20180104 -177
20180104 -43
Desired Output
20180101 55
20180102 15
20180104 -220
This is the error i recieve aswell, even though i get output.
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [8e31c44ccd394017a4a28b3b16471aca] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/8e31c44ccd394017a4a28b3b16471aca
?project=driven-airway-257512®ion=us-central1' and in 'gs://dataproc-261a376e-7874-4151-b6b7-566c18758206-us-central1/google-cloud-dataproc-metainfo/f912a2f0-107f-40b6-94
56-b6a72cc8bfc4/jobs/8e31c44ccd394017a4a28b3b16471aca/driveroutput'.
19/11/14 12:53:24 INFO client.RMProxy: Connecting to ResourceManager at cluster-1e8f-m/10.128.0.12:8032
19/11/14 12:53:25 INFO client.AHSProxy: Connecting to Application History server at cluster-1e8f-m/10.128.0.12:10200
19/11/14 12:53:26 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/11/14 12:53:26 INFO input.FileInputFormat: Total input files to process : 1
19/11/14 12:53:26 INFO mapreduce.JobSubmitter: number of splits:1
19/11/14 12:53:26 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/11/14 12:53:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573654432484_0035
19/11/14 12:53:27 INFO impl.YarnClientImpl: Submitted application application_1573654432484_0035
19/11/14 12:53:27 INFO mapreduce.Job: The url to track the job: http://cluster-1e8f-m:8088/proxy/application_1573654432484_0035/
19/11/14 12:53:27 INFO mapreduce.Job: Running job: job_1573654432484_0035
19/11/14 12:53:35 INFO mapreduce.Job: Job job_1573654432484_0035 running in uber mode : false
19/11/14 12:53:35 INFO mapreduce.Job: map 0% reduce 0%
19/11/14 12:53:41 INFO mapreduce.Job: map 100% reduce 0%
19/11/14 12:53:52 INFO mapreduce.Job: map 100% reduce 20%
19/11/14 12:53:53 INFO mapreduce.Job: map 100% reduce 40%
19/11/14 12:53:54 INFO mapreduce.Job: map 100% reduce 60%
19/11/14 12:53:56 INFO mapreduce.Job: map 100% reduce 80%
19/11/14 12:53:57 INFO mapreduce.Job: map 100% reduce 100%
19/11/14 12:53:58 INFO mapreduce.Job: Job job_1573654432484_0035 completed successfully
19/11/14 12:53:58 INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=120
FILE: Number of bytes written=1247665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=846
GS: Number of bytes written=76
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=139
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Killed reduce tasks=1
Launched map tasks=1
Launched reduce tasks=5
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=17348
Total time spent by all reduces in occupied slots (ms)=195920
Total time spent by all map tasks (ms)=4337
Total time spent by all reduce tasks (ms)=48980
Total vcore-milliseconds taken by all map tasks=4337
Total vcore-milliseconds taken by all reduce tasks=48980
Total megabyte-milliseconds taken by all map tasks=8882176
Total megabyte-milliseconds taken by all reduce tasks=100311040
Map-Reduce Framework
Map input records=25
Map output records=6
Map output bytes=78
Map output materialized bytes=120
Input split bytes=139
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=120
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=1409
CPU time spent (ms)=6350
Physical memory (bytes) snapshot=1900220416
Virtual memory (bytes) snapshot=21124952064
Total committed heap usage (bytes)=1492123648
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=846
File Output Format Counters
Bytes Written=76
Job output is complete
Update 3:
I updated the Reducer (after what LowKey said) and its giving me the same as output as above. It's not doing the addition I want it to do. It's completely ignoring that operation. Why ?
public static class MaxMinReducer
extends Reducer<Text, Text, Text, IntWritable> {
public IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int value = 0;
int sumResult = 0;
Iterator<IntWritable> iterator = values.iterator();
while (values.iterator().hasNext()) {
value = iterator.next().get();
sumResult = sumResult + value;
}
result.set(sumResult);
context.write(key, result);
}
}
Update 4: Adding my imports and driver class to work out why my reducer won't run ?
package mapreduceprogram;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "tempmin");
job.setJarByClass(TempMin.class);
job.setMapperClass(MaxMinMapper.class);
job.setReducerClass(MaxMinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path (args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Anything wrong with it, as to why my reducer class isn't running ?
What are you doing wrong ? Well, for one thing, why do you have:
final int missing = -9999;
That doesn't make any sense.
Below that, you have some code that apparently is supposed to add two values, but it seems like you are accidentally throwing away items from your list. See where you have:
if (values.iterator().next().get() != missing)
well... you never saved the value, so that means you threw it away.
Another problem is that you are adding incorrectly... For some reason you are trying to add two values for every iteration of the loop. You should be adding one, so your loop should look like this:
IntWritable value = null;
Iterator iterator = values.iterator();
while (values.iterator().hasNext()) {
value = iterator.next().get();
if (value != missing){
sumResult = sumResult + value;
}
}
The next obvious problem is that you put your output line inside your while loop:
while (values.iterator().hasNext()) {
[...]
context.write(key, result);
}
That means that every time you read an item into your reducer, you write an item out. I think you what are trying to do is read in all the items for a given key, and then write a single reduced value (the sum). In that case, you shouldn't have your output inside the loop. It should be after.
while ([...]) {
[...]
}
result.set(sumResult);
context.write(key, result);
Are those columns separated by tabs?
If yes, then don't expect to find a space character in there.
I am getting text file with contents like below. I want to retrieve the data present between start_word=Tax% and end_word="ErrorMessage".
ParsedText:
Tax%
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
"ErrorMessage":"","ErrorDetails":""
After retreiving the output would be
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
Please help.
I am using camel to read the text then i want to retrive the data to process further as per my requiement.
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
public class DataExtractor implements Processor{
#Override
public void process(Exchange exchange) throws Exception {
String textContent=(String) exchange.getIn().getBody();
System.out.println("TextContents >>>>>>"+textContent);
}
}
In the text content I am getting the content that i have given above.I need help regarding retreiving the the data in java.
Below is the code snippet to extract the desired output:
String[] strArr = textContent.split("\\r?\\n");
StringBuilder stringBuilder = new StringBuilder();
boolean appendLines = false;
for(String strLines : strArr) {
if(strLines.contains("Tax%")) {
appendLines = true;
continue;
}
if(strLines.contains("\"ErrorMessage\"")) {
break;
}
if(appendLines){
stringBuilder.append(strLines);
stringBuilder.append(System.getProperty("line.separator"));
}
}
textContent = stringBuilder.toString();
I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}
Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.
I'm trying to study a bit of hadoop and read a lot about how to do the natural join. I have two files with keys and info, I want to cross and present the final result as (a, b, c).
My problem is that the mappers are calling reducers for each file. I was expecting to receive something like (10, [R1,S10, S22]) (being 10 the key, 1, 10, 22, are values of different rows that have 10 as key and R and S are tagging so I can identify from which table they come from).
The thing is that my reducer receives (10, [S10, S22]) and only after finishing with all the S file I get another key value pair like (10, [R1]). That means, it groups by key separately for each file and calls the reducer
I'm not sure if that the correct behavior, if I have to configure it in a different way or if I'm doing everything wrong.
I'm also new to java, so code might look bad to you.
I'm avoiding using the TextPair data type because I can't come up with that myself yet and I would think that this would be another valid way (just in case you are wondering). Thanks
Running hadoop 2.4.1 based on the WordCount example.
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.MultipleInputs;
public class TwoWayJoin {
public static class FirstMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text a = new Text();
Text b = new Text();
a.set(tokenizer.nextToken());
b.set(tokenizer.nextToken());
output.collect(b, relation);
}
}
public static class SecondMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text b = new Text();
Text c = new Text();
b.set(tokenizer.nextToken());
c.set(tokenizer.nextToken());
Text relation = new Text("S"+c.toString());
output.collect(b, relation);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList < Text > RelationS = new ArrayList < Text >() ;
ArrayList < Text > RelationR = new ArrayList < Text >() ;
while (values.hasNext()) {
String relationValue = values.next().toString();
if (relationValue.indexOf('R') >= 0){
RelationR.add(new Text(relationValue));
} else {
RelationS.add(new Text(relationValue));
}
}
for( Text r : RelationR ) {
for (Text s : RelationS) {
output.collect(key, new Text(r + "," + key.toString() + "," + s));
}
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MultipleInputs.class);
conf.setJobName("TwoWayJoin");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, FirstMap.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, SecondMap.class);
Path output = new Path(args[2]);
FileOutputFormat.setOutputPath(conf, output);
FileSystem.get(conf).delete(output, true);
JobClient.runJob(conf);
}
}
R.txt
(a b(key))
2 46
1 10
0 24
31 50
11 2
5 31
12 36
9 46
10 34
6 31
S.txt
(b(key) c)
45 32
45 45
46 10
36 15
45 21
45 28
45 9
45 49
45 18
46 21
45 45
2 11
46 15
45 33
45 6
45 20
31 28
45 32
45 26
46 35
45 36
50 49
45 13
46 3
46 8
31 45
46 18
46 21
45 26
24 15
46 31
46 47
10 24
46 12
46 36
Output for this code is successful but empty because I either have the Array R empty or the Array S empty.
I have all the rows mapped if I simply collect them one by one without processing anything.
Expected output is
key "a,b,c"
The problem is with the combiner. Remember combiner applies the reduce function on the map output. So indirectly what it does is the
reduce function is applied on your R and S relation separately and that is the reason you get the R and S relation in different reduce calls.
Comment out
conf.setCombinerClass(Reduce.class);
and try running again there should not be any problem. On a side note the combiner function will only be helpful only when you feel your aggregation result of your map output would be the same when it is applied on the input once the sort and shuffle is done.
I'm trying to do a secondary sort in mapreduce with a composite key that consisnts of:
String natural-key = program name
Long key-for-sorting = time in milli since 1970
The problem is that After sorting I get lots of reducers according to the entire composite key
By debugging I have verified that the hashcode and the compare functions are correct.
From debug logging it where each block is from a different reducer it shows that either the grouping or the partitioning didn't succeed.
from debug logs:
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=top gear
14/12/14 00:55:12 INFO popularitweet.EtanReducer: top gear: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key top gear ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=american horror story
14/12/14 00:55:12 INFO popularitweet.EtanReducer: american horror story: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key american horror story ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
As you can see the voice is sent to two different reducers but the timestamp is different.
Any help would be appreciated.
The composite key is the following class:
public class ProgramKey implements WritableComparable<ProgramKey> {
private String program;
private Long timestamp;
public ProgramKey() {
}
public ProgramKey(String program, Long timestamp) {
this.program = program;
this.timestamp = timestamp;
}
#Override
public int compareTo(ProgramKey o) {
int result = program.compareTo(o.program);
if (result == 0) {
result = timestamp.compareTo(o.timestamp);
}
return result;
}
#Override
public void write(DataOutput dataOutput) throws IOException {
WritableUtils.writeString(dataOutput, program);
dataOutput.writeLong(timestamp);
}
#Override
public void readFields(DataInput dataInput) throws IOException {
program = WritableUtils.readString(dataInput);
timestamp = dataInput.readLong();
}
My implemeted Partitioner, GroupingComparator, and SortingComparator are these:
public class ProgramKeyPartitioner extends Partitioner<ProgramKey, TweetObject> {
#Override
public int getPartition(ProgramKey programKey, TweetObject tweetObject, int numPartitions) {
int hash = programKey.getProgram().hashCode();
int partition = hash % numPartitions;
return partition;
}
}
public class ProgramKeyGroupingComparator extends WritableComparator {
protected ProgramKeyGroupingComparator() {
super(ProgramKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
ProgramKey k1 = (ProgramKey) a;
ProgramKey k2 = (ProgramKey) b;
return k1.getProgram().compareTo(k2.getProgram());
}
}
public class TimeStampComparator extends WritableComparator {
protected TimeStampComparator() {
super(ProgramKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
int result = ts1.getProgram().compareTo(ts2.getProgram());
if (result == 0) {
result = ts1.getTimestamp().compareTo(ts2.getTimestamp());
}
return result;
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
// Create configuration
Configuration conf = new Configuration();
// Create job
Job job = new Job(conf, "test1");
job.setJarByClass(EtanMapReduce.class);
// Set partitioner keyComparator and groupComparator
job.setPartitionerClass(ProgramKeyPartitioner.class);
job.setGroupingComparatorClass(ProgramKeyGroupingComparator.class);
job.setSortComparatorClass(TimeStampComparator.class);
// Setup MapReduce
job.setMapperClass(EtanMapper.class);
job.setMapOutputKeyClass(ProgramKey.class);
job.setMapOutputValueClass(TweetObject.class);
job.setReducerClass(EtanReducer.class);
// Specify key / value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetObject.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
logger.info("starting job");
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
Edit...
your TimeStampComparator seems to have a typo... you're setting ts2 to a when it should be set to b:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
when it should be:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;
This would result in incorrectly sorted key/value pairs and invalidates the assumption made by the grouping comparator that the key/value pairs are sorted.
Check also that the original program names are in UTF-8 as that's what WritableUtils assumes. Is your system's default code page also UTF-8?