Unable to get required output for simple Hadoop mapreduce program

Unable to get required output for simple Hadoop mapreduce program - java

I am trying to write this mapreduce program which has to take input from two files, one has the details of occupations and states , and the other has details of occupation and job growth percentage. I use two mappers and combine them and in my reducer try to see which jobs have growth percent more than 30. My output should ideally be the occupation followed by the list of states. I am however, only getting the occupation names and not the states. I have posted the code and the sample input files below. PLease point out what i am doing wrong. Thanks.
(Please note that the samples of the input files i have provided are just small portions of the actual files).
package com;
import java.io.IOException;
//import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class GrowthState extends Configured implements Tool {
//Parser for Mapper1
public static class StateParser{
private String State,Occupation;
public void parse(String record){
String str[] = record.split("\t");
if(str[4].length() != 0)
setOccupation(str[4]);
else
setOccupation("Default Occupation");
if(str[2].length() != 0)
setState(str[2]);
else
setState("Default State");
}
public void parse(Text record){
parse(record.toString());
}
public String getState() {
return State;
}
public void setState(String state) {
State = state;
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
}
//Mapper1 - Processing state.txt
public static class GrowthMap1 extends Mapper<LongWritable,Text,Text,Text>{
StateParser sp = new StateParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
sp.parse(value);
outkey.set(sp.getOccupation());
outvalue.set("m1\t"+sp.getState());
context.write(outkey,outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[2]), new Text("m1\t"+str[4]));
}
}
public static class ProjParser{
private String Occupation,percent;
public void parse(String record){
String str[] = record.split("\t");
if(str[0].length() != 0)
setOccupation(str[0]);
else
setOccupation("Default Occupation");
if(str[5].length() != 0)
setPercent(str[5]);
else
setPercent("0");
}
public void parse(Text record){
parse(record.toString());
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
public String getPercent() {
return percent;
}
public void setPercent(String percent) {
this.percent = percent;
}
}
//Mapper2 - processing projection.txt
public static class GrowthMap2 extends Mapper<LongWritable,Text,Text,Text> {
ProjParser pp = new ProjParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
pp.parse(value);
outkey.set(pp.getOccupation());
outvalue.set("m2\t"+pp.getPercent());
context.write(outkey, outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[0]), new Text("m2\t"+str[5]));
}
}
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
String state = "";
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
}
}
//Driver
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf(), "States of Growth");
job.setJarByClass(GrowthState.class);
job.setReducerClass(GrowthReduce.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, GrowthMap1.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, GrowthMap2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String args[]) throws Exception{
int exitcode = ToolRunner.run(new GrowthState(), args);
System.exit(exitcode);
}
}
Sample input file1:
01 AL Alabama 00-0000 All Occupations total "1,857,530" 0.4 1000.000 1.00 19.66 "40,890" 0.5 8.30 9.72 14.83 23.95 36.04 "17,260" "20,220" "30,850" "49,810" "74,950"
01 AL Alabama 11-0000 Management Occupations major "67,500" 1.1 36.338 0.73 51.48 "107,080" 0.6 24.54 33.09 44.98 62.09 88.43 "51,050" "68,830" "93,550" "129,150" "183,940"
01 AL Alabama 11-1011 Chief Executives detailed "1,080" 4.8 0.580 0.32 97.67 "203,150" 2.5 52.05 67.58 # # # "108,270" "140,570" # # #
01 AL Alabama 11-1021 General and Operations Managers detailed "26,480" 1.5 14.258 0.94 58.00 "120,640" 0.9 27.65 35.76 49.00 71.44 # "57,510" "74,390" "101,930" "148,590" #
01 AL Alabama 11-1031 Legislators detailed "1,470" 8.7 0.790 1.94 * "21,920" 3.5 * * * * * "16,120" "17,000" "18,450" "20,670" "32,820" TRUE
01 AL Alabama 11-2011 Advertising and Promotions Managers detailed 80 16.3 0.042 0.19 44.88 "93,350" 9.5 21.59 30.28 38.92 52.22 74.07 "44,900" "62,980" "80,960" "108,620" "154,060"
01 AL Alabama 11-2021 Marketing Managers detailed 610 11.5 0.329 0.24 61.28 "127,460" 7.4 31.96 37.63 53.39 73.17 # "66,480" "78,280" "111,040" "152,200" #
01 AL Alabama 11-2022 Sales Managers detailed "2,330" 5.4 1.253 0.47 54.63 "113,620" 2.2 27.28 35.42 48.92 67.62 89.42 "56,740" "73,660" "101,750" "140,640" "186,000"
05 AR Arkansas 43-4161 "Human Resources Assistants, Except Payroll and Timekeeping" detailed "1,470" 6.6 1.265 1.26 17.25 "35,870" 1.5 11.09 13.54 17.11 20.74 23.30 "23,060" "28,170" "35,590" "43,150" "48,450"
05 AR Arkansas 43-4171 Receptionists and Information Clerks detailed "7,080" 3.3 6.109 0.84 11.26 "23,420" 0.8 8.14 9.19 10.87 13.09 14.94 "16,940" "19,110" "22,600" "27,230" "31,070"
05 AR Arkansas 43-4181 Reservation and Transportation Ticket Agents and Travel Clerks detailed 590 23.6 0.510 0.50 12.61 "26,220" 6.1 8.99 9.81 10.88 14.82 20.59 "18,710" "20,400" "22,630" "30,830" "42,830"
05 AR Arkansas 43-4199 "Information and Record Clerks, All Other" detailed 920 4.7 0.795 0.61 18.45 "38,370" 1.8 13.59 15.33 18.49 21.35 23.86 "28,270" "31,880" "38,470" "44,410" "49,630"
05 AR Arkansas 43-5011 Cargo and Freight Agents detailed 480 16.5 0.418 0.73 * * * * * * * * * * * * *
05 AR Arkansas 43-5021 Couriers and Messengers detailed 510 12.4 0.444 0.84 11.92 "24,790" 2.1 8.73 9.91 11.26 13.49 16.03 "18,160" "20,620" "23,420" "28,060" "33,350"
sample input file 2:
Management occupations 11-0000 "8,861.5" "9,498.0" 636.6 7.2 22.2 "2,586.7" "$93,910" — — —
Top executives 11-1000 "2,361.5" "2,626.8" 265.2 11.2 3.3 717.4 "$99,550" — — —
Chief executives 11-1011 330.5 347.9 17.4 5.3 17.7 87.8 "$168,140" Bachelor's degree 5 years or more None
General and operations managers 11-1021 "1,972.7" "2,216.8" 244.1 12.4 1.0 613.1 "$95,440" Bachelor's degree Less than 5 years None
Legislators 11-1031 58.4 62.1 3.7 6.4 — 16.5 "$19,780" Bachelor's degree Less than 5 years None
"Advertising, marketing, promotions, public relations, and sales managers" 11-2000 637.4 700.5 63.1 9.9 3.4 203.3 "$107,950" — — —
Advertising and promotions managers 11-2011 35.5 38.0 2.4 6.9 17.8 13.4 "$88,590" Bachelor's degree Less than 5 years None
Marketing and sales managers 11-2020 539.8 592.5 52.7 9.8 2.6 168.6 "$110,340" — — —
Marketing managers 11-2021 180.5 203.4 22.9 12.7 2.6 61.7 "$119,480" Bachelor's degree 5 years or more None
Sales managers 11-2022 359.3 389.0 29.8 8.3 2.7 106.9 "$105,260" Bachelor's degree Less than 5 years None
Public relations and fundraising managers 11-2031 62.1 70.1 8.0 12.9 1.6 21.3 "$95,450" Bachelor's degree 5 years or more None
Operations specialties managers 11-3000 "1,647.5" "1,799.7" 152.1 9.2 3.3 459.1 "$100,720" — — —
Administrative services managers 11-3011 280.8 315.0 34.2 12.2 0.1 79.9 "$81,080" Bachelor's degree Less than 5 years None
Computer and information systems managers 11-3021 332.7 383.6 50.9 15.3 3.1 97.1 "$120,950" Bachelor's degree 5 years or more None
Financial managers 11-3031 532.1 579.2 47.1 8.9 5.1 146.9 "$109,740" Bachelor's degree 5 years or more None
Industrial production managers 11-3051 172.7 168.6 -4.1 -2.4 6.1 31.4 "$89,190" Bachelor's degree 5 years or more None
Purchasing managers 11-3061 71.9 73.4 1.5 2.1 0.3 17.3 "$100,170" Bachelor's degree 5 years or more None
"Transportation, storage, and distribution managers" 11-3071 105.2 110.3 5.1 4.9 4.8 29.1 "$81,830" High school diploma or equivalent 5 years or more None
Compensation and benefits managers 11-3111 20.7 21.4 0.6 3.1 — 6.1 "$95,250" Bachelor's degree 5 years or more None
Human resources managers 11-3121 102.7 116.3 13.6 13.2 1.0 40.6 "$99,720" Bachelor's degree 5 years or more None
Training and development managers 11-3131 28.6 31.8 3.2 11.2 — 10.7 "$95,400" Bachelor's degree 5 years or more None
Other management occupations 11-9000 "4,215.0" "4,371.0" 156.1 3.7 43.1 "1,207.0" "$81,940" — — —

There is a problem with your reducer.
The faulty code is shown below. The loop below gets called for all the values of a particular key (for e.g. for "Advertising and promotions managers", it gets called twice. Once with value "Alabama" and again with value "6.9"). Problem is, you have put the if(cent >= 30) statement, outside the for loop. It should be inside, for matching the key.
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
Following piece of code works fine.
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
HashMap<String, String> stateMap = new HashMap<String, String>();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
stateMap.put(key.toString().toLowerCase(), str[1]);
}
else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
if(stateMap.containsKey(key.toString().toLowerCase()))
{
if(cent>30) {
outvalue.set(stateMap.get(key.toString().toLowerCase()));
context.write(key, outvalue);
}
stateMap.remove(key.toString());
}
}catch(Exception nf){
cent = 0;
}
}
}
}
}
The logic is:
As and when you encounter a state (value "m1"), you put it in state map.
Next time, when you encounter percent with same key (value "m2"), you check if the state is already in the map. If yes, then you output the key/value.

Related

How can I print the content of rows in a Dataset using Java and the Spark SQL?

I would like to do a simple Spark SQL code that reads a file called u.data, that contains the movie ratings, creates a Dataset of Rows, and then print the first rows of the Dataset.
I've had as premise read the file to a JavaRDD, and map the RDD according to a ratingsObject(the object has two parameters, movieID and rating). So I just want to print the first Rows in this Dataset.
I'm using Java language and Spark SQL.
public static void main(String[] args){
App obj = new App();
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Map<Integer,String> movieNames = obj.loadMovieNames();
JavaRDD<String> lines = spark.read().textFile("hdfs:///ml-100k/u.data").javaRDD();
JavaRDD<MovieRatings> movies = lines.map(line -> {
String[] parts = line.split(" ");
MovieRatings ratingsObject = new MovieRatings();
ratingsObject.setMovieID(Integer.parseInt(parts[1].trim()));
ratingsObject.setRating(Integer.parseInt(parts[2].trim()));
return ratingsObject;
});
Dataset<Row> movieDataset = spark.createDataFrame(movies, MovieRatings.class);
Encoder<Integer> intEncoder = Encoders.INT();
Dataset<Integer> HUE = movieDataset.map(
new MapFunction<Row, Integer>(){
private static final long serialVersionUID = -5982149277350252630L;
#Override
public Integer call(Row row) throws Exception{
return row.getInt(0);
}
}, intEncoder
);
HUE.show();
//stop the session
spark.stop();
}
I've tried a lot of possible solutions that I found, but all of them got the same error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, localhost, executor 1): java.lang.ArrayIndexOutOfBoundsException: 1
at com.ericsson.SparkMovieRatings.App.lambda$main$1e634467$1(App.java:63)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And here is the sample of the u.data file:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
Where the first column represents de UserID, the second MovieID, the third the rating,and the last one is the timestamp.

As mentioned before your data are not space separated.
I'll show you two possible solutions, the first one based on RDD and the second one based on spark sql which is, in general, the best solution in term of performance.
RDD (you should use built in types to reduce the overhead):
public class SparkDriver {
public static void main (String args[]) {
// Create a configuration object and set the name of
// the application
SparkConf conf = new SparkConf().setAppName("application_name");
// Create a spark Context object
JavaSparkContext context = new JavaSparkContext(conf);
// Create final rdd (suppose you have a text file)
JavaPairRDD<Integer,Integer> movieRatingRDD =
contextFile("u.data.txt")
.mapToPair(line -> {(
String[] tokens = line.split("\\s+");
int movieID = Integer.parseInt(tokens[0]);
int rating = Integer.parseInt(tokens[1]);
return new Tuple2<Integer, Integer>(movieID, rating);});
// Keep in mind that take operation takes the first n elements
// and the order is the order of the file.
ArrayList<Tuple2<Integer, Integer> list = new ArrayList<>(movieRatingRDD.take(10));
System.out.println("MovieID\tRating");
for(tuple : list) {
System.out.println(tuple._1 + "\t" + tuple._2);
}
context.close();
}}
SQL
public class SparkDriver {
public static void main(String[] args) {
// Create spark session
SparkSession session = SparkSession.builder().appName("[Spark app sql version]").getOrCreate();
Dataset<MovieRatings> personsDataframe = session.read()
.format("tct")
.option("header", false)
.option("inferSchema", true)
.option("delimiter", "\\s+")
.load("u.data.txt")
.map(row -> {
int movieID = row.getInteger(0);
int rating = row.getInteger(1);
return new MovieRatings(movieID, rating);
}).as(Encoders.bean(MovieRatings.class);
// Stop session
session.stop();
}
}

Combiner creating mapoutput file per region in HBase scan mapreduce

Hi i am running an application which reads records from HBase and writes into text files .
I have used combiner in my application and custom partitioner also.
I have used 41 reducer in my application because i need to create 40 reducer output file that satisfies my condition in custom partitioner.
All working fine but when i use combiner in my application it creates map output file per regions or per mapper .
Foe example i have 40 regions in my application so 40 mapper getting initiated then it create 40 map-output files .
But reducer is not able to combine all map-output and generate final reducer output file that will be 40 reducer output files.
Data in the files are correct but no of files has increased .
Any idea how can i get only reducer output files.
// Reducer Class
job.setCombinerClass(CommonReducer.class);
job.setReducerClass(CommonReducer.class); // reducer class
below is my Job details
Submitted: Mon Apr 10 09:42:55 CDT 2017
Started: Mon Apr 10 09:43:03 CDT 2017
Finished: Mon Apr 10 10:11:20 CDT 2017
Elapsed: 28mins, 17sec
Diagnostics:
Average Map Time 6mins, 13sec
Average Shuffle Time 17mins, 56sec
Average Merge Time 0sec
Average Reduce Time 0sec
Here is my reducer logic
import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(CommonCombiner.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
private static final String DATA_SEPERATOR = "\\|\\!\\|";
public void setup(Context context) {
logger.info("Inside Combiner.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if ("".equals(strName) && strName.length() == 0) {
String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");
strName = strFullFileName[strFullFileName.length - 1];
String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
sb.append(strArrvalueStr[0] + "|!|");
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}

I have replaced multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
with
context.write()
and i got the correct output .

Is it possible to create a list in java using data from multiple text files

I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
etc
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
{
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
}
fh.close();
HashMap<Integer,HashMap<String,HashMap<String,Integer>>>
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
{
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
{
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
}
year.put(wrds[0],langMap);
}
regionsByYear.put(i,year);
fh1.close();
}
}
}

Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise

I'm trying to study a bit of hadoop and read a lot about how to do the natural join. I have two files with keys and info, I want to cross and present the final result as (a, b, c).
My problem is that the mappers are calling reducers for each file. I was expecting to receive something like (10, [R1,S10, S22]) (being 10 the key, 1, 10, 22, are values of different rows that have 10 as key and R and S are tagging so I can identify from which table they come from).
The thing is that my reducer receives (10, [S10, S22]) and only after finishing with all the S file I get another key value pair like (10, [R1]). That means, it groups by key separately for each file and calls the reducer
I'm not sure if that the correct behavior, if I have to configure it in a different way or if I'm doing everything wrong.
I'm also new to java, so code might look bad to you.
I'm avoiding using the TextPair data type because I can't come up with that myself yet and I would think that this would be another valid way (just in case you are wondering). Thanks
Running hadoop 2.4.1 based on the WordCount example.
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.MultipleInputs;
public class TwoWayJoin {
public static class FirstMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text a = new Text();
Text b = new Text();
a.set(tokenizer.nextToken());
b.set(tokenizer.nextToken());
output.collect(b, relation);
}
}
public static class SecondMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text b = new Text();
Text c = new Text();
b.set(tokenizer.nextToken());
c.set(tokenizer.nextToken());
Text relation = new Text("S"+c.toString());
output.collect(b, relation);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList < Text > RelationS = new ArrayList < Text >() ;
ArrayList < Text > RelationR = new ArrayList < Text >() ;
while (values.hasNext()) {
String relationValue = values.next().toString();
if (relationValue.indexOf('R') >= 0){
RelationR.add(new Text(relationValue));
} else {
RelationS.add(new Text(relationValue));
}
}
for( Text r : RelationR ) {
for (Text s : RelationS) {
output.collect(key, new Text(r + "," + key.toString() + "," + s));
}
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MultipleInputs.class);
conf.setJobName("TwoWayJoin");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, FirstMap.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, SecondMap.class);
Path output = new Path(args[2]);
FileOutputFormat.setOutputPath(conf, output);
FileSystem.get(conf).delete(output, true);
JobClient.runJob(conf);
}
}
R.txt
(a b(key))
2 46
1 10
0 24
31 50
11 2
5 31
12 36
9 46
10 34
6 31
S.txt
(b(key) c)
45 32
45 45
46 10
36 15
45 21
45 28
45 9
45 49
45 18
46 21
45 45
2 11
46 15
45 33
45 6
45 20
31 28
45 32
45 26
46 35
45 36
50 49
45 13
46 3
46 8
31 45
46 18
46 21
45 26
24 15
46 31
46 47
10 24
46 12
46 36
Output for this code is successful but empty because I either have the Array R empty or the Array S empty.
I have all the rows mapped if I simply collect them one by one without processing anything.
Expected output is
key "a,b,c"

The problem is with the combiner. Remember combiner applies the reduce function on the map output. So indirectly what it does is the
reduce function is applied on your R and S relation separately and that is the reason you get the R and S relation in different reduce calls.
Comment out
conf.setCombinerClass(Reduce.class);
and try running again there should not be any problem. On a side note the combiner function will only be helpful only when you feel your aggregation result of your map output would be the same when it is applied on the input once the sort and shuffle is done.

Why do I get class, enum, interface expected?

/**
* #(#)b.java
*
*
* #author
* #version 1.00 2012/5/4
*/
import java.util.*;
import java.io.*;
import java.*;
public class b {
static void lireBddParcs(String nomFichier) throws IOException
{
LinkedHashMap parcMap = new LinkedHashMap<Parc,Collection<Manege>> ();
boolean existeFichier = true;
FileReader fr = null;
try
{
fr = new FileReader (nomFichier);
}
catch(java.io.FileNotFoundException erreur)
{
System.out.println("Probleme rencontree a l'ouverture du fichier" + nomFichier);
existeFichier = false;
}
if (existeFichier)
{
Scanner scan = new Scanner(new File(nomFichier));
while (scan.hasNextLine())
{
String[] line = scan.nextLine().split("\t");
Parc p = new Parc(line[0], line[1], line[2]);
parcMap.put(p, null);
}
}
scan.close();
}
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws IOException
{
lireBddParcs("parcs.txt");
}
}
parc.txt contains:
Great America Chicago Illinois
Magic mountain Los Ageles Californie
Six Flags over Georgia Atlanta Georgie
Darien Lake Buffalo New York
La Ronde Montreal Quebec
The Great Escape Lake Georges New York
Six Flags New Orleans New Orleans Louisiane
Elitch Gardens Denver Colorado
Six Flags over Texas Arlington Texas
Six Flags New England Springfield Massachusetts
Six Flags America Washington D.C.
Great Adventure Jackson New Jersey
error: class, interface, or enum expected line 94
error: class, interface, or enum expected line 99
I decided to change my code, because something didn't work out as expected, but now I am getting this. Can't get through the compilation. Any idea why it doesn't work? I am a complete noob about to abandon my java course.

Although indentation is confusing, the main method is outside of the class while it should be inside of it.
It also make the line scan.close(); invalid, as scan is not defined there. remove the } before scan.close();.

It's just because there's an extraneous closing brace in your first method here:
}
scan.close();
If you use an IDE such as eclipse or netbeans to edit your source files, it will help a lot with automatic brace matching and highlighting these kinds of errors.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to get required output for simple Hadoop mapreduce program - java

Related

How can I print the content of rows in a Dataset using Java and the Spark SQL?

Combiner creating mapoutput file per region in HBase scan mapreduce

Is it possible to create a list in java using data from multiple text files

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise

Why do I get class, enum, interface expected?

Categories

Resources