Hadoop Map Reduce Program Key and Value Passing

Hadoop Map Reduce Program Key and Value Passing - java

I am trying to learn hadoop.
I have the following file downloaded from free large data set websites. I made it short for my sample testing. This is the small file.
"CAMIS","DBA","BORO","BUILDING","STREET","ZIPCODE","PHONE","CUISINECODE","INSPDATE","ACTION","VIOLCODE","SCORE","CURRENTGRADE","GRADEDATE","RECORDDATE"
"40280083","INTERCONTINENTAL THE BARCLAY","1","111 ","EAST 48 STREET ","10017","2129063134","03","2014-02-07 00:00:00","D","10F","4","A","2014-02-07 00:00:00","2014-04-24 06:01:04.920000000"
"40356649","REGINA CATERERS","3","6409","11 AVENUE","11219","7182560829","03","2013-07-30 00:00:00","D","08A","12","A","2013-07-30 00:00:00","2014-04-24 06:01:04.920000000"
"40356649","REGINA CATERERS","3","6409","11 AVENUE","11219","7182560829","03","2013-07-30 00:00:00","D","08B","12","A","2013-07-30 00:00:00","2014-04-24 06:01:04.920000000"
"40356731","TASTE THE TROPICS ICE CREAM","3","1839 ","NOSTRAND AVENUE ","11226","7188560821","43","2013-07-10 00:00:00","D","06C","8","A","2013-07-10 00:00:00","2014-04-24 06:01:04.920000000"
"40356731","TASTE THE TROPICS ICE CREAM","3","1839 ","NOSTRAND AVENUE ","11226","7188560821","43","2013-07-10 00:00:00","D","10B","8","A","2013-07-10 00:00:00","2014-04-24 06:01:04.920000000"
"40357217","WILD ASIA","2","2300","SOUTHERN BOULEVARD","10460","7182207846","03","2013-06-19 00:00:00","D","10B","4","A","2013-06-19 00:00:00","2014-04-24 06:01:04.920000000"
"40360045","SEUDA FOODS","3","705 ","KINGS HIGHWAY ","11223","7183751500","50","2013-10-10 00:00:00","D","08C","13","A","2013-10-10 00:00:00","2014-04-24 06:01:04.920000000"
"40361521","GLORIOUS FOOD","1","522","EAST 74 STREET","10021","2127372140","03","2013-12-19 00:00:00","U","08A","16","B","2013-12-19 00:00:00","2014-04-24 06:01:04.920000000"
"40362098","HARRIET'S KITCHEN","1","502","AMSTERDAM AVENUE","10024","2127210045","18","2014-03-04 00:00:00","U","10F","13","A","2014-03-04 00:00:00","2014-04-24 06:01:04.920000000"
"40361322","CARVEL ICE CREAM","4","265-15 ","HILLSIDE AVENUE ","11004","7183430392","43","2013-09-18 00:00:00","D","08A","10","A","2013-09-18 00:00:00","2014-04-24 06:01:04.920000000"
"40361708","BULLY'S DELI","1","759 ","BROADWAY ","10003","2122549755","27","2014-01-21 00:00:00","D","10F","12","A","2014-01-21 00:00:00","2014-04-24 06:01:04.920000000"
"40362098","HARRIET'S KITCHEN","1","502","AMSTERDAM AVENUE","10024","2127210045","18","2014-03-04 00:00:00","U","04N","13","A","2014-03-04 00:00:00","2014-04-24 06:01:04.920000000"
"40362274","ANGELIKA FILM CENTER","1","18","WEST HOUSTON STREET","10012","2129952570","03","2014-04-03 00:00:00","D","06D","9","A","2014-04-03 00:00:00","2014-04-24 06:01:04.920000000"
"40362715","THE COUNTRY CAFE","1","60","WALL STREET","10005","3474279132","83","2013-09-18 00:00:00","D","10B","13","A","2013-09-18 00:00:00","2014-04-24 06:01:04.920000000"
"40362869","SHASHEMENE INT'L RESTAURA","3","195","EAST 56 STREET","11203","3474300871","17","2013-05-08 00:00:00","D","10B","7","A","2013-05-08 00:00:00","2014-04-24 06:01:04.920000000"
"40363021","DOWNTOWN DELI","1","107","CHURCH STREET","10007","2122332911","03","2014-02-26 00:00:00","D","10B","9","A","2014-02-26 00:00:00","2014-04-24 06:01:04.920000000"
"40362432","HO MEI RESTAURANT","4","103-05","37 AVENUE","11368","7187796903","20","2014-04-21 00:00:00","D","06C","10","A","2014-04-21 00:00:00","2014-04-24 06:01:04.920000000"
"40362869","SHASHEMENE INT'L RESTAURA","3","195","EAST 56 STREET","11203","3474300871","17","2013-05-08 00:00:00","D","10F","7","A","2013-05-08 00:00:00","2014-04-24 06:01:04.920000000"
"40363117","MEJLANDER & MULGANNON","3","7615","5 AVENUE","11209","7182386666","03","2013-10-24 00:00:00","D","02G","11","A","2013-10-24 00:00:00","2014-04-24 06:01:04.920000000"
"40363289","HAPPY GARDEN","2","1236 ","238 SPOFFORD AVE ","10474","7186171818","20","2013-12-30 00:00:00","D","10F","8","A","2013-12-30 00:00:00","2014-04-24 06:01:04.920000000"
"40363644","DOMINO'S PIZZA","1","464","3 AVENUE","10016","2125450200","62","2014-03-06 00:00:00","D","08A","11","A","2014-03-06 00:00:00","2014-04-24 06:01:04.920000000"
"30191841","DJ REYNOLDS PUB AND RESTAURANT","1","351 ","WEST 57 STREET ","10019","2122452912","03","2013-07-22 00:00:00","D","10B","11","A","2013-07-22 00:00:00","2014-04-24 06:01:04.920000000"
"40280083","INTERCONTINENTAL THE BARCLAY","1","111 ","EAST 48 STREET ","10017","2129063134","03","2014-02-07 00:00:00","D","10B","4","A","2014-02-07 00:00:00","2014-04-24 06:01:04.920000000"
"40356442","KOSHER ISLAND","5","2206","VICTORY BOULEVARD","10314","7186985800","50","2013-04-04 00:00:00","D","10F","12","A","2013-04-04 00:00:00","2014-04-24 06:01:04.920000000"
"40356483","WILKEN'S FINE FOOD","3","7114 ","AVENUE U ","11234","7184443838","27","2014-01-14 00:00:00","D","10B","10","A","2014-01-14 00:00:00","2014-04-24 06:01:04.920000000"
File is about some inspection in restaurants.
You can see there is CUISINECODE. Values of it ranges from "00" to some value or can be any value. There will be many restaurants have the same CUISINECODE.
I just want to display the number of restaurants in each cusinecode.
This is my MapReducer Program
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RestaurantInspection {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
if (line.startsWith("\"CAMIS\",")) {
// Line is the header, ignore it
return;
}
List<String> columns = new ArrayList<String>();
String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
if (tokens.length != 15) {
// Line isn't the correct number of columns or formatted properly
return;
}
for(String t : tokens) {
columns.add(t.replaceAll("\"", ""));
}
int cusineCode = Integer.parseInt(columns.get(7));
String violations = columns.get(9) + " --- " + columns.get(10);
value.set(violations);
output.collect(value, new IntWritable(cusineCode));
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(RestaurantInspection.class);
conf.setJobName("Restaurent Inspection");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
I am using hadoop 1.2.1. I copied the above code from WordCount Example and just changed few lines.
When I run the above code in hadoop I am getting following lines for the same file I given above
D --- 02G 3
D --- 06C 63
D --- 06D 3
D --- 08A 108
D --- 08B 3
D --- 08C 50
D --- 10B 182
D --- 10F 117
U --- 04N 18
U --- 08A 3
U --- 10F 18
That was just a test. I am not getting any logic of how to write the code to get the desired output. I am expecting the following output for the above file.
01 -- 1
03 -- 9
43 -- 3
50 -- 2
18 -- 2
27 -- 2
83 -- 1
17 -- 2
20 -- 2
62 -- 1
By this, I think I can learn hadoop and map reduce.
So how to write the code? Thanks.

You need key to be CUISINECODE.
String cusineCode = columns.get(7);
output.collect(new Text(cusineCode), new IntWritable(1));
This will do the job for you.

Related

PySpark SparkConf() equivalent of spark command option "--jars"

I'd like to run some PySpark script on JupyterLab, and create custom UDF from JAR packages. To do so I need to broadcast these JAR packages to executor nodes. This answer has showed the command line interface approach (invoking --jars option in spark-submit). But I'd like to know the SparkConf() approach. On my JupyterLab sc.version=3.3.0-SNAPSHOT.
I'm very new to Spark.. your help will be highly appreciated!
Code:
import findspark
findspark.init()
findspark.find()
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import os
# ------------ create spark session ------------
app_name = 'PySpark_Example'
path = os.getcwd()
conf = SparkConf().setAppName(os.environ.get('JUPYTERHUB_USER').replace(" ", "") + "_" + app_name).setMaster(
'spark://spark-master-svc.spark:7077')
command = os.popen("hostname -i")
hostname = command.read().split("\n")[0]
command.close()
conf.set("spark.scheduler.mode","FAIR")
conf.set("spark.deployMode","client")
conf.set("spark.driver.host",hostname)
conf.set('spark.extraListeners','sparkmonitor.listener.JupyterSparkMonitorListener')
conf.set("spark.jars", "{path}/my_func.jar,{path}/javabuilder.jar".format(path=path))
conf.set("spark.executor.extraClassPath", "{path}/".format(path=path))
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
spark._jsc.addJar("{}/my_func.jar".format(path))
spark._jsc.addJar("{}/javabuilder.jar".format(path))
# ------------- create sample dataframe ---------
sdf = spark.createDataFrame(
[
(1, 2.),
(2, 3.),
(3, 5.),
],
["col1", "col2"]
)
sdf.createOrReplaceTempView("temp_table")
# -------------- create UDF ----------------------
create_udf_from_jar = "CREATE OR REPLACE FUNCTION my_func AS 'my_func.Class1' " + \
"USING JAR '{}/my_func.jar'".format(path)
spark.sql(create_udf_from_jar)
spark.sql("SHOW USER FUNCTIONS").show()
# -------------- test ----------------------------
spark.sql("SELECT my_func(col1) FROM temp_table").show()
Error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~tmp/ipykernel_4398/644670379.py in <cell line: 1>()
----> 1 spark.sql("SELECT my_func(col1) FROM temp_table").show()
~opt/spark/python/pyspark/sql/session.py in sql(self, sqlQuery, **kwargs)
1033 sqlQuery = formatter.format(sqlQuery, **kwargs)
1034 try:
-> 1035 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
1036 finally:
1037 if len(kwargs) > 0:
~opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
~opt/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
188 def deco(*a: Any, **kw: Any) -> Any:
189 try:
--> 190 return f(*a, **kw)
191 except Py4JJavaError as e:
192 converted = convert_exception(e.java_exception)
~opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o3813.sql.
: java.lang.NoClassDefFoundError: com/mathworks/toolbox/javabuilder/internal/MWComponentInstance

spark.jars is the one you're looking for (Doc)

Combiner creating mapoutput file per region in HBase scan mapreduce

Hi i am running an application which reads records from HBase and writes into text files .
I have used combiner in my application and custom partitioner also.
I have used 41 reducer in my application because i need to create 40 reducer output file that satisfies my condition in custom partitioner.
All working fine but when i use combiner in my application it creates map output file per regions or per mapper .
Foe example i have 40 regions in my application so 40 mapper getting initiated then it create 40 map-output files .
But reducer is not able to combine all map-output and generate final reducer output file that will be 40 reducer output files.
Data in the files are correct but no of files has increased .
Any idea how can i get only reducer output files.
// Reducer Class
job.setCombinerClass(CommonReducer.class);
job.setReducerClass(CommonReducer.class); // reducer class
below is my Job details
Submitted: Mon Apr 10 09:42:55 CDT 2017
Started: Mon Apr 10 09:43:03 CDT 2017
Finished: Mon Apr 10 10:11:20 CDT 2017
Elapsed: 28mins, 17sec
Diagnostics:
Average Map Time 6mins, 13sec
Average Shuffle Time 17mins, 56sec
Average Merge Time 0sec
Average Reduce Time 0sec
Here is my reducer logic
import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(CommonCombiner.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
private static final String DATA_SEPERATOR = "\\|\\!\\|";
public void setup(Context context) {
logger.info("Inside Combiner.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if ("".equals(strName) && strName.length() == 0) {
String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");
strName = strFullFileName[strFullFileName.length - 1];
String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
sb.append(strArrvalueStr[0] + "|!|");
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}

I have replaced multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
with
context.write()
and i got the correct output .

How to set PivotTable Field Number Format Cell with Apache POI

I'd like to set number format cell of pivot table Value field Sum of Balance as # ##0.
Pivot table created with code based on Official POI Sample CreatePivotTable
Code below do create and get CTPivotField pivotField. But how to set its number format?
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2);
CTPivotField pivotField = pivotTable
.getCTPivotTableDefinition()
.getPivotFields()
.getPivotFieldArray(2);
In MS Excel this is doing by next steps (see screenshot):
right click on Sum of Balance pivot table Value
select Field Settings
click Number...
set Format Cells
Help please with decide, advice or any idea.

Format of pivot table fields is setting by CTDataField.setNumFmtId(long numFmtId) for values and CTPivotField.setNumFmtId(long numFmtId) for columns & rows.
numFmtId is id number of format code. Available format codes are represented in Format cells list - Custom category:
Predefined format codes, thanks to Ji Zhou - MSFT, is here:
1 0
2 0.00
3 #,##0
4 #,##0.00
5 $#,##0_);($#,##0)
6 $#,##0_);[Red]($#,##0)
7 $#,##0.00_);($#,##0.00)
8 $#,##0.00_);[Red]($#,##0.00)
9 0%
10 0.00%
11 0.00E+00
12 # ?/?
13 # ??/??
14 m/d/yyyy
15 d-mmm-yy
16 d-mmm
17 mmm-yy
18 h:mm AM/PM
19 h:mm:ss AM/PM
20 h:mm
21 h:mm:ss
22 m/d/yyyy h:mm
37 #,##0_);(#,##0)
38 #,##0_);[Red](#,##0)
39 #,##0.00_);(#,##0.00)
40 #,##0.00_);[Red](#,##0.00)
45 mm:ss
46 [h]:mm:ss
47 mm:ss.0
48 ##0.0E+0
49 #
Full list of predefined format codes in MSDN NumberingFormat Class
Here is an example of applying format pivot table fields:
package ru.inkontext.poi;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.SpreadsheetVersion;
import org.apache.poi.ss.usermodel.DataConsolidateFunction;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.usermodel.XSSFPivotTable;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTDataFields;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
import java.util.Optional;
public class CreatePivotTableSimple {
private static void setFormatPivotField(XSSFPivotTable pivotTable,
long fieldIndex,
Integer numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getPivotFields())
.map(pivotFields -> pivotFields
.getPivotFieldArray((int) fieldIndex))
.ifPresent(pivotField -> pivotField
.setNumFmtId(numFmtId));
}
private static void setFormatDataField(XSSFPivotTable pivotTable,
long fieldIndex,
long numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getDataFields())
.map(CTDataFields::getDataFieldList)
.map(List::stream)
.ifPresent(stream -> stream
.filter(dataField -> dataField.getFld() == fieldIndex)
.findFirst()
.ifPresent(dataField -> dataField.setNumFmtId(numFmtId)));
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sheet = wb.createSheet();
//Create some data to build the pivot table on
setCellData(sheet);
XSSFPivotTable pivotTable = sheet.createPivotTable(
new AreaReference("A1:C6", SpreadsheetVersion.EXCEL2007),
new CellReference("E3"));
pivotTable.addRowLabel(1); // set second column as 1-th level of rows
setFormatPivotField(pivotTable, 1, 9); //set format of row field numFmtId=9 0%
pivotTable.addRowLabel(0); // set first column as 2-th level of rows
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2); // Sum up the second column
setFormatDataField(pivotTable, 2, 3); //set format of value field numFmtId=3 # ##0
FileOutputStream fileOut = new FileOutputStream("stackoverflow-pivottable.xlsx");
wb.write(fileOut);
fileOut.close();
wb.close();
}
private static void setCellData(XSSFSheet sheet) {
String[] names = {"Jane", "Tarzan", "Terk", "Kate", "Dmitry"};
Double[] percents = {0.25, 0.5, 0.75, 0.25, 0.5};
Integer[] balances = {107634, 554234, 10234, 22350, 15234};
Row row = sheet.createRow(0);
row.createCell(0).setCellValue("Name");
row.createCell(1).setCellValue("Percents");
row.createCell(2).setCellValue("Balance");
for (int i = 0; i < names.length; i++) {
row = sheet.createRow(i + 1);
row.createCell(0).setCellValue(names[i]);
row.createCell(1).setCellValue(percents[i]);
row.createCell(2).setCellValue(balances[i]);
}
}
}
https://github.com/stolbovd/PoiSamples

Unable to get required output for simple Hadoop mapreduce program

I am trying to write this mapreduce program which has to take input from two files, one has the details of occupations and states , and the other has details of occupation and job growth percentage. I use two mappers and combine them and in my reducer try to see which jobs have growth percent more than 30. My output should ideally be the occupation followed by the list of states. I am however, only getting the occupation names and not the states. I have posted the code and the sample input files below. PLease point out what i am doing wrong. Thanks.
(Please note that the samples of the input files i have provided are just small portions of the actual files).
package com;
import java.io.IOException;
//import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class GrowthState extends Configured implements Tool {
//Parser for Mapper1
public static class StateParser{
private String State,Occupation;
public void parse(String record){
String str[] = record.split("\t");
if(str[4].length() != 0)
setOccupation(str[4]);
else
setOccupation("Default Occupation");
if(str[2].length() != 0)
setState(str[2]);
else
setState("Default State");
}
public void parse(Text record){
parse(record.toString());
}
public String getState() {
return State;
}
public void setState(String state) {
State = state;
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
}
//Mapper1 - Processing state.txt
public static class GrowthMap1 extends Mapper<LongWritable,Text,Text,Text>{
StateParser sp = new StateParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
sp.parse(value);
outkey.set(sp.getOccupation());
outvalue.set("m1\t"+sp.getState());
context.write(outkey,outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[2]), new Text("m1\t"+str[4]));
}
}
public static class ProjParser{
private String Occupation,percent;
public void parse(String record){
String str[] = record.split("\t");
if(str[0].length() != 0)
setOccupation(str[0]);
else
setOccupation("Default Occupation");
if(str[5].length() != 0)
setPercent(str[5]);
else
setPercent("0");
}
public void parse(Text record){
parse(record.toString());
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
public String getPercent() {
return percent;
}
public void setPercent(String percent) {
this.percent = percent;
}
}
//Mapper2 - processing projection.txt
public static class GrowthMap2 extends Mapper<LongWritable,Text,Text,Text> {
ProjParser pp = new ProjParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
pp.parse(value);
outkey.set(pp.getOccupation());
outvalue.set("m2\t"+pp.getPercent());
context.write(outkey, outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[0]), new Text("m2\t"+str[5]));
}
}
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
String state = "";
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
}
}
//Driver
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf(), "States of Growth");
job.setJarByClass(GrowthState.class);
job.setReducerClass(GrowthReduce.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, GrowthMap1.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, GrowthMap2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String args[]) throws Exception{
int exitcode = ToolRunner.run(new GrowthState(), args);
System.exit(exitcode);
}
}
Sample input file1:
01 AL Alabama 00-0000 All Occupations total "1,857,530" 0.4 1000.000 1.00 19.66 "40,890" 0.5 8.30 9.72 14.83 23.95 36.04 "17,260" "20,220" "30,850" "49,810" "74,950"
01 AL Alabama 11-0000 Management Occupations major "67,500" 1.1 36.338 0.73 51.48 "107,080" 0.6 24.54 33.09 44.98 62.09 88.43 "51,050" "68,830" "93,550" "129,150" "183,940"
01 AL Alabama 11-1011 Chief Executives detailed "1,080" 4.8 0.580 0.32 97.67 "203,150" 2.5 52.05 67.58 # # # "108,270" "140,570" # # #
01 AL Alabama 11-1021 General and Operations Managers detailed "26,480" 1.5 14.258 0.94 58.00 "120,640" 0.9 27.65 35.76 49.00 71.44 # "57,510" "74,390" "101,930" "148,590" #
01 AL Alabama 11-1031 Legislators detailed "1,470" 8.7 0.790 1.94 * "21,920" 3.5 * * * * * "16,120" "17,000" "18,450" "20,670" "32,820" TRUE
01 AL Alabama 11-2011 Advertising and Promotions Managers detailed 80 16.3 0.042 0.19 44.88 "93,350" 9.5 21.59 30.28 38.92 52.22 74.07 "44,900" "62,980" "80,960" "108,620" "154,060"
01 AL Alabama 11-2021 Marketing Managers detailed 610 11.5 0.329 0.24 61.28 "127,460" 7.4 31.96 37.63 53.39 73.17 # "66,480" "78,280" "111,040" "152,200" #
01 AL Alabama 11-2022 Sales Managers detailed "2,330" 5.4 1.253 0.47 54.63 "113,620" 2.2 27.28 35.42 48.92 67.62 89.42 "56,740" "73,660" "101,750" "140,640" "186,000"
05 AR Arkansas 43-4161 "Human Resources Assistants, Except Payroll and Timekeeping" detailed "1,470" 6.6 1.265 1.26 17.25 "35,870" 1.5 11.09 13.54 17.11 20.74 23.30 "23,060" "28,170" "35,590" "43,150" "48,450"
05 AR Arkansas 43-4171 Receptionists and Information Clerks detailed "7,080" 3.3 6.109 0.84 11.26 "23,420" 0.8 8.14 9.19 10.87 13.09 14.94 "16,940" "19,110" "22,600" "27,230" "31,070"
05 AR Arkansas 43-4181 Reservation and Transportation Ticket Agents and Travel Clerks detailed 590 23.6 0.510 0.50 12.61 "26,220" 6.1 8.99 9.81 10.88 14.82 20.59 "18,710" "20,400" "22,630" "30,830" "42,830"
05 AR Arkansas 43-4199 "Information and Record Clerks, All Other" detailed 920 4.7 0.795 0.61 18.45 "38,370" 1.8 13.59 15.33 18.49 21.35 23.86 "28,270" "31,880" "38,470" "44,410" "49,630"
05 AR Arkansas 43-5011 Cargo and Freight Agents detailed 480 16.5 0.418 0.73 * * * * * * * * * * * * *
05 AR Arkansas 43-5021 Couriers and Messengers detailed 510 12.4 0.444 0.84 11.92 "24,790" 2.1 8.73 9.91 11.26 13.49 16.03 "18,160" "20,620" "23,420" "28,060" "33,350"
sample input file 2:
Management occupations 11-0000 "8,861.5" "9,498.0" 636.6 7.2 22.2 "2,586.7" "$93,910" — — —
Top executives 11-1000 "2,361.5" "2,626.8" 265.2 11.2 3.3 717.4 "$99,550" — — —
Chief executives 11-1011 330.5 347.9 17.4 5.3 17.7 87.8 "$168,140" Bachelor's degree 5 years or more None
General and operations managers 11-1021 "1,972.7" "2,216.8" 244.1 12.4 1.0 613.1 "$95,440" Bachelor's degree Less than 5 years None
Legislators 11-1031 58.4 62.1 3.7 6.4 — 16.5 "$19,780" Bachelor's degree Less than 5 years None
"Advertising, marketing, promotions, public relations, and sales managers" 11-2000 637.4 700.5 63.1 9.9 3.4 203.3 "$107,950" — — —
Advertising and promotions managers 11-2011 35.5 38.0 2.4 6.9 17.8 13.4 "$88,590" Bachelor's degree Less than 5 years None
Marketing and sales managers 11-2020 539.8 592.5 52.7 9.8 2.6 168.6 "$110,340" — — —
Marketing managers 11-2021 180.5 203.4 22.9 12.7 2.6 61.7 "$119,480" Bachelor's degree 5 years or more None
Sales managers 11-2022 359.3 389.0 29.8 8.3 2.7 106.9 "$105,260" Bachelor's degree Less than 5 years None
Public relations and fundraising managers 11-2031 62.1 70.1 8.0 12.9 1.6 21.3 "$95,450" Bachelor's degree 5 years or more None
Operations specialties managers 11-3000 "1,647.5" "1,799.7" 152.1 9.2 3.3 459.1 "$100,720" — — —
Administrative services managers 11-3011 280.8 315.0 34.2 12.2 0.1 79.9 "$81,080" Bachelor's degree Less than 5 years None
Computer and information systems managers 11-3021 332.7 383.6 50.9 15.3 3.1 97.1 "$120,950" Bachelor's degree 5 years or more None
Financial managers 11-3031 532.1 579.2 47.1 8.9 5.1 146.9 "$109,740" Bachelor's degree 5 years or more None
Industrial production managers 11-3051 172.7 168.6 -4.1 -2.4 6.1 31.4 "$89,190" Bachelor's degree 5 years or more None
Purchasing managers 11-3061 71.9 73.4 1.5 2.1 0.3 17.3 "$100,170" Bachelor's degree 5 years or more None
"Transportation, storage, and distribution managers" 11-3071 105.2 110.3 5.1 4.9 4.8 29.1 "$81,830" High school diploma or equivalent 5 years or more None
Compensation and benefits managers 11-3111 20.7 21.4 0.6 3.1 — 6.1 "$95,250" Bachelor's degree 5 years or more None
Human resources managers 11-3121 102.7 116.3 13.6 13.2 1.0 40.6 "$99,720" Bachelor's degree 5 years or more None
Training and development managers 11-3131 28.6 31.8 3.2 11.2 — 10.7 "$95,400" Bachelor's degree 5 years or more None
Other management occupations 11-9000 "4,215.0" "4,371.0" 156.1 3.7 43.1 "1,207.0" "$81,940" — — —

There is a problem with your reducer.
The faulty code is shown below. The loop below gets called for all the values of a particular key (for e.g. for "Advertising and promotions managers", it gets called twice. Once with value "Alabama" and again with value "6.9"). Problem is, you have put the if(cent >= 30) statement, outside the for loop. It should be inside, for matching the key.
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
Following piece of code works fine.
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
HashMap<String, String> stateMap = new HashMap<String, String>();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
stateMap.put(key.toString().toLowerCase(), str[1]);
}
else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
if(stateMap.containsKey(key.toString().toLowerCase()))
{
if(cent>30) {
outvalue.set(stateMap.get(key.toString().toLowerCase()));
context.write(key, outvalue);
}
stateMap.remove(key.toString());
}
}catch(Exception nf){
cent = 0;
}
}
}
}
}
The logic is:
As and when you encounter a state (value "m1"), you put it in state map.
Next time, when you encounter percent with same key (value "m2"), you check if the state is already in the map. If yes, then you output the key/value.

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise

I'm trying to study a bit of hadoop and read a lot about how to do the natural join. I have two files with keys and info, I want to cross and present the final result as (a, b, c).
My problem is that the mappers are calling reducers for each file. I was expecting to receive something like (10, [R1,S10, S22]) (being 10 the key, 1, 10, 22, are values of different rows that have 10 as key and R and S are tagging so I can identify from which table they come from).
The thing is that my reducer receives (10, [S10, S22]) and only after finishing with all the S file I get another key value pair like (10, [R1]). That means, it groups by key separately for each file and calls the reducer
I'm not sure if that the correct behavior, if I have to configure it in a different way or if I'm doing everything wrong.
I'm also new to java, so code might look bad to you.
I'm avoiding using the TextPair data type because I can't come up with that myself yet and I would think that this would be another valid way (just in case you are wondering). Thanks
Running hadoop 2.4.1 based on the WordCount example.
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.MultipleInputs;
public class TwoWayJoin {
public static class FirstMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text a = new Text();
Text b = new Text();
a.set(tokenizer.nextToken());
b.set(tokenizer.nextToken());
output.collect(b, relation);
}
}
public static class SecondMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text b = new Text();
Text c = new Text();
b.set(tokenizer.nextToken());
c.set(tokenizer.nextToken());
Text relation = new Text("S"+c.toString());
output.collect(b, relation);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList < Text > RelationS = new ArrayList < Text >() ;
ArrayList < Text > RelationR = new ArrayList < Text >() ;
while (values.hasNext()) {
String relationValue = values.next().toString();
if (relationValue.indexOf('R') >= 0){
RelationR.add(new Text(relationValue));
} else {
RelationS.add(new Text(relationValue));
}
}
for( Text r : RelationR ) {
for (Text s : RelationS) {
output.collect(key, new Text(r + "," + key.toString() + "," + s));
}
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MultipleInputs.class);
conf.setJobName("TwoWayJoin");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, FirstMap.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, SecondMap.class);
Path output = new Path(args[2]);
FileOutputFormat.setOutputPath(conf, output);
FileSystem.get(conf).delete(output, true);
JobClient.runJob(conf);
}
}
R.txt
(a b(key))
2 46
1 10
0 24
31 50
11 2
5 31
12 36
9 46
10 34
6 31
S.txt
(b(key) c)
45 32
45 45
46 10
36 15
45 21
45 28
45 9
45 49
45 18
46 21
45 45
2 11
46 15
45 33
45 6
45 20
31 28
45 32
45 26
46 35
45 36
50 49
45 13
46 3
46 8
31 45
46 18
46 21
45 26
24 15
46 31
46 47
10 24
46 12
46 36
Output for this code is successful but empty because I either have the Array R empty or the Array S empty.
I have all the rows mapped if I simply collect them one by one without processing anything.
Expected output is
key "a,b,c"

The problem is with the combiner. Remember combiner applies the reduce function on the map output. So indirectly what it does is the
reduce function is applied on your R and S relation separately and that is the reason you get the R and S relation in different reduce calls.
Comment out
conf.setCombinerClass(Reduce.class);
and try running again there should not be any problem. On a side note the combiner function will only be helpful only when you feel your aggregation result of your map output would be the same when it is applied on the input once the sort and shuffle is done.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop Map Reduce Program Key and Value Passing - java

You need key to be CUISINECODE. String cusineCode = columns.get(7); output.collect(new Text(cusineCode), new IntWritable(1)); This will do the job for you.

Related

PySpark SparkConf() equivalent of spark command option "--jars"

Combiner creating mapoutput file per region in HBase scan mapreduce

How to set PivotTable Field Number Format Cell with Apache POI

Unable to get required output for simple Hadoop mapreduce program

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise

Categories

Resources