Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise - java

I'm trying to study a bit of hadoop and read a lot about how to do the natural join. I have two files with keys and info, I want to cross and present the final result as (a, b, c).
My problem is that the mappers are calling reducers for each file. I was expecting to receive something like (10, [R1,S10, S22]) (being 10 the key, 1, 10, 22, are values of different rows that have 10 as key and R and S are tagging so I can identify from which table they come from).
The thing is that my reducer receives (10, [S10, S22]) and only after finishing with all the S file I get another key value pair like (10, [R1]). That means, it groups by key separately for each file and calls the reducer
I'm not sure if that the correct behavior, if I have to configure it in a different way or if I'm doing everything wrong.
I'm also new to java, so code might look bad to you.
I'm avoiding using the TextPair data type because I can't come up with that myself yet and I would think that this would be another valid way (just in case you are wondering). Thanks
Running hadoop 2.4.1 based on the WordCount example.
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.MultipleInputs;
public class TwoWayJoin {
public static class FirstMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text a = new Text();
Text b = new Text();
a.set(tokenizer.nextToken());
b.set(tokenizer.nextToken());
output.collect(b, relation);
}
}
public static class SecondMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
Text b = new Text();
Text c = new Text();
b.set(tokenizer.nextToken());
c.set(tokenizer.nextToken());
Text relation = new Text("S"+c.toString());
output.collect(b, relation);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList < Text > RelationS = new ArrayList < Text >() ;
ArrayList < Text > RelationR = new ArrayList < Text >() ;
while (values.hasNext()) {
String relationValue = values.next().toString();
if (relationValue.indexOf('R') >= 0){
RelationR.add(new Text(relationValue));
} else {
RelationS.add(new Text(relationValue));
}
}
for( Text r : RelationR ) {
for (Text s : RelationS) {
output.collect(key, new Text(r + "," + key.toString() + "," + s));
}
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MultipleInputs.class);
conf.setJobName("TwoWayJoin");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, FirstMap.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, SecondMap.class);
Path output = new Path(args[2]);
FileOutputFormat.setOutputPath(conf, output);
FileSystem.get(conf).delete(output, true);
JobClient.runJob(conf);
}
}
R.txt
(a b(key))
2 46
1 10
0 24
31 50
11 2
5 31
12 36
9 46
10 34
6 31
S.txt
(b(key) c)
45 32
45 45
46 10
36 15
45 21
45 28
45 9
45 49
45 18
46 21
45 45
2 11
46 15
45 33
45 6
45 20
31 28
45 32
45 26
46 35
45 36
50 49
45 13
46 3
46 8
31 45
46 18
46 21
45 26
24 15
46 31
46 47
10 24
46 12
46 36
Output for this code is successful but empty because I either have the Array R empty or the Array S empty.
I have all the rows mapped if I simply collect them one by one without processing anything.
Expected output is
key "a,b,c"

The problem is with the combiner. Remember combiner applies the reduce function on the map output. So indirectly what it does is the
reduce function is applied on your R and S relation separately and that is the reason you get the R and S relation in different reduce calls.
Comment out
conf.setCombinerClass(Reduce.class);
and try running again there should not be any problem. On a side note the combiner function will only be helpful only when you feel your aggregation result of your map output would be the same when it is applied on the input once the sort and shuffle is done.

Related

Combiner creating mapoutput file per region in HBase scan mapreduce

Hi i am running an application which reads records from HBase and writes into text files .
I have used combiner in my application and custom partitioner also.
I have used 41 reducer in my application because i need to create 40 reducer output file that satisfies my condition in custom partitioner.
All working fine but when i use combiner in my application it creates map output file per regions or per mapper .
Foe example i have 40 regions in my application so 40 mapper getting initiated then it create 40 map-output files .
But reducer is not able to combine all map-output and generate final reducer output file that will be 40 reducer output files.
Data in the files are correct but no of files has increased .
Any idea how can i get only reducer output files.
// Reducer Class
job.setCombinerClass(CommonReducer.class);
job.setReducerClass(CommonReducer.class); // reducer class
below is my Job details
Submitted: Mon Apr 10 09:42:55 CDT 2017
Started: Mon Apr 10 09:43:03 CDT 2017
Finished: Mon Apr 10 10:11:20 CDT 2017
Elapsed: 28mins, 17sec
Diagnostics:
Average Map Time 6mins, 13sec
Average Shuffle Time 17mins, 56sec
Average Merge Time 0sec
Average Reduce Time 0sec
Here is my reducer logic
import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(CommonCombiner.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
private static final String DATA_SEPERATOR = "\\|\\!\\|";
public void setup(Context context) {
logger.info("Inside Combiner.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if ("".equals(strName) && strName.length() == 0) {
String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");
strName = strFullFileName[strFullFileName.length - 1];
String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
sb.append(strArrvalueStr[0] + "|!|");
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}

I have replaced multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
with
context.write()
and i got the correct output .

How to set PivotTable Field Number Format Cell with Apache POI

I'd like to set number format cell of pivot table Value field Sum of Balance as # ##0.
Pivot table created with code based on Official POI Sample CreatePivotTable
Code below do create and get CTPivotField pivotField. But how to set its number format?
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2);
CTPivotField pivotField = pivotTable
.getCTPivotTableDefinition()
.getPivotFields()
.getPivotFieldArray(2);
In MS Excel this is doing by next steps (see screenshot):
right click on Sum of Balance pivot table Value
select Field Settings
click Number...
set Format Cells
Help please with decide, advice or any idea.

Format of pivot table fields is setting by CTDataField.setNumFmtId(long numFmtId) for values and CTPivotField.setNumFmtId(long numFmtId) for columns & rows.
numFmtId is id number of format code. Available format codes are represented in Format cells list - Custom category:
Predefined format codes, thanks to Ji Zhou - MSFT, is here:
1 0
2 0.00
3 #,##0
4 #,##0.00
5 $#,##0_);($#,##0)
6 $#,##0_);[Red]($#,##0)
7 $#,##0.00_);($#,##0.00)
8 $#,##0.00_);[Red]($#,##0.00)
9 0%
10 0.00%
11 0.00E+00
12 # ?/?
13 # ??/??
14 m/d/yyyy
15 d-mmm-yy
16 d-mmm
17 mmm-yy
18 h:mm AM/PM
19 h:mm:ss AM/PM
20 h:mm
21 h:mm:ss
22 m/d/yyyy h:mm
37 #,##0_);(#,##0)
38 #,##0_);[Red](#,##0)
39 #,##0.00_);(#,##0.00)
40 #,##0.00_);[Red](#,##0.00)
45 mm:ss
46 [h]:mm:ss
47 mm:ss.0
48 ##0.0E+0
49 #
Full list of predefined format codes in MSDN NumberingFormat Class
Here is an example of applying format pivot table fields:
package ru.inkontext.poi;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.SpreadsheetVersion;
import org.apache.poi.ss.usermodel.DataConsolidateFunction;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.usermodel.XSSFPivotTable;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTDataFields;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
import java.util.Optional;
public class CreatePivotTableSimple {
private static void setFormatPivotField(XSSFPivotTable pivotTable,
long fieldIndex,
Integer numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getPivotFields())
.map(pivotFields -> pivotFields
.getPivotFieldArray((int) fieldIndex))
.ifPresent(pivotField -> pivotField
.setNumFmtId(numFmtId));
}
private static void setFormatDataField(XSSFPivotTable pivotTable,
long fieldIndex,
long numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getDataFields())
.map(CTDataFields::getDataFieldList)
.map(List::stream)
.ifPresent(stream -> stream
.filter(dataField -> dataField.getFld() == fieldIndex)
.findFirst()
.ifPresent(dataField -> dataField.setNumFmtId(numFmtId)));
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sheet = wb.createSheet();
//Create some data to build the pivot table on
setCellData(sheet);
XSSFPivotTable pivotTable = sheet.createPivotTable(
new AreaReference("A1:C6", SpreadsheetVersion.EXCEL2007),
new CellReference("E3"));
pivotTable.addRowLabel(1); // set second column as 1-th level of rows
setFormatPivotField(pivotTable, 1, 9); //set format of row field numFmtId=9 0%
pivotTable.addRowLabel(0); // set first column as 2-th level of rows
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2); // Sum up the second column
setFormatDataField(pivotTable, 2, 3); //set format of value field numFmtId=3 # ##0
FileOutputStream fileOut = new FileOutputStream("stackoverflow-pivottable.xlsx");
wb.write(fileOut);
fileOut.close();
wb.close();
}
private static void setCellData(XSSFSheet sheet) {
String[] names = {"Jane", "Tarzan", "Terk", "Kate", "Dmitry"};
Double[] percents = {0.25, 0.5, 0.75, 0.25, 0.5};
Integer[] balances = {107634, 554234, 10234, 22350, 15234};
Row row = sheet.createRow(0);
row.createCell(0).setCellValue("Name");
row.createCell(1).setCellValue("Percents");
row.createCell(2).setCellValue("Balance");
for (int i = 0; i < names.length; i++) {
row = sheet.createRow(i + 1);
row.createCell(0).setCellValue(names[i]);
row.createCell(1).setCellValue(percents[i]);
row.createCell(2).setCellValue(balances[i]);
}
}
}
https://github.com/stolbovd/PoiSamples

Unable to get required output for simple Hadoop mapreduce program

I am trying to write this mapreduce program which has to take input from two files, one has the details of occupations and states , and the other has details of occupation and job growth percentage. I use two mappers and combine them and in my reducer try to see which jobs have growth percent more than 30. My output should ideally be the occupation followed by the list of states. I am however, only getting the occupation names and not the states. I have posted the code and the sample input files below. PLease point out what i am doing wrong. Thanks.
(Please note that the samples of the input files i have provided are just small portions of the actual files).
package com;
import java.io.IOException;
//import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class GrowthState extends Configured implements Tool {
//Parser for Mapper1
public static class StateParser{
private String State,Occupation;
public void parse(String record){
String str[] = record.split("\t");
if(str[4].length() != 0)
setOccupation(str[4]);
else
setOccupation("Default Occupation");
if(str[2].length() != 0)
setState(str[2]);
else
setState("Default State");
}
public void parse(Text record){
parse(record.toString());
}
public String getState() {
return State;
}
public void setState(String state) {
State = state;
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
}
//Mapper1 - Processing state.txt
public static class GrowthMap1 extends Mapper<LongWritable,Text,Text,Text>{
StateParser sp = new StateParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
sp.parse(value);
outkey.set(sp.getOccupation());
outvalue.set("m1\t"+sp.getState());
context.write(outkey,outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[2]), new Text("m1\t"+str[4]));
}
}
public static class ProjParser{
private String Occupation,percent;
public void parse(String record){
String str[] = record.split("\t");
if(str[0].length() != 0)
setOccupation(str[0]);
else
setOccupation("Default Occupation");
if(str[5].length() != 0)
setPercent(str[5]);
else
setPercent("0");
}
public void parse(Text record){
parse(record.toString());
}
public String getOccupation() {
return Occupation;
}
public void setOccupation(String occupation) {
Occupation = occupation;
}
public String getPercent() {
return percent;
}
public void setPercent(String percent) {
this.percent = percent;
}
}
//Mapper2 - processing projection.txt
public static class GrowthMap2 extends Mapper<LongWritable,Text,Text,Text> {
ProjParser pp = new ProjParser();
Text outkey = new Text();
Text outvalue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
pp.parse(value);
outkey.set(pp.getOccupation());
outvalue.set("m2\t"+pp.getPercent());
context.write(outkey, outvalue);
//String str[] = value.toString().split("\t");
//context.write(new Text(str[0]), new Text("m2\t"+str[5]));
}
}
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
String state = "";
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
}
}
//Driver
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf(), "States of Growth");
job.setJarByClass(GrowthState.class);
job.setReducerClass(GrowthReduce.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, GrowthMap1.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, GrowthMap2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String args[]) throws Exception{
int exitcode = ToolRunner.run(new GrowthState(), args);
System.exit(exitcode);
}
}
Sample input file1:
01 AL Alabama 00-0000 All Occupations total "1,857,530" 0.4 1000.000 1.00 19.66 "40,890" 0.5 8.30 9.72 14.83 23.95 36.04 "17,260" "20,220" "30,850" "49,810" "74,950"
01 AL Alabama 11-0000 Management Occupations major "67,500" 1.1 36.338 0.73 51.48 "107,080" 0.6 24.54 33.09 44.98 62.09 88.43 "51,050" "68,830" "93,550" "129,150" "183,940"
01 AL Alabama 11-1011 Chief Executives detailed "1,080" 4.8 0.580 0.32 97.67 "203,150" 2.5 52.05 67.58 # # # "108,270" "140,570" # # #
01 AL Alabama 11-1021 General and Operations Managers detailed "26,480" 1.5 14.258 0.94 58.00 "120,640" 0.9 27.65 35.76 49.00 71.44 # "57,510" "74,390" "101,930" "148,590" #
01 AL Alabama 11-1031 Legislators detailed "1,470" 8.7 0.790 1.94 * "21,920" 3.5 * * * * * "16,120" "17,000" "18,450" "20,670" "32,820" TRUE
01 AL Alabama 11-2011 Advertising and Promotions Managers detailed 80 16.3 0.042 0.19 44.88 "93,350" 9.5 21.59 30.28 38.92 52.22 74.07 "44,900" "62,980" "80,960" "108,620" "154,060"
01 AL Alabama 11-2021 Marketing Managers detailed 610 11.5 0.329 0.24 61.28 "127,460" 7.4 31.96 37.63 53.39 73.17 # "66,480" "78,280" "111,040" "152,200" #
01 AL Alabama 11-2022 Sales Managers detailed "2,330" 5.4 1.253 0.47 54.63 "113,620" 2.2 27.28 35.42 48.92 67.62 89.42 "56,740" "73,660" "101,750" "140,640" "186,000"
05 AR Arkansas 43-4161 "Human Resources Assistants, Except Payroll and Timekeeping" detailed "1,470" 6.6 1.265 1.26 17.25 "35,870" 1.5 11.09 13.54 17.11 20.74 23.30 "23,060" "28,170" "35,590" "43,150" "48,450"
05 AR Arkansas 43-4171 Receptionists and Information Clerks detailed "7,080" 3.3 6.109 0.84 11.26 "23,420" 0.8 8.14 9.19 10.87 13.09 14.94 "16,940" "19,110" "22,600" "27,230" "31,070"
05 AR Arkansas 43-4181 Reservation and Transportation Ticket Agents and Travel Clerks detailed 590 23.6 0.510 0.50 12.61 "26,220" 6.1 8.99 9.81 10.88 14.82 20.59 "18,710" "20,400" "22,630" "30,830" "42,830"
05 AR Arkansas 43-4199 "Information and Record Clerks, All Other" detailed 920 4.7 0.795 0.61 18.45 "38,370" 1.8 13.59 15.33 18.49 21.35 23.86 "28,270" "31,880" "38,470" "44,410" "49,630"
05 AR Arkansas 43-5011 Cargo and Freight Agents detailed 480 16.5 0.418 0.73 * * * * * * * * * * * * *
05 AR Arkansas 43-5021 Couriers and Messengers detailed 510 12.4 0.444 0.84 11.92 "24,790" 2.1 8.73 9.91 11.26 13.49 16.03 "18,160" "20,620" "23,420" "28,060" "33,350"
sample input file 2:
Management occupations 11-0000 "8,861.5" "9,498.0" 636.6 7.2 22.2 "2,586.7" "$93,910" — — —
Top executives 11-1000 "2,361.5" "2,626.8" 265.2 11.2 3.3 717.4 "$99,550" — — —
Chief executives 11-1011 330.5 347.9 17.4 5.3 17.7 87.8 "$168,140" Bachelor's degree 5 years or more None
General and operations managers 11-1021 "1,972.7" "2,216.8" 244.1 12.4 1.0 613.1 "$95,440" Bachelor's degree Less than 5 years None
Legislators 11-1031 58.4 62.1 3.7 6.4 — 16.5 "$19,780" Bachelor's degree Less than 5 years None
"Advertising, marketing, promotions, public relations, and sales managers" 11-2000 637.4 700.5 63.1 9.9 3.4 203.3 "$107,950" — — —
Advertising and promotions managers 11-2011 35.5 38.0 2.4 6.9 17.8 13.4 "$88,590" Bachelor's degree Less than 5 years None
Marketing and sales managers 11-2020 539.8 592.5 52.7 9.8 2.6 168.6 "$110,340" — — —
Marketing managers 11-2021 180.5 203.4 22.9 12.7 2.6 61.7 "$119,480" Bachelor's degree 5 years or more None
Sales managers 11-2022 359.3 389.0 29.8 8.3 2.7 106.9 "$105,260" Bachelor's degree Less than 5 years None
Public relations and fundraising managers 11-2031 62.1 70.1 8.0 12.9 1.6 21.3 "$95,450" Bachelor's degree 5 years or more None
Operations specialties managers 11-3000 "1,647.5" "1,799.7" 152.1 9.2 3.3 459.1 "$100,720" — — —
Administrative services managers 11-3011 280.8 315.0 34.2 12.2 0.1 79.9 "$81,080" Bachelor's degree Less than 5 years None
Computer and information systems managers 11-3021 332.7 383.6 50.9 15.3 3.1 97.1 "$120,950" Bachelor's degree 5 years or more None
Financial managers 11-3031 532.1 579.2 47.1 8.9 5.1 146.9 "$109,740" Bachelor's degree 5 years or more None
Industrial production managers 11-3051 172.7 168.6 -4.1 -2.4 6.1 31.4 "$89,190" Bachelor's degree 5 years or more None
Purchasing managers 11-3061 71.9 73.4 1.5 2.1 0.3 17.3 "$100,170" Bachelor's degree 5 years or more None
"Transportation, storage, and distribution managers" 11-3071 105.2 110.3 5.1 4.9 4.8 29.1 "$81,830" High school diploma or equivalent 5 years or more None
Compensation and benefits managers 11-3111 20.7 21.4 0.6 3.1 — 6.1 "$95,250" Bachelor's degree 5 years or more None
Human resources managers 11-3121 102.7 116.3 13.6 13.2 1.0 40.6 "$99,720" Bachelor's degree 5 years or more None
Training and development managers 11-3131 28.6 31.8 3.2 11.2 — 10.7 "$95,400" Bachelor's degree 5 years or more None
Other management occupations 11-9000 "4,215.0" "4,371.0" 156.1 3.7 43.1 "1,207.0" "$81,940" — — —

There is a problem with your reducer.
The faulty code is shown below. The loop below gets called for all the values of a particular key (for e.g. for "Advertising and promotions managers", it gets called twice. Once with value "Alabama" and again with value "6.9"). Problem is, you have put the if(cent >= 30) statement, outside the for loop. It should be inside, for matching the key.
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
state = state + " " + str[1];
}else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
}catch(Exception nf){
cent = 0;
}
}
}
if(cent>=30){
outvalue.set(state);
context.write(key,outvalue );
}
Following piece of code works fine.
//Reducer
public static class GrowthReduce extends Reducer<Text,Text,Text,Text>{
Text outvalue = new Text();
HashMap<String, String> stateMap = new HashMap<String, String>();
public void reduce(Text key,Iterable<Text> value,Context context)throws IOException, InterruptedException{
float cent = 0;
for(Text values : value){
String[] str = values.toString().split("\t");
if(str[0].equals("m1")){
stateMap.put(key.toString().toLowerCase(), str[1]);
}
else if(str[0].equals("m2")){
try{
cent = Float.parseFloat(str[1]);
if(stateMap.containsKey(key.toString().toLowerCase()))
{
if(cent>30) {
outvalue.set(stateMap.get(key.toString().toLowerCase()));
context.write(key, outvalue);
}
stateMap.remove(key.toString());
}
}catch(Exception nf){
cent = 0;
}
}
}
}
}
The logic is:
As and when you encounter a state (value "m1"), you put it in state map.
Next time, when you encounter percent with same key (value "m2"), you check if the state is already in the map. If yes, then you output the key/value.

Is it possible to create a list in java using data from multiple text files

I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
etc
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
{
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
}
fh.close();
HashMap<Integer,HashMap<String,HashMap<String,Integer>>>
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
{
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
{
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
}
year.put(wrds[0],langMap);
}
regionsByYear.put(i,year);
fh1.close();
}
}
}

Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().

Hadoop Map Reduce Program Key and Value Passing

I am trying to learn hadoop.
I have the following file downloaded from free large data set websites. I made it short for my sample testing. This is the small file.
"CAMIS","DBA","BORO","BUILDING","STREET","ZIPCODE","PHONE","CUISINECODE","INSPDATE","ACTION","VIOLCODE","SCORE","CURRENTGRADE","GRADEDATE","RECORDDATE"
"40280083","INTERCONTINENTAL THE BARCLAY","1","111 ","EAST 48 STREET ","10017","2129063134","03","2014-02-07 00:00:00","D","10F","4","A","2014-02-07 00:00:00","2014-04-24 06:01:04.920000000"
"40356649","REGINA CATERERS","3","6409","11 AVENUE","11219","7182560829","03","2013-07-30 00:00:00","D","08A","12","A","2013-07-30 00:00:00","2014-04-24 06:01:04.920000000"
"40356649","REGINA CATERERS","3","6409","11 AVENUE","11219","7182560829","03","2013-07-30 00:00:00","D","08B","12","A","2013-07-30 00:00:00","2014-04-24 06:01:04.920000000"
"40356731","TASTE THE TROPICS ICE CREAM","3","1839 ","NOSTRAND AVENUE ","11226","7188560821","43","2013-07-10 00:00:00","D","06C","8","A","2013-07-10 00:00:00","2014-04-24 06:01:04.920000000"
"40356731","TASTE THE TROPICS ICE CREAM","3","1839 ","NOSTRAND AVENUE ","11226","7188560821","43","2013-07-10 00:00:00","D","10B","8","A","2013-07-10 00:00:00","2014-04-24 06:01:04.920000000"
"40357217","WILD ASIA","2","2300","SOUTHERN BOULEVARD","10460","7182207846","03","2013-06-19 00:00:00","D","10B","4","A","2013-06-19 00:00:00","2014-04-24 06:01:04.920000000"
"40360045","SEUDA FOODS","3","705 ","KINGS HIGHWAY ","11223","7183751500","50","2013-10-10 00:00:00","D","08C","13","A","2013-10-10 00:00:00","2014-04-24 06:01:04.920000000"
"40361521","GLORIOUS FOOD","1","522","EAST 74 STREET","10021","2127372140","03","2013-12-19 00:00:00","U","08A","16","B","2013-12-19 00:00:00","2014-04-24 06:01:04.920000000"
"40362098","HARRIET'S KITCHEN","1","502","AMSTERDAM AVENUE","10024","2127210045","18","2014-03-04 00:00:00","U","10F","13","A","2014-03-04 00:00:00","2014-04-24 06:01:04.920000000"
"40361322","CARVEL ICE CREAM","4","265-15 ","HILLSIDE AVENUE ","11004","7183430392","43","2013-09-18 00:00:00","D","08A","10","A","2013-09-18 00:00:00","2014-04-24 06:01:04.920000000"
"40361708","BULLY'S DELI","1","759 ","BROADWAY ","10003","2122549755","27","2014-01-21 00:00:00","D","10F","12","A","2014-01-21 00:00:00","2014-04-24 06:01:04.920000000"
"40362098","HARRIET'S KITCHEN","1","502","AMSTERDAM AVENUE","10024","2127210045","18","2014-03-04 00:00:00","U","04N","13","A","2014-03-04 00:00:00","2014-04-24 06:01:04.920000000"
"40362274","ANGELIKA FILM CENTER","1","18","WEST HOUSTON STREET","10012","2129952570","03","2014-04-03 00:00:00","D","06D","9","A","2014-04-03 00:00:00","2014-04-24 06:01:04.920000000"
"40362715","THE COUNTRY CAFE","1","60","WALL STREET","10005","3474279132","83","2013-09-18 00:00:00","D","10B","13","A","2013-09-18 00:00:00","2014-04-24 06:01:04.920000000"
"40362869","SHASHEMENE INT'L RESTAURA","3","195","EAST 56 STREET","11203","3474300871","17","2013-05-08 00:00:00","D","10B","7","A","2013-05-08 00:00:00","2014-04-24 06:01:04.920000000"
"40363021","DOWNTOWN DELI","1","107","CHURCH STREET","10007","2122332911","03","2014-02-26 00:00:00","D","10B","9","A","2014-02-26 00:00:00","2014-04-24 06:01:04.920000000"
"40362432","HO MEI RESTAURANT","4","103-05","37 AVENUE","11368","7187796903","20","2014-04-21 00:00:00","D","06C","10","A","2014-04-21 00:00:00","2014-04-24 06:01:04.920000000"
"40362869","SHASHEMENE INT'L RESTAURA","3","195","EAST 56 STREET","11203","3474300871","17","2013-05-08 00:00:00","D","10F","7","A","2013-05-08 00:00:00","2014-04-24 06:01:04.920000000"
"40363117","MEJLANDER & MULGANNON","3","7615","5 AVENUE","11209","7182386666","03","2013-10-24 00:00:00","D","02G","11","A","2013-10-24 00:00:00","2014-04-24 06:01:04.920000000"
"40363289","HAPPY GARDEN","2","1236 ","238 SPOFFORD AVE ","10474","7186171818","20","2013-12-30 00:00:00","D","10F","8","A","2013-12-30 00:00:00","2014-04-24 06:01:04.920000000"
"40363644","DOMINO'S PIZZA","1","464","3 AVENUE","10016","2125450200","62","2014-03-06 00:00:00","D","08A","11","A","2014-03-06 00:00:00","2014-04-24 06:01:04.920000000"
"30191841","DJ REYNOLDS PUB AND RESTAURANT","1","351 ","WEST 57 STREET ","10019","2122452912","03","2013-07-22 00:00:00","D","10B","11","A","2013-07-22 00:00:00","2014-04-24 06:01:04.920000000"
"40280083","INTERCONTINENTAL THE BARCLAY","1","111 ","EAST 48 STREET ","10017","2129063134","03","2014-02-07 00:00:00","D","10B","4","A","2014-02-07 00:00:00","2014-04-24 06:01:04.920000000"
"40356442","KOSHER ISLAND","5","2206","VICTORY BOULEVARD","10314","7186985800","50","2013-04-04 00:00:00","D","10F","12","A","2013-04-04 00:00:00","2014-04-24 06:01:04.920000000"
"40356483","WILKEN'S FINE FOOD","3","7114 ","AVENUE U ","11234","7184443838","27","2014-01-14 00:00:00","D","10B","10","A","2014-01-14 00:00:00","2014-04-24 06:01:04.920000000"
File is about some inspection in restaurants.
You can see there is CUISINECODE. Values of it ranges from "00" to some value or can be any value. There will be many restaurants have the same CUISINECODE.
I just want to display the number of restaurants in each cusinecode.
This is my MapReducer Program
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RestaurantInspection {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
if (line.startsWith("\"CAMIS\",")) {
// Line is the header, ignore it
return;
}
List<String> columns = new ArrayList<String>();
String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
if (tokens.length != 15) {
// Line isn't the correct number of columns or formatted properly
return;
}
for(String t : tokens) {
columns.add(t.replaceAll("\"", ""));
}
int cusineCode = Integer.parseInt(columns.get(7));
String violations = columns.get(9) + " --- " + columns.get(10);
value.set(violations);
output.collect(value, new IntWritable(cusineCode));
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(RestaurantInspection.class);
conf.setJobName("Restaurent Inspection");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
I am using hadoop 1.2.1. I copied the above code from WordCount Example and just changed few lines.
When I run the above code in hadoop I am getting following lines for the same file I given above
D --- 02G 3
D --- 06C 63
D --- 06D 3
D --- 08A 108
D --- 08B 3
D --- 08C 50
D --- 10B 182
D --- 10F 117
U --- 04N 18
U --- 08A 3
U --- 10F 18
That was just a test. I am not getting any logic of how to write the code to get the desired output. I am expecting the following output for the above file.
01 -- 1
03 -- 9
43 -- 3
50 -- 2
18 -- 2
27 -- 2
83 -- 1
17 -- 2
20 -- 2
62 -- 1
By this, I think I can learn hadoop and map reduce.
So how to write the code? Thanks.

You need key to be CUISINECODE.
String cusineCode = columns.get(7);
output.collect(new Text(cusineCode), new IntWritable(1));
This will do the job for you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop Multiple Inputs wrongly grouped - Two Way Join Exercise - java

Related

Combiner creating mapoutput file per region in HBase scan mapreduce

How to set PivotTable Field Number Format Cell with Apache POI

Unable to get required output for simple Hadoop mapreduce program

Is it possible to create a list in java using data from multiple text files

Hadoop Map Reduce Program Key and Value Passing

Categories

Resources