Not able to use CompositetextinputFormat in Mapside Join - java

I am trying to implement Map-side join using CompositeTextInoutFormat. However I am getting below errors in Map reduce job which I am unable to resolve,.
1. In the below code I am getting error while using Compose method and also getting an error while setting inputformat Class. The error says as below.
The method compose(String, Class, Path...) in
the type CompositeInputFormat is not applicable for the arguments
(String, Class, Path[])
Can someone please help
package Hadoop.MR.Practice;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.join.CompositeInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//import org.apache.hadoop.mapred.join.CompositeInputFormat;
public class MapJoinJob implements Tool{
private Configuration conf;
public Configuration getConf() {
return conf;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "MapSideJoinJob");
job.setJarByClass(this.getClass());
Path[] inputs = new Path[] { new Path(args[0]), new Path(args[1])};
String join = CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, inputs);
job.getConfiguration().set("mapreduce.join.expr", join);
job.setInputFormatClass(CompositeInputFormat.class);
job.setMapperClass(MapJoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//Configuring reducer
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
MapJoinJob mjJob = new MapJoinJob();
ToolRunner.run(conf, mjJob, args);
}

I would say your problem is likely related to mixing hadoop APIs. You can see that your imports are mixing mapred and mapreduce.
For example, you're trying to use org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat with org.apache.hadoop.mapred.join.CompositeInputFormat which is unlikely to work.
You should choose one (probably mapreduce i would say) and make sure everything is using the same API.

Related

Issue with specifying the separator for KeyValueTextInputFormat

I was running a mapreduce job to count the number of citations of patents in a dataset. The input type was of KeyValueTextInputFormat, and the separator was comma. I am using the New API. However specifying the separator in Config using mapreduce.input.keyvaluelinerecordreader.key.value.separator isn't working, but using key.value.separator.in.input.line (which is meant for the old API) works.
When I go to the declaration of KeyValueTextInputFormat class, go inside KeyValueLineRecordReader class, and check KEY_VALUE_SEPERATOR static variable, it is set to "mapreduce.input.keyvaluelinerecordreader.key.value.separator". So, setting this key in config should have worked, but it doesn't.
Here's the driver code:
package stubs.reverse_index;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
// THIS DOESN'T WORK
// conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
// THIS WORKS
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("MyJob");
job.setMapperClass(InverseMapper.class);
job.setReducerClass(CsvReducer.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
return (success ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyJob(), args);
System.exit(res);
}
}
What could be the issue here? As I understand key.value.separator.in.input.line is for the old API and shouldn't have worked here.

Hadoop program (java) to read comma separated input file

I have an input file that looks like below
000001928162247ffaf63185cd8b2a244c78e7c6,2009324,abcat0101001,Sharp,"2011-09-05 12:25:37.42","2011-09-05 12:25:01.187"
0001be1731ee7d1c519bc7e87110c9eb880cb396,1649294,abcat0715001,"Gunnar eyewear","2011-09-23 17:13:36.175","2011-09-23 17:12:18.389"
0001bfa0c494c01f9f8c141c476c11bb4625a746,17240521,cat02015,refrigerator,"2011-10-19 23:43:51.71","2011-10-19 23:43:06.485"
0001fb09f03fea4d04e2267ed3194c806839d997,1271997,abcat0513004,Razer,"2011-09-07 09:20:07.11","2011-09-07 09:19:03.279"
0002965b083b6e508f7740c47c8f39e1072b4219,3562379,pcmcat209400050001,"I phone 4","2011-10-27 14:10:31.92","2011-10-27 14:09:33.327"
0002bb28a9ca07f5515b01996fd5d7ca84742e41,3230638,pcmcat177200050009,"hd antenna","2011-10-20 00:03:49.966","2011-10-20 00:02:01.458"
0002bd9c3d654698bb514194c4f4171ad6992266,9947181,pcmcat253300050012,printer,"2011-10-06 19:51:40.984","2011-10-06 19:47:13.803"
0002fee45e1c32eb94e82fc6c15c4db14e796248,3519969,pcmcat247400050000,vaio,"2011-10-19 23:31:51.015","2011-10-19 23:31:12.213"
00042033d355973baf9454b021a15c6b5b48f4a3,2677297,pcmcat212600050008,"desk top","2011-08-29 12:03:38.265","2011-08-29 12:03:12.348"
000433e0ef411c2cb8ee1727002d6ba15fe9426b,8959317,cat02015,"how i met your mother","2011-09-17 19:44:40.129","2011-09-17 19:43:37.564"
which contains the following information
user_id,product_id,category,query,click_time,query_time
I want to read this file in Hadoop and extract the user_id and the category (the 1st and 3rd fields). I have a basic Hadoop program as below that I used for wordcount. In this task, the items are separated by comma, and I have to store them in ArrayList.
Here is my starting program:
import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
public class popularcats extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String [] args) throws Exception {
int exitCode = ToolRunner.run(new popularcats(), args);
System.exit(exitCode);
}
}
I suppose Hadoop must have some classes for reading in a CSV file. I found this class CSVLineRecordReader from this address
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
So how should i process this file and extract the desired fields?

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !
Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

How to read multiple types of Avro data in single MapReduce

I have two different types of Avro data which have some common fields. I want to read those common fields in the mapper. I want to read this by spawning a single job in cluster.
Below is the sample avro schema
Schema 1:
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaOne","doc":"Avro storing with schema using MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"BB","type":["null","string"],"default":null},
{"name":"CC","type":["null","string"],"default":null}]}
Schema 2 :
{"type":"record","name":"Test","namespace":"com.abc.schema.SchemaTwo","doc":"Avro
storing with schema using
MR.","fields":[{"name":"EE","type":"string","default":null},
{"name":"AA","type":["null","long"],"default":null},
{"name":"CC","type":["null","string"],"default":null},
{"name":"DD","type":["null","string"],"default":null}]}
Driver Class:
package com.mango.schema.aggrDaily;
import java.util.Date;
import org.apache.avro.Schema;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(super.getConf(), getClass());
conf.setJobName("DF");
args[0] = "hdfs://localhost:9999/home/hadoop/work/alok/aggrDaily/data/avro512MB/part-m-00000.avro";
args[1] = "/home/hadoop/work/alok/tmp"; // temp location
args[2] = "hdfs://localhost:9999/home/hadoop/work/alok/tmp/10";
FileInputFormat.addInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[2]));
AvroJob.setInputReflect(conf);
AvroJob.setMapperClass(conf, AvroMapper.class);
AvroJob.setOutputSchema(
conf,
Pair.getPairSchema(Schema.create(Schema.Type.STRING),
Schema.create(Schema.Type.INT)));
RunningJob job = JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
long startTime = new Date().getTime();
System.out.println("Start Time :::::" + startTime);
Configuration conf = new Configuration();
int exitCode = ToolRunner.run(conf, new AvroDriver(), args);
long endTime = new Date().getTime();
System.out.println("End Time :::::" + endTime);
System.out.println("Total Time Taken:::"
+ new Double((endTime - startTime) * 0.001) + "Sec.");
System.exit(exitCode);
}
}
Mapper class:
package com.mango.schema.aggrDaily;
import java.io.IOException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.Pair;
import org.apache.hadoop.mapred.Reporter;
public class AvroMapper extends
AvroMapper<GenericData, Pair<CharSequence, Integer>> {
#Override
public void map(GenericData record,
AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter) throws IOException {
System.out.println("record :: " + record);
}
}
I able to read Avro data with this code by setting the input schema.
AvroJob.setInputSchema(conf, new AggrDaily().getSchema());
As the Avro data has builtin schema into the data, I don't want to pass the specific schema to the job explicitly. I achieve this in Pig. But now I want to achieve the same in MapReduce also.
Can anybody help me to achieve this through MR code or let me know where am I going wrong?
By *org.apache.hadoop.mapreduce.lib.input.MultipleInputs class we can read multiple avro data trough single MR job
We cannot use org.apache.hadoop.mapreduce.lib.input.MultipleInputs to read multiple avro data because each one of the avro inputs will have a schema associated with it and currently context can store the schema for only one of the inputs. So other mappers will not be able read data..
The same thing is true with HCatInputFormat as well (because every input has a schema associated with it). However in Hcatalog 0.14 onwards there is a provision for the same.
AvroMultipleInputs can be used to accomplish the same. It works only with Specific and Reflect mappings. It is available from version 1.7.7 onwards.

Running a mapreduce class in another Java program

I write a mapreduce class and create a jar file from the class. now I want to use this jar in another java program.
can anyone help me please how could I do this?
thanks
here is my MapReduce Program:
package org.apache.cassandra.com;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.hadoop.cql3.CqlConfigHelper;
import org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat;
import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class CassandraSumLib extends Configured implements Tool
{
public CassandraSumLib(){
}
static final String KEYSPACE = "weather";
static final String COLUMN_FAMILY = "momentinfo1";
static final String OUTPUT_PATH = "/tmp/OutPut";
private static final Logger logger = LoggerFactory.getLogger(CassandraSum.class);
public int CassandraSum(String[] args) throws Exception
{
return ToolRunner.run(new Configuration(), new CassandraSumLib(), args);
}
///////////////////////////////////////////////////////////
public static class Summap extends Mapper<Map<String, ByteBuffer>, Map<FloatWritable, ByteBuffer>, Text, DoubleWritable>
{
Text word = new Text("SUM");
float temp;
public void map(Map<String, ByteBuffer> keys, Map<FloatWritable, ByteBuffer> columns, Context context) throws IOException, InterruptedException
{
for (Entry<FloatWritable, ByteBuffer> column : columns.entrySet())
{
if (!"column".equals(column.getKey()))
continue;
temp = ByteBufferUtil.toFloat(column.getValue());
//System.out.println(temp);
context.write(word, new DoubleWritable(temp));
//System.out.println(word + " " + temp);
}
}
}
///////////////////////////////////////////////////////////
public static class Sumred extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
Double sum = 0.0;
for (DoubleWritable val : values){
// System.out.println(val.get());
sum += val.get();}
context.write(key, new DoubleWritable(sum));
}
}
///////////////////////////////////////////////////////////
public int run(String[] args) throws Exception
{
Job job = new Job(getConf(), "SUM");
job.setJarByClass(CassandraSum.class);
job.setMapperClass(Summap.class);
JobConf conf = new JobConf( getConf(), CassandraSum.class);
// conf.setNumMapTasks(1000);
// conf.setNumReduceTasks(900);
job.setOutputFormatClass(TextOutputFormat.class);
job.setCombinerClass(Sumred.class);
job.setReducerClass(Sumred.class);
job.setOutputKeyClass(Text.class);
job.setNumReduceTasks(900);
job.setOutputValueClass(DoubleWritable.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setInputFormatClass(CqlPagingInputFormat.class);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "Murmur3Partitioner");
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3");
job.waitForCompletion(true);
return 0;
}
}
I want to call this class from another program. here is my second program that call my firs program:
package org.apache.cassandra.com;
import java.util;
import org.apache.hadoop.util.RunJar;
import org.apache.cassandra.com.CassandraSumLib;
public class CassandraSum {
public static void main(String[] args) throws Exception{
CassandraSumLib CSL = new CassandraSumLib();
CSL.??? (which method should I write here?)
}
}
thanks
Steps to add jar file in eclipse
1. right click on project
2. click on Bulid Path->configure path
3. click on java Build path
4. Click on libraries tab
5. click on add external jar tab
6. choose jar file
7. click on ok
Add the jar to class path of the second program. If you are compiling/running from command line, use -cp option.

Categories

Resources