I am trying to make a custom partitioner to allocate each unique key to a single reducer. this was after the default HashPartioner failed
Alternative to the default hashpartioner provided with hadoop
I keep getting the following error. It has something to do with the constructor not receiving its arguments, from what I can tell from doing some research. but in this context, with hadoop, aren't the arguments passed automatically by the framework? I cant find an error in the code
18/04/20 17:06:51 INFO mapred.JobClient: Task Id : attempt_201804201340_0007_m_000000_1, Status : FAILED
java.lang.RuntimeException: java.lang.NoSuchMethodException: biA3pipepart$parti.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:587)
This is my partitioner:
public class Parti extends Partitioner<Text, Text> {
String partititonkey;
int result=0;
#Override
public int getPartition(Text key, Text value, int numPartitions) {
String partitionKey = key.toString();
if(numPartitions >= 9){
if(partitionKey.charAt(0) =='0' ){
if(partitionKey.charAt(2)=='0' )
result= 0;
else
if(partitionKey.charAt(2)=='1' )
result= 1;
else
result= 2;
}else
if(partitionKey.charAt(0)=='1'){
if(partitionKey.charAt(2)=='0' )
result= 3;
else
if(partitionKey.charAt(2)=='1' )
result= 4;
else
result= 5;
}else
if(partitionKey.charAt(0)=='2' ){
if(partitionKey.charAt(2)=='0' )
result= 6;
else
if(partitionKey.charAt(2)=='1' )
result= 7;
else
result= 8;
}
} //
else
result= 0;
return result;
}// close method
}// close class
My mapper signature
public static class JoinsMap extends Mapper<LongWritable,Text,Text,Text>{
public void Map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
My reducer signiture
public static class JoinsReduce extends Reducer<Text,Text,Text,Text>{
public void Reduce (Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
main class:
public static void main( String[] args ) throws Exception {
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1, "biA3pipepart");
job1.setJarByClass(biA3pipepart.class);
job1.setNumReduceTasks(9);//***
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Text.class);
job1.setMapperClass(JoinsMap.class);
job1.setReducerClass(JoinsReduce.class);
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
job1.setPartitionerClass(Parti.class); //+++
// inputs to map.
FileInputFormat.addInputPath(job1, new Path(args[0]));
// single output from reducer.
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
job1.waitForCompletion(true);
}
keys emitted by Mapper are the following:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
and the Reducer only writes keys and values it receives.
SOLVED
I just added static to my Parti class like the mapper and reducer classes as suggested by comment (user238607).
public static class Parti extends Partitioner<Text, Text> {
Related
I am looking out for the mapreduce program to read from one hive table and write to hdfs location of first column value of each record. And it should contain only map phase not reducer phase.
Below is the mapper
public class Map extends Mapper<WritableComparable, HCatRecord, NullWritable, IntWritable> {
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
NullWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
// groupname = (String) value.get(0);
int id = (Integer) value.get(1);
// Just select and emit the name and ID
context.write(null, new IntWritable(id));
}
}
Main class
public class mapper1 {
public static void main(String[] args) throws Exception {
mapper1 m=new mapper1();
m.run(args);
}
public void run(String[] args) throws IOException, Exception, InterruptedException {
Configuration conf =new Configuration();
// Get the input and output table names as arguments
String inputTableName = args[0];
// Assume the default database
String dbName = "xademo";
Job job = new Job(conf, "UseHCat");
job.setJarByClass(mapper1.class);
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setMapperClass(Map.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath((JobConf) conf, new Path(args[1]));
job.waitForCompletion(true);
}
}
Is there anything wrong in this code?
This is giving some error as Numberformat exception from string 5s. I am not sure where it is being taken from. Showing error at below line HCatInputFormat.setInput()
I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.
These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.
The error I get is the following:
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:
472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473 org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474 TaskReporter reporter,
475 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476 throws InterruptedException, IOException {
...
491 this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 #Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521 org.apache.hadoop.mapreduce.TaskAttemptContext context
522 ) throws IOException, InterruptedException {
523 long bytesInPrev = getInputBytes(fsStats);
524 real.initialize(split, context);
525 long bytesInCurr = getInputBytes(fsStats);
526 fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }
Of course, the relevant parts of my code:
# MyInputFormat.java
public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
public static void addResId(Job job, String resId) {
Configuration conf = job.getConfiguration();
String inputs = conf.get(INPUT_RES_IDS, "");
if (inputs.isEmpty()) {
inputs += restId;
} else {
inputs += "," + resId;
}
conf.set(INPUT_RES_IDS, inputs);
}
#Override
public List<InputSplit> getSplits(JobContext job) {
// resulting splits container
List<InputSplit> splits = new ArrayList<InputSplit>();
// get the Job configuration
Configuration conf = job.getConfiguration();
// get the inputs, i.e. the list of resource IDs
String input = conf.get(INPUT_RES_IDS, "");
String[] resIDs = StringUtils.split(input);
// iterate on the resIDs
for (String resID: resIDs) {
splits.addAll(getSplitsResId(resID, job.getConfiguration()));
}
// return the splits
return splits;
}
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
# MyRecordReader.java
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// get start, end and current positions
MyInputSplit inputSplit = (MyInputSplit) this.split;
start = inputSplit.getFirstRecordIndex();
end = start + inputSplit.getLength();
current = 0;
// query the third-party system for the related resource, seeking to the start of the split
records = backend.getRecords(inputSplit.getResId(), start, end);
}
# MapReduceTest.java
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "MapReduce test");
job.setJarByClass(MapReduceTest.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
FileOutputFormat.setOutputPath(job, new Path(args[0]));
return job.waitForCompletion(true) ? 0 : 1;
}
Any ideas about what is wrong?
BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)
The way I read your strack trace real is null on line 524.
But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.
NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.
524 real.initialize( blah, blah) // I actually stopped reading after the dot
this.real was set on line 491.
491 this.real = inputFormat.createRecordReader(split, taskContext);
Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.
Turns out it will return null when backend is null.
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, " +
"it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
It looks like setEnvironmnet is supposed to set backend
# MyInputFormat.java
public static void setEnvironmnet(
Job job,
String host,
String port,
boolean ssl,
String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).
If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.
UPDATE:
As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.
Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.
Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.
Thanks to CandiedOrange who put me in the right path!
Driver code:
public class WcDriver {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "WcDriver");
job.setJarByClass(WcDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.waitForCompletion(true);
}
}
Reducer code
public class WcReducer extends Reducer<Text, LongWritable, Text,String>
{
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
String key1 = null;
int total = 0;
for (LongWritable value : values) {
total += value.get();
key1= key.toString();
}
context.write(new Text(key1), "ABC");
}
}
Here, in driver class I have set job.setOutputKeyClass(Text.class) and job.setOutputValueClass(LongWritable.class), but in reducer class I am writing a string context.write(new Text(key1), "ABC");. I think there should be an error while running the program because output types are not matching, and also reducer's key should implement WritableComparable and value should implement Writable interface. Strangely, this program is running good. I do not understand why there is not an exception.
try to do this :
// job.setOutputFormatClass(TextOutputFormat.class);
// comment this line, and you'll sure get exception of casting.
This is because, TextOutputFormat assumes LongWritable as key, and Text as value, if you'll not define the outPutFormat class, it will expect to get default behaviour of writable, which is by default, but if u'll mention it, it would implicitly cast it to the given type.;
try this
//job.setOutputValueClass(LongWritable.class); if you comment this line you get an error
this will for only define the key value pair by defaul it depent on the output format and
it will be text so this is not giving any error
I've recently started messing with Hadoop and just created my own inputformat to handle pdf's.
For some reason my custom RecordReader class doesn't have it's initialize method called. (checked it with a sysout, cause I haven't set up a debugging environment)
I'm running hadoop 2.2.0 on windows 7 32bit. Doing my calls with yarn jar, as hadoop jar is bugged under windows...
import ...
public class PDFInputFormat extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit arg0,
JobConf arg1, Reporter arg2) throws IOException
{
return new PDFRecordReader();
}
public static class PDFRecordReader implements RecordReader<Text, Text>
{
private FSDataInputStream fileIn;
public String fileName=null;
HashSet<String> hset=new HashSet<String>();
private Text key=null;
private Text value=null;
private byte[] output=null;
private int position = 0;
#Override
public Text createValue() {
int endpos = -1;
for (int i = position; i < output.length; i++){
if (output[i] == (byte) '\n') {
endpos = i;
}
}
if (endpos == -1) {
return new Text(Arrays.copyOfRange(output,position,output.length));
}
return new Text(Arrays.copyOfRange(output,position,endpos));
}
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext job) throws
IOException, InterruptedException
{
System.out.println("initialization is called");
FileSplit split=(FileSplit) genericSplit;
Configuration conf=job.getConfiguration();
Path file=split.getPath();
FileSystem fs=file.getFileSystem(conf);
fileIn= fs.open(split.getPath());
fileName=split.getPath().getName().toString();
System.out.println(fileIn.toString());
PDDocument docum = PDDocument.load(fileIn);
ByteArrayOutputStream boss = new ByteArrayOutputStream();
OutputStreamWriter ow = new OutputStreamWriter(boss);
PDFTextStripper stripper=new PDFTextStripper();
stripper.writeText(docum, ow);
ow.flush();
output = boss.toByteArray();
}
}
}
As I figured it out last night and I might help someone else with this:
RecordReader is a deprecated interface of Hadoop (hadoop.common.mapred) and it doesn't actually contain an initialize method, which explains why it doesn't get called automatically.
Extending the RecordReader class in hadoop.common.mapreduce does let you extend the initialize method of that class.
The System.out.println() may not help while running job. To make sure your initialize() is called or not try throw some RuntimeException there as below:
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext job) throws
IOException, InterruptedException
{
throw new NullPointerException("inside initialize()");
....
This will definitely do.
I have got the same problem as mentioned in this question (Type mismatch in key from map when replacing Mapper with MultithreadMapper), but the answer do not work for me.
The error message i get looks like the following:
13/09/17 10:37:38 INFO mapred.JobClient: Task Id : attempt_201309170943_0006_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:690)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is my main method:
public static int main(String[] init_args) throws Exception {
Configuration config = new Configuration();
if (args.length != 5) {
System.out.println("Invalid Arguments");
print_usage();
throw new IllegalArgumentException();
}
config.set("myfirstdata", args[0]);
config.set("myseconddata", args[1]);
config.set("mythirddata", args[2]);
config.set("mykeyattribute", "GK");
config.setInt("myy", 50);
config.setInt("myx", 49);
// additional attributes
config.setInt("myobjectid", 1);
config.setInt("myplz", 3);
config.setInt("mygenm", 4);
config.setInt("mystnm", 6);
config.setInt("myhsnr", 7);
config.set("mapred.textoutputformat.separator", ";");
Job job = new Job(config);
job.setJobName("MySample");
// set the outputs for the Job
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// set the outputs for the Job
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultithreadedMapper.setMapperClass(job, MyMapper.class);
job.setReducerClass(MyReducer.class);
// In our case, the combiner is the same as the reducer. This is
// possible
// for reducers that are both commutative and associative
job.setCombinerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path(args[3]));
TextOutputFormat.setOutputPath(job, new Path(args[4]));
job.setJarByClass(MySampleDriver.class);
MultithreadedMapper.setNumberOfThreads(job, 2);
return job.waitForCompletion(true) ? 0 : 1;
}
The mapper code looks like this:
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
...
/**
* Sets up mapper with filter geometry provided as argument[0] to the jar
*/
#Override
public void setup(Context context) {
...
}
#Override
public void map(LongWritable key, Text val, Context context)
throws IOException, InterruptedException {
...
// We know that the first line of the CSV is just headers, so at byte
// offset 0 we can just return
if (key.get() == 0)
return;
String line = val.toString();
String[] values = line.split(";");
float latitude = Float.parseFloat(values[latitudeIndex]);
float longitude = Float.parseFloat(values[longitudeIndex]);
...
// Create our Point directly from longitude and latitude
Point point = new Point(longitude, latitude);
IntWritable one = new IntWritable();
if (...) {
int name = ...
one.set(name);
String out = ...
context.write(new Text(out), one);
} else {
String out = ...
context.write(new Text(out), new IntWritable(-1));
}
}
}
You forgot to set the mapper class. You need to add job.setMapperClass(MultithreadedMapper.class); to your codes.