Hadoop: NullPointerException with Custom InputFormat

Hadoop: NullPointerException with Custom InputFormat - java

I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.
These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.
The error I get is the following:
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:
472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473 org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474 TaskReporter reporter,
475 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476 throws InterruptedException, IOException {
...
491 this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 #Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521 org.apache.hadoop.mapreduce.TaskAttemptContext context
522 ) throws IOException, InterruptedException {
523 long bytesInPrev = getInputBytes(fsStats);
524 real.initialize(split, context);
525 long bytesInCurr = getInputBytes(fsStats);
526 fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }
Of course, the relevant parts of my code:
# MyInputFormat.java
public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
public static void addResId(Job job, String resId) {
Configuration conf = job.getConfiguration();
String inputs = conf.get(INPUT_RES_IDS, "");
if (inputs.isEmpty()) {
inputs += restId;
} else {
inputs += "," + resId;
}
conf.set(INPUT_RES_IDS, inputs);
}
#Override
public List<InputSplit> getSplits(JobContext job) {
// resulting splits container
List<InputSplit> splits = new ArrayList<InputSplit>();
// get the Job configuration
Configuration conf = job.getConfiguration();
// get the inputs, i.e. the list of resource IDs
String input = conf.get(INPUT_RES_IDS, "");
String[] resIDs = StringUtils.split(input);
// iterate on the resIDs
for (String resID: resIDs) {
splits.addAll(getSplitsResId(resID, job.getConfiguration()));
}
// return the splits
return splits;
}
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
# MyRecordReader.java
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// get start, end and current positions
MyInputSplit inputSplit = (MyInputSplit) this.split;
start = inputSplit.getFirstRecordIndex();
end = start + inputSplit.getLength();
current = 0;
// query the third-party system for the related resource, seeking to the start of the split
records = backend.getRecords(inputSplit.getResId(), start, end);
}
# MapReduceTest.java
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "MapReduce test");
job.setJarByClass(MapReduceTest.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
FileOutputFormat.setOutputPath(job, new Path(args[0]));
return job.waitForCompletion(true) ? 0 : 1;
}
Any ideas about what is wrong?
BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)

The way I read your strack trace real is null on line 524.
But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.
NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.
524 real.initialize( blah, blah) // I actually stopped reading after the dot
this.real was set on line 491.
491 this.real = inputFormat.createRecordReader(split, taskContext);
Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.
Turns out it will return null when backend is null.
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, " +
"it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
It looks like setEnvironmnet is supposed to set backend
# MyInputFormat.java
public static void setEnvironmnet(
Job job,
String host,
String port,
boolean ssl,
String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).
If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.
UPDATE:
As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.

Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.
Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.
Thanks to CandiedOrange who put me in the right path!

Related

Hadoop is skipping reduce phase entirely

I have set up a Hadoop job like so:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Legion");
job.setJarByClass(Legion.class);
job.setMapperClass(CallQualityMap.class);
job.setReducerClass(CallQualityReduce.class);
// Explicitly configure map and reduce outputs, since they're different classes
job.setMapOutputKeyClass(CallSampleKey.class);
job.setMapOutputValueClass(CallSample.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(CombineRepublicInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
CombineRepublicInputFormat.setInputDirRecursive(job, true);
CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
This job completes, but something strange happens. I get one output line per input line. Each output line consists of the output from a CallSampleKey.toString() method, then a tab, then something like CallSample#17ab34d.
This means that the reduce phase is never running and the CallSampleKey and CallSample are getting passed directly to the TextOutputFormat. But I don't understand why this would be the case. I've very clearly specified job.setReducerClass(CallQualityReduce.class);, so I have no idea why it would skip the reducer!
Edit: Here's the code for the reducer:
public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());
while (inValues.hasNext()) {
call.addSample(inValues.next());
}
context.write(NullWritable.get(), new Text(call.getStats()));
}
}

What if you try to change your
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
to use Iterable instead of Iterator?
public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {
You'll have to then use inValues.iterator() to get the actual iterator.
If the method signature doesn't match then it's just falling through to the default identity reducer implementation. It's perhaps unfortunate that the underlying default implementation doesn't make it easy to detect this kind of typo, but the next best thing is to always use #Override in all methods you intend to override so that the compiler can help.

Hadoop - How to extract a taskId from mapred.JobConf?

Is it possible to create a valid *mapreduce*.TaskAttemptID from *mapred*.JobConf?
The background
I need to write a FileInputFormatAdapter for an ExistingFileInputFormat. The problem is that the Adapter needs to extend mapred.InputFormat and the Existing format extends mapreduce.InputFormat.
I need to build a mapreduce.TaskAttemptContextImpl, so that I can instantiate the ExistingRecordReader. However, I can't create a valid TaskId...the taskId comes out as null.
So How can I get the taskId, jobId, etc from mapred.JobConf.
In particular in the Adapter's getRecordReader I need to do something like:
public org.apache.hadoop.mapred.RecordReader<NullWritable, MyWritable> getRecordReader(
org.apache.hadoop.mapred.InputSplit split, JobConf job, Reporter reporter) throws IOException {
SplitAdapter splitAdapter = (SplitAdapter) split;
final Configuration conf = job;
/*************************************************/
//The problem is here, "mapred.task.id" is not in the conf
/*************************************************/
final TaskAttemptID taskId = TaskAttemptID.forName(conf.get("mapred.task.id"));
final TaskAttemptContext context = new TaskAttemptContextImpl(conf, taskId);
try {
return new RecordReaderAdapter(new ExistingRecordReader(
splitAdapter.getMapRedeuceSplit(),
context));
} catch (InterruptedException e) {
throw new RuntimeException("Failed to create record-reader.", e);
}
}
This code throws an exception:
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:44)
at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:39)
'super(conf, taskId.getJobID());' is throwing the exception, most likely because taskId is null.

I found the answer by looking through HiveHbaseTableInputFormat. Since my solution is targeted for hive, this works perfectly.
TaskAttemptContext tac = ShimLoader.getHadoopShims().newTaskAttemptContext(
job.getConfiguration(), reporter);

Best way to get distribute a small lookup file using Distributed Cache

Which is the best way to get Distributed cached data?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
OR
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
So if we are doing like (code 2) does that mean Say we have 5 map task every map task reads the same copy of the data . while writing like this for each map , the task reads the data multiple times am i right (5 times)?
code 1 : as it is written in setup it is read once and the global data is accessed in map.
Which is the right way of writing distributed cache.

Do as much as you can in the setup method: this will be called once by each mapper, but will then be cached for each record that is passed to the mapper. Parsing your data for each record is overhead you can avoid, since there is nothing there that depends on the key, value and context variables you are receiving in the map method.
The setup method will be called per map task, but map will be called for each record processed by that task (which can clearly be a very high number).

Turn off date comment in properties file [duplicate]

Is it possible to force Properties not to add the date comment in front? I mean something like the first line here:
#Thu May 26 09:43:52 CEST 2011
main=pkg.ClientMain
args=myargs
I would like to get rid of it altogether. I need my config files to be diff-identical unless there is a meaningful change.

Guess not. This timestamp is printed in private method on Properties and there is no property to control that behaviour.
Only idea that comes to my mind: subclass Properties, overwrite store and copy/paste the content of the store0 method so that the date comment will not be printed.
Or - provide a custom BufferedWriter that prints all but the first line (which will fail if you add real comments, because custom comments are printed before the timestamp...)

Given the source code or Properties, no, it's not possible. BTW, since Properties is in fact a hash table and since its keys are thus not sorted, you can't rely on the properties to be always in the same order anyway.
I would use a custom algorithm to store the properties if I had this requirement. Use the source code of Properties as a starter.

Based on https://stackoverflow.com/a/6184414/242042 here is the implementation I have written that strips out the first line and sorts the keys.
public class CleanProperties extends Properties {
private static class StripFirstLineStream extends FilterOutputStream {
private boolean firstlineseen = false;
public StripFirstLineStream(final OutputStream out) {
super(out);
}
#Override
public void write(final int b) throws IOException {
if (firstlineseen) {
super.write(b);
} else if (b == '\n') {
firstlineseen = true;
}
}
}
private static final long serialVersionUID = 7567765340218227372L;
#Override
public synchronized Enumeration<Object> keys() {
return Collections.enumeration(new TreeSet<>(super.keySet()));
}
#Override
public void store(final OutputStream out, final String comments) throws IOException {
super.store(new StripFirstLineStream(out), null);
}
}
Cleaning looks like this
final Properties props = new CleanProperties();
try (final Reader inStream = Files.newBufferedReader(file, Charset.forName("ISO-8859-1"))) {
props.load(inStream);
} catch (final MalformedInputException mie) {
throw new IOException("Malformed on " + file, mie);
}
if (props.isEmpty()) {
Files.delete(file);
return;
}
try (final OutputStream os = Files.newOutputStream(file)) {
props.store(os, "");
}

if you try to modify in the give xxx.conf file it will be useful.
The write method used to skip the First line (#Thu May 26 09:43:52 CEST 2011) in the store method. The write method run till the end of the first line. after it will run normally.
public class CleanProperties extends Properties {
private static class StripFirstLineStream extends FilterOutputStream {
private boolean firstlineseen = false;
public StripFirstLineStream(final OutputStream out) {
super(out);
}
#Override
public void write(final int b) throws IOException {
if (firstlineseen) {
super.write(b);
} else if (b == '\n') {
// Used to go to next line if did use this line
// you will get the continues output from the give file
super.write('\n');
firstlineseen = true;
}
}
}
private static final long serialVersionUID = 7567765340218227372L;
#Override
public synchronized Enumeration<java.lang.Object> keys() {
return Collections.enumeration(new TreeSet<>(super.keySet()));
}
#Override
public void store(final OutputStream out, final String comments)
throws IOException {
super.store(new StripFirstLineStream(out), null);
}
}

Can you not just flag up in your application somewhere when a meaningful configuration change takes place and only write the file if that is set?
You might want to look into Commons Configuration which has a bit more flexibility when it comes to writing and reading things like properties files. In particular, it has methods which attempt to write the exact same properties file (including spacing, comments etc) as the existing properties file.

You can handle this question by following this Stack Overflow post to retain order:
Write in a standard order:
How can I write Java properties in a defined order?
Then write the properties to a string and remove the comments as needed. Finally write to a file.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
properties.store(baos,null);
String propertiesData = baos.toString(StandardCharsets.UTF_8.name());
propertiesData = propertiesData.replaceAll("^#.*(\r|\n)+",""); // remove all comments
FileUtils.writeStringToFile(fileTarget,propertiesData,StandardCharsets.UTF_8);
// you may want to validate the file is readable by reloading and doing tests to validate the expected number of keys matches
InputStream is = new FileInputStream(fileTarget);
Properties testResult = new Properties();
testResult.load(is);

Is Hadoop's TooRunner thread-safe?

I would like to trigger a few Hadoop jobs simultaneously. I’ve created a pool of threads using Executors.newFixedThreadPool. Idea is that if the pool size is 2, my code will trigger 2 Hadoop jobs at the same exact time using ‘ToolRunner.run’. In my testing, I noticed that these 2 threads keep stepping on each other.
When I looked under the hood, I noticed that ToolRunner creates GenericOptionsParser which in turn calls a static method ‘buildGeneralOptions’. This method uses ‘OptionBuilder.withArgName’ which uses an instance variable called, ‘argName’. This doesn’t look thread safe to me and I believe is the root cause of issues I am running into.
Any thoughts?

Confirmed that ToolRunner is NOT thread-safe:
Original code (which runs into problems):
public static int run(Configuration conf, Tool tool, String[] args)
throws Exception{
if(conf == null) {
conf = new Configuration();
}
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
//set the configuration back, so that Tool can configure itself
tool.setConf(conf);
//get the args w/o generic hadoop args
String[] toolArgs = parser.getRemainingArgs();
return tool.run(toolArgs);
}
New Code(which works):
public static int run(Configuration conf, Tool tool, String[] args)
throws Exception{
if(conf == null) {
conf = new Configuration();
}
GenericOptionsParser parser = getParser(conf, args);
tool.setConf(conf);
//get the args w/o generic hadoop args
String[] toolArgs = parser.getRemainingArgs();
return tool.run(toolArgs);
}
private static synchronized GenericOptionsParser getParser(Configuration conf, String[] args) throws Exception {
return new GenericOptionsParser(conf, args);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop: NullPointerException with Custom InputFormat - java

Related

Hadoop is skipping reduce phase entirely

Hadoop - How to extract a taskId from mapred.JobConf?

Best way to get distribute a small lookup file using Distributed Cache

Turn off date comment in properties file [duplicate]

Is Hadoop's TooRunner thread-safe?

Categories

Resources