How can I use MultipleoutputFormai in Hadoop 0.20?

How can I use MultipleoutputFormai in Hadoop 0.20? - java

I am working with Hadoop 0.20 and I want to have two reduce output files instead of one output. I know that MultipleOutputFormat doesn't work in Hadoop 0.20. I added the hadoop1.1.1-core jar file in the build path of my project in Eclipse. But it still shows the last error.
Here is my code:
public static class ReduceStage extends Reducer<IntWritable, BitSetWritable, IntWritable, Text>
{
private MultipleOutputs mos;
public ReduceStage() {
System.out.println("ReduceStage");
}
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(final IntWritable key, final Iterable<BitSetWritable> values, Context output ) throws IOException, InterruptedException
{
mos.write("text1", key, new Text("Hello"));
}
public void cleanup(Context context) throws IOException {
try {
mos.close();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
And in the run():
FileOutputFormat.setOutputPath(job, ConnectedComponents_Nodes);
job.setOutputKeyClass(MultipleTextOutputFormat.class);
MultipleOutputs.addNamedOutput(job, "text1", TextOutputFormat.class,
IntWritable.class, Text.class);
The error is:
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputName(Lorg/apache/hadoop/mapreduce/JobContext;Ljava/lang/String;)V
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:409)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:348)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:179)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
What can I do to have MultipleOutputFormat? Did I use the code right?

You may go for an overridden extension of MultipleTextOutputFormat and then make all the contents of the record to be the part of 'value', while make the file-name or path to be the key.
There is an oddjob library. They have a range of outputformat implementations. The one which you want is MultipleLeafValueOutputFormat : Writes to the file specified by the key, and only writes the value.
Now,say you have to write the following pairs and your separator is say the tab character ('\t'):
<"key1","value1"> (you want this to be written in filename1)
<"key2","value2"> (you want this to be written in filename2)
So, now the output from reducer would transform into follows:
<"filename1","key1\tvalue1">
<"filename2","key2\tvalue2">
Also, don't forget that the above defined class should be added as the outformat class to the job:
conf.setOutputFormat(MultipleLeafValueOutputFormat.class);
One thing to note here is that you will need to work with the old mapred package rather than the mapreduce package. But that shouldn't be a problem.

Firstly, you should make sure FileOutputFormat.setOutputName has the same code between versions 0.20 and 1.1.1. If not, you must have compatible version to compile your code. If the same, there must be some parameter error in your command.
I encountered the same issue and I removed -Dmapreduce.user.classpath.first=true from run command and it works. hope that helps!

Related

'File#renameTo()' does not work in java

I'm trying to rename an existing file using File#renameTo(), but it doesn't seem to work.
The following code represents what I am trying to do:
public class RenameFileDirectory {
public static void main(String[] args) throws IOException {
new RenameFileDirectory();
}
public RenameFileDirectory() throws IOException {
File file = new File("C:\\Users\\User-PC\\Desktop\\Nouveau dossier2\\file.png");
File desFile = new File ("C:\\Users\\User-PC\\Desktop\\Nouveau dossier2\\file2.png");
if (file.renameTo(desFile)) {
System.out.println("successful rename");
} else {
System.out.println("error");
}
}
}

Try using Files.move instead. If you read the javadocs for renameTo, it states that:
Many aspects of the behavior of this method are inherently platform-dependent: The rename operation might not be able to move a file from one filesystem to another, it might not be atomic, and it might not succeed if a file with the destination abstract pathname already exists. The return value should always be checked to make sure that the rename operation was successful.

Hadoop is skipping reduce phase entirely

I have set up a Hadoop job like so:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Legion");
job.setJarByClass(Legion.class);
job.setMapperClass(CallQualityMap.class);
job.setReducerClass(CallQualityReduce.class);
// Explicitly configure map and reduce outputs, since they're different classes
job.setMapOutputKeyClass(CallSampleKey.class);
job.setMapOutputValueClass(CallSample.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(CombineRepublicInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
CombineRepublicInputFormat.setInputDirRecursive(job, true);
CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
This job completes, but something strange happens. I get one output line per input line. Each output line consists of the output from a CallSampleKey.toString() method, then a tab, then something like CallSample#17ab34d.
This means that the reduce phase is never running and the CallSampleKey and CallSample are getting passed directly to the TextOutputFormat. But I don't understand why this would be the case. I've very clearly specified job.setReducerClass(CallQualityReduce.class);, so I have no idea why it would skip the reducer!
Edit: Here's the code for the reducer:
public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());
while (inValues.hasNext()) {
call.addSample(inValues.next());
}
context.write(NullWritable.get(), new Text(call.getStats()));
}
}

What if you try to change your
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
to use Iterable instead of Iterator?
public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {
You'll have to then use inValues.iterator() to get the actual iterator.
If the method signature doesn't match then it's just falling through to the default identity reducer implementation. It's perhaps unfortunate that the underlying default implementation doesn't make it easy to detect this kind of typo, but the next best thing is to always use #Override in all methods you intend to override so that the compiler can help.

Hadoop: NullPointerException with Custom InputFormat

I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.
These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.
The error I get is the following:
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:
472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473 org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474 TaskReporter reporter,
475 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476 throws InterruptedException, IOException {
...
491 this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 #Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521 org.apache.hadoop.mapreduce.TaskAttemptContext context
522 ) throws IOException, InterruptedException {
523 long bytesInPrev = getInputBytes(fsStats);
524 real.initialize(split, context);
525 long bytesInCurr = getInputBytes(fsStats);
526 fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }
Of course, the relevant parts of my code:
# MyInputFormat.java
public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
public static void addResId(Job job, String resId) {
Configuration conf = job.getConfiguration();
String inputs = conf.get(INPUT_RES_IDS, "");
if (inputs.isEmpty()) {
inputs += restId;
} else {
inputs += "," + resId;
}
conf.set(INPUT_RES_IDS, inputs);
}
#Override
public List<InputSplit> getSplits(JobContext job) {
// resulting splits container
List<InputSplit> splits = new ArrayList<InputSplit>();
// get the Job configuration
Configuration conf = job.getConfiguration();
// get the inputs, i.e. the list of resource IDs
String input = conf.get(INPUT_RES_IDS, "");
String[] resIDs = StringUtils.split(input);
// iterate on the resIDs
for (String resID: resIDs) {
splits.addAll(getSplitsResId(resID, job.getConfiguration()));
}
// return the splits
return splits;
}
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
# MyRecordReader.java
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// get start, end and current positions
MyInputSplit inputSplit = (MyInputSplit) this.split;
start = inputSplit.getFirstRecordIndex();
end = start + inputSplit.getLength();
current = 0;
// query the third-party system for the related resource, seeking to the start of the split
records = backend.getRecords(inputSplit.getResId(), start, end);
}
# MapReduceTest.java
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "MapReduce test");
job.setJarByClass(MapReduceTest.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
FileOutputFormat.setOutputPath(job, new Path(args[0]));
return job.waitForCompletion(true) ? 0 : 1;
}
Any ideas about what is wrong?
BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)

The way I read your strack trace real is null on line 524.
But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.
NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.
524 real.initialize( blah, blah) // I actually stopped reading after the dot
this.real was set on line 491.
491 this.real = inputFormat.createRecordReader(split, taskContext);
Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.
Turns out it will return null when backend is null.
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, " +
"it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
It looks like setEnvironmnet is supposed to set backend
# MyInputFormat.java
public static void setEnvironmnet(
Job job,
String host,
String port,
boolean ssl,
String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).
If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.
UPDATE:
As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.

Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.
Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.
Thanks to CandiedOrange who put me in the right path!

Turn off date comment in properties file [duplicate]

Is it possible to force Properties not to add the date comment in front? I mean something like the first line here:
#Thu May 26 09:43:52 CEST 2011
main=pkg.ClientMain
args=myargs
I would like to get rid of it altogether. I need my config files to be diff-identical unless there is a meaningful change.

Guess not. This timestamp is printed in private method on Properties and there is no property to control that behaviour.
Only idea that comes to my mind: subclass Properties, overwrite store and copy/paste the content of the store0 method so that the date comment will not be printed.
Or - provide a custom BufferedWriter that prints all but the first line (which will fail if you add real comments, because custom comments are printed before the timestamp...)

Given the source code or Properties, no, it's not possible. BTW, since Properties is in fact a hash table and since its keys are thus not sorted, you can't rely on the properties to be always in the same order anyway.
I would use a custom algorithm to store the properties if I had this requirement. Use the source code of Properties as a starter.

Based on https://stackoverflow.com/a/6184414/242042 here is the implementation I have written that strips out the first line and sorts the keys.
public class CleanProperties extends Properties {
private static class StripFirstLineStream extends FilterOutputStream {
private boolean firstlineseen = false;
public StripFirstLineStream(final OutputStream out) {
super(out);
}
#Override
public void write(final int b) throws IOException {
if (firstlineseen) {
super.write(b);
} else if (b == '\n') {
firstlineseen = true;
}
}
}
private static final long serialVersionUID = 7567765340218227372L;
#Override
public synchronized Enumeration<Object> keys() {
return Collections.enumeration(new TreeSet<>(super.keySet()));
}
#Override
public void store(final OutputStream out, final String comments) throws IOException {
super.store(new StripFirstLineStream(out), null);
}
}
Cleaning looks like this
final Properties props = new CleanProperties();
try (final Reader inStream = Files.newBufferedReader(file, Charset.forName("ISO-8859-1"))) {
props.load(inStream);
} catch (final MalformedInputException mie) {
throw new IOException("Malformed on " + file, mie);
}
if (props.isEmpty()) {
Files.delete(file);
return;
}
try (final OutputStream os = Files.newOutputStream(file)) {
props.store(os, "");
}

if you try to modify in the give xxx.conf file it will be useful.
The write method used to skip the First line (#Thu May 26 09:43:52 CEST 2011) in the store method. The write method run till the end of the first line. after it will run normally.
public class CleanProperties extends Properties {
private static class StripFirstLineStream extends FilterOutputStream {
private boolean firstlineseen = false;
public StripFirstLineStream(final OutputStream out) {
super(out);
}
#Override
public void write(final int b) throws IOException {
if (firstlineseen) {
super.write(b);
} else if (b == '\n') {
// Used to go to next line if did use this line
// you will get the continues output from the give file
super.write('\n');
firstlineseen = true;
}
}
}
private static final long serialVersionUID = 7567765340218227372L;
#Override
public synchronized Enumeration<java.lang.Object> keys() {
return Collections.enumeration(new TreeSet<>(super.keySet()));
}
#Override
public void store(final OutputStream out, final String comments)
throws IOException {
super.store(new StripFirstLineStream(out), null);
}
}

Can you not just flag up in your application somewhere when a meaningful configuration change takes place and only write the file if that is set?
You might want to look into Commons Configuration which has a bit more flexibility when it comes to writing and reading things like properties files. In particular, it has methods which attempt to write the exact same properties file (including spacing, comments etc) as the existing properties file.

You can handle this question by following this Stack Overflow post to retain order:
Write in a standard order:
How can I write Java properties in a defined order?
Then write the properties to a string and remove the comments as needed. Finally write to a file.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
properties.store(baos,null);
String propertiesData = baos.toString(StandardCharsets.UTF_8.name());
propertiesData = propertiesData.replaceAll("^#.*(\r|\n)+",""); // remove all comments
FileUtils.writeStringToFile(fileTarget,propertiesData,StandardCharsets.UTF_8);
// you may want to validate the file is readable by reloading and doing tests to validate the expected number of keys matches
InputStream is = new FileInputStream(fileTarget);
Properties testResult = new Properties();
testResult.load(is);

parsing twitter json using mapreduce: Java, Pig

I'm sure you might find this question somewhat 'duplicate' but I'm sure I've done my research before posting the same. I also apologize for posting Java & Pig issues in one single thread here but just don't want to create another thread for same problem.
I got a json file with some twitter extracts. I'm trying to perform the parse using java MR & Pig as well, but facing issues. Below is my Java code that I tried writing:
public class twitterDataStore {
private static final ObjectMapper mapper = new ObjectMapper();
public static abstract class Map extends MapReduceBase implements
Mapper<Object, Text, IntWritable, Reporter>{
private static final IntWritable one = new IntWritable();
private Text word = new Text();
public void map(Object key, Text value, OutputCollector<Text, IntWritable> context, Reporter arg3)
throws IOException{
try{
JSONObject jsonObj = new JSONObject(value.toString());
String text = (String) jsonObj.get("retweet_count");
StringTokenizer strToken = new StringTokenizer(text);
while(strToken.hasMoreTokens()){
word.set(strToken.nextToken());
context.collect(word, one);
}
}catch(IOException e){
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Object, Text, IntWritable, Reporter>{
#Override
public void reduce(Object key, Iterator<Text> value,
OutputCollector<IntWritable, Reporter> context, Reporter arg3)
throws IOException {
while(value.hasNext()){
System.out.println(value);
if(value.equals("retweet_count"))
{
System.out.println(value.equals("id_str"));
}
}
// TODO Auto-generated method stub
}
}
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(twitterDataStore.class);
conf.setJobName("twitterDataStore");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
The issue is, and you might have gotten it by now, is that I can't do the parsing when I execute the jar, most probably because the json jar isn't included in the package. I tried going with the information provided here: "parsing json input in hadoop java". But I can't get pass any option. Whatever #Tejas Patil has suggested and #Fraz tried, I couldn't get anything working for my problem. I'll paste the warning generated here also for an FYI:
14/04/14 21:09:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Coming to Pig(version 0.11) loading, I wrote a JsonLoader load to load my tweet data :
data = LOAD '/tmp/twitter.txt' using JsonLoader('in_reply_to_screen_name:chararray,text:chararray,id_str:long,place:chararray,in_reply_to_status_id:chararray, contributors:chararray,retweet_count:CHARARRAY,favorited:chararray,truncated:chararray,source:chararray,in_reply_to_status_id_str:chararray,created_at:chararray, in_reply_to_user_id_str:chararray,in_reply_to_user_id:chararray,user:{(lang:chararray,profile_background_image_url:chararray,id_str:long,default_profile_image: chararray,statuses_count:chararray,profile_link_color:chararray,favourites_count:chararray,profile_image_url_https:chararray,following:chararray, profile_background_color:chararray,description:chararray,notifications:chararray,profile_background_tile:chararray,time_zone:chararray, profile_sidebar_fill_color:chararray,listed_count:chararray,contributors_enabled:chararray,geo_enabled:chararray,created_at:chararray,screen_name:chararray, follow_request_sent:chararray,profile_sidebar_border_color:chararray,protected:chararray,url:chararray,default_profile:chararray,name:chararray, is_translator:chararray,show_all_inline_media:chararray,verified:chararray,profile_use_background_image:chararray,followers_count:chararray,profile_image_url:chararray, id:long,profile_background_image_url_https:chararray,utc_offset:chararray,friends_count:chararray,profile_text_color:chararray,location:chararray)},retweeted:chararray, id:long,coordinates:chararray,geo:chararray');
Sorry for pasting everything unnecessarily here but just don't want to miss anything in the explanation.
I was facing issues with declaring some of the fields as 'integer' but when I converted all integers to chararray, the command passed the check. The error I'm getting here is
2014-04-14 21:19:24,977 [JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2014-04-14 21:19:24,982 [JobControl] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
The same issue with parsing. I tried registering the json jar before this load, but still the same problem. Can anyone help me out in resolving the issue?
Thanks in advance.
-Adil

You are using TextInputFormat so when you do value.toString(), you get a single line as an input. Not the whole JSON object.
I recommend you use this:
https://github.com/alexholmes/json-mapreduce#an-inputformat-to-work-with-splittable-multi-line-json

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I use MultipleoutputFormai in Hadoop 0.20? - java

Related

'File#renameTo()' does not work in java

Hadoop is skipping reduce phase entirely

Hadoop: NullPointerException with Custom InputFormat

Turn off date comment in properties file [duplicate]

parsing twitter json using mapreduce: Java, Pig

Categories

Resources