I am trying to limit the number of lines each of the Mappers gets.
My code goes like this:
package com.iathao.mapreduce;
import java.io.IOException;
import java.net.MalformedURLException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.NLineInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.regexp.RESyntaxException;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
public class Main {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, RESyntaxException {
try {
if (args.length != 2) {
System.err.println("Usage: NewMaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Main.class);
job.getConfiguration().set("mapred.max.map.failures.percent", "100");
// job.getConfiguration().set("mapred.map.max.attempts", "10");
//NLineInputFormat. .setNumLinesPerSplit(job, 1);
job.setInputFormatClass(NLineInputFormat.class);
At the last line in the sample (job.setInputFormatClass(NLineInputFormat.class);) I get following error:
The method setInputFormatClass(Class<? extends InputFormat>) in the type Job is not applicable for the arguments (Class<NLineInputFormat>)
Did I somehow get the wrong NLineInputFormat class?
You are mixing the old and the new API.
import org.apache.hadoop.mapred.lib.NLineInputFormat;
import org.apache.hadoop.mapreduce.Job;
According to the "Hadoop : The Definitive Guide"
The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API can still be found in org.apache.hadoop.mapred.
If you plan to use the new API, then use the NLineInputFormat from the org.apache.hadoop.mapreduce package.
Related
I have the following class to perform PCA on a arff file. I have added the Weka jar to my project but I am still getting an error saying DataSource cannot be resolved and I don't know what to do to resolve it. Can anyone suggest what could be wrong?
package project;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
import weka.core.converters.ConverterUtils;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.converters.TextDirectoryLoader;
import weka.gui.visualize.Plot2D;
import weka.gui.visualize.PlotData2D;
import weka.gui.visualize.VisualizePanel;
import java.awt.BorderLayout;
import java.io.File;
import java.util.ArrayList;
import javax.swing.JFrame;
import org.math.plot.FrameView;
import org.math.plot.Plot2DPanel;
import org.math.plot.PlotPanel;
import org.math.plot.plots.ScatterPlot;
import weka.attributeSelection.PrincipalComponents;
import weka.attributeSelection.Ranker;
public class PCA {
public static void main(String[] args) {
try {
// Load the Data.
DataSource source = new DataSource("../data/ingredients.arff");
Instances data = source.getDataSet();
// Perform PCA.
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(1.0);
//pca.setCenterData(true);
pca.setNormalize(true);
pca.setTransformBackToOriginal(false);
pca.buildEvaluator(data);
// Show transform data into eigenvector basis.
Instances transformedData = pca.transformedData();
System.out.println(transformedData);
} catch (Exception e) {
e.printStackTrace();
}
}
}
For the requirement of my project, I need to build a class from the confluent java code to write data from kafka topic to the hdfs filesystem.
It is actually working in CLI with connect-standalone, but I need to do the same thing with the source code which I built successfully.
I have a problem with SinkTask and hdfsConnector classes.
An exception is showing up in the put method.
Here below is my class code:
package io.confluent.connect.hdfs;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.sink.SinkConnector;
import org.apache.kafka.connect.sink.SinkRecord;
import org.apache.kafka.connect.sink.SinkTaskContext;
import io.confluent.connect.avro.AvroData;
import io.confluent.connect.hdfs.avro.AvroFormat;
import io.confluent.connect.hdfs.partitioner.DefaultPartitioner;
import io.confluent.connect.storage.common.StorageCommonConfig;
import io.confluent.connect.storage.partitioner.PartitionerConfig;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.config.ConfigDef;
public class main{
private static Map<String, String> props = new HashMap<>();
protected static final TopicPartition TOPIC_PARTITION = new TopicPartition(TOPIC, PARTITION);
protected static String url = "hdfs://localhost:9000";
protected static SinkTaskContext context;
public static void main(String[] args) {
HdfsSinkConnector hk = new HdfsSinkConnector();
HdfsSinkTask h = new HdfsSinkTask();
props.put(StorageCommonConfig.STORE_URL_CONFIG, url);
props.put(HdfsSinkConnectorConfig.HDFS_URL_CONFIG, url);
props.put(HdfsSinkConnectorConfig.FLUSH_SIZE_CONFIG, "3");
props.put(HdfsSinkConnectorConfig.FORMAT_CLASS_CONFIG, AvroFormat.class.getName());
try {
hk.start(props);
Collection<SinkRecord> sinkRecords = new ArrayList<>();
SinkRecord record = new SinkRecord("test", 0, null, null, null, null, 0);
sinkRecords.add(record);
h.initialize(context);
h.put(sinkRecords);
hk.stop();
} catch (Exception e) {
throw new ConnectException("Couldn't start HdfsSinkConnector due to configuration error", e);
}
}
}
I want to write my own word count example using MapReduce and hadoop v. 1.0.3 (I'm on MacOS) but i don't understand why it doesn't work
Sharing my code :
main:
package org.myorg;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
// set job name, mapper, combiner, and reducer classes
conf.setJobName("WordCount");
// set input and output formats
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
// set input and output paths
//FileInputFormat. setInputPaths(conf, new Path(input));
//FileOutputFormat.setOutputPath(conf, new Path(output));
FileOutputFormat.setCompressOutput(conf, false);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(org.myorg.Map.class);
conf.setReducerClass(org.myorg.Reduce.class);
String host = args[0];
String input = host + "/" + args[1];
String output = host + "/" + args[2];
// set input and output paths
FileInputFormat.addInputPath(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient j=new JobClient(conf);
(j.submitJob(conf)).waitForCompletion();
}
}
Mapper:
package org.myorg;
import java.io.IOException;
import java.util.HashMap;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.Vector;
import java.util.Map.Entry;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
MapWritable hs = new MapWritable();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
//hs.put(word, one);
output.collect(word,one);
}
// TODO Auto-generated method stub
}
}
Reducer:
package org.myorg;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.math.RoundingMode;
import java.net.URI;
import java.net.URISyntaxException;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
import java.text.NumberFormat;
import java.text.ParseException;
import java.util.Iterator;
import java.util.Map;
import java.util.TreeMap;
import java.util.Map.Entry;
import java.util.Vector;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Reducer.Context;
//public class Reduce extends MapReduceBase implements Reducer<Text, MapWritable, Text, Text> {
public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
String host = "hdfs://localhost:54310/";
String tmp = host + "Temporany/output.txt";
FileSystem srcFS;
try {
srcFS = FileSystem.get(new URI(tmp), new JobConf());
srcFS.delete(new Path(tmp), true);
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(
srcFS.create(new Path(tmp))));
wr.write(key.toString() + ":" + sum);
wr.close();
srcFS.close();
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// context.write(key, new IntWritable(sum));
}
}
The Job started and end with no error but didn't write the output file.
I launch the jar with Hadoop with this command:
./Hadoop jar /Users/User/Desktop/hadoop/wordcount.jar hdfs://localhost:54310 /In/testo.txt /Out/wordcount17
This is the output:
2014-03-03 17:56:22.063 java[6365:1203] Unable to load realm info from SCDynamicStore
14/03/03 17:56:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/03 17:56:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/03 17:56:23 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/03 17:56:23 INFO mapred.FileInputFormat: Total input paths to process : 1
I suppose the problem is "unable to load native-hadoop library" but works fine for other Jar's.
Q : The Job start and end with no error but don't write the output-file ??
Ans : I am not sure that job successfully ends w/o error .
Problems :
Job Configuration:
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
The setOutputKeyClass() and setOutputValueClass() methods control the
output types for the map and the reduce functions, which are often the
same. If they are different, then the map
output types can be set using the methods setMapOutputKeyClass() and
setMapOutputValueClass().
The Output class in your case is :
Map key : Text
Map Value : IntWritable
Reduce key : Text
Reduce Value : Text
Which will result in Type mismatch exception
Reduce
I am not sure why you are using hdfs API to write your output to a file ?.
Should use output.collect(key,value).
In case of multiple reducer are you handling the simultaneous write operation?
And I wonder what context.write is doing in old apis (it's commented )? .
You can use following for more information
Debug your map-reduce Job :
Counters
Jobtracker web interface
Q. Difference b/w SubmitJob() && waitForCompletion() ?
Ans : SubmitJob() : submits the job and ends .
waitForCompletion() : submits the job and prints the status of the job on console .
So waitForCompletion() is SubmitJob()+status update of job until complete.
word count
Please read
Map Reduce Apache
You can also find hadoop-examples-X.X.X.jar in your installation folder.
Go through $HADOOP_HOME/src/expalmes/ for source code .
**$HADOOP_HOME = hadoop installation folder
I have 2 files which needs to be accessed by the hadoop cluster. Those two files are good.txt and bad.txt respectively.
Firstly since both these files needs to be accessed from different nodes i place these two files in distributed cache in driver class as follows
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
Now both good and bad files are placed in distributed cache. I access the distributed cache in mapper class as follows
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
}
I need to check if a word is present in a good.txt or bad.txt. So i use the something like this
File file=new File(files[0].toString()); //to access good.txt
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuider sb=new StringBuilder();
String input=null;
while((input=br.readLine())!=null){
sb.append(input);
}
input=sb.toString();
iam supposed to get the content of good file in my input variable. But i dont get it. Have i missed anything??
Does job finish successfully? The maptask may fail because you are using JobConf in this line
files=DistributedCache.getLocalCacheFiles(new JobConf(context.getConfiguration()));
If you change it like this it should work, I don't see any problem with remaining code you posted in question.
files=DistributedCache.getLocalCacheFiles(context.getConfiguration());
or
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
#rVr these is my driver class
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvgWordLength {
public static void main(String[] args) throws Exception {
if (args.length !=2) {
System.out.printf("Usage: AvgWordLength <input dir> <output dir>\n");
System.exit(-1);
}
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
job.setJarByClass(AvgWordLength.class);
job.setJobName("Average Word Length");
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(LetterMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
And my mapper class is
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
System.out.println("in setup()"+files.toString());
}
#Override
public void map(LongWritable key, Text value, Context context)throws IOException,InterruptedException{
int i=0;
System.out.println("in map----->>"+files.toString());//added just to view logs
HashMap<String,String> h=new HashMap<String,String>();
String negword=null;
String input=value.toString();
if(isPresent(input,files[0].toString()){
h.put(input,"good");
}
else
if(isPresent(input,files[1].toString()){
h.put(input,"bad");
}
}
public static boolean isPresent(String n,Path files2) throws IOException{
File file=new File(files2.toString());
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuilder sb=new StringBuilder();
String input=null;
while((input=br.readLine().toString())!=null){
sb.append(input.toString());
}
input=sb.toString();
//System.out.println(input);
Pattern pattern=Pattern.compile(n);
Matcher matcher=pattern.matcher(input);
if(matcher.find()){
return true;
}
else
return false;
}
}
I am running a MapReduce program. I need to give input text file in the format of KEYVALUE pair. so that If I write
job.setInputFormatClass(KeyValueTextInputFormat.class);
The eclipse compiler is showing error that I cant use InputFormat.
anyhow I need to set the Input's format as KeyValueTextInputFormat
How do I do this ?? Any IDea ?????
My Code is
`
package com.iot.dictionary;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.iot.dictionary.Dictionary.AllTranslationsReducer;
import com.iot.dictionary.Dictionary.WordMapper;
public class Driver2 {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "dictionary");
System.out.println("Job-> "+job.toString());
job.setJarByClass(Dictionary.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(AllTranslationsReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
`
If you are using new Hadoop API (Hadoop 0.20.2 and above), you have to import the KeyValueTextInputFormat.class class from package org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat and if you are using the old Hadoop API, you have to import it from org.apache.hadoop.mapred.KeyValueTextInputFormat
You see that line in your code:
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
Change it to
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
Hope this helps.
Thanks