Error - Hadoop Word Count Program in MapReduce - java

I am new to Hadoop to so pardon me if this looks like silly question.
I am running my below MapReduce program and getting the following error:
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
Any help is appreciated.
public class WordCount {
// Mapper Class
public static class MapperClass extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// Mapper method defined
public void mapperMethod(Object key,Text lineContent,Context context){
try{
StringTokenizer strToken = new StringTokenizer(lineContent.toString());
//Iterating through the line
while(strToken.hasMoreTokens()){
word.set(strToken.nextToken());
try{
context.write(word, one);
}
catch(Exception e){
System.err.println(new Date()+" ---> Cannot write data to hadoop in Mapper.");
e.printStackTrace();
}
}
}
catch(Exception ex){
ex.printStackTrace();
}
}
}
// Reducer Class
public static class ReducerClass extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
//Reducer method
public void reduce(Text key,Iterable<IntWritable> values,Context context){
try{
int sum=0;
for(IntWritable itr : values){
sum+=itr.get();
}
result.set(sum);
try {
context.write(key,result);
} catch (Exception e) {
System.err.println(new Date()+" ---> Error while sending data to Hadoop in Reducer");
e.printStackTrace();
}
}
catch (Exception err){
err.printStackTrace();
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
try{
Configuration conf = new Configuration();
String [] arguments = new GenericOptionsParser(conf, args).getRemainingArgs();
if(arguments.length!=2){
System.err.println("Enter both and input and output location.");
System.exit(1);
}
Job job = new Job(conf,"Simple Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(arguments[0]));
FileOutputFormat.setOutputPath(job, new Path(arguments[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
catch(Exception e){
}
}
}

You need to override Map method in the Mapper Class instead you have a new method.
Comming to your error, as you dont have map method overridden your program boils down to a reduce only job. Reducer is getting input as LongWritable,Text but you have declared Intwritable and text as input.
Hope this explains.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

Related

why is the first output line in map reduce null in java

I don't understand why the first output of my map reduce job is 0 and null
The output is : url ; number of visits
and here is the mapper class :
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable cle, Text valeur, Context sortie)
throws IOException
{
String url="";
int nbVisites=0;
Pattern httplogPattern = Pattern.compile("([^\\s]+) - - \\[(.+)\\] \"([^\\s]+) (/[^\\s]*) HTTP/[^\\s]+\" [^\\s]+ ([0-9]+)");
String ligne = valeur.toString();
if (ligne.length()>0) {
Matcher matcher = httplogPattern.matcher(ligne);
if (matcher.matches()) {
url = matcher.group(1);
nbVisites = Integer.parseInt(matcher.group(5));
}
}
Text urlText = new Text(url);
IntWritable value = new IntWritable(nbVisites);
try
{
sortie.write(urlText, value);
System.out.println(urlText + " ; " + value);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
and reducer :
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context sortie) throws IOException, InterruptedException
{
Iterator<IntWritable> it = values.iterator();
int nb=0;
while (it.hasNext()) {
nb = nb + it.next().get();
}
try {
sortie.write(key, new IntWritable(nb));
System.out.println(key.toString() + ";" + nb);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Each line of the input file looks like this :
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
and here is the output :
0
04-dynamic-c.rotterdam.luna.net 4
06-dynamic-c.rotterdam.luna.net 1
10.salc.wsu.edu 3
11.ts2.mnet.medstroms.se 1
128.100.183.222 4
128.102.149.149 4
As you can see first line is a couple of null values
Thank you
You get an empty key (not null) because your default mapper Text is an empty string. Then the reducer counts that as 0...
It works fine if you check that your lines actually match before writing the output
Here's a refactored version of your code
public class WebLogDriver extends Configured implements Tool {
public static final String APP_NAME = WebLogDriver.class.getSimpleName();
public static void main(String[] args) throws Exception {
final int status = ToolRunner.run(new Configuration(), new WebLogDriver(), args);
System.exit(status);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, APP_NAME);
job.setJarByClass(WebLogDriver.class);
// outputs for mapper and reducer
job.setOutputKeyClass(Text.class);
// setup mapper
job.setMapperClass(WebLogDriver.WebLogMapper.class);
job.setMapOutputValueClass(IntWritable.class);
// setup reducer
job.setReducerClass(WebLogDriver.WebLogReducer.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
final Path outputDir = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputDir);
return job.waitForCompletion(true) ? 0 : 1;
}
static class WebLogMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
static final Pattern HTTP_LOG_PATTERN = Pattern.compile("(\\S+) - - \\[(.+)] \"(\\S+) (/\\S*) HTTP/\\S+\" \\S+ (\\d+)");
final Text keyOut = new Text();
final IntWritable valueOut = new IntWritable();
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
String line = value.toString();
if (line.isEmpty()) return;
Matcher matcher = HTTP_LOG_PATTERN.matcher(line);
if (matcher.matches()) {
keyOut.set(matcher.group(1));
try {
valueOut.set(Integer.parseInt(matcher.group(5)));
context.write(keyOut, valueOut);
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
}
static class WebLogReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
static final IntWritable valueOut = new IntWritable();
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int nb = StreamSupport.stream(values.spliterator(), true)
.mapToInt(IntWritable::get)
.sum();
valueOut.set(nb);
context.write(key, valueOut);
}
}
}

Java Hadoop Gengerate no result

I have made a test with Hadoop Map and Reducer with code below:
App.java
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.out.println("Words Count Start!");
System.setProperty("hadoop.home.dir", "C:/Users/xx/Meekou/Meekou.hadoop/hadoop-3.2.2.tar/hadoop-3.2.2");
URL url = App.class.getClassLoader().getResource("wordcount.txt");
Path inputPath = new Path(URLDecoder.decode(url.getFile(),"UTF-8") );
Configuration config = new Configuration();
Job job = Job.getInstance(config, "WordsCount");
job.setJarByClass(App.class);
job.setMapperClass(WordsCountMapper.class);
job.setReducerClass(WordsCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inputPath);
FileOutputFormat.setOutputPath(job, new Path("./result.txt"));
job.waitForCompletion(true);
System.out.println("Words Count Complete!");
}
WordsCountMapper.java
public class WordsCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private static final Log LOG = LogFactory.getLog(WordsCountMapper.class);
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
String line = value.toString();
LOG.info(line);
String[] words = line.split(",");
for (String word : words) {
LOG.info(word);
context.write(new Text(word), one);
}
// StringTokenizer tokenizer = new StringTokenizer(line);
// while (tokenizer.hasMoreTokens()) {
// word.set(tokenizer.nextToken());
// context.write(word, one);
// }
}
}
WordsCountReducer.java
public class WordsCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> words,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int frequencyForWord = 0;
for(IntWritable word: words){
frequencyForWord += word.get();
}
System.out.println(key);
context.write(key, new IntWritable(frequencyForWord));
}
}
And test txt file content is
tom,jack,mary,
rose,anly,billo,anly,
billo,mary,zoor,
zoor,poly,
It generated empty result folder instead of expected result.

how to custom select column reading for hadoop input in java for map reducer job

New to Hadoop and I'm trying to understand how Hadoop read file input : I am able to use this code below to run Hadoop job from 2 column ( key / value ) input file :
But what if I have 5 columns and the ( key /value ) i want is A&E ( instead of A&B) which function do I need to modify exactly ?
public class InverterCounter extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
output.collect(value, key);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, IntWritable> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int count = 0;
while (values.hasNext()) {
values.next();
count++;
}
output.collect(key, new IntWritable(count));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, InverterCounter.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("InverterCounter");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new InverterCounter(), args);
System.exit(res);
}
}
Any recommendation would appreciated, I was trying to change job.set("key.value.separator.in.input.line", ","); and job.setInputFormat(KeyValueTextInputFormat.class); with no luck still could not figure this out.
Thanks
KeyValueTextInputFormat assumes that the key is at the start of each line, so it isn't applicable for your 6 column data set.
Instead, you can use TextInputFormat and extract the key and value yourself. I'm assuming all values in the line are separated by commas (and that there are no commas in the data, which is another story).
With TextInputFormat you receive the full line in value, and the position of the line in the file in key. We don't need the position so we will ignore it. With the full line in a single Text we can turn it into a String, split it by commas, and derive the key and value to emit:
public class InverterCounter extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String[] lineFields = value.toString().split(",");
Text outputKey = new Text(lineFields[0] + "," + lineFields[4]);
Text outputValue = new Text(lineFields[1] + "," + lineFields[2] + "," +
lineFields[3] + "," + lineFields[5]);
output.collect(outputKey, outputValue);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, IntWritable> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int count = 0;
while (values.hasNext()) {
values.next();
count++;
}
output.collect(key, new IntWritable(count));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, InverterCounter.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("InverterCounter");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new InverterCounter(), args);
System.exit(res);
}
}
I haven't had a chance to test this, so there may be small bugs. You would probably want to rename the class because it is no longer inverting anything. Finally, the value has been sent to the reducer but it isn't being used, so you could just as easily send a NullWritable instead.

Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

I have my mapper and reducers as follows. But I am getting some kind of strange exception.
I can't figure out why is it throwing such kind of exception.
public static class MyMapper implements Mapper<LongWritable, Text, Text, Info> {
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, Info> output, Reporter reporter)
throws IOException {
Text text = new Text("someText")
//process
output.collect(text, infoObjeject);
}
}
public static class MyReducer implements Reducer<Text, Info, Text, Text> {
#Override
public void reduce(Text key, Iterator<Info> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
String value = "xyz" //derived in some way
//process
output.collect(key, new Text(value)); //exception occurs at this line
}
}
System.out.println("Starting v14 ");
JobConf conf = new JobConf(RouteBuilderJob.class);
conf.setJobName("xyz");
String jarLocation =ClassUtil.findContainingJar(getClass());
System.out.println("path of jar file = " + jarLocation);
conf.setJarByClass(RouteBuilderJob.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Info.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
//am i missing something here???
conf.setMapperClass(RouteBuilderJob.RouteMapper.class);
conf.setCombinerClass(RouteBuilderJob.RouteReducer.class);
conf.setReducerClass(RouteBuilderJob.RouteReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
I am getting an exception:
Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class com.xyz.mypackage.Info
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:199)
at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1307)
at com.xyz.mypackage.job.MyJob$RouteReducer.reduce(MyJob.java:156)
at com.xyz.mypackage.job.MyJob$RouteReducer.reduce(MyJob.java:1)
Internally info object (which implements Writable) is serialized using Text
#Override
public void write(DataOutput out) throws IOException {
Gson gson = new Gson();
String searlizedStr = gson.toJson(this);
Text.writeString(out, searlizedStr);
}
#Override
public void readFields(DataInput in) throws IOException {
String s = Text.readString(in);
Gson gson = new Gson();
JsonReader jsonReader = new JsonReader(new StringReader(s));
jsonReader.setLenient(true);
Info info = gson.fromJson(jsonReader, Info.class);
//set fields using this.somefield = info.getsomefield()
}
Technically the output types of your reduce should be the same as your input type. This must be true if you use a combiner as the output of the combiner is fed into your reducer.

ClassCast Error while writing to Cassandra from hadoop job

I am running a hadoop job and trying to write the output to Cassandra. I am getting following exception:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to java.nio.ByteBuffer
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.write(ColumnFamilyRecordWriter.java:60)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:156)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
I modeled my map reduce code on the WordCount example given at https://wso2.org/repos/wso2/trunk/carbon/dependencies/cassandra/contrib/word_count/src/WordCount.java
Here's my MR code:
public class SentimentAnalysis extends Configured implements Tool {
static final String KEYSPACE = "Travel";
static final String OUTPUT_COLUMN_FAMILY = "Keyword_PtitleId";
public static class Map extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
Sentiment sentiment = null;
try {
sentiment = (Sentiment) PojoMapper.fromJson(line, Sentiment.class);
} catch(Exception e) {
return;
}
if(sentiment != null && sentiment.isLike()) {
word.set(sentiment.getNormKeyword());
context.write(word, new LongWritable(sentiment.getPtitleId()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, ByteBuffer, List<Mutation>> {
private ByteBuffer outputKey;
public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException, InterruptedException {
List<Long> ptitles = new ArrayList<Long>();
java.util.Map<Long, Integer> ptitleToFrequency = new HashMap<Long, Integer>();
while (values.hasNext()) {
Long value = values.next().get();
ptitles.add(value);
}
for(Long ptitle : ptitles) {
if(ptitleToFrequency.containsKey(ptitle)) {
ptitleToFrequency.put(ptitle, ptitleToFrequency.get(ptitle) + 1);
}
else {
ptitleToFrequency.put(ptitle, 1);
}
}
byte[] keyBytes = key.getBytes();
outputKey = ByteBuffer.wrap(Arrays.copyOf(keyBytes, keyBytes.length));
for(Long ptitle : ptitleToFrequency.keySet()) {
context.write(outputKey, Collections.singletonList(getMutation(new Text(ptitle.toString()), ptitleToFrequency.get(ptitle))));
}
}
private static Mutation getMutation(Text word, int sum)
{
Column c = new Column();
byte[] wordBytes = word.getBytes();
c.name = ByteBuffer.wrap(Arrays.copyOf(wordBytes, wordBytes.length));
c.value = ByteBuffer.wrap(String.valueOf(sum).getBytes());
c.timestamp = System.currentTimeMillis() * 1000;
Mutation m = new Mutation();
m.column_or_supercolumn = new ColumnOrSuperColumn();
m.column_or_supercolumn.column = c;
return m;
}
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new SentimentAnalysis(), args);
System.exit(ret);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "SentimentAnalysis");
job.setJarByClass(SentimentAnalysis.class);
String inputFile = args[0];
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(ByteBuffer.class);
job.setOutputValueClass(List.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY);
FileInputFormat.setInputPaths(job, inputFile);
ConfigHelper.setRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}
If you look under the Reduce class, I am converting Text field (key) to ByteBuffer properly.
Would appreciate some pointers on how to fix this.
After some trial and error, I was able to figure out how to solve this particular issue. Basically, in my reduce method signature, I was using Iterator instead of Iterable and so the reducer was never called. And, hadoop was trying to write my Mapper output (Text, LongWritable) to Cassandra using outputKey/Value Classes for Reducer (ByteBuffer, List). This was causing the ClassCastException.
Changing reduce method signature to Iterable solved this issue.

Categories

Resources