Hadoop Map Reduce program for hashing - java

I have written a Map Reduce Program in Hadoop for hashing all the records of the file, and appending the hased value as an additional attribute to each record and then output to Hadoop file system
This is the code i have written
public class HashByMapReduce
public static class LineMapper extends Mapper<Text, Text, Text, Text>
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
String line = value.toString();
context.write(key, line);
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
String translations = "";
for (Text val : values)
translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error
context.write(key, result);
public static void main(String[] args) throws Exception
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
I have written this code with the logic that Each line is read by the Map method which assigns all value to a single key which then passes to same Reducer method. which the passes each values to hash64() function.
But i see its passing a null value(empty value) to hash function. I am not unable to figure it out why? Thanks in advance

The cause of the problem is most probably due to the use of KeyValueTextInputFormat. From Yahoo tutorial :
InputFormat: Description: Key: Value:
TextInputFormat Default format; The byte offset The line contents
reads lines of of the line
text files
KeyValueInputFormat Parses lines Everything up to the The remainder of
into key, first tab character the line
val pairs
It's breaking your input lines wrt tab character. I suppose there is no tab in your lines. As a result the key in the LineMapper is a whole line while nothing is being passed as value ( not sure null or empty ).
From your code I think you should better use TextInputFormat class as your inputformat which produces line offset as key and the complete line as value. This should solve your problem.
EDIT : I run your code with following changes, and it seems to work fine:
Changed inputformat to TextInputFormat and accordingly change declaration of the Mapper
Added proper setMapOutputKeyClass & setMapOutputValueClass to the job. These are not mandatory but often creates problem on running.
Removed your ket.set("single") and added a private outkey to the Mapper.
Since you provided no details of hash64 method, I used String.toUpperCase for testing.
If the issue persists, then I'm sure that your hash method hasn't handle null well.
Full code :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class HashByMapReduce {
public static class LineMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
private Text outKey = new Text("single");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
context.write(outKey, word);
public static class LineReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String translations = "";
for (Text val : values) {
translations = val.toString() + ","
+ val.toString().toUpperCase(); // Point of Error
context.write(key, result);
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Hashing");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);


Hadoop reducer ArrayIndexOutOfBoundsException when passing values from mapper

I'm trying to output two values from the mapper to the reducer by passing a string value but when I parse the string in the Mapper I get an out of bounds error. However, I made the string in the Mapper so I'm sure it has two values, what I'm doing wrong? How can I pass two values from the mapper to the reducer? (Eventually, I need to pass more variables to the reducer but this makes the problem a bit simpler.)
This is the error:
Error: java.lang.ArrayIndexOutOfBoundsException: 1
at TotalTime$TimeReducer.reduce(TotalTime.java:57)
at TotalTime$TimeReducer.reduce(TotalTime.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
and this is my code
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class TotalTime {
public static class TimeMapper extends Mapper<Object, Text, Text, Text> {
Text textKey = new Text();
Text textValue = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String data = value.toString();
String[] field = data.split(",");
if (null != field && field.length == 4) {
String strTimeIn[] = field[1].split(":");
String strTimeOout[] = field[2].split(":");
int timeOn = Integer.parseInt(strTimeIn[0]) * 3600 + Integer.parseInt(strTimeIn[1]) * 60 + Integer.parseInt(strTimeIn[2]);
int timeOff = Integer.parseInt(strTimeOout[0]) * 3600 + Integer.parseInt(strTimeOout[1]) * 60 + Integer.parseInt(strTimeOout[2]);
String v = String.valueOf(timeOn) + "," + String.valueOf(timeOff);
context.write(textKey, textValue);
public static class TimeReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Text textValue = new Text();
int sumTime = 0;
for (Text val : values) {
String line = val.toString();
// Split the string by commas
String[] field = line.split(",");
int timeOn = Integer.parseInt(field[0]);
int timeOff = Integer.parseInt(field[1]);
int time = timeOff - timeOn;
sumTime += time;
String v = String.valueOf(sumTime);
context.write(key, textValue);
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "User Score");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
The input file looks like this:
It seems it is combiner which causes your code fails. remember combiner is a piece of code that is ran before reducer. now imagine this scenario:
your mapper process this line:
and write following output to context
[ID2347, (56451,58904)]
now combiner come into play and process the output of your mapper before reducer and produce this:
[ID2347, 2453]
now above line go to reducer and it fails because in your code your assumption is the value is something like this val1,val2
if you want to your code work just remove combiner [or change your logic]

How to get user input in Hadoop 2.7.5?

I'm trying to make it so that when a user enters a word the program will go through the txt file and count all the instances of that word.
I'm using MapReduce and i'm new at it.
I know there is a really simple way to do this and i've been trying to figure that out for a while.
In this code I'm trying to make it so that it would ask for the user input and the program would go through the file and find instances.
I've seen some codes on stack overflow and someone mentioned that setting the configuration to conf.set("userinput","Data") would help somehow.
Also there is some updated way to have the user input.
The if statement in my program is an example of when the user word is entered it only finds that word.
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//So I've seen that this is the correct way of setting it up.
// However I've heard that there mroe efficeint ways of setting it up as well.
public void setup(Context context) {
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
if(word=="userinput") {
context.write(word, one);
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, result);
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
I'm not sure about the setup method, but you pass the input at the command line as an argument.
Job job =...
// Notice you now need 3 arguments to run this
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
In the mapper or reducer, you can get the string
Configuration config=context.getConfiguration();
String wordstring=config.get("mapper.word");
And you need to get the string from the tokenizer before you can compare it. You also need to compare strings, not a string to a text object
String wordstring=config.get("mapper.word");
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(wordstring.equals(token)) {
context.write(word, one);

How to format the output being written by Mapreduce in Hadoop

I am trying to reverse the contents of the file by each word. I have the program running fine, but the output i am getting is something like this
1 dwp
2 seviG
3 eht
4 tnerruc
5 gnikdrow
6 yrotcerid
7 ridkm
8 desU
9 ot
10 etaerc
I want the output to be something like this
dwp seviG eht tnerruc gnikdrow yrotcerid ridkm desU
ot etaerc
The code i am working with
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Reproduce {
public static int temp =0;
public static class ReproduceMap extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text>{
private Text word = new Text();
public void map(LongWritable arg0, Text value,
OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException {
String line = value.toString().concat("\n");
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(new StringBuffer(tokenizer.nextToken()).reverse().toString());
output.collect(new IntWritable(temp),word);
public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>{
public void reduce(IntWritable arg0, Iterator<Text> arg1,
OutputCollector<IntWritable, Text> arg2, Reporter arg3)
throws IOException {
String word = arg1.next().toString();
Text word1 = new Text();
arg2.collect(arg0, word1);
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
How do i modify my output instead of writing another java program to do that
Thanks in advance
Here is a simple code demonstrate the use of custom FileoutputFormat
public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
//get the current path
Path path = FileOutputFormat.getOutputPath(arg0);
//create the full path with the output directory plus our filename
Path fullPath = new Path(path, "result.txt");
//create the file in the file system
FileSystem fs = path.getFileSystem(arg0.getConfiguration());
FSDataOutputStream fileOut = fs.create(fullPath, arg0);
//create our record writer with the new file
return new MyCustomRecordWriter(fileOut);
public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
private DataOutputStream out;
public MyCustomRecordWriter(DataOutputStream stream) {
out = stream;
try {
catch (Exception ex) {
public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
//close our file
public void write(Text arg0, List arg1) throws IOException, InterruptedException {
//write out our key
out.writeBytes(arg0.toString() + ": ");
//loop through all values associated with our key and write them with commas between
for (int i=0; i<arg1.size(); i++) {
if (i>0)
Finally we need to tell our job about our ouput format and the path before running it.
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));
We can customize the output by writing a custom fileoutputformat class
you can use NullWritable as a output value. NullWritable is just a placeholder Since you don't want number to be displayed as a part of your output. I have given modified reducer class. Note :- need to add import statement for NullWritable
public static class ReproduceReduce extends MapReduceBase implements Reducer<IntWritable, Text, Text, NullWritable>{
public void reduce(IntWritable arg0, Iterator<Text> arg1,
OutputCollector<Text, NullWritable> arg2, Reporter arg3)
throws IOException {
String word = arg1.next().toString();
Text word1 = new Text();
arg2.collect(word1, new NullWritable());
and change the driver class or main method
In Mapper key temp is incremented for each word value, So each word is processed as a separate key-value pair.
Below steps should solve the problem
1) In Mapper just remove the temp++, so that all the reversed words will have the key as 0 (temp =0).
2) Reducer receives the key 0 and list of reversed strings.
In reducer set the key to NullWritable and write the output.
What you can try is take one constant key (or simply nullwritable) and pass this as a key and your complete line as a value(you can reverse it in mapper class or you can also reverse it in the reducer class as well). so your reducer will receive a constant key (or place holder if you have used nullwritable as a key) and complete line. Now you can simply reverse the line and write it to output file. By not using tmp as a key you avoid writing unwanted numbers in your output file.

Hadoop program error while execution - Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I get the following error when I execute my alphabet count program.
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1014)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.example.AlphabetCount$Map.map(AlphabetCount.java:40)
Command used to run: ./bin/hadoop jar /home/ubuntu/Documents/AlphabetCount.jar input output
I have browsed and checked the first eight links when I google using the error message. I have implemented their advice and yet the error message appears. Can you help me out, please?
package com.example;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AlphabetCount {
public static class Map1 extends
Mapper<LongWritable, Text, Text, IntWritable> {
private Text alphabet = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
byte[] byteArray = line.getBytes();
int sum = 0;
for (int i = 0; i < byteArray.length; i++) {
if ((byteArray[i] == 'a') || (byteArray[i] == 'A')) {
sum += 1;
context.write(alphabet, new IntWritable(sum));
public static class Reduce1 extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value,
Context context) throws IOException, InterruptedException {
final Text alphabet = new Text();
int sum = 0;
while (value.hasNext()) {
sum = sum + value.next().get();
context.write(alphabet, new IntWritable(sum));
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Update: (Solution) the above code works! I was getting the error because the jar I was executing was different from the jar I was updating with the above code! I had initially exported the jar (with erroneous code) from eclipse to location x and subsequently I was updating the code in location y but still executing the jar from location x! damn!
Try specifying your input and output formats classes in your main method and also the input key format of your Mapper. You should have something similar to this :
public class AlphabetCount {
public static class Map1 extends
Mapper<Text, Text, Text, IntWritable> {
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
public static class Reduce1 extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value,
Context context) throws IOException, InterruptedException {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);

Unexpected output from Hadoop word count

I modified the code below to output words which occurred at least ten times. But it does not work -- the output file does not change at all. What do I have to do to make it work?
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
// ...
public class WordCount extends Configured implements Tool {
// ...
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
context.write(word, one);
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
// where I modified, but not working, the output file didnt change
if(sum >= 10)
context.write(key, new IntWritable(sum));
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new WordCount(), args);
Code looks completely valid. I can suspect that your dataset is big enough, so words happens to appear more then 10 times?
Please laso make sure that you indeed looking into new results..
You can see the default Hadoop counters and have an idea of whats happening.
The code looks valid.
To be able to help you we need at least the command line you used to run this. It would also help if you could post the actual output if you feed it a file like this
two two
three three three
Etc up till 20
The code is definitely correct, Maybe you are reading the output generated before you modified the code. Or maybe you did not update the jar file which you previously used after modifying the code?

