In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
one,first line
two,second line
Ouput Required :
Key : one
Value : first line
Key : two
Value : second line
I am specifying KeyValueTextInputFormat as :
Job job = new Job(conf, "Sample");
job.setInputFormatClass(KeyValueTextInputFormat.class);
KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));
This is working fine for tab as a separator.
In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.
Here's an example:
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// next job set-up
Please set the following in the Driver Code.
conf.set("key.value.separator.in.input.line", ",");
For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"
Key1 Value1,Value2
By changing default seperator, You will be able to read as you wish.
For New Api
Here is the solution
//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
Map
public class Map extends Mapper<Text, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
System.out.println("key---> "+key);
System.out.println("value---> "+value.toString());
.
.
Output
key---> one
value---> first line
key---> two
value---> second line
It's a sequence matter.
The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve.
Ignore the LongWritable key, and split the Text value on comma yourself.
By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.
If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.
For the new Hadoop APIs, it is different:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");
Example
public class KeyValueTextInput extends Configured implements Tool {
public static void main(String args[]) throws Exception {
String log4jConfPath = "log4j.properties";
PropertyConfigurator.configure(log4jConfPath);
int res = ToolRunner.run(new KeyValueTextInput(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
//conf.set("key.value.separator.in.input.line", ",");
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",
",");
Job job = Job.getInstance(conf, "WordCountSampleTemplate");
job.setJarByClass(KeyValueTextInput.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outputPath = new Path(args[1]);
FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
fs.delete(outputPath, true);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
}
class Map extends Mapper<Text, Text, Text, Text> {
public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
context.write(k1, v1);
}
}
class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String sum = " || ";
for (Text value : values)
sum = sum + value.toString() + " || ";
context.write(Key, new Text(sum));
}
}
Related
I am learning Big Data Hadoop by my own and I wrote a Simple Map Reduce code for
Word Count Which is not working .Please lets have a look
// importing all classes
public class WordCount {
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String Line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(Line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable x : values) {
sum = sum + x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Word Count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
}
But after replacing these lines
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
in driver code from these
Path outputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
then it works properly.
May I know the reason and what these lines are for.
I am assuming that when you say it is not working, so you would be getting below error :-
org.apache.hadoop.mapred.FileAlreadyExistsException: **Output directory hdfs://localhost:54310/<<your_output_directory>> already exists**
The output directory should not exists before the map reduce job is submitted. So it might have given you the above exception.
The new lines of code that you used in driver, gets the FileSystem (either local/hdfs based on conf object) from path and deletes the output path before the map reduce job is submitted. So now the job executes as the output directory doesn't exist.
Can you please tell me what error you are getting earlier.
may this error is related to output path is already exist.
here you have written code that will delete output path every time.
If output path already exist then it will delete that path.
you can write this code like that .
Path outputPath = new Path(args[1]);
if (outputPath.getFileSystem(conf).exist()) {
outputPath.getFileSystem(conf).delete(outputPath);
}
Below sample data input.txt, it has 2 columns key & value. For each record processed by Mapper, the output of map should be written to
1)HDFS => A new file needs to created based on key column
2)Context object
Below is the code, where 4 files need to be created based on key column, but files are not getting created. Output is incorrect too. I am expecting wordcount output, but I am getting character count output.
input.txt
------------
key value
HelloWorld1|ID1
HelloWorld2|ID2
HelloWorld3|ID3
HelloWorld4|ID4
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException {
String line = value.toString();
String[] fileContent = line.split("|");
Path hdfsPath = new Path("/filelocation/" + fileContent[0]);
System.out.println("FilePath : " +hdfsPath);
Configuration configuration = con.getConfiguration();
writeFile(fileContent[1], hdfsPath, configuration);
for (String word : fileContent) {
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
static void writeFile(String fileContent, Path hdfsPath, Configuration configuration) throws IOException {
FileSystem fs = FileSystem.get(configuration);
FSDataOutputStream fin = fs.create(hdfsPath);
fin.writeUTF(fileContent);
fin.close();
}
}
Split uses regexp. You need to escape the '|' like .split("\\|");
See docs here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
I am writing a mapreduce program to process a text file append a string to each line.The problem i am facing is that the text value coming in map method of the mapper is incorrect .
Whenever a line in the file is lesser than previous line , few characters are automatically appended to the line to make the line length equal to previous read line.
Map method params as below
*#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {*
I am logging the value inside the map method and observing this behavior.
Any pointers?
Code Snippet
Driver
Configuration configuration = new Configuration();
configuration.set("CLIENT_ID", "Test");
Job job = Job.getInstance(configuration, JOB_NAME);
job.setJarByClass(JobDriver.class);
job.setMapperClass(AdwordsMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Mapper
public class AdwordsMapper extends Mapper<LongWritable, Text, Text, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String textLine = new String(value.getBytes());
textLine = new StringBuffer(textLine).append(",")
.append(context.getConfiguration().get("CLIENT_ID")).toString();
context.write(new Text(""), new Text(textLine));
}
}
as of my knowledge ,the problem with in your mapper is getBytes();
instead of this
String textLine = new String(value.getBytes());
try it.
String textLine = value.toString();
I am having a strange problem with a Hadoop Map/Reduce job. The job submits correctly, runs, but produces incorrect/strange results. It seems as if the mapper and reducer are not run at all. The input file is transformed from:
12
16
132
654
132
12
to
0 12
4 16
8 132
13 654
18 132
23 12
I assume the first column are the generated keys for pairs before the mapper, but neither mapper nor reducer seem to run. The job ran fine when I used the old API.
Source for the job is provided below. I am using Hortonworks as the platform.
public class HadoopAnalyzer
{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(HadoopAnalyzer.class);
conf.setJobName("wordcount");
conf.set("mapred.job.tracker", "192.168.229.128:50300");
conf.set("fs.default.name", "hdfs://192.168.229.128:8020");
conf.set("fs.defaultFS", "hdfs://192.168.229.128:8020");
conf.set("hbase.master", "192.168.229.128:60000");
conf.set("hbase.zookeeper.quorum", "192.168.229.128");
conf.set("hbase.zookeeper.property.clientPort", "2181");
System.out.println("Executing job.");
Job job = new Job(conf, "job");
job.setInputFormatClass(InputFormat.class);
job.setOutputFormatClass(OutputFormat.class);
job.setJarByClass(HadoopAnalyzer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path("/user/usr/in"));
TextOutputFormat.setOutputPath(job, new Path("/user/usr/out"));
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.waitForCompletion(true);
System.out.println("Done.");
}
}
Maybe I am missing something obvious, but can anyone shed some light on what might be going wrong here?
The output is as expected because you used the following,
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
Which should have been --
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
You extended the Mapper and Reducer classes with Map and Reduce but didn't use them in your job.
I'm new on hadoop.
I have a MapReduce job which is supposed to get an input from Hdfs and write the output of the reducer to Hbase. I haven't found any good example.
Here's the code, the error runing this example is Type mismatch in map, expected ImmutableBytesWritable recieved IntWritable.
Mapper Class
public static class AddValueMapper extends Mapper < LongWritable,
Text, ImmutableBytesWritable, IntWritable > {
/* input <key, line number : value, full line>
* output <key, log key : value >*/
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
byte[] key;
int value, pos = 0;
String line = value.toString();
String p1 , p2 = null;
pos = line.indexOf("=");
//Key part
p1 = line.substring(0, pos);
p1 = p1.trim();
key = Bytes.toBytes(p1);
//Value part
p2 = line.substring(pos +1);
p2 = p2.trim();
value = Integer.parseInt(p2);
context.write(new ImmutableBytesWritable(key),new IntWritable(value));
}
}
Reducer Class
public static class AddValuesReducer extends TableReducer<
ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
long total =0;
// Loop values
while(values.iterator().hasNext()){
total += values.iterator().next().get();
}
// Put to HBase
Put put = new Put(key.get());
put.add(Bytes.toBytes("data"), Bytes.toBytes("total"),
Bytes.toBytes(total));
Bytes.toInt(key.get()), total));
context.write(key, put);
}
}
I had a similar job only with HDFS and works fine.
Edited 18-06-2013. The college project finished successfully two years ago. For job configuration (driver part) check correct answer.
Here is the code which will solve your problem
Driver
HBaseConfiguration conf = HBaseConfiguration.create();
Job job = new Job(conf,"JOB_NAME");
job.setJarByClass(yourclass.class);
job.setMapperClass(yourMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Intwritable.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
TableMapReduceUtil.initTableReducerJob(TABLE,
yourReducer.class, job);
job.setReducerClass(yourReducer.class);
job.waitForCompletion(true);
Mapper&Reducer
class yourMapper extends Mapper<LongWritable, Text, Text,IntWritable> {
//#overide map()
}
class yourReducer
extends
TableReducer<Text, IntWritable,
ImmutableBytesWritable>
{
//#override reduce()
}
Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as
job.setInputFormatClass(TextInputFormat.class);
The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility.
You will find a sample code here:
Hope this will be useful :)
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
change this to immutableBytesWritable, intwritable.
I am not sure..hope it works