hadoop map reduce job with HDFS input and HBASE output

hadoop map reduce job with HDFS input and HBASE output - java

I'm new on hadoop.
I have a MapReduce job which is supposed to get an input from Hdfs and write the output of the reducer to Hbase. I haven't found any good example.
Here's the code, the error runing this example is Type mismatch in map, expected ImmutableBytesWritable recieved IntWritable.
Mapper Class
public static class AddValueMapper extends Mapper < LongWritable,
Text, ImmutableBytesWritable, IntWritable > {
/* input <key, line number : value, full line>
* output <key, log key : value >*/
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
byte[] key;
int value, pos = 0;
String line = value.toString();
String p1 , p2 = null;
pos = line.indexOf("=");
//Key part
p1 = line.substring(0, pos);
p1 = p1.trim();
key = Bytes.toBytes(p1);
//Value part
p2 = line.substring(pos +1);
p2 = p2.trim();
value = Integer.parseInt(p2);
context.write(new ImmutableBytesWritable(key),new IntWritable(value));
}
}
Reducer Class
public static class AddValuesReducer extends TableReducer<
ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
long total =0;
// Loop values
while(values.iterator().hasNext()){
total += values.iterator().next().get();
}
// Put to HBase
Put put = new Put(key.get());
put.add(Bytes.toBytes("data"), Bytes.toBytes("total"),
Bytes.toBytes(total));
Bytes.toInt(key.get()), total));
context.write(key, put);
}
}
I had a similar job only with HDFS and works fine.
Edited 18-06-2013. The college project finished successfully two years ago. For job configuration (driver part) check correct answer.

Here is the code which will solve your problem
Driver
HBaseConfiguration conf = HBaseConfiguration.create();
Job job = new Job(conf,"JOB_NAME");
job.setJarByClass(yourclass.class);
job.setMapperClass(yourMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Intwritable.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
TableMapReduceUtil.initTableReducerJob(TABLE,
yourReducer.class, job);
job.setReducerClass(yourReducer.class);
job.waitForCompletion(true);
Mapper&Reducer
class yourMapper extends Mapper<LongWritable, Text, Text,IntWritable> {
//#overide map()
}
class yourReducer
extends
TableReducer<Text, IntWritable,
ImmutableBytesWritable>
{
//#override reduce()
}

Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as
job.setInputFormatClass(TextInputFormat.class);

The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility.
You will find a sample code here:
Hope this will be useful :)

public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
change this to immutableBytesWritable, intwritable.
I am not sure..hope it works

Related

Write multiple outputs from Mapper

Below sample data input.txt, it has 2 columns key & value. For each record processed by Mapper, the output of map should be written to
1)HDFS => A new file needs to created based on key column
2)Context object
Below is the code, where 4 files need to be created based on key column, but files are not getting created. Output is incorrect too. I am expecting wordcount output, but I am getting character count output.
input.txt
------------
key value
HelloWorld1|ID1
HelloWorld2|ID2
HelloWorld3|ID3
HelloWorld4|ID4
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException {
String line = value.toString();
String[] fileContent = line.split("|");
Path hdfsPath = new Path("/filelocation/" + fileContent[0]);
System.out.println("FilePath : " +hdfsPath);
Configuration configuration = con.getConfiguration();
writeFile(fileContent[1], hdfsPath, configuration);
for (String word : fileContent) {
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
static void writeFile(String fileContent, Path hdfsPath, Configuration configuration) throws IOException {
FileSystem fs = FileSystem.get(configuration);
FSDataOutputStream fin = fs.create(hdfsPath);
fin.writeUTF(fileContent);
fin.close();
}
}

Split uses regexp. You need to escape the '|' like .split("\\|");
See docs here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Hadoop MapReduce querying on large json data

Hadoop n00b here.
I have installed Hadoop 2.6.0 on a server where I have stored twelve json files I want to perform MapReduce operations on. These files are large, ranging from 2-5 gigabytes each.
The structure of the JSON files is an array of JSON objects. Snippet of two objects below:
[{"campus":"Gløshaugen","building":"Varmeteknisk og Kjelhuset","floor":"4. etasje","timestamp":1412121618,"dayOfWeek":3,"hourOfDay":2,"latitude":63.419161638078066,"salt_timestamp":1412121602,"longitude":10.404867443910122,"id":"961","accuracy":56.083199914753536},{"campus":"Gløshaugen","building":"IT-Vest","floor":"2. etasje","timestamp":1412121612,"dayOfWeek":3,"hourOfDay":2,"latitude":63.41709424828986,"salt_timestamp":1412121602,"longitude":10.402167488838765,"id":"982","accuracy":7.315199988880896}]
I want to perform MapReduce operations based on the fields building and timestamp. At least in the beginning until I get the hang of this. E.g. mapReduce the data where building equals a parameter and timestamp is greater than X and less than Y. The relevant fields I need after the reduce process is latitude and longitude.
I know there are different tools(Hive, HBase, PIG, Spark etc) you can use with Hadoop that might solve this easier, but my boss wants an evaluation of the MapReduce performance of standalone Hadoop.
So far I have created the main class triggering the map and reduce classes, implemented what I believe is a start in the map class, but I'm stuck on the reduce class. Below is what I have so far.
public class Hadoop {
public static void main(String[] args) throws Exception {
try {
Configuration conf = new Configuration();
Job job = new Job(conf, "maze");
job.setJarByClass(Hadoop.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path inPath = new Path("hdfs://xxx.xxx.106.23:50070/data.json");
FileInputFormat.addInputPath(job, inPath);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}catch (Exception e){
e.printStackTrace();
}
}
}
Mapper:
public class Map extends org.apache.hadoop.mapreduce.Mapper{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
try {
JSONObject jo = new JSONObject(value.toString());
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
long timestamp = jo.getLong("timestamp");
String building = jo.getString("building");
StringBuilder sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
sb.append("/");
sb.append(timestamp);
sb.append("/");
sb.append(building);
sb.append("/");
context.write(new Text(sb.toString()),value);
}catch (JSONException e){
e.printStackTrace();
}
}
}
Reducer:
public class Reducer extends org.apache.hadoop.mapreduce.Reducer{
private Text result = new Text();
protected void reduce(Text key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException {
}
}
UPDATE
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
private static String BUILDING;
private static int tsFrom;
private static int tsTo;
try {
JSONArray ja = new JSONArray(key.toString());
StringBuilder sb;
for(int n = 0; n < ja.length(); n++)
{
JSONObject jo = ja.getJSONObject(n);
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
int timestamp = jo.getInt("timestamp");
String building = jo.getString("building");
if (BUILDING.equals(building) && timestamp < tsTo && timestamp > tsFrom) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()), value);
}
}
}catch (JSONException e){
e.printStackTrace();
}
}
#Override
public void configure(JobConf jobConf) {
System.out.println("configure");
BUILDING = jobConf.get("BUILDING");
tsFrom = Integer.parseInt(jobConf.get("TSFROM"));
tsTo = Integer.parseInt(jobConf.get("TSTO"));
}
This works for a small data set. Since I am working with LARGE json files, I get Java Heap Space exception. Since I am not familiar with Hadoop, I'm having trouble understanding how MapR can read the data without getting outOfMemoryError.

If you simply want a list of LONG/LAT under the constraint of building=something and timestamp=somethingelse.
This is a simple filter operation; for this you do not need a reducer. In the mapper you should check if the current JSON satisfies the condition, and only then write it out to the context. If it fails to satisfy the condition you don't want it in the output.
The output should be LONG/LAT (no building/timestamp, unless you want them there as well)
If no reducer is present, the output of the mappers is the output of the job, which in your case is sufficient.
As for the code:
your driver should pass the building ID and the timestamp range to the mapper, using the job configuration. Anything you put there will be available to all your mappers.
Configuration conf = new Configuration();
conf.set("Building", "123");
conf.set("TSFROM", "12300000000");
conf.set("TSTO", "12400000000");
Job job = new Job(conf);
your mapper class needs to implement JobConfigurable.configure; in there you will read from the configuration object into local static variables
private static String BUILDING;
private static Long tsFrom;
private static Long tsTo;
public void configure(JobConf job) {
BUILDING = job.get("Building");
tsFrom = Long.parseLong(job.get("TSFROM"));
tsTo = Long.parseLong(job.get("TSTO"));
}
Now, your map function needs to check:
if (BUILDING.equals(building) && timestamp < TSTO && timestamp > TSFROM) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()),1);
}
this means any rows belonging to other buildings or outside the timestamp, would not appear in the result.

Exception in Reducer in Hadoop when run on Cluster

I have a map reduce program that runs perfectly when run in stand-alone mode but when I run it on Hadoop Cluster at my school, an exception is happening in the Reducer. I have no clue what exception it is. I came to know this as when I keep a try/catch in reducer, the job passes but empty output. When I don't keep the try/catch, job fails. Since it is a school cluster, I do not have access to any of the job trackers or other files. All I can find is through programatically only. Is there a way I can find what exception happened on hadoop during run time ?
Following are snippets of my code
public static class RowMPreMap extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
private Text keyText = new Text();
private Text valText = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (lineNo, lineContent)
// Split each line using seperator based on the dataset.
String line[] = null;
line = value.toString().split(Settings.INPUT_SEPERATOR);
keyText.set(line[0]);
valText.set(line[1] + "," + line[2]);
// Output: (userid, "movieid,rating")
output.collect(keyText, valText);
}
}
public static class RowMPreReduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private Text valText = new Text();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (userid, List<movieid, rating>)
float sum = 0.0F;
int totalRatingCount = 0;
ArrayList<String> movieID = new ArrayList<String>();
ArrayList<Float> rating = new ArrayList<Float>();
while (values.hasNext()) {
String[] movieRatingPair = values.next().toString().split(",");
movieID.add(movieRatingPair[0]);
Float parseRating = Float.parseFloat(movieRatingPair[1]);
rating.add(parseRating);
sum += parseRating;
totalRatingCount++;
}
float average = ((float) sum) / totalRatingCount;
for (int i = 0; i < movieID.size(); i++) {
valText.set("M " + key.toString() + " " + movieID.get(i) + " "
+ (rating.get(i) - average));
output.collect(null, valText);
}
// Output: (null, <M userid, movieid, normalizedrating>)
}
}
Exception happens in the above reducer. Below is the config
public void normalizeM() throws IOException, InterruptedException {
JobConf conf1 = new JobConf(UVDriver.class);
conf1.setMapperClass(RowMPreMap.class);
conf1.setReducerClass(RowMPreReduce.class);
conf1.setJarByClass(UVDriver.class);
conf1.setMapOutputKeyClass(Text.class);
conf1.setMapOutputValueClass(Text.class);
conf1.setOutputKeyClass(Text.class);
conf1.setOutputValueClass(Text.class);
conf1.setKeepFailedTaskFiles(true);
conf1.setInputFormat(TextInputFormat.class);
conf1.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(conf1, new Path(Settings.INPUT_PATH));
FileOutputFormat.setOutputPath(conf1, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
JobConf conf2 = new JobConf(UVDriver.class);
conf2.setMapperClass(ColMPreMap.class);
conf2.setReducerClass(ColMPreReduce.class);
conf2.setJarByClass(UVDriver.class);
conf2.setMapOutputKeyClass(Text.class);
conf2.setMapOutputValueClass(Text.class);
conf2.setOutputKeyClass(Text.class);
conf2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
FileOutputFormat.setOutputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH));
Job job1 = new Job(conf1);
Job job2 = new Job(conf2);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(job1);
jobControl.addJob(job2);
job2.addDependingJob(job1);
handleRun(jobControl);
}

I caught the exception in reducer and write the stack trace to a file in the file system. I know this is the dirtiest possible way of doing this, but I have no option at this point. Following is the code if it helps any one in future. Put the code in catch block.
String valueString = "";
while (values.hasNext()) {
valueString += values.next().toString();
}
StringWriter sw = new StringWriter();
e.printStackTrace(new PrintWriter(sw));
String exceptionAsString = sw.toString();
Path pt = new Path("errorfile");
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
br.write(exceptionAsString + "\nkey: " + key.toString() + "\nvalues: " + valueString);
br.close();
Inputs to do this in a clean way are welcome.
On a sider note, Eventually I found it is a NumberFormatException. Counters would not have helped me identify this. Later I realized the format of splitting input in stand-alone and on cluster is happening in different fashion, which I am yet to find the reason.

Even if you don't have access to the server, you can get the counters for a job:
Counters counters = job.getCounters();
and dump the set of counters to your local console. These counters will show, among other things, the counts for the number of records input to and written from the mappers and reducers. The counters that have value zero indicate the problem location in your workflow. You can instrument your own counters to help debug/monitor the flow.

Simple word count MapReduce example yielding strange results

I am having a strange problem with a Hadoop Map/Reduce job. The job submits correctly, runs, but produces incorrect/strange results. It seems as if the mapper and reducer are not run at all. The input file is transformed from:
12
16
132
654
132
12
to
0 12
4 16
8 132
13 654
18 132
23 12
I assume the first column are the generated keys for pairs before the mapper, but neither mapper nor reducer seem to run. The job ran fine when I used the old API.
Source for the job is provided below. I am using Hortonworks as the platform.
public class HadoopAnalyzer
{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(HadoopAnalyzer.class);
conf.setJobName("wordcount");
conf.set("mapred.job.tracker", "192.168.229.128:50300");
conf.set("fs.default.name", "hdfs://192.168.229.128:8020");
conf.set("fs.defaultFS", "hdfs://192.168.229.128:8020");
conf.set("hbase.master", "192.168.229.128:60000");
conf.set("hbase.zookeeper.quorum", "192.168.229.128");
conf.set("hbase.zookeeper.property.clientPort", "2181");
System.out.println("Executing job.");
Job job = new Job(conf, "job");
job.setInputFormatClass(InputFormat.class);
job.setOutputFormatClass(OutputFormat.class);
job.setJarByClass(HadoopAnalyzer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path("/user/usr/in"));
TextOutputFormat.setOutputPath(job, new Path("/user/usr/out"));
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.waitForCompletion(true);
System.out.println("Done.");
}
}
Maybe I am missing something obvious, but can anyone shed some light on what might be going wrong here?

The output is as expected because you used the following,
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
Which should have been --
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
You extended the Mapper and Reducer classes with Map and Reduce but didn't use them in your job.

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
one,first line
two,second line
Ouput Required :
Key : one
Value : first line
Key : two
Value : second line
I am specifying KeyValueTextInputFormat as :
Job job = new Job(conf, "Sample");
job.setInputFormatClass(KeyValueTextInputFormat.class);
KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));
This is working fine for tab as a separator.

In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.
Here's an example:
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// next job set-up

Please set the following in the Driver Code.
conf.set("key.value.separator.in.input.line", ",");

For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"
Key1 Value1,Value2
By changing default seperator, You will be able to read as you wish.
For New Api
Here is the solution
//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
Map
public class Map extends Mapper<Text, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
System.out.println("key---> "+key);
System.out.println("value---> "+value.toString());
.
.
Output
key---> one
value---> first line
key---> two
value---> second line

It's a sequence matter.
The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);

First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve.
Ignore the LongWritable key, and split the Text value on comma yourself.

By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.
If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.
For the new Hadoop APIs, it is different:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");

Example
public class KeyValueTextInput extends Configured implements Tool {
public static void main(String args[]) throws Exception {
String log4jConfPath = "log4j.properties";
PropertyConfigurator.configure(log4jConfPath);
int res = ToolRunner.run(new KeyValueTextInput(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
//conf.set("key.value.separator.in.input.line", ",");
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",
",");
Job job = Job.getInstance(conf, "WordCountSampleTemplate");
job.setJarByClass(KeyValueTextInput.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outputPath = new Path(args[1]);
FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
fs.delete(outputPath, true);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
}
class Map extends Mapper<Text, Text, Text, Text> {
public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
context.write(k1, v1);
}
}
class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String sum = " || ";
for (Text value : values)
sum = sum + value.toString() + " || ";
context.write(Key, new Text(sum));
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

hadoop map reduce job with HDFS input and HBASE output - java

Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as job.setInputFormatClass(TextInputFormat.class);

The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility. You will find a sample code here: Hope this will be useful :)

public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { change this to immutableBytesWritable, intwritable. I am not sure..hope it works

Related

Write multiple outputs from Mapper

Hadoop MapReduce querying on large json data

Exception in Reducer in Hadoop when run on Cluster

Simple word count MapReduce example yielding strange results

How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

Categories

Resources