Hadoop MapReduce querying on large json data

Hadoop MapReduce querying on large json data - java

Hadoop n00b here.
I have installed Hadoop 2.6.0 on a server where I have stored twelve json files I want to perform MapReduce operations on. These files are large, ranging from 2-5 gigabytes each.
The structure of the JSON files is an array of JSON objects. Snippet of two objects below:
[{"campus":"Gløshaugen","building":"Varmeteknisk og Kjelhuset","floor":"4. etasje","timestamp":1412121618,"dayOfWeek":3,"hourOfDay":2,"latitude":63.419161638078066,"salt_timestamp":1412121602,"longitude":10.404867443910122,"id":"961","accuracy":56.083199914753536},{"campus":"Gløshaugen","building":"IT-Vest","floor":"2. etasje","timestamp":1412121612,"dayOfWeek":3,"hourOfDay":2,"latitude":63.41709424828986,"salt_timestamp":1412121602,"longitude":10.402167488838765,"id":"982","accuracy":7.315199988880896}]
I want to perform MapReduce operations based on the fields building and timestamp. At least in the beginning until I get the hang of this. E.g. mapReduce the data where building equals a parameter and timestamp is greater than X and less than Y. The relevant fields I need after the reduce process is latitude and longitude.
I know there are different tools(Hive, HBase, PIG, Spark etc) you can use with Hadoop that might solve this easier, but my boss wants an evaluation of the MapReduce performance of standalone Hadoop.
So far I have created the main class triggering the map and reduce classes, implemented what I believe is a start in the map class, but I'm stuck on the reduce class. Below is what I have so far.
public class Hadoop {
public static void main(String[] args) throws Exception {
try {
Configuration conf = new Configuration();
Job job = new Job(conf, "maze");
job.setJarByClass(Hadoop.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path inPath = new Path("hdfs://xxx.xxx.106.23:50070/data.json");
FileInputFormat.addInputPath(job, inPath);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}catch (Exception e){
e.printStackTrace();
}
}
}
Mapper:
public class Map extends org.apache.hadoop.mapreduce.Mapper{
private Text word = new Text();
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
try {
JSONObject jo = new JSONObject(value.toString());
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
long timestamp = jo.getLong("timestamp");
String building = jo.getString("building");
StringBuilder sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
sb.append("/");
sb.append(timestamp);
sb.append("/");
sb.append(building);
sb.append("/");
context.write(new Text(sb.toString()),value);
}catch (JSONException e){
e.printStackTrace();
}
}
}
Reducer:
public class Reducer extends org.apache.hadoop.mapreduce.Reducer{
private Text result = new Text();
protected void reduce(Text key, Iterable<Text> values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException {
}
}
UPDATE
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
private static String BUILDING;
private static int tsFrom;
private static int tsTo;
try {
JSONArray ja = new JSONArray(key.toString());
StringBuilder sb;
for(int n = 0; n < ja.length(); n++)
{
JSONObject jo = ja.getJSONObject(n);
String latitude = jo.getString("latitude");
String longitude = jo.getString("longitude");
int timestamp = jo.getInt("timestamp");
String building = jo.getString("building");
if (BUILDING.equals(building) && timestamp < tsTo && timestamp > tsFrom) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()), value);
}
}
}catch (JSONException e){
e.printStackTrace();
}
}
#Override
public void configure(JobConf jobConf) {
System.out.println("configure");
BUILDING = jobConf.get("BUILDING");
tsFrom = Integer.parseInt(jobConf.get("TSFROM"));
tsTo = Integer.parseInt(jobConf.get("TSTO"));
}
This works for a small data set. Since I am working with LARGE json files, I get Java Heap Space exception. Since I am not familiar with Hadoop, I'm having trouble understanding how MapR can read the data without getting outOfMemoryError.

If you simply want a list of LONG/LAT under the constraint of building=something and timestamp=somethingelse.
This is a simple filter operation; for this you do not need a reducer. In the mapper you should check if the current JSON satisfies the condition, and only then write it out to the context. If it fails to satisfy the condition you don't want it in the output.
The output should be LONG/LAT (no building/timestamp, unless you want them there as well)
If no reducer is present, the output of the mappers is the output of the job, which in your case is sufficient.
As for the code:
your driver should pass the building ID and the timestamp range to the mapper, using the job configuration. Anything you put there will be available to all your mappers.
Configuration conf = new Configuration();
conf.set("Building", "123");
conf.set("TSFROM", "12300000000");
conf.set("TSTO", "12400000000");
Job job = new Job(conf);
your mapper class needs to implement JobConfigurable.configure; in there you will read from the configuration object into local static variables
private static String BUILDING;
private static Long tsFrom;
private static Long tsTo;
public void configure(JobConf job) {
BUILDING = job.get("Building");
tsFrom = Long.parseLong(job.get("TSFROM"));
tsTo = Long.parseLong(job.get("TSTO"));
}
Now, your map function needs to check:
if (BUILDING.equals(building) && timestamp < TSTO && timestamp > TSFROM) {
sb = new StringBuilder();
sb.append(latitude);
sb.append("/");
sb.append(longitude);
context.write(new Text(sb.toString()),1);
}
this means any rows belonging to other buildings or outside the timestamp, would not appear in the result.

Related

Stackoverflowerror while using distinct in apache spark

I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}

I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.

Write multiple outputs from Mapper

Below sample data input.txt, it has 2 columns key & value. For each record processed by Mapper, the output of map should be written to
1)HDFS => A new file needs to created based on key column
2)Context object
Below is the code, where 4 files need to be created based on key column, but files are not getting created. Output is incorrect too. I am expecting wordcount output, but I am getting character count output.
input.txt
------------
key value
HelloWorld1|ID1
HelloWorld2|ID2
HelloWorld3|ID3
HelloWorld4|ID4
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException {
String line = value.toString();
String[] fileContent = line.split("|");
Path hdfsPath = new Path("/filelocation/" + fileContent[0]);
System.out.println("FilePath : " +hdfsPath);
Configuration configuration = con.getConfiguration();
writeFile(fileContent[1], hdfsPath, configuration);
for (String word : fileContent) {
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
static void writeFile(String fileContent, Path hdfsPath, Configuration configuration) throws IOException {
FileSystem fs = FileSystem.get(configuration);
FSDataOutputStream fin = fs.create(hdfsPath);
fin.writeUTF(fileContent);
fin.close();
}
}

Split uses regexp. You need to escape the '|' like .split("\\|");
See docs here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Exception in Reducer in Hadoop when run on Cluster

I have a map reduce program that runs perfectly when run in stand-alone mode but when I run it on Hadoop Cluster at my school, an exception is happening in the Reducer. I have no clue what exception it is. I came to know this as when I keep a try/catch in reducer, the job passes but empty output. When I don't keep the try/catch, job fails. Since it is a school cluster, I do not have access to any of the job trackers or other files. All I can find is through programatically only. Is there a way I can find what exception happened on hadoop during run time ?
Following are snippets of my code
public static class RowMPreMap extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
private Text keyText = new Text();
private Text valText = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (lineNo, lineContent)
// Split each line using seperator based on the dataset.
String line[] = null;
line = value.toString().split(Settings.INPUT_SEPERATOR);
keyText.set(line[0]);
valText.set(line[1] + "," + line[2]);
// Output: (userid, "movieid,rating")
output.collect(keyText, valText);
}
}
public static class RowMPreReduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private Text valText = new Text();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (userid, List<movieid, rating>)
float sum = 0.0F;
int totalRatingCount = 0;
ArrayList<String> movieID = new ArrayList<String>();
ArrayList<Float> rating = new ArrayList<Float>();
while (values.hasNext()) {
String[] movieRatingPair = values.next().toString().split(",");
movieID.add(movieRatingPair[0]);
Float parseRating = Float.parseFloat(movieRatingPair[1]);
rating.add(parseRating);
sum += parseRating;
totalRatingCount++;
}
float average = ((float) sum) / totalRatingCount;
for (int i = 0; i < movieID.size(); i++) {
valText.set("M " + key.toString() + " " + movieID.get(i) + " "
+ (rating.get(i) - average));
output.collect(null, valText);
}
// Output: (null, <M userid, movieid, normalizedrating>)
}
}
Exception happens in the above reducer. Below is the config
public void normalizeM() throws IOException, InterruptedException {
JobConf conf1 = new JobConf(UVDriver.class);
conf1.setMapperClass(RowMPreMap.class);
conf1.setReducerClass(RowMPreReduce.class);
conf1.setJarByClass(UVDriver.class);
conf1.setMapOutputKeyClass(Text.class);
conf1.setMapOutputValueClass(Text.class);
conf1.setOutputKeyClass(Text.class);
conf1.setOutputValueClass(Text.class);
conf1.setKeepFailedTaskFiles(true);
conf1.setInputFormat(TextInputFormat.class);
conf1.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(conf1, new Path(Settings.INPUT_PATH));
FileOutputFormat.setOutputPath(conf1, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
JobConf conf2 = new JobConf(UVDriver.class);
conf2.setMapperClass(ColMPreMap.class);
conf2.setReducerClass(ColMPreReduce.class);
conf2.setJarByClass(UVDriver.class);
conf2.setMapOutputKeyClass(Text.class);
conf2.setMapOutputValueClass(Text.class);
conf2.setOutputKeyClass(Text.class);
conf2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
FileOutputFormat.setOutputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH));
Job job1 = new Job(conf1);
Job job2 = new Job(conf2);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(job1);
jobControl.addJob(job2);
job2.addDependingJob(job1);
handleRun(jobControl);
}

I caught the exception in reducer and write the stack trace to a file in the file system. I know this is the dirtiest possible way of doing this, but I have no option at this point. Following is the code if it helps any one in future. Put the code in catch block.
String valueString = "";
while (values.hasNext()) {
valueString += values.next().toString();
}
StringWriter sw = new StringWriter();
e.printStackTrace(new PrintWriter(sw));
String exceptionAsString = sw.toString();
Path pt = new Path("errorfile");
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
br.write(exceptionAsString + "\nkey: " + key.toString() + "\nvalues: " + valueString);
br.close();
Inputs to do this in a clean way are welcome.
On a sider note, Eventually I found it is a NumberFormatException. Counters would not have helped me identify this. Later I realized the format of splitting input in stand-alone and on cluster is happening in different fashion, which I am yet to find the reason.

Even if you don't have access to the server, you can get the counters for a job:
Counters counters = job.getCounters();
and dump the set of counters to your local console. These counters will show, among other things, the counts for the number of records input to and written from the mappers and reducers. The counters that have value zero indicate the problem location in your workflow. You can instrument your own counters to help debug/monitor the flow.

Hadoop DistributedCache object changed during job

I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache:
java.io.IOException: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during the job from 4/8/13 2:20 PM to 4/8/13 2:20 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1246)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1237)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1152)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2541)
at java.lang.Thread.run(Thread.java:662)
What sets this question apart from this one is the fact that this error appears intermittently. I've run the same code successfully on a smaller dataset. Furthermore, when I change the number of centroids from 12 (seen above in the code) to 8, it fails on iteration 5 instead of 6 (which can you see in the centroids_6 name above).
Here's the relevant DistributedCache code in the main driver that runs the KMeans loop:
int iteration = 1;
long changes = 0;
do {
// First, write the previous iteration's centroids to the dist cache.
Configuration iterConf = new Configuration();
Path prevIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration - 1));
FileSystem fs = prevIter.getFileSystem(iterConf);
Path pathPattern = new Path(prevIter, "part-*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
DistributedCache.addCacheFile(status.getPath().toUri(), iterConf);
}
// Now, set up the job.
Job iterJob = new Job(iterConf);
iterJob.setJobName("KMeans " + iteration);
iterJob.setJarByClass(KMeansDriver.class);
Path nextIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration));
KMeansDriver.delete(iterConf, nextIter);
// Set input/output formats.
iterJob.setInputFormatClass(SequenceFileInputFormat.class);
iterJob.setOutputFormatClass(SequenceFileOutputFormat.class);
// Set Mapper, Reducer, Combiner
iterJob.setMapperClass(KMeansMapper.class);
iterJob.setCombinerClass(KMeansCombiner.class);
iterJob.setReducerClass(KMeansReducer.class);
// Set MR formats.
iterJob.setMapOutputKeyClass(IntWritable.class);
iterJob.setMapOutputValueClass(VectorWritable.class);
iterJob.setOutputKeyClass(IntWritable.class);
iterJob.setOutputValueClass(VectorWritable.class);
// Set input/output paths.
FileInputFormat.addInputPath(iterJob, data);
FileOutputFormat.setOutputPath(iterJob, nextIter);
iterJob.setNumReduceTasks(nReducers);
if (!iterJob.waitForCompletion(true)) {
System.err.println("ERROR: Iteration " + iteration + " failed!");
System.exit(1);
}
iteration++;
changes = iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).getValue();
iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).setValue(0);
} while (changes > 0);
How else would the files be modified? The only possibility I can think of is that, at the completion of one iteration, the loop begins again before the centroids from the previous job have finished writing. But within the comment, I invoke the job with waitForCompletion(true), so there shouldn't be any residual parts of the job running when the loop starts over. Any ideas?

This isn't really an answer, but I did realize it was silly to use the DistributedCache in the way I was, as opposed to reading the results from the previous iteration directly from HDFS. I instead wrote this method in the main driver:
public static HashMap<Integer, VectorWritable> readCentroids(Configuration conf, Path path)
throws IOException {
HashMap<Integer, VectorWritable> centroids = new HashMap<Integer, VectorWritable>();
FileSystem fs = FileSystem.get(path.toUri(), conf);
FileStatus [] list = fs.globStatus(new Path(path, "part-*"));
for (FileStatus status : list) {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, status.getPath(), conf);
IntWritable key = null;
VectorWritable value = null;
try {
key = (IntWritable)reader.getKeyClass().newInstance();
value = (VectorWritable)reader.getValueClass().newInstance();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
while (reader.next(key, value)) {
centroids.put(new Integer(key.get()),
new VectorWritable(value.get(), value.getClusterId(), value.getNumInstances()));
}
reader.close();
}
return centroids;
}
This is invoked in the setup() method of the Mapper and Reducer during each iteration, to read the centroids of the previous iteration.
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
Path centroidsPath = new Path(conf.get(KMeansDriver.CENTROIDS));
centroids = KMeansDriver.readCentroids(conf, centroidsPath);
}
This allowed me to remove the block of code in the loop in my original question which writes the centroids to the DistributedCache. I tested it, and it now works on both large and small datasets.
I still don't know why I was getting the error I posted about (how would something in the read-only DistributedCache be changed? especially when I was changing HDFS paths on every iteration?), but this seems to both work and be a much less hack-y way of reading the centroids.

hadoop map reduce job with HDFS input and HBASE output

I'm new on hadoop.
I have a MapReduce job which is supposed to get an input from Hdfs and write the output of the reducer to Hbase. I haven't found any good example.
Here's the code, the error runing this example is Type mismatch in map, expected ImmutableBytesWritable recieved IntWritable.
Mapper Class
public static class AddValueMapper extends Mapper < LongWritable,
Text, ImmutableBytesWritable, IntWritable > {
/* input <key, line number : value, full line>
* output <key, log key : value >*/
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
byte[] key;
int value, pos = 0;
String line = value.toString();
String p1 , p2 = null;
pos = line.indexOf("=");
//Key part
p1 = line.substring(0, pos);
p1 = p1.trim();
key = Bytes.toBytes(p1);
//Value part
p2 = line.substring(pos +1);
p2 = p2.trim();
value = Integer.parseInt(p2);
context.write(new ImmutableBytesWritable(key),new IntWritable(value));
}
}
Reducer Class
public static class AddValuesReducer extends TableReducer<
ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
long total =0;
// Loop values
while(values.iterator().hasNext()){
total += values.iterator().next().get();
}
// Put to HBase
Put put = new Put(key.get());
put.add(Bytes.toBytes("data"), Bytes.toBytes("total"),
Bytes.toBytes(total));
Bytes.toInt(key.get()), total));
context.write(key, put);
}
}
I had a similar job only with HDFS and works fine.
Edited 18-06-2013. The college project finished successfully two years ago. For job configuration (driver part) check correct answer.

Here is the code which will solve your problem
Driver
HBaseConfiguration conf = HBaseConfiguration.create();
Job job = new Job(conf,"JOB_NAME");
job.setJarByClass(yourclass.class);
job.setMapperClass(yourMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Intwritable.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
TableMapReduceUtil.initTableReducerJob(TABLE,
yourReducer.class, job);
job.setReducerClass(yourReducer.class);
job.waitForCompletion(true);
Mapper&Reducer
class yourMapper extends Mapper<LongWritable, Text, Text,IntWritable> {
//#overide map()
}
class yourReducer
extends
TableReducer<Text, IntWritable,
ImmutableBytesWritable>
{
//#override reduce()
}

Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as
job.setInputFormatClass(TextInputFormat.class);

The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility.
You will find a sample code here:
Hope this will be useful :)

public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
change this to immutableBytesWritable, intwritable.
I am not sure..hope it works

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop MapReduce querying on large json data - java

Related

Stackoverflowerror while using distinct in apache spark

Write multiple outputs from Mapper

Exception in Reducer in Hadoop when run on Cluster

Hadoop DistributedCache object changed during job

hadoop map reduce job with HDFS input and HBASE output

Categories

Resources