mapreduce to read hive table and write to hdfs location with context

mapreduce to read hive table and write to hdfs location with context - java

I am looking out for the mapreduce program to read from one hive table and write to hdfs location of first column value of each record. And it should contain only map phase not reducer phase.
Below is the mapper
public class Map extends Mapper<WritableComparable, HCatRecord, NullWritable, IntWritable> {
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
NullWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
// groupname = (String) value.get(0);
int id = (Integer) value.get(1);
// Just select and emit the name and ID
context.write(null, new IntWritable(id));
}
}
Main class
public class mapper1 {
public static void main(String[] args) throws Exception {
mapper1 m=new mapper1();
m.run(args);
}
public void run(String[] args) throws IOException, Exception, InterruptedException {
Configuration conf =new Configuration();
// Get the input and output table names as arguments
String inputTableName = args[0];
// Assume the default database
String dbName = "xademo";
Job job = new Job(conf, "UseHCat");
job.setJarByClass(mapper1.class);
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setMapperClass(Map.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath((JobConf) conf, new Path(args[1]));
job.waitForCompletion(true);
}
}
Is there anything wrong in this code?
This is giving some error as Numberformat exception from string 5s. I am not sure where it is being taken from. Showing error at below line HCatInputFormat.setInput()

Related

custom partitioner in Hadoop error java.lang.NoSuchMethodException:- <init>()

I am trying to make a custom partitioner to allocate each unique key to a single reducer. this was after the default HashPartioner failed
Alternative to the default hashpartioner provided with hadoop
I keep getting the following error. It has something to do with the constructor not receiving its arguments, from what I can tell from doing some research. but in this context, with hadoop, aren't the arguments passed automatically by the framework? I cant find an error in the code
18/04/20 17:06:51 INFO mapred.JobClient: Task Id : attempt_201804201340_0007_m_000000_1, Status : FAILED
java.lang.RuntimeException: java.lang.NoSuchMethodException: biA3pipepart$parti.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:587)
This is my partitioner:
public class Parti extends Partitioner<Text, Text> {
String partititonkey;
int result=0;
#Override
public int getPartition(Text key, Text value, int numPartitions) {
String partitionKey = key.toString();
if(numPartitions >= 9){
if(partitionKey.charAt(0) =='0' ){
if(partitionKey.charAt(2)=='0' )
result= 0;
else
if(partitionKey.charAt(2)=='1' )
result= 1;
else
result= 2;
}else
if(partitionKey.charAt(0)=='1'){
if(partitionKey.charAt(2)=='0' )
result= 3;
else
if(partitionKey.charAt(2)=='1' )
result= 4;
else
result= 5;
}else
if(partitionKey.charAt(0)=='2' ){
if(partitionKey.charAt(2)=='0' )
result= 6;
else
if(partitionKey.charAt(2)=='1' )
result= 7;
else
result= 8;
}
} //
else
result= 0;
return result;
}// close method
}// close class
My mapper signature
public static class JoinsMap extends Mapper<LongWritable,Text,Text,Text>{
public void Map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
My reducer signiture
public static class JoinsReduce extends Reducer<Text,Text,Text,Text>{
public void Reduce (Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
main class:
public static void main( String[] args ) throws Exception {
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1, "biA3pipepart");
job1.setJarByClass(biA3pipepart.class);
job1.setNumReduceTasks(9);//***
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Text.class);
job1.setMapperClass(JoinsMap.class);
job1.setReducerClass(JoinsReduce.class);
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
job1.setPartitionerClass(Parti.class); //+++
// inputs to map.
FileInputFormat.addInputPath(job1, new Path(args[0]));
// single output from reducer.
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
job1.waitForCompletion(true);
}
keys emitted by Mapper are the following:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
and the Reducer only writes keys and values it receives.

SOLVED
I just added static to my Parti class like the mapper and reducer classes as suggested by comment (user238607).
public static class Parti extends Partitioner<Text, Text> {

Using MultipleOutputs without context.write results empty files

I don't know how to use MultipleOutputs class. I'm using it to create multiple output files. Following is my Driver class's code snippet
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(CustomKeyValueTest.class);//class with mapper and reducer
job.setOutputKeyClass(CustomKey.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(CustomKey.class);
job.setMapOutputValueClass(CustomValue.class);
job.setMapperClass(CustomKeyValueTestMapper.class);
job.setReducerClass(CustomKeyValueTestReducer.class);
job.setInputFormatClass(TextInputFormat.class);
Path in = new Path(args[1]);
Path out = new Path(args[2]);
out.getFileSystem(conf).delete(out, true);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
MultipleOutputs.addNamedOutput(job, "islnd" , TextOutputFormat.class, CustomKey.class, Text.class);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
MultipleOutputs.setCountersEnabled(job, true);
boolean status = job.waitForCompletion(true);
and in Reducer, I used MultipleOutputs like this,
private MultipleOutputs<CustomKey, Text> multipleOutputs;
#Override
public void setup(Context context) throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<>(context);
}
#Override
public void reduce(CustomKey key, Iterable<CustomValue> values, Context context) throws IOException, InterruptedException {
...
multipleOutputs.write("islnd", key, pop, key.toString());
//context.write(key, pop);
}
public void cleanup() throws IOException, InterruptedException {
multipleOutputs.close();
}
}
When I use context.write I get output files with data in it. But When I remove context.write the output files are empty. But I don't want to call context.write because it creates extra file part-r-00000. As Stated here(last para in the description of class) I used LazyOutputFormat to avoid part-r-00000 file. But still didn't work.

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
This means , in case you are not creating any output, dont create empty files.
Can you please look at hadoop counters and find
1. map.output.records
2. reduce.input.groups
3. reduce.input.records to verify if your mappers are sending any data to mapper.
Code with IT for multioutput is
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

MapReduce HBase NullPointerException

I am beginner at bigdata. First I wanna try how mapreduce work with hbase. The scenario is summing of a field uas in my hbase use map reduce based on date which is as primary key. Here is my table :
Hbase::Table - test
ROW COLUMN+CELL
10102010#1 column=cf:nama, timestamp=1418267197429, value=jonru
10102010#1 column=cf:quiz, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uas, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uts, timestamp=1418267197429, value=\x00\x00\x00d
10102010#2 column=cf:nama, timestamp=1418267180874, value=jonru
10102010#2 column=cf:quiz, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uas, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uts, timestamp=1418267180874, value=\x00\x00\x00d
10102012#1 column=cf:nama, timestamp=1418267156542, value=jonru
10102012#1 column=cf:quiz, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uas, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uts, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#2 column=cf:nama, timestamp=1418267166524, value=jonru
10102012#2 column=cf:quiz, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uas, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uts, timestamp=1418267166524, value=\x00\x00\x00\x0A
My codes are like these :
public class TestMapReduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "Test");
job.setJarByClass(TestMapReduce.TestMapper.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
TableMapReduceUtil.initTableMapperJob(
"test",
scan,
TestMapReduce.TestMapper.class,
Text.class,
IntWritable.class,
job);
TableMapReduceUtil.initTableReducerJob(
"test",
TestReducer.class,
job);
job.waitForCompletion(true);
}
public static class TestMapper extends TableMapper<Text, IntWritable> {
#Override
protected void map(ImmutableBytesWritable rowKey, Result columns, Mapper.Context context) throws IOException, InterruptedException {
System.out.println("mulai mapping");
try {
//get row key
String inKey = new String(rowKey.get());
//get new key having date only
String onKey = new String(inKey.split("#")[0]);
//get value s_sent column
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
Integer uas = new Integer(sUas);
//emit date and sent values
context.write(new Text(onKey), new IntWritable(uas));
} catch (RuntimeException ex) {
ex.printStackTrace();
}
}
}
public class TestReducer extends TableReducer {
public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
try {
int sum = 0;
for (Object test : values) {
System.out.println(test.toString());
sum += Integer.parseInt(test.toString());
}
Put inHbase = new Put(key.getBytes());
inHbase.add(Bytes.toBytes("cf"), Bytes.toBytes("sum"), Bytes.toBytes(sum));
context.write(null, inHbase);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I got errors like these :
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:451)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:745)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at TestMapReduce.main(TestMapReduce.java:97)
Java Result: 1
Help me please :)

Let's look at this part of your code:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
For the current key you are trying to get a value of column uas from column family cf. This is a non-relational DB, so it is easily possible that this key doesn't have a value for this column. In that case, getValue method will return null. String constructor that accepts byte[] as an input can't handle null values, so it will throw a NullPointerException. A quick fix will look like this:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = bUas == null ? "" : new String(bUas);

Best way to get distribute a small lookup file using Distributed Cache

Which is the best way to get Distributed cached data?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
OR
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
So if we are doing like (code 2) does that mean Say we have 5 map task every map task reads the same copy of the data . while writing like this for each map , the task reads the data multiple times am i right (5 times)?
code 1 : as it is written in setup it is read once and the global data is accessed in map.
Which is the right way of writing distributed cache.

Do as much as you can in the setup method: this will be called once by each mapper, but will then be cached for each record that is passed to the mapper. Parsing your data for each record is overhead you can avoid, since there is nothing there that depends on the key, value and context variables you are receiving in the map method.
The setup method will be called per map task, but map will be called for each record processed by that task (which can clearly be a very high number).

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

Driver code:
public class WcDriver {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "WcDriver");
job.setJarByClass(WcDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.waitForCompletion(true);
}
}
Reducer code
public class WcReducer extends Reducer<Text, LongWritable, Text,String>
{
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
String key1 = null;
int total = 0;
for (LongWritable value : values) {
total += value.get();
key1= key.toString();
}
context.write(new Text(key1), "ABC");
}
}
Here, in driver class I have set job.setOutputKeyClass(Text.class) and job.setOutputValueClass(LongWritable.class), but in reducer class I am writing a string context.write(new Text(key1), "ABC");. I think there should be an error while running the program because output types are not matching, and also reducer's key should implement WritableComparable and value should implement Writable interface. Strangely, this program is running good. I do not understand why there is not an exception.

try to do this :
// job.setOutputFormatClass(TextOutputFormat.class);
// comment this line, and you'll sure get exception of casting.
This is because, TextOutputFormat assumes LongWritable as key, and Text as value, if you'll not define the outPutFormat class, it will expect to get default behaviour of writable, which is by default, but if u'll mention it, it would implicitly cast it to the given type.;

try this
//job.setOutputValueClass(LongWritable.class); if you comment this line you get an error
this will for only define the key value pair by defaul it depent on the output format and
it will be text so this is not giving any error

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

mapreduce to read hive table and write to hdfs location with context - java

Related

custom partitioner in Hadoop error java.lang.NoSuchMethodException:- <init>()

Using MultipleOutputs without context.write results empty files

MapReduce HBase NullPointerException

Best way to get distribute a small lookup file using Distributed Cache

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

Categories

Resources