I am beginner at bigdata. First I wanna try how mapreduce work with hbase. The scenario is summing of a field uas in my hbase use map reduce based on date which is as primary key. Here is my table :
Hbase::Table - test
ROW COLUMN+CELL
10102010#1 column=cf:nama, timestamp=1418267197429, value=jonru
10102010#1 column=cf:quiz, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uas, timestamp=1418267197429, value=\x00\x00\x00d
10102010#1 column=cf:uts, timestamp=1418267197429, value=\x00\x00\x00d
10102010#2 column=cf:nama, timestamp=1418267180874, value=jonru
10102010#2 column=cf:quiz, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uas, timestamp=1418267180874, value=\x00\x00\x00d
10102010#2 column=cf:uts, timestamp=1418267180874, value=\x00\x00\x00d
10102012#1 column=cf:nama, timestamp=1418267156542, value=jonru
10102012#1 column=cf:quiz, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uas, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#1 column=cf:uts, timestamp=1418267156542, value=\x00\x00\x00\x0A
10102012#2 column=cf:nama, timestamp=1418267166524, value=jonru
10102012#2 column=cf:quiz, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uas, timestamp=1418267166524, value=\x00\x00\x00\x0A
10102012#2 column=cf:uts, timestamp=1418267166524, value=\x00\x00\x00\x0A
My codes are like these :
public class TestMapReduce {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "Test");
job.setJarByClass(TestMapReduce.TestMapper.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
TableMapReduceUtil.initTableMapperJob(
"test",
scan,
TestMapReduce.TestMapper.class,
Text.class,
IntWritable.class,
job);
TableMapReduceUtil.initTableReducerJob(
"test",
TestReducer.class,
job);
job.waitForCompletion(true);
}
public static class TestMapper extends TableMapper<Text, IntWritable> {
#Override
protected void map(ImmutableBytesWritable rowKey, Result columns, Mapper.Context context) throws IOException, InterruptedException {
System.out.println("mulai mapping");
try {
//get row key
String inKey = new String(rowKey.get());
//get new key having date only
String onKey = new String(inKey.split("#")[0]);
//get value s_sent column
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
Integer uas = new Integer(sUas);
//emit date and sent values
context.write(new Text(onKey), new IntWritable(uas));
} catch (RuntimeException ex) {
ex.printStackTrace();
}
}
}
public class TestReducer extends TableReducer {
public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
try {
int sum = 0;
for (Object test : values) {
System.out.println(test.toString());
sum += Integer.parseInt(test.toString());
}
Put inHbase = new Put(key.getBytes());
inHbase.add(Bytes.toBytes("cf"), Bytes.toBytes("sum"), Bytes.toBytes(sum));
context.write(null, inHbase);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I got errors like these :
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:451)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:745)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at TestMapReduce.main(TestMapReduce.java:97)
Java Result: 1
Help me please :)
Let's look at this part of your code:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = new String(bUas);
For the current key you are trying to get a value of column uas from column family cf. This is a non-relational DB, so it is easily possible that this key doesn't have a value for this column. In that case, getValue method will return null. String constructor that accepts byte[] as an input can't handle null values, so it will throw a NullPointerException. A quick fix will look like this:
byte[] bUas = columns.getValue(Bytes.toBytes("cf"), Bytes.toBytes("uas"));
String sUas = bUas == null ? "" : new String(bUas);
Related
I am looking out for the mapreduce program to read from one hive table and write to hdfs location of first column value of each record. And it should contain only map phase not reducer phase.
Below is the mapper
public class Map extends Mapper<WritableComparable, HCatRecord, NullWritable, IntWritable> {
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
NullWritable, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
// groupname = (String) value.get(0);
int id = (Integer) value.get(1);
// Just select and emit the name and ID
context.write(null, new IntWritable(id));
}
}
Main class
public class mapper1 {
public static void main(String[] args) throws Exception {
mapper1 m=new mapper1();
m.run(args);
}
public void run(String[] args) throws IOException, Exception, InterruptedException {
Configuration conf =new Configuration();
// Get the input and output table names as arguments
String inputTableName = args[0];
// Assume the default database
String dbName = "xademo";
Job job = new Job(conf, "UseHCat");
job.setJarByClass(mapper1.class);
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setMapperClass(Map.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath((JobConf) conf, new Path(args[1]));
job.waitForCompletion(true);
}
}
Is there anything wrong in this code?
This is giving some error as Numberformat exception from string 5s. I am not sure where it is being taken from. Showing error at below line HCatInputFormat.setInput()
I'm creating a RDD in 1st part of the application, then converting it to a list using rdd.collect().
But for some reason the list size is coming as 0 in the second part of the application , while the RDD from which I'm creating the list is not empty.Even rdd.toArray() is giving empty list.
Below is my program.
public class Query5kPids implements Serializable{
List<String> ListFromS3 = new ArrayList<String>();
public static void main(String[] args) throws JSONException, IOException, InterruptedException, URISyntaxException {
SparkConf conf = new SparkConf();
conf.setAppName("Spark-Cassandra Integration");
conf.set("spark.cassandra.connection.host", "12.16.193.19");
conf.setMaster("yarn-cluster");
SparkConf conf1 = new SparkConf().setAppName("SparkAutomation").setMaster("yarn-cluster");
Query5kPids app1 = new Query5kPids(conf1);
app1.run1(file);
Query5kPids app = new Query5kPids(conf);
System.out.println("Both RDD has been generated");
app.run();
}
private void run() throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
query(sc);
sc.stop();
}
private void run1(File file) throws JSONException, IOException, InterruptedException {
JavaSparkContext sc = new JavaSparkContext(conf);
getData(sc,file);
sc.stop();
}
private void getData(JavaSparkContext sc, File file) {
JavaRDD<String> Data = sc.textFile(file.toString());
System.out.println("RDD Count is " + Data.count());
// here it prints some count value
ListFromS3 = Data.collect();
// ListFromS3 = Data.toArray();
}
private void query(JavaSparkContext sc) {
System.out.println("RDD Count is " + ListFromS3.size());
// Prints 0
// So cant convert the list to RDD
JavaRDD<String> rddFromGz = sc.parallelize(ListFromS3);
}
}
NOTE -> In the actual program , the RDD and List is of type.
List<UserSetGet> ListFromS3 = new ArrayList<UserSetGet>();
JavaRDD<UserSetGet> Data = new ....
where UserSetGet is a Pojo , With Setter and getter methods, and its Serializable.
app1.run1 puts the RDD contents into app1.ListFromS3. Then you look at app.ListFromS3, which is empty. app1.ListFromS3 and app.ListFromS3 are fields on two different objects. Setting one does not set the other.
I think you meant ListFromS3 to be static, meaning it belongs to the Query5kPids class, not to a particular instance. Like this:
static List<String> ListFromS3 = new ArrayList<String>();
I am creating a Spark job in Java. Here is my code.
I am trying to filter records from a CSV file. Header contains fields OID, COUNTRY_NAME, ......
Instead of just filtering based on s.contains("CANADA"), I would like to be more specific, like I want to filter based on COUNTRY_NAME.equals("CANADA").
Any thoughts on how I can do this?
public static void main(String[] args) {
String gaimFile = "hdfs://xx.yy.zz.com/sandbox/data/acc/mydata";
SparkConf conf = new SparkConf().setAppName("Filter App");
JavaSparkContext sc = new JavaSparkContext(conf);
try{
JavaRDD<String> gaimData = sc.textFile(gaimFile);
JavaRDD<String> canadaOnly = gaimData.filter(new Function<String, Boolean>() {
private static final long serialVersionUID = -4438640257249553509L;
public Boolean call(String s) {
// My file id csv with header OID, COUNTRY_NAME, .....
// here instead of just saying s.contains
// i would like to be more specific and say
// if COUNTRY_NAME.eqauls("CANADA)
return s.contains("CANADA");
}
});
}
catch(Exception e){
System.out.println("ERROR: G9 MatchUp Failed");
}
finally{
sc.close();
}
}
You will have to map your values into a custom class first:
rdd.map(lines=>ConvertToCountry(line))
.filter(country=>country == "CANADA")
class Country{
...ctor that takes an array and fills properties...
...properties for each field from the csv...
}
ConvertToCountry(line: String){
return new Country(line.split(','))
}
The above is a combination of Scala and pseudocode, but you should get the point.
Getting Null pointer exception in Driver class conf.getstrings() method. This driver class is invoked from my custom website.
Below are Driver class details
#SuppressWarnings("unchecked")
public void doGet(HttpServletRequest request,
HttpServletResponse response)
throws ServletException, IOException
{
Configuration conf = new Configuration();
//conf.set("fs.default.name", "hdfs://localhost:54310");
//conf.set("mapred.job.tracker", "localhost:54311");
//conf.set("mapred.jar","/home/htcuser/Desktop/ResumeLatest.jar");
Job job = new Job(conf, "ResumeSearchClass");
job.setJarByClass(HelloForm.class);
job.setJobName("ResumeParse");
job.setInputFormatClass(FileInputFormat.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:54310/usr/ResumeDirectory"));
job.setMapperClass(ResumeMapper.class);
job.setReducerClass(ResumeReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setSortComparatorClass(ReverseComparator.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(FileOutPutFormat.class);
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:54310/usr/output" + System.currentTimeMillis()));
long start = System.currentTimeMillis();
var = job.waitForCompletion(true) ? 0 : 1;
Getting NULL pointer exception from following two line of code
String[] keytextarray=conf.getStrings("Keytext");
for(int i=0;i<keytextarray.length;i++) //GETTING NULL POINTER EXCEPTION HERE IN keytextarray.length
{
//some code here
}
if(var==0)
{
RequestDispatcher dispatcher = request.getRequestDispatcher("/Result.jsp");
dispatcher.forward(request, response);
long finish= System.currentTimeMillis();
System.out.println("Time Taken "+(finish-start));
}
}
I have removed few unwanted codes from above Drives class method...
Below are RecordWriter class where I use conf.setstrings() in Write() method to set values
Below are RecordWriter class details
public class RecordWrite extends org.apache.hadoop.mapreduce.RecordWriter<IntWritable, Text> {
TaskAttemptContext context1;
Configuration conf;
public RecordWrite(DataOutputStream output, TaskAttemptContext context)
{
out = output;
conf = context.getConfiguration();
HelloForm.context1=context;
try {
out.writeBytes("result:\n");
out.writeBytes("Name:\t\t\t\tExperience\t\t\t\t\tPriority\tPriorityCount\n");
} catch (IOException e) {
e.printStackTrace();
}
}
public RecordWrite() {
// TODO Auto-generated constructor stub
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException
{
out.close();
}
int z=0;
#Override
public void write(IntWritable value,Text key) throws IOException,
InterruptedException
{
conf.setStrings("Keytext", key1string); //setting values here
conf.setStrings("valtext", valuestring);
String[] keytext=key.toString().split(Pattern.quote("^"));
//some code here
}
}`
`I suspect this null pointer exception happens since i call conf.getstrings() method after job is completed (job.waitForCompletion(true)). Please help fix this issue.
If above code is not correct way of passing values from recordwriter() method to driverclass.. please let me know how to pass values from recordwriter() to driver class.
I have tried option of setting values in RecordWriter() to an custom static class and accessing that object from static class in Driverclass again returns Null exception if i am running code in cluster..
If you have the value of key1staring and valuestirng, in Job class, try setting them in job class itself, rather than RecordWriter.write() method.
I have got the same problem as mentioned in this question (Type mismatch in key from map when replacing Mapper with MultithreadMapper), but the answer do not work for me.
The error message i get looks like the following:
13/09/17 10:37:38 INFO mapred.JobClient: Task Id : attempt_201309170943_0006_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1019)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:690)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is my main method:
public static int main(String[] init_args) throws Exception {
Configuration config = new Configuration();
if (args.length != 5) {
System.out.println("Invalid Arguments");
print_usage();
throw new IllegalArgumentException();
}
config.set("myfirstdata", args[0]);
config.set("myseconddata", args[1]);
config.set("mythirddata", args[2]);
config.set("mykeyattribute", "GK");
config.setInt("myy", 50);
config.setInt("myx", 49);
// additional attributes
config.setInt("myobjectid", 1);
config.setInt("myplz", 3);
config.setInt("mygenm", 4);
config.setInt("mystnm", 6);
config.setInt("myhsnr", 7);
config.set("mapred.textoutputformat.separator", ";");
Job job = new Job(config);
job.setJobName("MySample");
// set the outputs for the Job
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// set the outputs for the Job
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultithreadedMapper.setMapperClass(job, MyMapper.class);
job.setReducerClass(MyReducer.class);
// In our case, the combiner is the same as the reducer. This is
// possible
// for reducers that are both commutative and associative
job.setCombinerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path(args[3]));
TextOutputFormat.setOutputPath(job, new Path(args[4]));
job.setJarByClass(MySampleDriver.class);
MultithreadedMapper.setNumberOfThreads(job, 2);
return job.waitForCompletion(true) ? 0 : 1;
}
The mapper code looks like this:
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
...
/**
* Sets up mapper with filter geometry provided as argument[0] to the jar
*/
#Override
public void setup(Context context) {
...
}
#Override
public void map(LongWritable key, Text val, Context context)
throws IOException, InterruptedException {
...
// We know that the first line of the CSV is just headers, so at byte
// offset 0 we can just return
if (key.get() == 0)
return;
String line = val.toString();
String[] values = line.split(";");
float latitude = Float.parseFloat(values[latitudeIndex]);
float longitude = Float.parseFloat(values[longitudeIndex]);
...
// Create our Point directly from longitude and latitude
Point point = new Point(longitude, latitude);
IntWritable one = new IntWritable();
if (...) {
int name = ...
one.set(name);
String out = ...
context.write(new Text(out), one);
} else {
String out = ...
context.write(new Text(out), new IntWritable(-1));
}
}
}
You forgot to set the mapper class. You need to add job.setMapperClass(MultithreadedMapper.class); to your codes.