Running a mapreduce class in another Java program

Running a mapreduce class in another Java program - java

I write a mapreduce class and create a jar file from the class. now I want to use this jar in another java program.
can anyone help me please how could I do this?
thanks
here is my MapReduce Program:
package org.apache.cassandra.com;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.hadoop.cql3.CqlConfigHelper;
import org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat;
import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class CassandraSumLib extends Configured implements Tool
{
public CassandraSumLib(){
}
static final String KEYSPACE = "weather";
static final String COLUMN_FAMILY = "momentinfo1";
static final String OUTPUT_PATH = "/tmp/OutPut";
private static final Logger logger = LoggerFactory.getLogger(CassandraSum.class);
public int CassandraSum(String[] args) throws Exception
{
return ToolRunner.run(new Configuration(), new CassandraSumLib(), args);
}
///////////////////////////////////////////////////////////
public static class Summap extends Mapper<Map<String, ByteBuffer>, Map<FloatWritable, ByteBuffer>, Text, DoubleWritable>
{
Text word = new Text("SUM");
float temp;
public void map(Map<String, ByteBuffer> keys, Map<FloatWritable, ByteBuffer> columns, Context context) throws IOException, InterruptedException
{
for (Entry<FloatWritable, ByteBuffer> column : columns.entrySet())
{
if (!"column".equals(column.getKey()))
continue;
temp = ByteBufferUtil.toFloat(column.getValue());
//System.out.println(temp);
context.write(word, new DoubleWritable(temp));
//System.out.println(word + " " + temp);
}
}
}
///////////////////////////////////////////////////////////
public static class Sumred extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
Double sum = 0.0;
for (DoubleWritable val : values){
// System.out.println(val.get());
sum += val.get();}
context.write(key, new DoubleWritable(sum));
}
}
///////////////////////////////////////////////////////////
public int run(String[] args) throws Exception
{
Job job = new Job(getConf(), "SUM");
job.setJarByClass(CassandraSum.class);
job.setMapperClass(Summap.class);
JobConf conf = new JobConf( getConf(), CassandraSum.class);
// conf.setNumMapTasks(1000);
// conf.setNumReduceTasks(900);
job.setOutputFormatClass(TextOutputFormat.class);
job.setCombinerClass(Sumred.class);
job.setReducerClass(Sumred.class);
job.setOutputKeyClass(Text.class);
job.setNumReduceTasks(900);
job.setOutputValueClass(DoubleWritable.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setInputFormatClass(CqlPagingInputFormat.class);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "Murmur3Partitioner");
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3");
job.waitForCompletion(true);
return 0;
}
}
I want to call this class from another program. here is my second program that call my firs program:
package org.apache.cassandra.com;
import java.util;
import org.apache.hadoop.util.RunJar;
import org.apache.cassandra.com.CassandraSumLib;
public class CassandraSum {
public static void main(String[] args) throws Exception{
CassandraSumLib CSL = new CassandraSumLib();
CSL.??? (which method should I write here?)
}
}
thanks

Steps to add jar file in eclipse
1. right click on project
2. click on Bulid Path->configure path
3. click on java Build path
4. Click on libraries tab
5. click on add external jar tab
6. choose jar file
7. click on ok

Add the jar to class path of the second program. If you are compiling/running from command line, use -cp option.

Related

How do i create custom mapper class in mapreduce

I am having unique requirement where i have to pass the zip shell command from text file and mapper will process the script that will create zip files in parallel fashion using mapper only. I am thinking to execute shell command using exec in java. I am bit stuck on how to implement the custom mapper as my output would be compressed format.
Below is my mapper class -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, NullWritable>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line= value.toString();
StringTokenizer tokenizer= new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
value.set(tokenizer.nextToken());
context.write(value,NullWritable.get());
}
}
}
Processor class -
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
public class ZipProcessor extends Configured implements Tool {
public static void main(String [] args) throws Exception{
int exitCode = ToolRunner.run(new ZipProcessor(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if(args.length!=2){
System.err.printf("Usage: %s needs two arguments, input and output files\n", getClass().getSimpleName());
return -1;
}
Configuration conf=new Configuration();
Job job = Job.getInstance(conf,"zipping");
job.setJarByClass(ZipProcessor.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(Map.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
Sample mapr.txt
zip -r "/folder1/file.zip" "sourceFolder"
zip -r "/folder2/file.zip" "sourceFolder"
zip -r "/folder3/file.zip" "sourceFolder"

Hadoop Jar runs but no output. Driver, mapper and reduce compiles successfully in namenode

I'm a newbie to Hadoop Programming and I have started learning by setting up Hadoop 2.7.1 on a three node cluster. I have tried running helloworld jars that comes out of the box in Hadoop and it ran fine with success but I wrote my own driver code in my local machine and bundled it into a jar and executed it this way but it fails with NO error messages.
Here is my code and this is what I did.
WordCountMapper.java
package mot.com.bin.test;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text Value,
OutputCollector<Text, IntWritable> opc, Reporter r)
throws IOException {
String s = Value.toString();
for (String word :s.split(" ")) {
if( word.length() > 0) {
opc.collect(new Text(word), new IntWritable(1));
}
}
}
}
WordCountReduce.java
package mot.com.bin.test;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordCountReduce extends MapReduceBase implements Reducer < Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> opc, Reporter r)
throws IOException {
// TODO Auto-generated method stub
int i = 0;
while (values.hasNext()) {
IntWritable in = values.next();
i+=in.get();
}
opc.collect(key, new IntWritable (i));
}
WordCount.java
/**
* **DRIVER**
*/
package mot.com.bin.test;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.io.Text;
//import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;
/**
* #author rgb764
*
*/
public class WordCount extends Configured implements Tool{
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
}
public int run(String[] arg0) throws Exception {
if (arg0.length < 2) {
System.out.println("Need input file and output directory");
return -1;
}
JobConf conf = new JobConf();
FileInputFormat.setInputPaths(conf, new Path( arg0[0]));
FileOutputFormat.setOutputPath(conf, new Path( arg0[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
}
First I tried extracting it as a jar from eclipse and run it in my hadoop cluster. No errors yet no success as well. Then moved my individual java files to my NameNode and compiled each java files and then created the jar file there, still hadoop command returns no results but no errors as well. Kindly help me on this.
hadoop jar WordCout.jar mot.com.bin.test.WordCount /karthik/mytext.txt /tempo
Extracted all dependent jar files using Maven and I added them into the classpath in my name node. Help me figure what and where am I going wrong.

IMO you are missing the code in your main method which instantiate the Tool implementation ( WordCount in your case) and runs the same.
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
Refer this.

When to use NLineInputFormat in Hadoop Map-Reduce?

I have a Text based input file of size around 25 GB. And in that file one single record consists of 4 lines. And the processing for every record is the same. But inside every record,each of the four lines are processed differently.
I'm new to Hadoop so I wanted a guidance that whether to use NLineInputFormat in this situation or use the default TextInputFormat ? Thanks in advance !

Assuming you have the text file in the following format :
2015-8-02
error2014 blahblahblahblah
2015-8-02
blahblahbalh error2014
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
In your case you can use to input 4 lines per mapper.
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 4);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}

how to access the texts from a text file present in hadoop cache

I have 2 files which needs to be accessed by the hadoop cluster. Those two files are good.txt and bad.txt respectively.
Firstly since both these files needs to be accessed from different nodes i place these two files in distributed cache in driver class as follows
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
Now both good and bad files are placed in distributed cache. I access the distributed cache in mapper class as follows
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
}
I need to check if a word is present in a good.txt or bad.txt. So i use the something like this
File file=new File(files[0].toString()); //to access good.txt
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuider sb=new StringBuilder();
String input=null;
while((input=br.readLine())!=null){
sb.append(input);
}
input=sb.toString();
iam supposed to get the content of good file in my input variable. But i dont get it. Have i missed anything??

Does job finish successfully? The maptask may fail because you are using JobConf in this line
files=DistributedCache.getLocalCacheFiles(new JobConf(context.getConfiguration()));
If you change it like this it should work, I don't see any problem with remaining code you posted in question.
files=DistributedCache.getLocalCacheFiles(context.getConfiguration());
or
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));

#rVr these is my driver class
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvgWordLength {
public static void main(String[] args) throws Exception {
if (args.length !=2) {
System.out.printf("Usage: AvgWordLength <input dir> <output dir>\n");
System.exit(-1);
}
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
job.setJarByClass(AvgWordLength.class);
job.setJobName("Average Word Length");
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(LetterMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
And my mapper class is
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
System.out.println("in setup()"+files.toString());
}
#Override
public void map(LongWritable key, Text value, Context context)throws IOException,InterruptedException{
int i=0;
System.out.println("in map----->>"+files.toString());//added just to view logs
HashMap<String,String> h=new HashMap<String,String>();
String negword=null;
String input=value.toString();
if(isPresent(input,files[0].toString()){
h.put(input,"good");
}
else
if(isPresent(input,files[1].toString()){
h.put(input,"bad");
}
}
public static boolean isPresent(String n,Path files2) throws IOException{
File file=new File(files2.toString());
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuilder sb=new StringBuilder();
String input=null;
while((input=br.readLine().toString())!=null){
sb.append(input.toString());
}
input=sb.toString();
//System.out.println(input);
Pattern pattern=Pattern.compile(n);
Matcher matcher=pattern.matcher(input);
if(matcher.find()){
return true;
}
else
return false;
}
}

MapReduce-Cassandra wordcount compilation error: ConfigHelper not found

I am trying to run WordCount MapReduce program to read and count data stored in Cassandra table (Column Family) but, when I compile my program I got the same error repeated times. Below is my source code and error I got. Can anyone help me to solve this issue? Thanks in advance.
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.cassandra.db.IColumn;
import org.apache.cassandra.hadoop.*;
import org.apache.cassandra.hadoop.ColumnFamilyInputFormat;
import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.thrift.*;
import org.apache.cassandra.utils.ByteBufferUtil;
/**
* This sums the word count stored in the input_words_count ColumnFamily for the key "key-if-verse1".
*
* Output is written to a text file.
*/
public class WordCountCounters extends Configured implements Tool
{
private static final Logger logger = LoggerFactory.getLogger(WordCountCounters.class);
static final String COUNTER_COLUMN_FAMILY = "input_words";
private static final String OUTPUT_PATH_PREFIX = "/Users/Deepu/Documents/dse-3.2.4/dse-data/word_count_counters";
public static void main(String[] args) throws Exception
{
// Let ToolRunner handle generic command-line options
ToolRunner.run(new Configuration(), new WordCountCounters(), args);
System.exit(0);
}
public static class SumMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context) throws IOException, InterruptedException
{
long sum = 0;
for (IColumn column : columns.values())
{
logger.debug("read " + key + ":" + column.name() + " from " + context.getInputSplit());
sum += ByteBufferUtil.toLong(column.value());
}
context.write(new Text(ByteBufferUtil.string(key)), new LongWritable(sum));
}
}
public int run(String[] args) throws Exception
{
Job job = new Job(getConf(), "wordcountcounters");
job.setJarByClass(WordCountCounters.class);
job.setMapperClass(SumMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_PREFIX));
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), WordCount.KEYSPACE, WordCountCounters.COUNTER_COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setSlice_range(
new SliceRange().
setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER).
setFinish(ByteBufferUtil.EMPTY_BYTE_BUFFER).
setCount(100));
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
job.waitForCompletion(true);
return 0;
}
}
Compiation Errors are:

Because you commented out these two lines perhaps:
//import org.apache.cassandra.hadoop.ColumnFamilyInputFormat;
//import org.apache.cassandra.hadoop.ConfigHelper;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Running a mapreduce class in another Java program - java

Steps to add jar file in eclipse 1. right click on project 2. click on Bulid Path->configure path 3. click on java Build path 4. Click on libraries tab 5. click on add external jar tab 6. choose jar file 7. click on ok

Add the jar to class path of the second program. If you are compiling/running from command line, use -cp option.

Related

How do i create custom mapper class in mapreduce

Hadoop Jar runs but no output. Driver, mapper and reduce compiles successfully in namenode

When to use NLineInputFormat in Hadoop Map-Reduce?

how to access the texts from a text file present in hadoop cache

MapReduce-Cassandra wordcount compilation error: ConfigHelper not found

Categories

Resources