I am running a hadoop job and trying to write the output to Cassandra. I am getting following exception:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to java.nio.ByteBuffer
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.write(ColumnFamilyRecordWriter.java:60)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:156)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
I modeled my map reduce code on the WordCount example given at https://wso2.org/repos/wso2/trunk/carbon/dependencies/cassandra/contrib/word_count/src/WordCount.java
Here's my MR code:
public class SentimentAnalysis extends Configured implements Tool {
static final String KEYSPACE = "Travel";
static final String OUTPUT_COLUMN_FAMILY = "Keyword_PtitleId";
public static class Map extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
Sentiment sentiment = null;
try {
sentiment = (Sentiment) PojoMapper.fromJson(line, Sentiment.class);
} catch(Exception e) {
return;
}
if(sentiment != null && sentiment.isLike()) {
word.set(sentiment.getNormKeyword());
context.write(word, new LongWritable(sentiment.getPtitleId()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, ByteBuffer, List<Mutation>> {
private ByteBuffer outputKey;
public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException, InterruptedException {
List<Long> ptitles = new ArrayList<Long>();
java.util.Map<Long, Integer> ptitleToFrequency = new HashMap<Long, Integer>();
while (values.hasNext()) {
Long value = values.next().get();
ptitles.add(value);
}
for(Long ptitle : ptitles) {
if(ptitleToFrequency.containsKey(ptitle)) {
ptitleToFrequency.put(ptitle, ptitleToFrequency.get(ptitle) + 1);
}
else {
ptitleToFrequency.put(ptitle, 1);
}
}
byte[] keyBytes = key.getBytes();
outputKey = ByteBuffer.wrap(Arrays.copyOf(keyBytes, keyBytes.length));
for(Long ptitle : ptitleToFrequency.keySet()) {
context.write(outputKey, Collections.singletonList(getMutation(new Text(ptitle.toString()), ptitleToFrequency.get(ptitle))));
}
}
private static Mutation getMutation(Text word, int sum)
{
Column c = new Column();
byte[] wordBytes = word.getBytes();
c.name = ByteBuffer.wrap(Arrays.copyOf(wordBytes, wordBytes.length));
c.value = ByteBuffer.wrap(String.valueOf(sum).getBytes());
c.timestamp = System.currentTimeMillis() * 1000;
Mutation m = new Mutation();
m.column_or_supercolumn = new ColumnOrSuperColumn();
m.column_or_supercolumn.column = c;
return m;
}
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new SentimentAnalysis(), args);
System.exit(ret);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "SentimentAnalysis");
job.setJarByClass(SentimentAnalysis.class);
String inputFile = args[0];
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(ByteBuffer.class);
job.setOutputValueClass(List.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY);
FileInputFormat.setInputPaths(job, inputFile);
ConfigHelper.setRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}
If you look under the Reduce class, I am converting Text field (key) to ByteBuffer properly.
Would appreciate some pointers on how to fix this.
After some trial and error, I was able to figure out how to solve this particular issue. Basically, in my reduce method signature, I was using Iterator instead of Iterable and so the reducer was never called. And, hadoop was trying to write my Mapper output (Text, LongWritable) to Cassandra using outputKey/Value Classes for Reducer (ByteBuffer, List). This was causing the ClassCastException.
Changing reduce method signature to Iterable solved this issue.
Related
this is my code
public class solution1 {
public static void main(String[] args) throws IOException {
String localStr = args[0];
String hdfsStr = args[1];
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(URI.create(hdfsStr), conf);
FileSystem local = FileSystem.getLocal(conf);
Path inputDir = new Path(localStr);
String folderName = inputDir.getName();
Path hdfsFile = new Path(hdfsStr, folderName);
try {
FileStatus[] inputFiles = local.listStatus(inputDir);
FSDataOutputStream out = hdfs.create(hdfsFile);
for (int i=0; i<inputFiles.length; i++) {
System.out.println(inputFiles[i].getPath().getName());
FSDataInputStream in = local.open(inputFiles[i].getPath());
byte buffer[] = new byte[256];
int bytesRead = 0;
while( (bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}
in.close();
}
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
i am trying to use this code to loop through hundreds of .txt file
this the the content of the .txt files, and there are 500 of this
files
updated
this is what i typed in virtual box and this is the result i get, the output is not the expected output, no successful result
no successful result
it is not reading the content inside individual txt file
but if i were to use the same command to run another java file, there is a result, it will run successfully. it looks like this
this is running
the code for wordcount java is here
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
i am stuck and cannot really figure out how to proceed. what should i type in virtualbox for my code to read all 500 txt files?
I create two jobs, and I want to chain them, so that one job is executed just after the previous job is complete. So I wrote the following code. But as I have observed job1 finished correctly, and job2 never seems to execute.
public class Simpletask extends Configured implements Tool {
public static enum FileCounters {
COUNT;
}
public static class TokenizerMapper extends Mapper<Object, Text, IntWritable, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String line = itr.nextToken();
String part[] = line.split(",");
int id = Integer.valueOf(part[0]);
int x1 = Integer.valueOf(part[1]);
int y1 = Integer.valueOf(part[2]);
int z1 = Integer.valueOf(part[3]);
int x2 = Integer.valueOf(part[4]);
int y2 = Integer.valueOf(part[5]);
int z2 = Integer.valueOf(part[6]);
int h_v = Hilbert(x1,y1,z1);
int parti = h_v/10;
IntWritable partition = new IntWritable(parti);
Text neuron = new Text();
neuron.set(line);
context.write(partition,neuron);
}
}
public int Hilbert(int x,int y,int z){
return (int) (Math.random()*20);
}
}
public static class IntSumReducer extends Reducer<IntWritable,Text,IntWritable,Text> {
private Text result = new Text();
private MultipleOutputs<IntWritable, Text> mos;
public void setup(Context context) {
mos = new MultipleOutputs<IntWritable, Text>(context);
}
<K, V> String generateFileName(K k) {
return "p"+k.toString();
}
public void reduce(IntWritable key,Iterable<Text> values, Context context) throws IOException, InterruptedException {
String accu = "";
for (Text val : values) {
String[] entry=val.toString().split(",");
String MBR = entry[1];
accu+=entry[0]+",MBR"+MBR+" ";
}
result.set(accu);
context.getCounter(FileCounters.COUNT).increment(1);
mos.write(key, result, generateFileName(key));
}
}
public static class RTreeMapper extends Mapper<Object, Text, IntWritable, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
System.out.println("WOWOWOWOW RUNNING");// NOTHING SHOWS UP!
}
}
public static class RTreeReducer extends Reducer<IntWritable,Text,IntWritable,Text> {
private MultipleOutputs<IntWritable, Text> mos;
Text t = new Text();
public void setup(Context context) {
mos = new MultipleOutputs<IntWritable, Text>(context);
}
public void reduce(IntWritable key,Iterable<Text> values, Context context) throws IOException, InterruptedException {
t.set("dsfs");
mos.write(key, t, "WOWOWOWOWOW"+key.get());
//ALSO, NOTHING IS WRITTEN TO THE FILE!!!!!
}
}
public static class RTreeInputFormat extends TextInputFormat{
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Enter valid number of arguments <Inputdirectory> <Outputlocation>");
System.exit(0);
}
ToolRunner.run(new Configuration(), new Simpletask(), args);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Job1");
job.setJarByClass(Simpletask.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
boolean complete = job.waitForCompletion(true);
//================RTree Loop============
int capacity = 3;
Configuration rconf = new Configuration();
Job rtreejob = Job.getInstance(rconf, "rtree");
if(complete){
int count = (int) job.getCounters().findCounter(FileCounters.COUNT).getValue();
System.out.println("File count: "+count);
String path = null;
for(int i=0;i<count;i++){
path = "/Worker/p"+i+"-m-00000";
System.out.println("Add input path: "+path);
FileInputFormat.addInputPath(rtreejob, new Path(path));
}
System.out.println("Input path done.");
FileOutputFormat.setOutputPath(rtreejob, new Path("/RTree"));
rtreejob.setJarByClass(Simpletask.class);
rtreejob.setMapperClass(RTreeMapper.class);
rtreejob.setCombinerClass(RTreeReducer.class);
rtreejob.setReducerClass(RTreeReducer.class);
rtreejob.setOutputKeyClass(IntWritable.class);
rtreejob.setOutputValueClass(Text.class);
rtreejob.setInputFormatClass(RTreeInputFormat.class);
complete = rtreejob.waitForCompletion(true);
}
return 0;
}
}
For a mapreduce job, the output directory should not exists. It will check for the output directory first. If it is exists, the job will fail. In your case, you specified the same output directory for both the jobs. I modified your code. I changed the args[1] to args[2] in the job2. Now the third argument will be the output directory of second job. So pass a third argument also.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Job1");
job.setJarByClass(Simpletask.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//AND THEN I WAIT THIS JOB TO COMPLETE.
boolean complete = job.waitForCompletion(true);
//I START A NEW JOB, BUT WHY IS IT NOT RUNNING?
Configuration conf = new Configuration();
Job job2 = Job.getInstance(conf, "Job2");
job2.setJarByClass(Simpletask.class);
job2.setMapperClass(TokenizerMapper.class);
job2.setCombinerClass(IntSumReducer.class);
job2.setReducerClass(IntSumReducer.class);
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job2, new Path(args[0]));
FileOutputFormat.setOutputPath(job2, new Path(args[2]));
A few possible causes of errors:
conf is declared twice (no compile error there?)
The output path of job2 already exists, as it was created from job1 (+1 to Amal G Jose's answer)
I think you should also use job.setMapOutputKeyClass(Text.class); and job.setMapOutputValueClass(IntWritable.class); for both jobs.
Do you also have a command to execute job2 after the code snippet that you posted? I mean, do you actually run job2.waitForCompletion(true);, or something similar to that?
Overall: check the logs for error messages, which should clearly explain what went wrong.
I am trying to use HBase and Hadoop together. When I run the JAR file I get this error. Here is my source code:
public class TwitterTable {
final static Charset ENCODING = StandardCharsets.UTF_8;
final static String FILE_NAME = "/home/hduser/project04/sample.txt";
static class Mapper1 extends TableMapper<ImmutableBytesWritable, IntWritable>
{
byte[] value;
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException
{
value = values.getValue(Bytes.toBytes("text"), Bytes.toBytes(""));
String valueStr = Bytes.toString(value);
System.out.println("GET: " + valueStr);
}
}
public static class Reducer1 extends TableReducer<ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
}
}
public static void main( String args[] ) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "TwitterTable");
job.setJarByClass(TwitterTable.class);
HTableDescriptor ht = new HTableDescriptor( "twitter" );
ht.addFamily( new HColumnDescriptor("text"));
HBaseAdmin hba = new HBaseAdmin( conf );
if(!hba.tableExists("twitter"))
{
hba.createTable( ht );
System.out.println( "Table Created!" );
}
//Read the file and add to the database
TwitterTable getText = new TwitterTable();
Scan scan = new Scan();
String columns = "text";
scan.addColumn(Bytes.toBytes(columns), Bytes.toBytes(""));
TableMapReduceUtil.initTableMapperJob("twitter", scan, Mapper1.class, ImmutableBytesWritable.class,
IntWritable.class, job);
job.waitForCompletion(true);
//getText.readTextFile(FILE_NAME);
}
void readTextFile(String aFileName) throws IOException
{
Path path = Paths.get(aFileName);
try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)){
String line = null;
while ((line = reader.readLine()) != null) {
//process each line in some way
addToTable(line);
}
}
System.out.println("all done!");
}
void addToTable(String line) throws IOException
{
Configuration conf = new Configuration();
HTable table = new HTable(conf, "twitter");
String LineText[] = line.split(",");
String row = "";
String text = "";
row = LineText[0].toString();
row = row.replace("\"", "");
text = LineText[1].toString();
text = text.replace("\"", "");
Put put = new Put(Bytes.toBytes(row));
put.addColumn(Bytes.toBytes("text"), Bytes.toBytes(""), Bytes.toBytes(text));
table.put(put);
table.flushCommits();
table.close();
}
}
I added the class path to the hadoop-env.sh still no luck.. I don't know what's the problem. Here my hadoop-env.sh class path :
export HADOOP_CLASSPATH=
/usr/lib/hbase/hbase-1.0.0/lib/hbase-common-1.0.0.jar:
/usr/lib/hbase/hbase-1.0.0/lib/hbase-client.jar:
/usr/lib/hbase/hbase-1.0.0/lib/log4j-1.2.17.jar:
/usr/lib/hbase/hbase-1.0.0/lib/hbase-it-1.0.0.jar:
/usr/lib/hbase/hbase-1.0.0/lib/hbase-common-1.0.0-tests.jar:
/usr/lib/hbase/hbase-1.0.0/conf:
/usr/lib/hbase/hbase-1.0.0/lib/zookeeper-3.4.6.jar:
/usr/lib/hbase/hbase-1.0.0/lib/protobuf-java-2.5.0.jar:
/usr/lib/hbase/hbase-1.0.0/lib/guava-12.0.1.jar
Ok I found it.. maybe you cannot add everything to the class path. In that case copy all the libraries from the HBase and add into the Hadoop(refer the hadoop.env.sh)
HADOOP_DIR/contrib/capacity-scheduler
It worked for me.
I am trying to calculate word frequency and using the order inversion design pattern.
Here is my Java code :
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
import java.io.*;
import java.util.*;
public class WordFreq2 {
static enum StatusCounters {MAP_COUNTER, REDUCE_COUNTER, TOTAL_WORDS}
static enum MyExceptions {IO_EXCEPTION, INTERRUPTED_EXCEPTION, NULL_POINTER_EXCEPTION}
public static class MyComparator extends WritableComparator {
public int compare(WritableComparable a, WritableComparable b)
{
if (a.toString().equals("special_key0") && b.toString().equals("special_key1") )
return 0;
else
if ( a.toString().equals("special_key0") || a.toString().equals("special_key1") )
return -1;
else
if ( b.toString().equals("special_key0") || a.toString().equals("special_key1") )
return 1;
else
return a.toString().compareTo(b.toString());
}
}
public static class MyPartitioner extends Partitioner<Text,IntWritable>
{
public int getPartition(Text key, IntWritable value, int num)
{
if ( key.toString().equals("special_key0") )
return 0;
else
if ( key.toString().equals("special_key1") )
return 1;
else
return key.hashCode() % num;
}
}
public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
private final int MEMORYHASHSIZE = 7;
private final HashMap<String,Integer> memoryHash = new HashMap<String,Integer>(MEMORYHASHSIZE);
private int special_key_count = 0;
protected void setup(Context context) throws IOException, InterruptedException {
}
protected void cleanup(Context context) throws IOException, InterruptedException {
flushMap(context);
for ( int c = 0; c < context.getNumReduceTasks(); c++)
{
word.set("special_key"+c);
context.write(word,new IntWritable(special_key_count));
}
}
private void flushMap(Context context) throws IOException, InterruptedException
{
Iterator<Map.Entry<String, Integer>> entries = memoryHash.entrySet().iterator();
while (entries.hasNext()) {
Map.Entry<String, Integer> entry = entries.next();
word.set(entry.getKey());
context.write(word,new IntWritable(entry.getValue()));
entries.remove();
}
}
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
context.progress(); //in case of long running code, report that something is happening
while (tokenizer.hasMoreTokens())
{
String current_token = tokenizer.nextToken();
// Key Present in our In-Memory Hash Tbale
if ( memoryHash.containsKey(current_token) )
{
// Increase the corresponding counter
Integer val = memoryHash.get(current_token);
memoryHash.put(current_token,++val);
}
else
{
// Flush the HashTable if size limit reached
if ( memoryHash.size() == MEMORYHASHSIZE)
flushMap(context);
memoryHash.put(current_token,1); // Make a new key with corresponding count 1
}
special_key_count++;
context.getCounter(StatusCounters.MAP_COUNTER).increment(1);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, FloatWritable>
{
int total_words;
protected void setup(Context context) throws IOException, InterruptedException {
total_words=0;
}
protected void cleanup(Context context) throws IOException, InterruptedException {
context.getCounter(StatusCounters.TOTAL_WORDS).increment(total_words);
}
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
float frequency;
for (IntWritable val : values)
{
if(key.toString().equals("special0") || key.toString().equals("special1"))
{
total_words = total_words + val.get();
}
else
{
frequency = val.get() / total_words;
context.write(key, new FloatWritable(frequency));
}
}
context.progress(); //in case of long running code, report that something is happening
context.getCounter(StatusCounters.REDUCE_COUNTER).increment(1);
}
}
private static boolean deleteOutputDir(Job job, Path p) throws IOException {
boolean retvalue = false;
Configuration conf = job.getConfiguration();
FileSystem myfs = p.getFileSystem(conf);
if(myfs.exists(p) && myfs.isDirectory(p)) {
retvalue = myfs.delete(p,true);
}
return retvalue;
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance();
job.setJarByClass(WordFreq2.class);
job.setJobName("wordfreq");
/* type of map output */
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
/* type of reduce output */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
/* specify input/output directories */
FileInputFormat.setInputPaths(job, new Path(args[0]));
deleteOutputDir(job,new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
/* How to read and write inputs/outputs */
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
/* specify program components */
job.setMapperClass(MyMap.class);
job.setReducerClass(Reduce.class);
job.setNumReduceTasks(2); // Set the number of reducer to two
job.setSortComparatorClass(MyComparator.class);
job.setPartitionerClass(MyPartitioner.class);
boolean result = job.waitForCompletion(true);
Counters counters = job.getCounters();
Counter acounter = counters.findCounter(MyExceptions.IO_EXCEPTION);
long iocount = acounter.getValue();
System.exit(result?0:1);
}
}
However, I constantly hit this error :
Error: java.lang.NullPointerException
at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245)
at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:74)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
I am unable to figure out the issue. Can anyone point me to the right direction?
So I am doing a little test program just to get the hang of hadoops inputformat classes. I had a word search already built which took in lines as values and searched for the word line by line. I wanted to see if I could get hadoop to take in values word by word, hadoop doesn't seem to like that and keeps giving me results using the default mapper. My mappers initialize function is never even called.
I do know my record reader is called and that it is doing more or less what it is supposed to and I'm pretty sure the output of the record reader is what my mapper is searching for so why does hadoop decide not to call it?
Here is the relevant code
Input Format Class
public class WordReader extends FileInputFormat<Text, Text> {
#Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) {
return new MyWholeFileReader();
}
}
Record Reader
public class MyWholeFileReader extends RecordReader<Text, Text> {
private long start;
private LineReader in;
private Text key = null;
private Text value = null;
private ArrayList<String> outputvalues;
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
outputvalues = new ArrayList<String>();
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = split.getStart();
final Path file = split.getPath();
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
if (key == null) {
key = new Text();
}
key.set(split.getPath().getName());
if (value == null) {
value = new Text();
}
}
public boolean nextKeyValue() throws IOException {
if (outputvalues.size() == 0) {
Text buffer = new Text();
int i = in.readLine(buffer);
String str = buffer.toString();
for (String vals : str.split(" ")) {
outputvalues.add(vals);
}
if (i == 0 || outputvalues.size() == 0) {
key = null;
value = null;
return false;
}
}
value.set(outputvalues.remove(0));
System.out.println(value.toString());
return true;
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
/**
*
* Get the progress within the split
*/
public float getProgress() {
return 0.0f;
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
Mapper
public class WordSearchMapper extends Mapper<Text, Text, OutputCollector<Text,IntWritable>, Reporter> {
static String keyword;
BloomFilter<String> b;
public void configure(JobContext jobConf) {
keyword = jobConf.getConfiguration().get("keyword");
System.out.println("keyword>> " + keyword);
b = new BloomFilter<String>(.01,10000);
b.add(keyword);
System.out.println(b.getExpectedBitsPerElement());
}
public void map(Text key, Text value, OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int wordPos;
System.out.println("value.toString()>> " + value.toString());
System.out.println(((FileSplit) reporter.getInputSplit()).getPath()
.getName());
String[] tokens = value.toString().split("[\\p{P} \\t\\n\\r]");
for (String st :tokens) {
if (b.contains(st)) {
if (value.toString().contains(keyword)) {
System.out.println("Found one");
wordPos = ((Text) value).find(keyword);
output.collect(value, new IntWritable(wordPos));
}
}
}
}
}
Driver:
public class WordSearch {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf,"WordSearch");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setInputFormatClass( WordReader.class);
job.setOutputFormatClass(TextOutputFormat.class);
conf.set("keyword", "the");
FileInputFormat.setInputPaths(job, new Path("search.txt"));
FileOutputFormat.setOutputPath(job, new Path("outputs"+System.currentTimeMillis()));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
And I figured it out... this is why hadoop needs to stop supporting multiple versions of itself or why I should stop jamming multiple tutorials together. Turns out my mapper needs to be set up like this for the way my mapper and record reader are set up to interact.
'public class WordSearchMapper extends Mapper { static String keyword;`
I only realized this after looking at my imports and seeing that reporter was from package org.apache.hadoop.mapred as opposed to org.apache.hadoop.mapreduce –