How to parse PDF files in map reduce programs?

How to parse PDF files in map reduce programs? - java

I want to parse PDF files in my hadoop 2.2.0 program and I found this, followed what it says and until now, I have these three classes:
PDFWordCount: the main class containing map and reduce functions. (just like native hadoop wordcount sample but instead of TextInputFormat I used my PDFInputFormat class.
PDFRecordReader extends RecordReader<LongWritable, Text>: Which is the main work here. Especially I put my initialize function here for more illustration.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
System.out.println("initialize");
System.out.println(genericSplit.toString());
FileSplit split = (FileSplit) genericSplit;
System.out.println("filesplit convertion has been done");
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
conf.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
System.out.println("fs has been opened");
start = split.getStart();
end = start + split.getLength();
System.out.println("going to open split");
FSDataInputStream filein = fs.open(split.getPath());
System.out.println("going to load pdf");
PDDocument pd = PDDocument.load(filein);
System.out.println("pdf has been loaded");
PDFTextStripper stripper = new PDFTextStripper();
in =
new LineReader(new ByteArrayInputStream(stripper.getText(pd).getBytes(
"UTF-8")));
start = 0;
this.pos = start;
System.out.println("init has finished");
}
(You can see my system.out.printlns for debugging.
This method fails in converting genericSplit to FileSplit. Last thing I see in console, is this:
hdfs://localhost:9000/in:0+9396432
which is genericSplit.toString()
PDFInputFormat extends FileInputFormat<LongWritable, Text>: which just creates new PDFRecordReader in createRecordReader method.
I want to know what is my mistake?
Do I need extra classes or something?

Reading PDFs is not that difficult, you need to extend the class FileInputFormat as well as the RecordReader. The FileInputClass should not be able to split PDF files since they are binaries.
public class PDFInputFormat extends FileInputFormat<Text, Text> {
#Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new PDFLineRecordReader();
}
// Do not allow to ever split PDF files, even if larger than HDFS block size
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
The RecordReader then performs the reading itself (I am using PDFBox to read PDFs).
public class PDFLineRecordReader extends RecordReader<Text, Text> {
private Text key = new Text();
private Text value = new Text();
private int currentLine = 0;
private List<String> lines = null;
private PDDocument doc = null;
private PDFTextStripper textStripper = null;
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit) split;
final Path file = fileSplit.getPath();
Configuration conf = context.getConfiguration();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream filein = fs.open(fileSplit.getPath());
if (filein != null) {
doc = PDDocument.load(filein);
// Konnte das PDF gelesen werden?
if (doc != null) {
textStripper = new PDFTextStripper();
String text = textStripper.getText(doc);
lines = Arrays.asList(text.split(System.lineSeparator()));
currentLine = 0;
}
}
}
// False ends the reading process
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new Text();
}
if (value == null) {
value = new Text();
}
if (currentLine < lines.size()) {
String line = lines.get(currentLine);
key.set(line);
value.set("");
currentLine++;
return true;
} else {
// All lines are read? -> end
key = null;
value = null;
return false;
}
}
#Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return (100.0f / lines.size() * currentLine) / 100.0f;
}
#Override
public void close() throws IOException {
// If done close the doc
if (doc != null) {
doc.close();
}
}
Hope this helps!

package com.sidd.hadoop.practice.pdf;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.sidd.hadoop.practice.input.pdf.PdfFileInputFormat;
import com.sidd.hadoop.practice.output.pdf.PdfFileOutputFormat;
public class ReadPdfFile {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// context.progress();
context.write(key, value);
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
if (values.iterator().hasNext()) {
context.write(key, values.iterator().next());
} else {
context.write(key, new Text(""));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Read Pdf");
job.setJarByClass(ReadPdfFile.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(PdfFileInputFormat.class);
job.setOutputFormatClass(PdfFileOutputFormat.class);
removeDir(args[1], conf);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static void removeDir(String path, Configuration conf) throws IOException {
Path output_path = new Path(path);
FileSystem fs = FileSystem.get(conf);
if (fs.exists(output_path)) {
fs.delete(output_path, true);
}
}
}

Related

CombineFileInputFormat implementation for XML files

I have to process 250 XML files each of which is 25 MB in size. For processing XML files, I am using XMLInputFormat from Apache Mahout and generating a sequence file. The key is filename and value is entire file contents in sequence file. But problem with this approach is that 250 Mappers are launched which makes the MapReduce job slower.
I have come across CombineFileInputFormat (while going through Tom White book) using which 250 Mappers wouldn't be launched for 250 files. But CombineFileInputFormat is an abstract class and I am facing difficulty implementing it for XML files as I am new to Java as well as Hadoop.
So, can someone please provide me implementation of CombineFileInputFormat for XML files.
Driver Code:
package com.ericsson.sequencefile;
//A MapReduce program for packaging a collection of small files as a single SequenceFile.
//hadoop jar sequencefiles.jar com.ericsson.sequencefile.SmallFilesToSequenceFileConverter -D xmlinput.start="<XMLstart>" -D xmlinput.end="</XMLstart>" /IRIS_NG/pfinder2/ccn/archive /IRIS_NG/pfinder2/output
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SmallFilesToSequenceFileConverter extends Configured implements Tool {
public static class SequenceFileMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text filenameKey;
#Override
public void setup(Context context) throws IOException, InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString() + "\n");
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String document = value.toString();
context.write(filenameKey, new Text(document));
}
}
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Configuration conf = getConf();
Job job = Job.getInstance(conf,"SmallFilesToSequenceFile");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(XmlInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(SequenceFileMapper.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
System.exit(exitCode);
}
}
XMLInputFormt.java
package com.ericsson.sequencefile;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.slf4j.*;
import java.io.IOException;
/**
* Reads records that are delimited by a specific begin/end tag.
*/
public class XmlInputFormat extends TextInputFormat {
private static final Logger log =
LoggerFactory.getLogger(XmlInputFormat.class);
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
try {
return new XmlRecordReader((FileSplit) split,
context.getConfiguration());
} catch (IOException ioe) {
log.warn("Error while creating XmlRecordReader", ioe);
return null;
}
}
/**
* XMLRecordReader class to read through a given xml document to
* output xml blocks as records as specified
* by the start tag and end tag
*/
public static class XmlRecordReader
extends RecordReader<LongWritable, Text> {
private final byte[] startTag;
private final byte[] endTag;
private final long start;
private final long end;
private final FSDataInputStream fsin;
private final DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable currentKey;
private Text currentValue;
public XmlRecordReader(FileSplit split, Configuration conf)
throws IOException {
startTag = conf.get(START_TAG_KEY).getBytes("UTF-8");
endTag = conf.get(END_TAG_KEY).getBytes("UTF-8");
// open the file and seek to the start of the split
start = split.getStart();
end = start + split.getLength();
Path file = split.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(split.getPath());
fsin.seek(start);
}
private boolean next(LongWritable key, Text value)
throws IOException {
if (fsin.getPos() < end && readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0, buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
return false;
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1) {
return false;
}
// save to buffer:
if (withinBlock) {
buffer.write(b);
}
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length) {
return true;
}
} else {
i = 0;
}
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end) {
return false;
}
}
}
#Override
public LongWritable getCurrentKey()
throws IOException, InterruptedException {
return currentKey;
}
#Override
public Text getCurrentValue()
throws IOException, InterruptedException {
return currentValue;
}
#Override
public void initialize(InputSplit split,
TaskAttemptContext context)
throws IOException, InterruptedException {
}
#Override
public boolean nextKeyValue()
throws IOException, InterruptedException {
currentKey = new LongWritable();
currentValue = new Text();
return next(currentKey, currentValue);
}
}
}

MapReduce Job hangs

I am new to Hadoop's MapReduce. I have written a map reduce task and I am trying to run that on my local machine. But the job hangs after map 100%.
Below is the code, I don't understand what am I missing.
I have a custom key class
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class AirlineMonthKey implements WritableComparable<AirlineMonthKey>{
Text airlineName;
Text month;
public AirlineMonthKey(){
super();
}
public AirlineMonthKey(Text airlineName, Text month) {
super();
this.airlineName = airlineName;
this.month = month;
}
public Text getAirlineName() {
return airlineName;
}
public void setAirlineName(Text airlineName) {
this.airlineName = airlineName;
}
public Text getMonth() {
return month;
}
public void setMonth(Text month) {
this.month = month;
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
this.airlineName.readFields(in);
this.month.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
this.airlineName.write(out);
this.month.write(out);
}
#Override
public int compareTo(AirlineMonthKey airlineMonthKey) {
// TODO Auto-generated method stub
int diff = getAirlineName().compareTo(airlineMonthKey.getAirlineName());
if(diff != 0){
return diff;
}
int m1 = Integer.parseInt(getMonth().toString());
int m2 = Integer.parseInt(airlineMonthKey.getMonth().toString());
if(m1>m2){
return -1;
}
else
return 1;
}
}
and The mapper and the reducer class that uses the custom key as below.
package com.mapresuce.secondarysort;
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.opencsv.CSVReader;
public class FlightDelayByMonth {
public static class FlightDelayByMonthMapper extends
Mapper<Object, Text, AirlineMonthKey, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String str = value.toString();
// Reading Line one by one from the input CSV.
CSVReader reader = new CSVReader(new StringReader(str));
String[] split = reader.readNext();
reader.close();
String airlineName = split[6];
String month = split[2];
String year = split[0];
String delayMinutes = split[37];
String cancelled = split[41];
if (!(airlineName.equals("") || month.equals("") || delayMinutes
.equals(""))) {
if (year.equals("2008") && cancelled.equals("0.00")) {
AirlineMonthKey airlineMonthKey = new AirlineMonthKey(
new Text(airlineName), new Text(month));
Text delay = new Text(delayMinutes);
context.write(airlineMonthKey, delay);
System.out.println("1");
}
}
}
}
public static class FlightDelayByMonthReducer extends
Reducer<AirlineMonthKey, Text, Text, Text> {
public void reduce(AirlineMonthKey key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text val : values){
context.write(new Text(key.getAirlineName().toString()+" "+key.getMonth().toString()), val);
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage:<in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Average monthly flight dealy");
job.setJarByClass(FlightDelayByMonth.class);
job.setMapperClass(FlightDelayByMonthMapper.class);
job.setReducerClass(FlightDelayByMonthReducer.class);
job.setOutputKeyClass(AirlineMonthKey.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Also I have created a job and configuration in the main. Don't know what I am missing. I am running all this in local environment.

Try with writing a custom implementation of toString, equals and hashcode in your AirlineMonthKey class.
Read below link.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparable.html
It is important for key types to implement hashCode().
Hope this could help you.

The issue was I had to use the default Constructor in the AirlineMonthKey (which I did) and initialize the instance variables in the custom key class (which I didn't).

FileAlreadyExistsException while running MapReduce code

This program is supposed to accomplish the MapReduce job. The output of the first job has to be taken as the input of the second job.
When I run it, I get two errors:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException
The mapping part is running 100% but the reducer is not running.
Here's my code:
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.LongWritable;
public class MaxPubYear {
public static class FrequencyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Text word = new Text();
String delim = ";";
Integer year = 0;
String tokens[] = value.toString().split(delim);
if (tokens.length >= 4) {
year = TryParseInt(tokens[3].replace("\"", "").trim());
if (year > 0) {
word = new Text(year.toString());
context.write(word, new IntWritable(1));
}
}
}
}
public static class FrequencyReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
public static class MaxPubYearMapper extends
Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String delim = "\t";
Text valtosend = new Text();
String tokens[] = value.toString().split(delim);
if (tokens.length == 2) {
valtosend.set(tokens[0] + ";" + tokens[1]);
context.write(new IntWritable(1), valtosend);
}
}
}
public static class MaxPubYearReducer extends
Reducer<IntWritable, Text, Text, IntWritable> {
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int maxiValue = Integer.MIN_VALUE;
String maxiYear = "";
for (Text value : values) {
String token[] = value.toString().split(";");
if (token.length == 2
&& TryParseInt(token[1]).intValue() > maxiValue) {
maxiValue = TryParseInt(token[1]);
maxiYear = token[0];
}
}
context.write(new Text(maxiYear), new IntWritable(maxiValue));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Frequency");
job.setJarByClass(MaxPubYear.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(FrequencyMapper.class);
job.setCombinerClass(FrequencyReducer.class);
job.setReducerClass(FrequencyReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
int exitCode = job.waitForCompletion(true) ? 0 : 1;
if (exitCode == 0) {
Job SecondJob = new Job(conf, "Maximum Publication year");
SecondJob.setJarByClass(MaxPubYear.class);
SecondJob.setOutputKeyClass(Text.class);
SecondJob.setOutputValueClass(IntWritable.class);
SecondJob.setMapOutputKeyClass(IntWritable.class);
SecondJob.setMapOutputValueClass(Text.class);
SecondJob.setMapperClass(MaxPubYearMapper.class);
SecondJob.setReducerClass(MaxPubYearReducer.class);
FileInputFormat.addInputPath(SecondJob, new Path(args[1] + "_temp"));
FileOutputFormat.setOutputPath(SecondJob, new Path(args[1]));
System.exit(SecondJob.waitForCompletion(true) ? 0 : 1);
}
}
public static Integer TryParseInt(String trim) {
// TODO Auto-generated method stub
return(0);
}
}

Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException
Map-reduce job does not overwrite the contents in a existing directory. Output path to MR job must be a directory path which does not exist. MR job will create a directory at specified path with files within it.
In your code:
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_temp"));
Make sure this path does not exist when you run MR job.

Hadoop 2 IOException only when trying to open supposed cache files

I recent updated to hadoop 2.2 (using this tutorial here).
My main job class looks like so, and throws an IOException:
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.chain.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.lib.reduce.*;
public class UFOLocation2
{
public static class MapClass extends Mapper<LongWritable, Text, Text, LongWritable>
{
private final static LongWritable one = new LongWritable(1);
private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");
private Map<String, String> stateNames;
#Override
public void setup(Context context)
{
try
{
URI[] cacheFiles = context.getCacheFiles();
setupStateMap(cacheFiles[0].toString());
}
catch (IOException ioe)
{
System.err.println("Error reading state file.");
System.exit(1);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String[] fields = line.split("\t");
String location = fields[2].trim();
if (location.length() >= 2)
{
Matcher matcher = locationPattern.matcher(location);
if (matcher.find())
{
int start = matcher.start();
String state = location.substring(start, start + 2);
context.write(new Text(lookupState(state.toUpperCase())), one);
}
}
}
private void setupStateMap(String filename) throws IOException
{
Map<String, String> states = new HashMap<String, String>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line = reader.readLine();
while (line != null)
{
String[] split = line.split("\t");
states.put(split[0], split[1]);
line = reader.readLine();
}
stateNames = states;
}
private String lookupState(String state)
{
String fullName = stateNames.get(state);
return fullName == null ? "Other" : fullName;
}
}
public static void main(String[] args) throws Exception
{
Configuration config = new Configuration();
Job job = Job.getInstance(config, "UFO Location 2");
job.setJarByClass(UFOLocation2.class);
job.addCacheFile(new URI("/user/kevin/data/states.txt"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
Configuration mapconf1 = new Configuration(false);
ChainMapper.addMapper(job, UFORecordValidationMapper.class, LongWritable.class,
Text.class, LongWritable.class,Text.class, mapconf1);
Configuration mapconf2 = new Configuration(false);
ChainMapper.addMapper(job, MapClass.class, LongWritable.class,
Text.class, Text.class, LongWritable.class, mapconf2);
job.setMapperClass(ChainMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I get an IOException because it can't find the file "/user/kevin/data/states.txt" when it tries to instantiate the BufferredReader in the method setupStateMap()

Yes, it is deprecated and Job.addCacheFile() should be used to add the files and in your tasks( map or reduce) files can be accessed with Context.getCacheFiles().

//its fine addCacheFile and getCacheFiles are from 2.x u can use something like this
Path path = new Path(uri[0].getPath().toString());
if (fileSystem.exists(path)) {
FSDataInputStream dataInputStream = fileSystem.open(path);
byte[] data = new byte[1024];
while (dataInputStream.read(data) > 0) {
//do your stuff here
}
dataInputStream.close();
}

Deprecated functionality shall work anyway.

Question on hadoop "java.lang.RuntimeException: java.lang.ClassNotFoundException: "

Here's my source code
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class PageRank {
public static final String MAGIC_STRING = ">>>>";
boolean overwrite = true;
PageRank(boolean overwrite){
this.overwrite = overwrite;
}
public static class TextPair implements WritableComparable<TextPair>{
Text x;
int ordering;
public TextPair(){
x = new Text();
ordering = 1;
}
public void setText(Text t, int o){
x = t;
ordering = o;
}
public void setText(String t, int o){
x.set(t);
ordering = o;
}
public void readFields(DataInput in) throws IOException {
x.readFields(in);
ordering = in.readInt();
}
public void write(DataOutput out) throws IOException {
x.write(out);
out.writeInt(ordering);
}
public int hashCode() {
return x.hashCode();
}
public int compareTo(TextPair o) {
int x = this.x.compareTo(o.x);
if(x==0)
return ordering-o.ordering;
else
return x;
}
}
public static class MapperA extends Mapper<LongWritable, Text, TextPair, Text> {
private Text word = new Text();
Text title = new Text();
Text link = new Text();
TextPair textpair = new TextPair();
boolean start=false;
String currentTitle="";
private Pattern linkPattern = Pattern.compile("\\[\\[\\s*(.+?)\\s*\\]\\]");
private Pattern titlePattern = Pattern.compile("<title>\\s*(.+?)\\s*</title>");
private Pattern pagePattern = Pattern.compile("&ltpage&gt\\s*(.+?)\\s*&lt/page&gt");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int startPage=line.lastIndexOf("<title>");
if(startPage<0)
{
Matcher matcher = linkPattern.matcher(line);
int n = 0;
title.set(currentTitle);
while(matcher.find()){
textpair.setText(matcher.group(1), 1);
context.write(textpair, title);
}
link.set(MAGIC_STRING);
textpair.setText(title.toString(), 0);
context.write(textpair, link);
}
else
{
String result=line.trim();
Matcher titleMatcher = titlePattern.matcher(result);
if(titleMatcher.find()){
currentTitle = titleMatcher.group(1);
}
else
{
currentTitle=result;
}
}
}
}
public static class ReducerA extends Reducer<TextPair, Text, Text, Text>{
Text aw = new Text();
boolean valid = false;
String last = "";
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
TextPair key = context.getCurrentKey();
Text value = context.getCurrentValue();
if(key.ordering==0){
last = key.x.toString();
}
else if(key.x.toString().equals(last)){
context.write(key.x, value);
}
}
cleanup(context);
}
}
public static class MapperB extends Mapper<Text, Text, Text, Text>{
Text t = new Text();
public void map(Text key, Text value, Context context) throws InterruptedException, IOException{
context.write(value, key);
}
}
public static class ReducerB extends Reducer<Text, Text, Text, PageRankRecord>{
ArrayList<String> q = new ArrayList<String>();
public void reduce(Text key, Iterable<Text> values, Context context)throws InterruptedException, IOException{
q.clear();
for(Text value:values){
q.add(value.toString());
}
PageRankRecord prr = new PageRankRecord();
prr.setPageRank(1.0);
if(q.size()>0){
String[] a = new String[q.size()];
q.toArray(a);
prr.setlinks(a);
}
context.write(key, prr);
}
}
public boolean roundA(Configuration conf, String inputPath, String outputPath, boolean overwrite) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round A");
//job.setJarByClass(GraphBuilder.class);
job.setMapperClass(MapperA.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerA.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
FileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean roundB(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round B");
//job.setJarByClass(PageRank.class);
job.setMapperClass(MapperB.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerB.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PageRankRecord.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
SequenceFileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean build(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
System.err.println(inputPath);
if(roundA(conf, inputPath, "cgb", true)){
return roundB(conf, "cgb", outputPath);
}
else
return false;
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
Configuration conf = new Configuration();
//PageRanking.banner("ClosureGraphBuilder");
PageRank cgb = new PageRank(true);
cgb.build(conf, args[0], args[1]);
}
}
Here's how i compile and run
javac -classpath hadoop-0.20.1-core.jar -d pagerank_classes PageRank.java PageRankRecord.java
jar -cvf pagerank.jar -C pagerank_classes/ .
bin/hadoop jar pagerank.jar PageRank pagerank result
but I am getting the following errors:
INFO mapred.JobClient: Task Id : attempt_201001012025_0009_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: PageRank$MapperA
Can someone tell me whats wrong
Thanks

If you are using the 0.2.0 hadoop (want to use the non-deprecated classes), you can do:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(YourMapReduceClass.class); // <-- omitting this causes above error
job.setMapperClass(MyMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}

Did "PageRank$MapperA.class" end up inside that jar file? It should be in the same place as "PageRank.class".

Try to add "--libjars pagerank.jar". Mapper and reducer are running across machines, thus you need to distribute your jar to every machine. "--libjars" helps to do that.

For the HADOOP_CLASSPATH you should specify the folder where the JAR file is located...
If you want to understand how the classpath works: http://download.oracle.com/javase/6/docs/technotes/tools/windows/classpath.html

I guess you should change your HADOOP_CLASSPATH variable, so that it points to the jar file.
e.g. HADOOP_CLASSPATH=<what ever the path>/PageRank.jar or something like that.

If you are using ECLIPSE for generating jar then use "Extract generated libraries into generated JAR" option.

Though MapReduce program is parallel processing. Mapper, Combiner and Reducer class has sequence flow. Have to wait for completing each flow depends on other class so need job.waitForCompletion(true); But It must to set input and output path before starting Mapper, Combiner and Reducer class. Reference
Solution for this already answered in https://stackoverflow.com/a/38145962/3452185

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse PDF files in map reduce programs? - java

Related

CombineFileInputFormat implementation for XML files

MapReduce Job hangs

FileAlreadyExistsException while running MapReduce code

Hadoop 2 IOException only when trying to open supposed cache files

Question on hadoop "java.lang.RuntimeException: java.lang.ClassNotFoundException: "

Categories

Resources