I recent updated to hadoop 2.2 (using this tutorial here).
My main job class looks like so, and throws an IOException:
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.chain.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.lib.reduce.*;
public class UFOLocation2
{
public static class MapClass extends Mapper<LongWritable, Text, Text, LongWritable>
{
private final static LongWritable one = new LongWritable(1);
private static Pattern locationPattern = Pattern.compile("[a-zA-Z]{2}[^a-zA-Z]*$");
private Map<String, String> stateNames;
#Override
public void setup(Context context)
{
try
{
URI[] cacheFiles = context.getCacheFiles();
setupStateMap(cacheFiles[0].toString());
}
catch (IOException ioe)
{
System.err.println("Error reading state file.");
System.exit(1);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String[] fields = line.split("\t");
String location = fields[2].trim();
if (location.length() >= 2)
{
Matcher matcher = locationPattern.matcher(location);
if (matcher.find())
{
int start = matcher.start();
String state = location.substring(start, start + 2);
context.write(new Text(lookupState(state.toUpperCase())), one);
}
}
}
private void setupStateMap(String filename) throws IOException
{
Map<String, String> states = new HashMap<String, String>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line = reader.readLine();
while (line != null)
{
String[] split = line.split("\t");
states.put(split[0], split[1]);
line = reader.readLine();
}
stateNames = states;
}
private String lookupState(String state)
{
String fullName = stateNames.get(state);
return fullName == null ? "Other" : fullName;
}
}
public static void main(String[] args) throws Exception
{
Configuration config = new Configuration();
Job job = Job.getInstance(config, "UFO Location 2");
job.setJarByClass(UFOLocation2.class);
job.addCacheFile(new URI("/user/kevin/data/states.txt"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
Configuration mapconf1 = new Configuration(false);
ChainMapper.addMapper(job, UFORecordValidationMapper.class, LongWritable.class,
Text.class, LongWritable.class,Text.class, mapconf1);
Configuration mapconf2 = new Configuration(false);
ChainMapper.addMapper(job, MapClass.class, LongWritable.class,
Text.class, Text.class, LongWritable.class, mapconf2);
job.setMapperClass(ChainMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I get an IOException because it can't find the file "/user/kevin/data/states.txt" when it tries to instantiate the BufferredReader in the method setupStateMap()
Yes, it is deprecated and Job.addCacheFile() should be used to add the files and in your tasks( map or reduce) files can be accessed with Context.getCacheFiles().
//its fine addCacheFile and getCacheFiles are from 2.x u can use something like this
Path path = new Path(uri[0].getPath().toString());
if (fileSystem.exists(path)) {
FSDataInputStream dataInputStream = fileSystem.open(path);
byte[] data = new byte[1024];
while (dataInputStream.read(data) > 0) {
//do your stuff here
}
dataInputStream.close();
}
Deprecated functionality shall work anyway.
Related
I have an input text file as given below (partial):
{"author":"Martti Paturi","book":"Aiotko oppikouluun"}
{"author":"International Meeting of Neurobiologists Amsterdam 1959.","book":"Structure and function of the cerebral cortex"}
{"author":"Paraná (Brazil : State). Comissão de Desenvolvimento Municipal.","book":"Plano diretor de desenvolvimento de Maringá"}
I need to perform MapReduce on this file to get as output a JSON object which has all the books from the same author in a JSON array, in the form:
{"author": "Ian Fleming", "books": [{"book": "Goldfinger"},{"book": "Moonraker"}]}
My code is as follows:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
public class CombineBooks {
//TODO define variables and implement necessary components
/*public static class MyTuple implements Writable{
private String author;
private String book;
public void readFields(DataInput in){
JSONObject obj = new JSONObject(in.readLine());
author = obj.getString("author");
book = obj.getString("book");
}
public void write(DataOutput out){
out.writeBytes(author);
out.writeBytes(book);
}
public static MyTuple read(DataInput in){
MyTuple tup = new MyTuple();
tup.readFields(in);
return tup;
}
}*/
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Combine extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
String booklist = null;
int i = 0;
for(Text val : values){
if(booklist.equals(null)){
booklist = booklist + val.toString();
}
else{
booklist = booklist + "," + val.toString();
}
i++;
}
context.write(key, new Text(booklist));
}
}
public static class Reduce extends Reducer<Text,Text,JSONObject,NullWritable>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONArray ja = new JSONArray();
String[] book = null;
for(Text val : values){
book = val.toString().split(",");
}
for(int i=0; i<book.length; i++){
JSONObject jo = new JSONObject().put("book", book[i]);
ja.put(jo);
}
JSONObject obj = new JSONObject();
obj.put("author", key.toString());
obj.put("books", ja);
context.write(obj, NullWritable.get());
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Combine.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(JSONObject.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I am trying to run it, I am getting the follwoing error:
java.lang.ClassCastException: class org.json.JSONObject
at java.lang.Class.asSubclass(Class.java:3165)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I am using java-json.jar as an external dependency. I am not sure what the error is here. Any halp is appreciated!
the json jar file have to be saved in the hadoop lib folder and then try and execute the program.
Have a look at: Hadoop Writable. While you are indeed telling Hadoop to set the value of the output key, but JSONObject doesn't implement Writable interface.
Why you just don't output text?
context.write(new Text(jo.toString()), NullWritable.get());
I have a requirement where in the Map Reduce code should read the local file system in each node. The program will be running on HDFS and I cannot change the FileSystem property for hadoop in xml files for configuration.
I have tried the following solutions, but none gave me results.
Approach 1
Configuration config = new Configuration();
FileSystem localFileSystem = FileSystem.get(config);
localFileSystem.set("fs.defaultFS", "file:///");
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 2
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 3
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
localFileSystem.set("fs.defaultFS", "file:///");
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 4
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.getRaw().open(new Path("/user/input/localFile"))));
This did not work either
[Reading HDFS and local files in Java
Each of them gave the error: No such file exists
Error Stack
attempt_201406050021_0018_m_000000_2: java.io.FileNotFoundException: File /home/cloudera/sftp/id_rsa does not exist
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:468)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:380)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:231)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:183)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:81)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1934)
attempt_201406050021_0018_m_000000_2: at com.skanda.ecomm.sftp.FTPMapper.configure(FTPMapper.java:91)
I am hoping to get a positive solution here. Let me know where I am going wrong.
Main class (Driver class)
/*
* #SFTPClient.java #May 20, 2014
*
*
*/
package com.skanda.ecomm.sftp;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
*
* <p>
* SFTPClient Class
* </p>
*
* #author skanda
* #version 1.0
*
*/
public class SFTPClient extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration config = getConf();
String inputPath = config.get(ApplicationConstants.INPUT_PATH);
String outputPath = config.get(ApplicationConstants.OUTPUT_PATH);
String configPath = config.get(ApplicationConstants.CONFIG_PATH);
int reducers = Integer.parseInt(config.get(ApplicationConstants.REDUCERS));
if(outputPath == null || inputPath == null || configPath == null) {
throw new Exception("Usage: \n" + "-D configPath=<configPath> -D inputPath=<inputPath> -D reducers=<reducers" +
"-D outputPath=<path>");
}
JobConf conf = new JobConf(SFTPClient.class);
conf.setJobName("SFTP Injection client");
DistributedCache.addCacheFile(new URI(configPath),conf);
conf.setMapperClass(FTPMapper.class);
conf.setReducerClass(FTPReducer.class);
conf.setMapOutputKeyClass(IntWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(IntWritable.class);
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path(outputPath), true); // true stands for recursively, deleting the folder you gave
conf.setStrings(ApplicationConstants.INPUT_PATH, inputPath);
conf.setStrings(ApplicationConstants.OUTPUT_PATH, outputPath);
FileInputFormat.setInputPaths(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
conf.setNumReduceTasks(reducers);
conf.setInt(ApplicationConstants.NUNBER_OF_REDUCERS, reducers);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SFTPClient(), args);
System.exit(exitCode);
}
}
Mapper
/*
* #FTPMapper.java #May 20, 2014
*
*
*/
package com.skanda.ecomm.sftp;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.InetAddress;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import com.ftp.mapreduce.CommonUtility;
import com.ftp.mapreduce.RetrieveFileNames;
import com.jcraft.jsch.hm.Channel;
/**
*
* <p>
* FTP Mapper Class
* </p>
*
* #author skanda
* #version 1.0
*
*/
#SuppressWarnings("unused")
public class FTPMapper extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text> {
private URI[] localFiles;
private String userName;
private String hostName;
private String folderPath;
private int reducers;
private byte[] pvtKey;
private String fileName;
private String startDate;
private String endDate;
private String sshKeyPath;
private String password;
public void configure(JobConf job) {
Properties properties = new Properties();
try {
localFiles = DistributedCache.getCacheFiles(job);
if (localFiles != null && localFiles.length == 1) {
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(localFiles[0], conf);
BufferedReader bufferRedaer=new BufferedReader(new InputStreamReader(fileSystem.open(new Path(localFiles[0]))));
properties.load(bufferRedaer);
userName = properties.getProperty(ApplicationConstants.USER_NAME);
reducers = job.getInt(ApplicationConstants.NUNBER_OF_REDUCERS, 30);
hostName = properties.getProperty(ApplicationConstants.SFTP_SERVER_HOST);
folderPath = properties.getProperty(ApplicationConstants.HOSTFILE_DIRECTORY_PATH);
fileName = properties.getProperty(ApplicationConstants.FILE_NAME_PATTERN);
startDate = properties.getProperty(ApplicationConstants.FILE_START_DATE);
endDate = properties.getProperty(ApplicationConstants.FILE_END_DATE);
sshKeyPath = properties.getProperty(ApplicationConstants.SSH_KEY_PATH);
password = properties.getProperty(ApplicationConstants.PASSWORD);
System.out.println("--------------------------------------------------");
/*FileSystem fs = FileSystem.getLocal(conf);
//Path inputPath = fs.makeQualified(new Path(sshKeyPath));
String inputPath = new Path("file:///home/cloudera/"+sshKeyPath).toUri().getPath();
fs.copyFromLocalFile(new Path(inputPath), new Path("outputSFTP/idFile") );*/
try{
Configuration conf1 = new Configuration();
Path pt = new Path("file:///home/cloudera/.ssh/id_rsa");
FileSystem fs = FileSystem.get( new URI("file:///home/cloudera/.ssh/id_rsa"), conf);
LocalFileSystem localFileSystem = fs.getLocal(conf1);
BufferedReader bufferRedaer1 = new BufferedReader(new InputStreamReader(localFileSystem.open(pt)));
String str = null;
while ((str = bufferRedaer1.readLine())!= null)
{
System.out.println("-----------");
System.out.println(str);
}
}catch(Exception e){
System.out.println("failed again");
String computername=InetAddress.getLocalHost().getHostName();
System.out.println(computername);
e.printStackTrace();
}
System.out.println("--------------------------------------------------");
Configuration config = new Configuration();
config.set("fs.defaultFS", "file:////");
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path(sshKeyPath))));
/*Configuration config = new Configuration();
//config.set("fs.defaultFS", "file:///home/cloudera/.ssh/id_rsa");
LocalFileSystem fileSystm = FileSystem.getLocal(config);
Path path = fileSystm.makeQualified(new Path("/home/cloudera/.ssh/id_rsa"));*/
//FileInputFormat.setInputPaths(job, path);
//bufferRedaer = new BufferedReader(new InputStreamReader(fileSystem.open(path)));
String key = "";
try {
String line = "";
while ((line = bufferRedaer.readLine()) != null) {
key += line + "\n";
}
pvtKey = key.getBytes();
} catch(Exception e){
e.printStackTrace();
} finally {
//fileSystem.close();
//bufferRedaer.close();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException {
List<String> filterFileNamesList = new ArrayList<String>();
Channel channel = CommonUtility.connectSFTP(userName, hostName, pvtKey);
Map<String, String> fileNamesMap = CommonUtility.getFileNames(channel, folderPath);
List<String> filterFileNameList_output = RetrieveFileNames.FILTER_BY_NAME.retrieveFileNames(fileNamesMap, filterFileNamesList,
fileName, startDate, endDate);
for (int i = 0; i < filterFileNameList_output.size(); i++) {
int keyGroup = i % reducers;
output.collect(new IntWritable(keyGroup), new Text(filterFileNameList_output.get(i)));
}
}
}
This code is working for me when program runs on hdfs and my txt file is in this location:
/home/Rishi/Documents/RishiFile/r.txt
public class HadoopRead {
public static void main(String[] args) {
try{
Configuration conf = new Configuration();
Path pt = new Path("/home/Rishi/Documents/RishiFile/r.txt");
FileSystem fs = FileSystem.get( new URI("/home/Rishi/Documents/RishiFile"), conf);
LocalFileSystem localFileSystem = fs.getLocal(conf);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(pt)));
String str = null;
while ((str = bufferRedaer.readLine())!= null)
{
System.out.println("-----------");
System.out.println(str);
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Word Count Example for reading local file on hdfs
my main class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FileDriver extends Configured implements Tool {
public static void main(String[] args) {
try{
ToolRunner.run(new Configuration(), new FileDriver(), args);
System.exit(0);
}catch(Exception e){
e.printStackTrace();
}
}
public int run(String[] arg0) throws Exception {
Configuration conf = new Configuration();
Path pt = new Path("file:///home/winoria/Documents/Ri/r");
Job job = new Job(conf, "new Job");
job.setJarByClass(FileDriver.class);
job.setMapperClass(FileMapper.class);
job.setReducerClass(FileReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, pt);
FileSystem.get(job.getConfiguration()).delete(new Path("Output2"), true);
FileOutputFormat.setOutputPath(job, new Path("Output2"));
job.waitForCompletion(true);
return 0;
}
}
mapper class :
public class FileMapper extends Mapper<LongWritable, Text, Text, Text> {
protected void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException {
String str[] = value.toString().split(" ");
for(int i =0; i<str.length;i++){
context.write(new Text(str[i]), new Text());
}
};
}
Reducer Class:
public class FileReducer extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key,Iterable<Text> value,Context context) throws java.io.IOException ,InterruptedException {
int count=0;
for (Text text : value) {
count++;
}
context.write(key, new Text(count+""));
};
}
I want to parse PDF files in my hadoop 2.2.0 program and I found this, followed what it says and until now, I have these three classes:
PDFWordCount: the main class containing map and reduce functions. (just like native hadoop wordcount sample but instead of TextInputFormat I used my PDFInputFormat class.
PDFRecordReader extends RecordReader<LongWritable, Text>: Which is the main work here. Especially I put my initialize function here for more illustration.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
System.out.println("initialize");
System.out.println(genericSplit.toString());
FileSplit split = (FileSplit) genericSplit;
System.out.println("filesplit convertion has been done");
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
conf.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
System.out.println("fs has been opened");
start = split.getStart();
end = start + split.getLength();
System.out.println("going to open split");
FSDataInputStream filein = fs.open(split.getPath());
System.out.println("going to load pdf");
PDDocument pd = PDDocument.load(filein);
System.out.println("pdf has been loaded");
PDFTextStripper stripper = new PDFTextStripper();
in =
new LineReader(new ByteArrayInputStream(stripper.getText(pd).getBytes(
"UTF-8")));
start = 0;
this.pos = start;
System.out.println("init has finished");
}
(You can see my system.out.printlns for debugging.
This method fails in converting genericSplit to FileSplit. Last thing I see in console, is this:
hdfs://localhost:9000/in:0+9396432
which is genericSplit.toString()
PDFInputFormat extends FileInputFormat<LongWritable, Text>: which just creates new PDFRecordReader in createRecordReader method.
I want to know what is my mistake?
Do I need extra classes or something?
Reading PDFs is not that difficult, you need to extend the class FileInputFormat as well as the RecordReader. The FileInputClass should not be able to split PDF files since they are binaries.
public class PDFInputFormat extends FileInputFormat<Text, Text> {
#Override
public RecordReader<Text, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new PDFLineRecordReader();
}
// Do not allow to ever split PDF files, even if larger than HDFS block size
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
The RecordReader then performs the reading itself (I am using PDFBox to read PDFs).
public class PDFLineRecordReader extends RecordReader<Text, Text> {
private Text key = new Text();
private Text value = new Text();
private int currentLine = 0;
private List<String> lines = null;
private PDDocument doc = null;
private PDFTextStripper textStripper = null;
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit) split;
final Path file = fileSplit.getPath();
Configuration conf = context.getConfiguration();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream filein = fs.open(fileSplit.getPath());
if (filein != null) {
doc = PDDocument.load(filein);
// Konnte das PDF gelesen werden?
if (doc != null) {
textStripper = new PDFTextStripper();
String text = textStripper.getText(doc);
lines = Arrays.asList(text.split(System.lineSeparator()));
currentLine = 0;
}
}
}
// False ends the reading process
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new Text();
}
if (value == null) {
value = new Text();
}
if (currentLine < lines.size()) {
String line = lines.get(currentLine);
key.set(line);
value.set("");
currentLine++;
return true;
} else {
// All lines are read? -> end
key = null;
value = null;
return false;
}
}
#Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return (100.0f / lines.size() * currentLine) / 100.0f;
}
#Override
public void close() throws IOException {
// If done close the doc
if (doc != null) {
doc.close();
}
}
Hope this helps!
package com.sidd.hadoop.practice.pdf;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.sidd.hadoop.practice.input.pdf.PdfFileInputFormat;
import com.sidd.hadoop.practice.output.pdf.PdfFileOutputFormat;
public class ReadPdfFile {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// context.progress();
context.write(key, value);
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
if (values.iterator().hasNext()) {
context.write(key, values.iterator().next());
} else {
context.write(key, new Text(""));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Read Pdf");
job.setJarByClass(ReadPdfFile.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(PdfFileInputFormat.class);
job.setOutputFormatClass(PdfFileOutputFormat.class);
removeDir(args[1], conf);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static void removeDir(String path, Configuration conf) throws IOException {
Path output_path = new Path(path);
FileSystem fs = FileSystem.get(conf);
if (fs.exists(output_path)) {
fs.delete(output_path, true);
}
}
}
Ive written a small hadoop map program to parse(regex) information from log files generated from other apps. I found this article http://www.nearinfinity.com//blogs/stephen_mouring_jr/2013/01/04/writing-hive-tables-from-mapreduce.html
This article explains how to parse and write it into the hive table
Here is my code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ParseDataToDB {
public static final String SEPARATOR_FIELD = new String(new char[] {1});
public static final String SEPARATOR_ARRAY_VALUE = new String(new char[] {2});
public static final BytesWritable NULL_KEY = new BytesWritable();
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, Text> {
//private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private ArrayList<String> bazValues = new ArrayList<String>();
public void map(LongWritable key, Text value,
OutputCollector<BytesWritable, Text> context)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
if(word.find("extract") > -1) {
System.out.println("in herer");
bazValues.add(line);
}
}
// Build up the array values as a delimited string.
StringBuilder bazValueBuilder = new StringBuilder();
int i = 0;
for (String bazValue : bazValues) {
bazValueBuilder.append(bazValue);
++i;
if (i < bazValues.size()) {
bazValueBuilder.append(SEPARATOR_ARRAY_VALUE);
}
}
// Build up the column values / fields as a delimited string.
String hiveRow = new String();
hiveRow += "fooValue";
hiveRow += SEPARATOR_FIELD;
hiveRow += "barValue";
hiveRow += SEPARATOR_FIELD;
hiveRow += bazValueBuilder.toString();
System.out.println("in herer hiveRow" + hiveRow);
// StringBuilder hiveRow = new StringBuilder();
// hiveRow.append("fooValue");
// hiveRow.append(SEPARATOR_FIELD);
// hiveRow.append("barValue");
// hiveRow.append(SEPARATOR_FIELD);
// hiveRow.append(bazValueBuilder.toString());
// Emit a null key and a Text object containing the delimited fields
context.collect(NULL_KEY, new Text(hiveRow));
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "MyTest");
job.setJarByClass(ParseDataToDB.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(BytesWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(BytesWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
But when i run this app, i get an error saying "expected ByteWritable but recieved LongWritable. Can someone tell me what im doing wrong? Im new to hadoop programming. Im also open to creating external tables and pointing that to hdfs, again im struggling with implementation.
Thanks.
from looking at article you provided LINK, Show NULL_KEY that you haven't set any value.
It should be
public static final BytesWritable NULL_KEY = new BytesWritable(null);
I think as you are trying to output NULL as key from the map so you can use NullWritable. So your code would be something as below:-
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ParseDataToDB {
public static final String SEPARATOR_FIELD = new String(new char[] {1});
public static final String SEPARATOR_ARRAY_VALUE = new String(new char[] {2});
public static class MyMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
//private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private ArrayList<String> bazValues = new ArrayList<String>();
public void map(LongWritable key, Text value,
OutputCollector<NullWritable, Text> context)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
if(word.find("extract") > -1) {
System.out.println("in herer");
bazValues.add(line);
}
}
// Build up the array values as a delimited string.
StringBuilder bazValueBuilder = new StringBuilder();
int i = 0;
for (String bazValue : bazValues) {
bazValueBuilder.append(bazValue);
++i;
if (i < bazValues.size()) {
bazValueBuilder.append(SEPARATOR_ARRAY_VALUE);
}
}
// Build up the column values / fields as a delimited string.
String hiveRow = new String();
hiveRow += "fooValue";
hiveRow += SEPARATOR_FIELD;
hiveRow += "barValue";
hiveRow += SEPARATOR_FIELD;
hiveRow += bazValueBuilder.toString();
System.out.println("in herer hiveRow" + hiveRow);
// Emit a null key and a Text object containing the delimited fields
context.collect(NullWritable.get(), new Text(hiveRow));
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "MyTest");
job.setJarByClass(ParseDataToDB.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's my source code
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class PageRank {
public static final String MAGIC_STRING = ">>>>";
boolean overwrite = true;
PageRank(boolean overwrite){
this.overwrite = overwrite;
}
public static class TextPair implements WritableComparable<TextPair>{
Text x;
int ordering;
public TextPair(){
x = new Text();
ordering = 1;
}
public void setText(Text t, int o){
x = t;
ordering = o;
}
public void setText(String t, int o){
x.set(t);
ordering = o;
}
public void readFields(DataInput in) throws IOException {
x.readFields(in);
ordering = in.readInt();
}
public void write(DataOutput out) throws IOException {
x.write(out);
out.writeInt(ordering);
}
public int hashCode() {
return x.hashCode();
}
public int compareTo(TextPair o) {
int x = this.x.compareTo(o.x);
if(x==0)
return ordering-o.ordering;
else
return x;
}
}
public static class MapperA extends Mapper<LongWritable, Text, TextPair, Text> {
private Text word = new Text();
Text title = new Text();
Text link = new Text();
TextPair textpair = new TextPair();
boolean start=false;
String currentTitle="";
private Pattern linkPattern = Pattern.compile("\\[\\[\\s*(.+?)\\s*\\]\\]");
private Pattern titlePattern = Pattern.compile("<title>\\s*(.+?)\\s*</title>");
private Pattern pagePattern = Pattern.compile("<page>\\s*(.+?)\\s*</page>");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int startPage=line.lastIndexOf("<title>");
if(startPage<0)
{
Matcher matcher = linkPattern.matcher(line);
int n = 0;
title.set(currentTitle);
while(matcher.find()){
textpair.setText(matcher.group(1), 1);
context.write(textpair, title);
}
link.set(MAGIC_STRING);
textpair.setText(title.toString(), 0);
context.write(textpair, link);
}
else
{
String result=line.trim();
Matcher titleMatcher = titlePattern.matcher(result);
if(titleMatcher.find()){
currentTitle = titleMatcher.group(1);
}
else
{
currentTitle=result;
}
}
}
}
public static class ReducerA extends Reducer<TextPair, Text, Text, Text>{
Text aw = new Text();
boolean valid = false;
String last = "";
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
TextPair key = context.getCurrentKey();
Text value = context.getCurrentValue();
if(key.ordering==0){
last = key.x.toString();
}
else if(key.x.toString().equals(last)){
context.write(key.x, value);
}
}
cleanup(context);
}
}
public static class MapperB extends Mapper<Text, Text, Text, Text>{
Text t = new Text();
public void map(Text key, Text value, Context context) throws InterruptedException, IOException{
context.write(value, key);
}
}
public static class ReducerB extends Reducer<Text, Text, Text, PageRankRecord>{
ArrayList<String> q = new ArrayList<String>();
public void reduce(Text key, Iterable<Text> values, Context context)throws InterruptedException, IOException{
q.clear();
for(Text value:values){
q.add(value.toString());
}
PageRankRecord prr = new PageRankRecord();
prr.setPageRank(1.0);
if(q.size()>0){
String[] a = new String[q.size()];
q.toArray(a);
prr.setlinks(a);
}
context.write(key, prr);
}
}
public boolean roundA(Configuration conf, String inputPath, String outputPath, boolean overwrite) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round A");
//job.setJarByClass(GraphBuilder.class);
job.setMapperClass(MapperA.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerA.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
FileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean roundB(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
if(FileSystem.get(conf).exists(new Path(outputPath))){
if(overwrite){
FileSystem.get(conf).delete(new Path(outputPath), true);
System.err.println("The target file is dirty, overwriting!");
}
else
return true;
}
Job job = new Job(conf, "closure graph build round B");
//job.setJarByClass(PageRank.class);
job.setMapperClass(MapperB.class);
//job.setCombinerClass(RankCombiner.class);
job.setReducerClass(ReducerB.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PageRankRecord.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(30);
SequenceFileInputFormat.addInputPath(job, new Path(inputPath));
SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));
return job.waitForCompletion(true);
}
public boolean build(Configuration conf, String inputPath, String outputPath) throws IOException, InterruptedException, ClassNotFoundException{
System.err.println(inputPath);
if(roundA(conf, inputPath, "cgb", true)){
return roundB(conf, "cgb", outputPath);
}
else
return false;
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
Configuration conf = new Configuration();
//PageRanking.banner("ClosureGraphBuilder");
PageRank cgb = new PageRank(true);
cgb.build(conf, args[0], args[1]);
}
}
Here's how i compile and run
javac -classpath hadoop-0.20.1-core.jar -d pagerank_classes PageRank.java PageRankRecord.java
jar -cvf pagerank.jar -C pagerank_classes/ .
bin/hadoop jar pagerank.jar PageRank pagerank result
but I am getting the following errors:
INFO mapred.JobClient: Task Id : attempt_201001012025_0009_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: PageRank$MapperA
Can someone tell me whats wrong
Thanks
If you are using the 0.2.0 hadoop (want to use the non-deprecated classes), you can do:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(YourMapReduceClass.class); // <-- omitting this causes above error
job.setMapperClass(MyMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
Did "PageRank$MapperA.class" end up inside that jar file? It should be in the same place as "PageRank.class".
Try to add "--libjars pagerank.jar". Mapper and reducer are running across machines, thus you need to distribute your jar to every machine. "--libjars" helps to do that.
For the HADOOP_CLASSPATH you should specify the folder where the JAR file is located...
If you want to understand how the classpath works: http://download.oracle.com/javase/6/docs/technotes/tools/windows/classpath.html
I guess you should change your HADOOP_CLASSPATH variable, so that it points to the jar file.
e.g. HADOOP_CLASSPATH=<what ever the path>/PageRank.jar or something like that.
If you are using ECLIPSE for generating jar then use "Extract generated libraries into generated JAR" option.
Though MapReduce program is parallel processing. Mapper, Combiner and Reducer class has sequence flow. Have to wait for completing each flow depends on other class so need job.waitForCompletion(true); But It must to set input and output path before starting Mapper, Combiner and Reducer class. Reference
Solution for this already answered in https://stackoverflow.com/a/38145962/3452185