Hadoop : Provide directory as input to MapReduce job - java

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output

The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.

you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}

you can use hdfs wildcards in order to provide multiple files
so, the solution :
hadoop jar ABC.jar /folder1/* /output
or
hadoop jar ABC.jar /folder1/*.txt /output

Use MultipleInputs class.
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code

Related

Hadoop, MapReduce - Multiple Input/Output Paths

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.
Thanks.
You don't have to pass individual file as input for MapReduce Job.
FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.
public static void setInputPaths(Job job,
Path... inputPaths)
throws IOException
Add a Path to the list of inputs for the map-reduce job.
Parameters:
conf - The configuration of the job
path - Path to be added to the list of inputs for the map-reduce job.
Example code from Apache tutorial
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
MultipleInputs provides below APIs.
public static void addInputPath(Job job,
Path path,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass)
Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.
Related SE question:
Can hadoop take input from multiple directories and files
Refer to MultipleOutputs API regarding your second query on multiple output paths.
FileOutputFormat.setOutputPath(job, outDir);
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
Have a look at related SE questions regarding multiple output files.
Writing to multiple folders in hadoop?
hadoop method to send output to multiple directories

Hadoop Map Reduce let addInputPath work with spesific file name

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?
From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).
If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

How to join the Pig output files?

The pig script output a few part files (part-m-00000, part-m-00001, etc) with .pig_header and .pig_schema and I am trying to join them as one output csv.
I tried to use the hadoop merge
hadoop fs -getmerge ./output output.csv
but the files are merged with the .pig_schema file as well so it becomes something like
header1,header2,header3
{"fields":[{"name": "header1", "type":...}]}
value1,value2,value3
How do I join them correctly without the .pig_schema included?
Thanks!
Use a fileglob: hadoop fs -getmerge ./output/part* output.csv

Hadoop Mapreduce multiple Input files

So I need two files as an Input to my mapreduce program: City.dat and Country.dat
In my main method im parsing the command line arguments like this:
Path cityInputPath = new Path(args[0]);
Path countryInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
If I'm now running my programm with the following command:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
I get the following error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists
Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?
Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:
bin/hadoop fs -rmr /user/cloudera/capital/output
Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).
Basically you need to shift all your indices one to the right:
Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
Don't forget to clear your output folder though!
Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat
And in your java code:
FileInputFormat.addInputPaths(job, args[1]);
What is happening here is that the class name is deemed to be the first argument!
By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.
So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:
Manifest-Version: 1.0
Main-Class: org.myorg.Capital
And now your command would look like:
hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
You can certainly just change the index values being used in your code but that's not advisable solution.
can you try this:
hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output
This should read all files in the single input directory.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?
I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Categories

Resources