Hadoop, MapReduce - Multiple Input/Output Paths

Hadoop, MapReduce - Multiple Input/Output Paths - java

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.
Thanks.

You don't have to pass individual file as input for MapReduce Job.
FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.
public static void setInputPaths(Job job,
Path... inputPaths)
throws IOException
Add a Path to the list of inputs for the map-reduce job.
Parameters:
conf - The configuration of the job
path - Path to be added to the list of inputs for the map-reduce job.
Example code from Apache tutorial
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
MultipleInputs provides below APIs.
public static void addInputPath(Job job,
Path path,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass)
Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.
Related SE question:
Can hadoop take input from multiple directories and files
Refer to MultipleOutputs API regarding your second query on multiple output paths.
FileOutputFormat.setOutputPath(job, outDir);
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
Have a look at related SE questions regarding multiple output files.
Writing to multiple folders in hadoop?
hadoop method to send output to multiple directories

Related

Hadoop Map Reduce let addInputPath work with spesific file name

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?

From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).

If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

Hadoop : Provide directory as input to MapReduce job

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output

The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.

you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}

you can use hdfs wildcards in order to provide multiple files
so, the solution :
hadoop jar ABC.jar /folder1/* /output
or
hadoop jar ABC.jar /folder1/*.txt /output

Use MultipleInputs class.
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code

Unable to load the HFiles into HBase using mapreduce.LoadIncrementalHFiles

I want to insert the out-put of my map-reduce job into a HBase table using HBase Bulk loading API LoadIncrementalHFiles.doBulkLoad(new Path(), hTable).
I am emitting the KeyValue data type from my mapper and then using the HFileOutputFormat to prepare my HFiles using its default reducer.
When I run my map-reduce job, it gets completed without any errors and it creates the outfile, however, the final step - inserting HFiles to HBase is not happening. I get the below error after my map-reduce completes:
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:54310/user/xx.xx/output/_SUCCESS
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory output/. Does it contain files in subdirectories that correspond to column family names?
But I can see the output directory containing:
1. _SUCCESS
2. _logs
3. _0/2aa96255f7f5446a8ea7f82aa2bd299e file (which contains my data)
I have no clue as to why my bulkloader is not picking the files from output directory.
Below is the code of my Map-Reduce driver class:
public static void main(String[] args) throws Exception{
String inputFile = args[0];
String tableName = args[1];
String outFile = args[2];
Path inputPath = new Path(inputFile);
Path outPath = new Path(outFile);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
//set the configurations
conf.set("mapred.job.tracker", "localhost:54311");
//Input data to HTable using Map Reduce
Job job = new Job(conf, "MapReduce - Word Frequency Count");
job.setJarByClass(MapReduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, inputPath);
fs.delete(outPath);
FileOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MapReduce.MyMap.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
HTable hTable = new HTable(conf, tableName.toUpperCase());
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
job.waitForCompletion(true);
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outFile), hTable);
}
How can I figure out the wrong thing happening here which I avoiding my data insert to HBase?

Finally, I figured out as to why my HFiles were not getting dumped into HBase. Below are the details:
My create statement ddl was not having any default column-name so my guess is that Phoenix created the default column-family as "_0". I was able to see this column-family in my HDFS/hbase dir.
However, when I use the HBase's LoadIncrementalHFiles API for fetching the files from my output directory, it was not picking my dir named after the col-family ("0") in my case. I debugged the LoadIncrementalHFiles API code and found that it skips all the directories from the output path that starts with "" (for e.g. "_logs").
I re-tried the same again but now by specifying some column-family and everything worked perfectly fine. I am able to query data using Phoenix SQL.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?

I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.