Hadoop Map Reduce let addInputPath work with spesific file name - java

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?

From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).

If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

Related

Processing/Java File Count Issue With File Pathway (Variable Type)

Although the Title isn't very understandable I do have a simple issue. So i'm trying to write some code in a Processing Sketch (https://processing.org/) which can count how many files are in a document. The problem is, is that it doesn't accept the variable type.
File folder = File("My File Path");
folder.listFiles().size;
It says the function File(String) doesn't exist. When I try to put the file path without quation marks, it still doesn't work!
If you have a solution then please use a functioning example so that I know how it works. Thanks for any help!
As Joakim Danielson says it is constructor so you need to use new keyword.
Below code will work for you.
File folder = new File("My File Path");
int fileLength = folder.listFiles().length;
It's a constructor so you need to use new
File folder = new File("My File Path");
//To get the number of files in the folder
folder.listFiles().length;
Assuming the "My File Path" folder is inside your sketch you need to provide the path to your sketch. Luckily Processing already provides a helper function: sketchPath()
Here's an example:
File folder = new File(sketchPath("My File Path"));
println("folder.exists: " + folder.exists());
if(folder.exists()){
println(folder.listFiles().length + " files and/or directories");
}else{
println("folder does not exist, double check the path");
}
Bare in mind there's also a dataPath() function which points to a folder named data in your sketch folder. The data folder is typically used for storing external data (e.g. assets (raster or vector images/Processing font files) or raw data (binary/text/csv/xml/json/etc.)). This is useful to separate your sketch source files from the data to be loaded/accessed by your sketch.
Also, Processing has a few utility functions for listing files and folders.
Be sure to check out Processing > Examples > Topics > File IO > DirectoryList
The example includes less documented functions such as listFiles() (which returns an array of java.io.File objects based on the filters set) or listPaths (which returns an array of String objects: just the paths).
The options and filters are quite handy, for example if you want to list directories only and ignore files you can simply write simply like:
println("directories: " + listFiles(sketchPath("My File Path"),"directories").length);
For example if want to list all the wav files in a data/audio directory inside the sketch you can use:
File[] files = listFiles(dataPath("audio"), "files", "extension=wav");
This will ignore directories and any other file that does not have .wav extension.
To make this answer complete, here are a few more details on the options for listFiles/listPaths from Processing's source code:
"relative" -> no effect with the Files version, but important for listPaths
"recursive"-> traverse nested directories
"extension=js" or "extensions=js|csv|txt" (no dot)
"directories" -> only directories
"files" -> only files
"hidden" -> include hidden files (prefixed with .) disabled by default

Hadoop, MapReduce - Multiple Input/Output Paths

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.
Thanks.
You don't have to pass individual file as input for MapReduce Job.
FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.
public static void setInputPaths(Job job,
Path... inputPaths)
throws IOException
Add a Path to the list of inputs for the map-reduce job.
Parameters:
conf - The configuration of the job
path - Path to be added to the list of inputs for the map-reduce job.
Example code from Apache tutorial
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
MultipleInputs provides below APIs.
public static void addInputPath(Job job,
Path path,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass)
Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.
Related SE question:
Can hadoop take input from multiple directories and files
Refer to MultipleOutputs API regarding your second query on multiple output paths.
FileOutputFormat.setOutputPath(job, outDir);
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
Have a look at related SE questions regarding multiple output files.
Writing to multiple folders in hadoop?
hadoop method to send output to multiple directories

Java File - Append pre-defined filename to a user-defined directory

I have a program that creates multiple output files e.g. daily_results.txt, overall_results.txt etc.
I want to allow the user to specify the directory that these files will be saved to using JFileChooser.
So if the user selected the directory they wanted their output to be saved to as "C:\temp\". What is the best way to append daily_results.txt to that file object. Is there a more elegant way to do this other than:
File file = new File(userDirectory.getPath() + "daily_results.txt");
Any ideas?
Apologies!
I think this can quite easily be accomplished with the JFileChoosers setSelectedFile method.

Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)
The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.
You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?
I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Categories

Resources