Hadoop Mapreduce multiple Input files - java

So I need two files as an Input to my mapreduce program: City.dat and Country.dat
In my main method im parsing the command line arguments like this:
Path cityInputPath = new Path(args[0]);
Path countryInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
If I'm now running my programm with the following command:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
I get the following error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists
Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?

Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:
bin/hadoop fs -rmr /user/cloudera/capital/output
Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).
Basically you need to shift all your indices one to the right:
Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
Don't forget to clear your output folder though!
Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat
And in your java code:
FileInputFormat.addInputPaths(job, args[1]);

What is happening here is that the class name is deemed to be the first argument!
By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.
So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:
Manifest-Version: 1.0
Main-Class: org.myorg.Capital
And now your command would look like:
hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
You can certainly just change the index values being used in your code but that's not advisable solution.

can you try this:
hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output
This should read all files in the single input directory.

Related

moving File leads to file with path separator "/" in filename

I try to move a file from one directory to another.
I do this with
File fileToMove = new File("/Users/kai-dj/separator_problem/from/file_to_move.file");
File destDir = new File("/Users/kai-dj/separator_problem/to");
if (fileToMove.exists() && destDir.isDirectory()) {
fileToMove.renameTo(new File(destDir.getAbsolutePath()+File.pathSeparator+fileToMove.getName()));
}
I'd expect to find file_to_move.file in folder /Users/kai-dj/separator_problem/to after execution, but I get a file named to/file_to_move.file placed in the parent folder /Users/kai-dj/separator_problem. At least that's what Finder shows.
As I thought: "File names mustn't contain path separator characters, this can't be true.", I also checked what ls would output in terminal:
mac-book:separator_problem kai-dj$ ls
from to:file_to_move.file
to
OK – seems no /in file name. Very strange nontheless.
Why does Finder show it as file name containing /?
Why does Java rename the file to <dirname>:<filename> – especially even when I used File.pathSeparator, not / and certainly not :?
I also tried with Files.move – same result.
EDIT: Solved, but I'd still love to know, why Finder shows : as / ^^
As mentioned in the comment above, the correct member to use is called File.separator.
Also, you can avoid using File.separator in general, and use Paths instead:
System.out.println(Paths.get("/Users/kai-dj/separator_problem/to", fileToMove.getName()).toAbsolutePath());

Inputting file name through main function argument (args[0])

EDIT: to run my code i am using "java filename.java input1.txt" is this correct?
I am creating a program where i have to tokenize a string into separate words and that string is in a text file. I have to specify the text file name in the terminal through command line arguments (args[0], etc). I am able to scan and print the content of the text file if i specify through paths but when i try to do it using args[0] it doesn't seem to work. I am using net beans. I will attach my section of code here:
public static void main(String[] args) {
try {
File f = new File(args[0]);
//using this commented out section using paths works File f = new
//File("NetBeansProjects/SentenceUtils/src/input1.txt");
Scanner input = new Scanner(new FileInputStream(f));
while(input.hasNext()) {
String s = input.next();
System.out.println(s);
}
} catch(FileNotFoundException fnfe) {
System.out.println("File not found");
}
SentenceUtils s = new SentenceUtils();
}
java filename.java input1.txt
is not correct for running a java program, you need to compile the *.java file to get a *.class file which you can then run like:
java filename input1.txt
assuming your class is in the default package and you are running the command in the output directory of your compile command, or using the fully qualified class name of the class, i.e. including the package name. For example if your class is in the package foo/bar/baz (sub folders in your source folder) and has the package declaration package foo.bar.baz;, then you need to specify your class like this:
java [-cp your-classpath] foo.bar.baz.filename input1.txt
for input1.txt to be found it has to be in the same directory where you run the command.
your-classpath is a list of directories separated by a system dependent delimiter (; for windows, : for linux, ...) or archives which the java command uses to look up the class to run specified and its dependencies.
NetBeansProjects/SentenceUtils/src/input1.txt is a relative path.
File f = new File("NetBeansProjects/SentenceUtils/src/input1.txt");
if this works then it means that the current working directory (i.e. the directory from which all relative paths are calculated) is the the rectory named NetBeansProjects.
You get FileNotFoundException because your file is expected to be in
NetBeansProjects/input1.txt
To find out which is the current working directory for your running program you can add the following statement:
System.out.println(new File("").getAbsolutePath());
Place input.txt in that directory and it will be found.
Alternatively you can pass the absolute path of your input file. an absolute path is a path that can be used to locate your file from whatever location your program is running from on your local filesystem. For example:
java -cp <your-classpath> <fully-qualified-name-of-class> /home/john/myfiles/myprogects/...../input1.txt
To sum up, what you need to know/do is the following:
the location of your program class and its package (filename)
the location of your input file (input.txt)
pass the correct argument accordingly

Hadoop Map Reduce let addInputPath work with spesific file name

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?
From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).
If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

Java binary not picking up resources from the specified class path

I am trying to load a text file as InputStream, but the txt file is never picked up and the input stream value is always null. I don't know why this is happening and I would like to request assistance.
nohup java -jar crawler-jar-2014.11.0-SNAPSHOT.jar -classpath /home/nbsxlwa/crawler/resources/common-conf:/home/nbsxlwa/crawler/resources/dotcom-conf:./plugins/*:./lib/* &
The txt file is located in /home/nbsxlwa/crawler/resources/dotcom-conf directory. I can confirm that the file does exist, so I don't know why the file is not being picked up. The setup is given below:
`System.getEnvironment("java.class.path")
returns the following
value crawler-jar-2014.11.0-SNAPSHOT.jar
The code blocks are trying to create an input stream out of text.
String fileRules = conf.get(URLFILTER_REGEX_FILE);
System.out.println("file rules = " + fileRules);
// Pass the file as a resource in classpath.
// return conf.getConfResourceAsReader(fileRules);
// Pass the file as a resource in classpath.
InputStream is = RegexURLFilter.class.getResourceAsStream("/" + fileRules);
System.out.println("Inputstream is = " + is);
System.out.println(ClassLoader.getSystemResourceAsStream(fileRules));
The output to the above snippet is
file rules = regex-urlfilter.txt
Inputstream is = null
null
I tried adding classpath folders to the classpath MANIFEST.MF file. MANIFEST.MF contains the following entries, but the output still returned null.
Class-Path: resources/common-conf resources/dotcom-conf
resources/olb-conf lib/gora-cassandra-0.3.jar ** OTHER JARS**
Note the java man page entry for -jar
When you use this option, the JAR file is the source of all user
classes, and other user class path settings are ignored.
So your other classpath entries are ignored. You'll have to use the plain old java command, specifying a class with a main method and including your .jar in the classpath.
The relative paths you've specified in the MANIFEST are relative to the JAR itself or are complete URLs.

Hadoop : Provide directory as input to MapReduce job

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output
The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.
you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}
you can use hdfs wildcards in order to provide multiple files
so, the solution :
hadoop jar ABC.jar /folder1/* /output
or
hadoop jar ABC.jar /folder1/*.txt /output
Use MultipleInputs class.
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code

Categories

Resources