Hadoop - How to get a Path object of an HDFS file

Hadoop - How to get a Path object of an HDFS file - java

I'm trying to figure out the various ways to write content/files to the HDFS in a Hadoop cluster.
I know there is org.apache.hadoop.fs.FileSystem.get() and org.apache.hadoop.fs.FileSystem.getLocal() to create an output stream and write byte by byte. If you are making use of OutputCollector.collect() it doesn't seem like this is the intended way to write to the HDFS. I believe you have to use Outputcollector.collect() when implementing Mappers and Reducers, correct me if I'm wrong?.
I know you can set FileOutputFormat.setOutputPath() before even running the job but it looks like this can only accepts objects of type Path.
When looking at org.apache.hadoop.fs.path and looking at the path class, I do not see anything which allows you to specify remote or local. Then when looking up org.apache.hadoop.fs.FileSystem I do not see anything which returns an object of type path.
Does FileOutputFormat.setOutputPath() always have to write to the local file system? I don't think this is true, I vaguely remember reading that a jobs' output can be used as another jobs' input. This leads me to believe there is also a way to set this to the HDFS.
Is the only way to write to the HDFS to use a data stream as described?

org.apache.hadoop.fs.FileSystem.get and org.apache.hadoop.FileSystem.getLocal return a FileSystem object which is a generic that can be implemented both as a local filesystem or distibuted file system. OutputCollector doest write to hdfs . it just provides a method collect for mappers and reducers to collect the data output (both intermediate and final). By the way, its deprecated in favor of Context object.FileOutputFormat.setOuptPath sets the final output directory by setting mapred.output.dir which can be on your local file system or distributed. About remote or local - fs.default.name sets those value . If you have set it as file:/// it will take local file system. if set as hdfs:// it will take hdfs and so on. And about writing to hdfs - whatever method you take that writes to files in hadoop , it will be using FSDataOuputStream underneath. FSDataOutputStrem is wrapper of java.io.OutputStream . By the way, whenever you want to write to a filesystem in java, you have create a stream object for that.FileOutputFormat has method FileOutputFormat.setOutputPath(job, output_path) where in place of output_path , you can specify , whether you want to use local file system or hdfs , overriding the settings of core-site.xml. e.g FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/path_to_file")) will set up output to be written to hdfs. change it to file:/// and you can write to local file system. Change loclahost and portno as per your settings. In the same way, input can also be overridden at per job level. –

Related

Hadoop - Merge reducer outputs to a single file using Java

I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'
I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?
Thanks in advance.

you can use getmerge through shell command to merge multiple file into single file.
Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>
Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link
if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.
--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;
I hope this will help you.

The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.
That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.
In that case I would first run your MapReduce program,
Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.
Or you could also just use a single mapper with your original program with
Job.setNumberOfReducer(1);

Hadoop MapReduce: Read a file and use it as input to filter other files

I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!

If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.

I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.

dfs.block.size for Local hadoop jobs ?

I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.

Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path) method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):
For example, if you're using the TextInputFormat (or most input formats that extend FileInputFormat), you can call the static util methods:
FileInputFormat.setMaxInputSplitSize(Job, long)
FileInputFormat.setMinInputSplitSize(Job, long)
The long argument is the size of the split in bytes, so just set to you're desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.size
mapred.max.split.size
Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize() method (which defaults to 1 byte for FileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.
A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?

try doing this it will work
hadoop fs -D dfs.block.size=16777216 -put 25090206.P .

Pass file location as value to hadoop mapper?

Is it possible to pass the locations of a files in HDFS as the value to my mapper so that i can run an executable on them to process them?

yes, you can create file with file names in the HDFS, and use it as an input for the map/reduce job. You will need to create custom splitter, in order to serve several file names to each mapper. By default you input file will be split by the blocks, and probabbly the whole file list will be passed to one mapper.
Another solution will be to define Your input as not splittable. In this case each file will be passed to the mapper, and you free to create your own InputFormat which will use whenever logic you need to process the file - for example call external executable. If you will go this way the Hadoop framework will take care about data locality.

The another of approaching this can be by obtaining the file name through FileSplit, thos can done by using the following code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String filename = fileSplit.getPath().getName();
Hope this helps

hadoop, map/reduce output file(part-00000) and distributed cache

the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) which may not be a text file and store in distributed cache.

My suggestion:
Create a new Hadoop job with the following properties:
Input the directory with all the part-... files.
Create a custom OutputFormat class that writes to your distributed cache.
Now make your job to look essentially to have something like this:
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setMapperClass(IdentityMapper.class);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputFormat(DistributedCacheOutputFormat.class);
Have a look at the Yahoo Hadoop tutorial because it has some examples on this point: http://developer.yahoo.com/hadoop/tutorial/module5.html#outputformat
HTH

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.