File id in Hadoop - java

I want to store some information of the files being processed from HDFS. What would be the most suitable way to read a file location and byte offset in a java program of a file stored in HDFS?
Is there concept of a unique file id being associated to each file stored in Hadoop 1?
If yes, then how can it be fetched in a MapReduce program?

As per my understanding,
You can use org.apache.hadoop.fs.FileSystem class for all your needs.
1.You can get each file uniquely identified with it's URI or you can use getFileChecksum(Path path)
2.You can get all files all block locations with the getFileBlockLocations(FileStatus file,long start,long len)
TextInputFormat gives byte offset for key starting location in the file, which is not same as the file offset on the HDFS.
You can use the org.apache.hadoop.fs.FileSystem class to fulfill all your needs. There are many other methods available. Please go through it for better understanding.
Hope it helps.

According to "The Definitive Guide to Hadoop", the input format TextInputFormat gives to the key a value of the byte offset.
For the filename you can look into these:
Mapper input Key-Value pair in Hadoop
How can to get the filename from a streaming mapreduce job in R?

Related

HDFS File System - How to get the byte count for a specific file in a directory

I am trying to get the byte count for the specific file in a HDFS directory.
I tried to use fs.getFileStatus() ,but i do not see any methods for getting byte count of the file, i can see only getBlockSize() method.
Is there any way can i get the byte count of a specific file in HDFS?
fs.getFileStatus() returns a FileStatus objects which has a method getLen() this will return "length of this file, in bytes." Maybe you should haev a closer look on this: https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileStatus.html.
BUT be aware that the file size is not that important on HDFS. The files will be organized in so called data-blocks each datablock is by default 64 MB. So if you deal with many small files (which is one big anti-pattern on HDFS) you may have less capacity than you expect. See this link for more details:
https://hadoop.apache.org/docs/r2.6.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Blocks
We need to use fs.getLen() method to get the file byte count.

Hadoop MapReduce: Read a file and use it as input to filter other files

I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!
If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.
I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.

How do I output whole files from a map job?

This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.
The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).

Pass file location as value to hadoop mapper?

Is it possible to pass the locations of a files in HDFS as the value to my mapper so that i can run an executable on them to process them?
yes, you can create file with file names in the HDFS, and use it as an input for the map/reduce job. You will need to create custom splitter, in order to serve several file names to each mapper. By default you input file will be split by the blocks, and probabbly the whole file list will be passed to one mapper.
Another solution will be to define Your input as not splittable. In this case each file will be passed to the mapper, and you free to create your own InputFormat which will use whenever logic you need to process the file - for example call external executable. If you will go this way the Hadoop framework will take care about data locality.
The another of approaching this can be by obtaining the file name through FileSplit, thos can done by using the following code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String filename = fileSplit.getPath().getName();
Hope this helps

hadoop, map/reduce output file(part-00000) and distributed cache

the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) which may not be a text file and store in distributed cache.
My suggestion:
Create a new Hadoop job with the following properties:
Input the directory with all the part-... files.
Create a custom OutputFormat class that writes to your distributed cache.
Now make your job to look essentially to have something like this:
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setMapperClass(IdentityMapper.class);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputFormat(DistributedCacheOutputFormat.class);
Have a look at the Yahoo Hadoop tutorial because it has some examples on this point: http://developer.yahoo.com/hadoop/tutorial/module5.html#outputformat
HTH

Categories

Resources