I want to write a MapReduce application which can process both text and zip files. For this I want to use to different input formats, one for text and another for zip. Is it possible to do so?
Extending a bit from #ChrisWhite's answer, what you need is to use a custom InputFormat and RecordReader that work with ZIP files. You can find here a sample ZipFileInputFormat and here a sample ZipFileRecordReader.
Given this, as Chris suggested you should use MultipleInputs, and here is how I would do it if you don't need custom mappers for each type of file:
MultipleInputs.addInputPath(job, new Path("/path/to/zip"), ZipFileInputFormat.class);
MultipleInputs.addInputPath(job, new Path("/path/to/txt"), TextInputFormat.class);
Look at the API docs for MultipleInputs (old api, new api). Not hugely self explanatory, but you should be able to see that you call the addInputPath methods in your job configuration and configure the input path (which can be a glob, input format and associated mapper).
You should be able to Google for some examples, infact here's a SO question / answer that shows some usage
Consider writing a custom InputFormat where you can check what kind of input is being read and then based on the check invoke the required InputFormat
Related
I am trying to run a job where each mapper 'type' recieves a different input file. I know there is a way to do this with Java using MultipleInputs class like so:
MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,CounterMapper.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,CountertwoMapper.class);
Where CounterMapper.class and CountertwoMapper.class are the respective mapper 'types'.
I am trying to achieve similar functionality with MrJob for Python or any other language that is not Java (please don't ask why!).
This image is similar to what I want to achieve.
Any help is appreciated.
I have found a way in which different mappers can be associated to a sing input path, this doesn't exactly answer your question but hope it helps you. In the link below
Using multiple mapper inputs in one streaming job on hadoop?
I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file.
I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.
PS:- converting the .xls file into .csv file is not an option.
Yes, you should create RecordReader to read each record from your excel document. Inside that record reader you should use POI like api to read from excel docs. More precisely please do the following steps:
Extend FileInputFromat and create your own CustomInputFrmat and overrride getRecordReader .
Create a CustomRecordReader by extending RecordReader ,here you have to write how to generate a key value pair from a given filesplit.
So first read bytes from filesplit and from that bufferedbytes read out desired key and value using POI.
You can check myown CustomInputFormat and RecordReader to deal with custom data objects here
myCustomInputFormat
Your research is correct. You need a custom InputFormat for Hadoop. If you are lucky, somebody already created one for your use case.
If not, I would suggest to look for a Java library that is able to read excel files.
Since Excel is a proprietary file format, it is unlikely that you will find an implementation that works perfectly.
Once you found a library that is able to read Excel files, integrate it with the InputFormat.
Therefore, You have to extend the FileInputFormat of Hadoop. The getRecordReader that is being returned by your ExcelInputFormat must return the rows from your excel file. You probably also have to overwrite the getSplits() method to tell the framework not to split the file at all.
Iam writing a map-reduce job in Java I would like to know is it possible to obtain output of the job as stream(may be a output stream) rather a physical output file. My objective is to use the stream for another application.
You can write a Custom Output Format and use that write to any stream you want to. Not necessarily a file. See this tutorial on how to write a Custom Output Format.
Or else you can make use Hadoop Streaming API. Have a look here for that.
I don't think you can do this using Apache-Hadoop. It is designed to work in a distributed system and AFAIK providing the way to emit an output stream would defy the purpose, as then how system would decide on the stream to emit, i.e. of which reducer! You may write to a flat-file/DB/amazon-s3 etc but perhaps you won't get a stream.
I am trying not to use files as input in Hadoop. I have my java program and it produces output like 'chicken','10' and I store these in arrays and I want this entry to be directly fed to Mapper class. Does anyone has an idea how to feed such input directly into Hadoop and not using files as input?
Thanks a lot
You should use your own class that extends InputFormat<K,V> and tell your hadoop job to use this class instead of FileInputFormat. See the documentation here.
Is it possible to pass the locations of a files in HDFS as the value to my mapper so that i can run an executable on them to process them?
yes, you can create file with file names in the HDFS, and use it as an input for the map/reduce job. You will need to create custom splitter, in order to serve several file names to each mapper. By default you input file will be split by the blocks, and probabbly the whole file list will be passed to one mapper.
Another solution will be to define Your input as not splittable. In this case each file will be passed to the mapper, and you free to create your own InputFormat which will use whenever logic you need to process the file - for example call external executable. If you will go this way the Hadoop framework will take care about data locality.
The another of approaching this can be by obtaining the file name through FileSplit, thos can done by using the following code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String filename = fileSplit.getPath().getName();
Hope this helps