I am new to hadoop. I have mutiple folders containing files to processing a data in hadoop. I have doubt to implement mapper in map-reducer algorithm. Can I specify multiple mappers for processing mulitple files and have all input files as one output using a single reducer? If possible, please give guidelines for implementing the above steps.
If you have multiple files, use MultipleInputs
addInputPath() method can be used to:
add multiple paths and one common mapper implementation
add multiple paths with custom mapper and input format implementation.
For having a single reducer, have each maps' output key same...say 1 or "abc". This way, the framework will create only one reducer.
If the files are to be mapped in the same way (e.g. they all have the same format and processing requirements) then you can configure a single mapper to process all of them.
You do this by configuring the TextInputFormat class:
string folder1 = "file:///home/chrisgerken/blah/blah/folder1";
string folder2 = "file:///home/chrisgerken/blah/blah/folder2";
string folder3 = "file:///home/chrisgerken/blah/blah/folder3";
TextInputFormat.setInputPaths(job, new Path(folder1), new Path(folder2), new Path(folder3));
This will result in all of the files in folders 1, 2 and 3 being processed by the mapper.
Of course, if you need to use a different input type you'll have to configure that type appropriately.
Related
I have different JSON files and need to read, process and write the containing JSON objects of a JSON array.
The output format (more specific: the output class) is for all files the same. Lets call it OutputClass. Hence the item processor is something like ItemProcessor<X, OutPutClass>. Where X is the class of the specific JSON file.
The difference between files are:
The JSON array / the information is at a different position in every JSON file
The structure of the JSON objects in the JSON array are different (the objects in file a have a different syntax than the ones in file b)
I already came across of #StepScope and was able to dynamically generate a reader (depending on job parameters) which starts reading at a different position in the JSON structure.
But I have no idea how to dynamically choose an ItemProcessor depending on the job parameters. Because I got many different JSON files and want to reduce the amount of code to write for each file.
Since you were able to create a dynamic item reader based on job parameters by using the a step-scoped bean (which is the way I would do it too), you can use the same approach to create a dynamic item processor as well.
I have two Mapper Classes which take some files from the same folder as input and based on the name of the file which has a timestamp determines which mapper the file has to be given as an Input. At times it so happens that the same input file is to be given as an input to two different Mappers. Now I've tested it to work when two different inputs are given to both Mappers but When I give them the same input , then one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
The code is enormous so instead of putting it here , I'll describe what I had done. I created two lists and scanning through the files in the directory and based on the names of the files which have timestamps , I put them in two different lists and then add them to two different Mappers i.e. both of them are computed differently so I use different Mappers to compute , which is then used to compare in the reducer, but when it is the same Input file as the time criteria for both mappers is almost same one of the mapper doesn't generate any result. So is it because one mapper is not able to access the file because the other is using it and if that is the case is there any way around it.
Here MapPath1 is one list while MapPath2 is another
for(i=0;i<MapPath1.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath1.get(i)),TextInputFormat.class,Map1.class);
if(type.equals("comparative"))
for(i=0;i<MapPath2.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath2.get(i)),TextInputFormat.class,Map2.class);
Update
I just Found this question ( Multiple mappers in hadoop ) to be similar to mine but I don't want to be duplicating the input file as it can be large. Can any one direct me on how can I create two separate jobs using different Mappers and provide it to a single reducer.
one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
My guess that both the mappers are getting launched on the same task tracker node and intermediate mapper output location is shared by both the mapper task - You should check the task tracker nodes where these map tasks are launched to confirm this.
Also you should run mapper(s) only job, by setting number of reduce tasks to zero and check the output - this is to confirm that mapper are not sharing output directories.
To give solution to your problem - it sounds like you are passing same file to both the mappers and data from both the mappers given to single reducer. This has some duplication, Is your job output ok to have this duplication?
I'm writing a Pig script that looks as follows:
...
myGroup = group simplifiedJoinData by (dir1, dir2, dir3, dir4);
betterGroup = foreach myGroup {
value1Value2 = foreach simplifiedJoinedGroup generate value1, value2;
distinctValue1Value2 = DISTINCT value1Value2; generate group, distinctValue1Value2;
}
store betterGroup into '/myHdfsPath/myMultiStorageTest' using MyMultiStorage('output', '0', 'none' );
Please note that the schema of simplifiedJoinData is simplifiedJoinedGroup: {dir1: long,dir2: long,dir3: chararray,dir4: chararray,value1: chararray,value2: chararray}
It uses a custom storage class (MyMultiStorage - basically a modified version of MultiStorage in the piggybank) that writes multiple output files. The custom storage class expects that the values passed to it are in the following format:
{group:(dir1:long,dir2:long,dir3:chararray,dir4:chararray), bag:{(value1:chararrary,value2:chararray)}}
What I'd like the custom storage class to do is output multiple files as follows:
dir/dir2/dir3/dir4/value1_values.txt
dir/dir2/dir3/dir4/value2_values.txt
where the value1_values.txt contains all the value1 values and value2_values.txt contains all the value2 values. Ideally I would prefer not to write multiple part files that have to be combined later (Note that the example has been simplified for the purposes of this discussion. The real output files are binary structures that can't be combined with a simple cat). I have this working for small data sets; however, when I run with larger data sets, I run into issues where I get exceptions in Hadoop that the output file name already exists or that it is already being created:
java.io.IOException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
I suspect that this is because mutiple mappers or reducers are attempting to write the same file, and I am not using part IDs in the filename as PigStorage does. However, I would have expected that by grouping the data, I'd only have one record for each dir1, dir2, dir3, dir4 combination, and, as such, only one mapper or reducer would be attempting to write a particular file for a given run. I've tried running without speculative execution for both map and reduce tasks, but that seems to have had no effect. Clearly I don't understand what's going on here.
My question is: Why am I getting the AlreadyBeingCreatedException?
If there is no way for me to have a single reducer write all data for each record, it would be acceptable to have to write multiple parts output files in a directory (one per reducer) and combine them after the fact. It just wouldn't be ideal. However, as of yet, I have not been able to determine the proper way to have the custom storage class determine a unique filename, and I still end up with multiple reducers trying to create/write the same file. Is there a particular method in the job configuration or context that would allow me to coordinate parts accross the job?
Thanks in advance for any help you can provide.
Turns out that there was a condition where I was generating the same file name due to a tuple parsing error. I was getting the AlreadyBeingCreatedException for that exact reason.
Nothing wrong with the custom store function, or approaching the problem in this manner. Just a silly mistake on my part!
I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.
Is it possible to pass the locations of a files in HDFS as the value to my mapper so that i can run an executable on them to process them?
yes, you can create file with file names in the HDFS, and use it as an input for the map/reduce job. You will need to create custom splitter, in order to serve several file names to each mapper. By default you input file will be split by the blocks, and probabbly the whole file list will be passed to one mapper.
Another solution will be to define Your input as not splittable. In this case each file will be passed to the mapper, and you free to create your own InputFormat which will use whenever logic you need to process the file - for example call external executable. If you will go this way the Hadoop framework will take care about data locality.
The another of approaching this can be by obtaining the file name through FileSplit, thos can done by using the following code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String filename = fileSplit.getPath().getName();
Hope this helps