I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.
Related
I'm trying to write MapReduce job which it can Read two sequence file in the Mapper. I've tried read and write a sequence file in 'main' but I don't know how to do it in Mapper. I think that I'm not really familiar with how MapReduce work. Thanks for helping me out.
If everything is correct, in the main() method you wrote something like that:
FileInputFormat.addInputPath()
FileOutputFormat.setOutputPath()
to tell Hadoop the directory where to find your two input files and where to write the results of the computation.
When the job starts, Hadoop starts reading the files it finds in the input directory and it calls the map() method of the mapper passing to it every line of the file (one at the time) as an argument.
At the end of the computation, when the reducer emits its data, Hadoop is going to write the results in one (or more) files in the specified output directory.
So, the mapper and the reducer don't need to know anything about input/output files.
I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'
I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?
Thanks in advance.
you can use getmerge through shell command to merge multiple file into single file.
Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>
Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link
if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.
--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;
I hope this will help you.
The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.
That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.
In that case I would first run your MapReduce program,
Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.
Or you could also just use a single mapper with your original program with
Job.setNumberOfReducer(1);
I have two Mapper Classes which take some files from the same folder as input and based on the name of the file which has a timestamp determines which mapper the file has to be given as an Input. At times it so happens that the same input file is to be given as an input to two different Mappers. Now I've tested it to work when two different inputs are given to both Mappers but When I give them the same input , then one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
The code is enormous so instead of putting it here , I'll describe what I had done. I created two lists and scanning through the files in the directory and based on the names of the files which have timestamps , I put them in two different lists and then add them to two different Mappers i.e. both of them are computed differently so I use different Mappers to compute , which is then used to compare in the reducer, but when it is the same Input file as the time criteria for both mappers is almost same one of the mapper doesn't generate any result. So is it because one mapper is not able to access the file because the other is using it and if that is the case is there any way around it.
Here MapPath1 is one list while MapPath2 is another
for(i=0;i<MapPath1.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath1.get(i)),TextInputFormat.class,Map1.class);
if(type.equals("comparative"))
for(i=0;i<MapPath2.size();i++)
MultipleInputs.addInputPath(job,new Path(MapPath2.get(i)),TextInputFormat.class,Map2.class);
Update
I just Found this question ( Multiple mappers in hadoop ) to be similar to mine but I don't want to be duplicating the input file as it can be large. Can any one direct me on how can I create two separate jobs using different Mappers and provide it to a single reducer.
one of the Mapper class doesn't generate the result to be used for comparison in the reducer.
My guess that both the mappers are getting launched on the same task tracker node and intermediate mapper output location is shared by both the mapper task - You should check the task tracker nodes where these map tasks are launched to confirm this.
Also you should run mapper(s) only job, by setting number of reduce tasks to zero and check the output - this is to confirm that mapper are not sharing output directories.
To give solution to your problem - it sounds like you are passing same file to both the mappers and data from both the mappers given to single reducer. This has some duplication, Is your job output ok to have this duplication?
I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.
Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path) method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):
For example, if you're using the TextInputFormat (or most input formats that extend FileInputFormat), you can call the static util methods:
FileInputFormat.setMaxInputSplitSize(Job, long)
FileInputFormat.setMinInputSplitSize(Job, long)
The long argument is the size of the split in bytes, so just set to you're desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.size
mapred.max.split.size
Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize() method (which defaults to 1 byte for FileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.
A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?
try doing this it will work
hadoop fs -D dfs.block.size=16777216 -put 25090206.P .
I want to create a file in HDFS that has a bunch of lines, each generated by a different call to map. I don't care about the order of the lines, just that they all get added to the file. How do I accomplish this?
If this is not possible, then is there a standard way to generate unique file names to put each line of output into a separate file?
There is no way to append to an existing file in hadoop at the moment, but that's not what it sounds like you want to do anyway. It sounds like you want to have the output from your Map Reduce job go to a single file, which is quite possible. The number of output files is (less than or) equal to the number of reducers, so if you set your number of reducers to 1, you'll get a single file of output.
Before you go and do that however, think if that's what you really want. You'll be creating a bottle neck in your pipeline where it needs to pass all your data through a single machine for that reduce. Within the HDFS distributed file system, the difference between having one file and having several files is pretty transparent. If you want a single file outside the cluster, you might do better to use getmerge from the file system tools.
Both your map and reduce functions should output the lines. In other words, your reduce function is a pass through function that doesn't do much. Set the number of reducers to 1. The output will be a list of all the lines in one file.