dfs.block.size for Local hadoop jobs ?

dfs.block.size for Local hadoop jobs ? - java

I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.

Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path) method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):
For example, if you're using the TextInputFormat (or most input formats that extend FileInputFormat), you can call the static util methods:
FileInputFormat.setMaxInputSplitSize(Job, long)
FileInputFormat.setMinInputSplitSize(Job, long)
The long argument is the size of the split in bytes, so just set to you're desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.size
mapred.max.split.size
Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize() method (which defaults to 1 byte for FileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.
A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?

try doing this it will work
hadoop fs -D dfs.block.size=16777216 -put 25090206.P .

Related

Hadoop - Merge reducer outputs to a single file using Java

I have a pig script that generates some output to a HDFS directory. The pig script also generates a SUCCESS file in the same HDFS directory. The output of the pig script is split into multiple parts as the number of reducers to use in the script is defined via 'SET default_parallel n;'
I would like to now use Java to concatenate/merge all the file parts into a single file. I obviously want to ignore the SUCCESS file while concatenating. How can I do this in Java?
Thanks in advance.

you can use getmerge through shell command to merge multiple file into single file.
Usage: hdfs dfs -getmerge <srcdir> <destinationdir/file.txt>
Example: hdfs dfs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
In case you don't want to use shell command to do it. You can write a java program and can use FileUtil.copyMerge method to merge output file into single file. The implementation details are available in this link
if you want a single output on hdfs itself through pig then you need to pass it through single reducer. You need to set number of reducer 1 to do so. you need to put below line at the start of your script.
--Assigning only one reducer in order to generate only one output file.
SET default_parallel 1;
I hope this will help you.

The reason why this does not seem easy to do, is typically there would be little purpose. If I have a very large cluster, and I am really dealing with a Big Data problem, my output file as a single file would probably not fit onto any single machine.
That being said, I can see use metrics collections where maybe you want just output some metrics about your data, like counts.
In that case I would first run your MapReduce program,
Then create a 2nd map/reduce job that reads the data, and reduces all the elements to the single same reducer by using the a static key with your reduce function.
Or you could also just use a single mapper with your original program with
Job.setNumberOfReducer(1);

Hadoop - How to get a Path object of an HDFS file

I'm trying to figure out the various ways to write content/files to the HDFS in a Hadoop cluster.
I know there is org.apache.hadoop.fs.FileSystem.get() and org.apache.hadoop.fs.FileSystem.getLocal() to create an output stream and write byte by byte. If you are making use of OutputCollector.collect() it doesn't seem like this is the intended way to write to the HDFS. I believe you have to use Outputcollector.collect() when implementing Mappers and Reducers, correct me if I'm wrong?.
I know you can set FileOutputFormat.setOutputPath() before even running the job but it looks like this can only accepts objects of type Path.
When looking at org.apache.hadoop.fs.path and looking at the path class, I do not see anything which allows you to specify remote or local. Then when looking up org.apache.hadoop.fs.FileSystem I do not see anything which returns an object of type path.
Does FileOutputFormat.setOutputPath() always have to write to the local file system? I don't think this is true, I vaguely remember reading that a jobs' output can be used as another jobs' input. This leads me to believe there is also a way to set this to the HDFS.
Is the only way to write to the HDFS to use a data stream as described?

org.apache.hadoop.fs.FileSystem.get and org.apache.hadoop.FileSystem.getLocal return a FileSystem object which is a generic that can be implemented both as a local filesystem or distibuted file system. OutputCollector doest write to hdfs . it just provides a method collect for mappers and reducers to collect the data output (both intermediate and final). By the way, its deprecated in favor of Context object.FileOutputFormat.setOuptPath sets the final output directory by setting mapred.output.dir which can be on your local file system or distributed. About remote or local - fs.default.name sets those value . If you have set it as file:/// it will take local file system. if set as hdfs:// it will take hdfs and so on. And about writing to hdfs - whatever method you take that writes to files in hadoop , it will be using FSDataOuputStream underneath. FSDataOutputStrem is wrapper of java.io.OutputStream . By the way, whenever you want to write to a filesystem in java, you have create a stream object for that.FileOutputFormat has method FileOutputFormat.setOutputPath(job, output_path) where in place of output_path , you can specify , whether you want to use local file system or hdfs , overriding the settings of core-site.xml. e.g FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/path_to_file")) will set up output to be written to hdfs. change it to file:/// and you can write to local file system. Change loclahost and portno as per your settings. In the same way, input can also be overridden at per job level. –

Control number of hadoop mapper output files

I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?

The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs

Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml

You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.

If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.

How to pass an argument to the main program in Hadoop

Each time I run my Hadoop program I need to change the number of mappers and reducers. Is there any way to pass the number of mappers and reducers to my program from command line (when I run the program) and then used args to retrieve it?

It is important to understand that you cannot really specify the number of map tasks. Ultimately the number of map tasks is defined as the number of input splits which is dependent on your InputFormat implementation. Let's say you have 1TB of input data, and your HDFS block size is 64MB, so Hadoop will compute around 16k map tasks, and from there if you specify a manual value less than 16k it will be ignored, but more than 16k and it will be used.
To pass via command-line, the easiest way is to use the built-in class GenericOptionsParser (described here) which will directly parse common command-line Hadoop-related arguments like what you are trying to do. The good thing is that it allows you to pass pretty much any Hadoop parameters you want without having to write extra code later. You would do something like this:
public static void main(String[] args) {
Configuration conf = new Configuration();
String extraArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs();
// do something with your non-Hadoop parameters if needed
}
Now the properties you need to define to modify the number of mappers and reducers are respectively mapred.map.tasks and mapred.reduce.tasks, so you can just run your job with these parameters:
-D mapred.map.tasks=42 -D mapred.reduce.tasks
and they will get directly parsed with your GenericOptionParser and populate your Configuration object automatically. Note that there is a space between the -D and the properties, this is important otherwise this will be interpreted as JVM parameters.
Here is a good link if you want to know more about this.

You can specify the number of mappers and reducers ( and really any parameter you can specify in the config), by using the -D parameter. This works for all default Hadoop jars and your own jars as long as you extends Configured.
hadoop jar myJar.jar -Dmapreduce.job.maps=<Number of maps> -Dmapreduce.job.reduces=<Number of reducers>
From there you can retreive the values using.
configuration.get("mapreduce.job.maps");
configuration.get("mapreduce.job.reduces");
or for Reducers
job.getNumReduceTasks();
Specifying the mappers with the configuration values will not work when mapreduce.jobtracker.address is "local". See Charles' answer where he explains how Hadoop usually determines the number of Mappers by data size.

How do I output whole files from a map job?

This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.

The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.