Set the number of map tasks

Set the number of map tasks - java

While configuring a Map Reduce job, I know that one can set the number of reduce tasks by using the method job.setNumReduceTasks(2);.
Can we set the number of map tasks?
I don't see any methods to do this.
If there is no such functionality, does someone know why this framework has the ability to have more than 1 reduce task, but not more than 1 map task?

There used to be property for setting the number of map tasks, which was setNumMapTasks. Bur it was merely a hint to the framework, and could not guarantee that you'll get only the specified number of maps. The map creation is actually governed by the InputFormat you are using in your job. And this is the reason it is not supported anymore.
If you are not happy with the number of mappers created by the framework, you could try tweaking the values of following 2 properties as per your requirements :
- mapred.min.split.size
- mapred.max.split.size

Number of map tasks is not something the programmer sets,rather its something that the hadoop framework,in particular the TaskTracker that creates as many mappers as the number of input splits(generally of 64mb but can be changed) of the InputFile from HDFS...

Related

Java Hadoop - Can the input to a reducer be the output of a reducer?

I'm writing a map-reduce program with (currently) 3 map-reduce phases. I need to do another reduce to the output of the 3rd phase reduce - I can use a map of identity (takes (key, value) and outputs them without changing) but I don't want to do that extra map (time and resources wise) and wish to simply pass them to a reducer.
Is it possible? If so, how to I code the "jobs"?
I can post my whole code if it might help (maybe I'm doing something redundant/insufficient in the previous 3 phases).
Thank you for the help.

I don't think it will be feasible to use reduce only jobs. Moreover if you want to use reducer2 on output of reducer 1 then you can make your map 2 as a unity which simply means that map2 will do not perform any operation on reducer 1 output and will let it pass to reducer 2.
Major reason why reducer only jobs are not feasible is because reducer node reads data from output of map nodes that is why maps are required. I will suggest you to visit this page this will clear your concept of how map reduce jobs works ( www.javacrunch.in/MR.jsp ).
Hope this solve your query

I/O time when reading from HDFS in Hadoop

I would like to measure the time taken for map and reduce when performing I/O (reading from HDFS) in Hadoop. I am using Yarn.
Hadoop 2.6.0.
What are the options for that?

If you need exact measurments - you could use btrace, add it as a javaagent to your tasks via mapreduce.{map,reduce}.java.opts - and then write script which measures whatever you like. Sample of btrace scripts are here.
Also there is HTrace - that might also be helpful.

One rough estimation could be creating custom counters. For both mapper and reducer you could collect the timestamp when mapper(or reducer) starts processing and when it ends. From starting and ending timestamp, calculate and add it to custom counters, i.e mappers add to MAPPER_RUNNING_TIME and reducers add to REDUCER_RUNNING_TIME (or whatever name you would like to give it). When the execution is finished, subtract the aggregated value of your counters from MILLIS_MAPS and MILLIS_REDUCES respectively. You might need to look into Hadoop source code though to confirm if the staging time is or is not being included into MILLIS_MAPS and MILLIS_REDUCES. With this estimation you would need to take into account that the tasks are being executed concurrently, so the time will be rather total (or aggregated for all mappers and reducers).
I have not done this personally, but I think this solution could work unless you find better one.

Modify mapper size (split size) in MapReduce to have a faster performance

Are there any ways to improve the MapReduce performance by changing the number of map tasks or changing the split sizes of each mapper?
For example, I have a 100GB text file and 20 nodes. I want to run a WordCount job on the text file, what is the ideal number of mappers or the ideal split size so that it can be done faster?
Would it be faster with more mappers?
Would it be faster with a smaller split size?
EDIT
I am using hadoop 2.7.1, just so you know there is YARN.

It is not necessarily faster when you use more mappers. Each mapper has a start up and setup time. In the early days of hadoop when mapreduce was the de facto standard it was said that a mapper should run ~10 minutes. Today the documentations recommends 1 minute. You can vary the number of map tasks by using setNumMapTasks(int) which you can define within the JobConf. IN the documentation of the method are very good information about the mapper count:
How many maps?
The number of maps is usually driven by the total size
of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 or so for very cpu-light
map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
The default behavior of file-based InputFormats is to split the input
into logical InputSplits based on the total size, in bytes, of input
files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size
can be set via mapreduce.input.fileinputformat.split.minsize.
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to
set it even higher.
Your question is probably related to this SO question.
To be honest, try to have a look at modern frameworks as well, like Apache Spark and Apache Flink.

Java/Scala Metrics - Codahale - Cluster/Mulitnode & Graphite Reporter

When using CodaHale Metrics in Java or Scala code a clustered environment what are the gotchas when reporting to Graphite?
If I have multiple instances of my application running and creating different metrics can Graphite cope - i.e. is reporting cumulative?
For example if I have AppInstance A and B. If B has a gauge reporting 1.2 and another reporting 1.3 - what would be the result in Graphite? Will it be an average? or will one override the other.
Are counters cumulative?
Are timers cumulative?
Or should I somehow give each instance some tag to differentiate different JVM instances?

You can find your default behavior for the case where graphite receive several points during the aggregation period in aggreagtion-rules.conf.
I think graphite default is to take last received point in the aggregation period.
If you might be interested in metric detail by process instance (and you'll probably be at some point), you should tag instances in some way and use that tag in metric path. Graphite is extremely useful for aggregation at request time and finding a way to aggregate individuals metrics (sum, avg, max, or more complex) you be difficult.
One thing that can make you reluctant to have different metrics by sender process would be if you have a very versatile environment where instances change all the time (thus creating many transient metrics). Otherwise, just use ip+pid and you'll be fine.

I added a 'count' field to every set of metrics that I knew went in at the same time. Then I aggregated all of the values including the counts as 'sum'. This let me find both the average, sum and count for all metrics in a set. (Yes, graphite's default is to take the most recent sample for a time period. You need to use the carbon aggregator front end.)
Adding the IP address to the metric name lets you calculate relative speeds for different servers. If they're all the same type and some are 4x as fast as others you have a problem. (I've seen this). As noted above adding a transitory value like IP creates a dead metric problem. If you care about history you could create a special IP for 'old' and collect dead metrics there, then remove the dead entries. In fact, the number of machines at any timer period would be a very useful metric.

We've found that the easiest way to handle this is by using per instance metrics. This way you can see how each instance is behaving independently. If you want an overall view of the cluster it's also easy to look at the sumSeries of a set of metrics by using wildcards in the metric name.
The caveat to this approach is that you are keeping track of more metrics in your graphite instance, so if you're using a hosted solution this does cost more.

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.

I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.