Hadoop: Chaining jobs in 0.20.203

Hadoop: Chaining jobs in 0.20.203 - java

I currently have a task where i need to chain a few jobs in Hadoop.
What i am dong right now is that i have 2 jobs. My first job has a map function,a combiner and a reducer. Well i need one more phase of reduce so i created a second job with a simple map task that passes the output of the previous reducer to the final reducer.
I find that this is a bit "stupid" because there has to be a way to simply chain this. Moreover i think the I/Os would be decreased that way.
I am using the 0.20.203 version and i only find deprecated examples of ChainMapper and ChainReducer using JobConf.
I have found these:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainMapper.html
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainReducer.html
that seems to work with Job class and not with the JobConf which is deprecated in 203, but there isn't any package that contains these classes in 203.

You can consider using oozie. Creating a workflow would be much easier.

Related

I/O time when reading from HDFS in Hadoop

I would like to measure the time taken for map and reduce when performing I/O (reading from HDFS) in Hadoop. I am using Yarn.
Hadoop 2.6.0.
What are the options for that?

If you need exact measurments - you could use btrace, add it as a javaagent to your tasks via mapreduce.{map,reduce}.java.opts - and then write script which measures whatever you like. Sample of btrace scripts are here.
Also there is HTrace - that might also be helpful.

One rough estimation could be creating custom counters. For both mapper and reducer you could collect the timestamp when mapper(or reducer) starts processing and when it ends. From starting and ending timestamp, calculate and add it to custom counters, i.e mappers add to MAPPER_RUNNING_TIME and reducers add to REDUCER_RUNNING_TIME (or whatever name you would like to give it). When the execution is finished, subtract the aggregated value of your counters from MILLIS_MAPS and MILLIS_REDUCES respectively. You might need to look into Hadoop source code though to confirm if the staging time is or is not being included into MILLIS_MAPS and MILLIS_REDUCES. With this estimation you would need to take into account that the tasks are being executed concurrently, so the time will be rather total (or aggregated for all mappers and reducers).
I have not done this personally, but I think this solution could work unless you find better one.

Is Hazelcast's built in CountAggregation really inefficient?

I've been looking into replacing our Oracle database of currently-executing commands with a Hazelcast distributed map implementation. To do this, I need to replace our SQL queries with the Hazelcast equivalent. Hazelcast provides some built-in aggregations, such as a count. I've been happily using this, but when I came to writing my own aggregations, I had a look at the source code for the CountAggregation. It can be found here: http://grepcode.com/file/repo1.maven.org/maven2/com.hazelcast/hazelcast/3.3-RC2/com/hazelcast/mapreduce/aggregation/impl/CountAggregation.java
Aggregations in Hazelcast are implemented using the MapReduce algorithm. But to me, the source above seems to be really inefficient. For the Mapper stage of the algorithm, they use a SupplierConsumingMapper, which simply emits mappings using the same key as the supplied key. What this then means is that the reducing stage doesn't actually reduce anything, because all of the emitted keys are different, and you end up with a whole load of 1's to count up at the final collation stage, rather than a number of partial counts to add together.
Surely what they should be doing is using a mapper which always emits the same key? Then the combiners and reducers could actually do some combining and reducing. It seems to me that the source code above is incorrectly using the MapReduce model, although the result you end up with is correct. Have I misunderstood something?

Hey you're absolutely correct. The implementation is a bit to simple at that place :) Can you please file an issue at github so we won't forget to fix that one. Thanks Chris

Set the number of map tasks

While configuring a Map Reduce job, I know that one can set the number of reduce tasks by using the method job.setNumReduceTasks(2);.
Can we set the number of map tasks?
I don't see any methods to do this.
If there is no such functionality, does someone know why this framework has the ability to have more than 1 reduce task, but not more than 1 map task?

There used to be property for setting the number of map tasks, which was setNumMapTasks. Bur it was merely a hint to the framework, and could not guarantee that you'll get only the specified number of maps. The map creation is actually governed by the InputFormat you are using in your job. And this is the reason it is not supported anymore.
If you are not happy with the number of mappers created by the framework, you could try tweaking the values of following 2 properties as per your requirements :
- mapred.min.split.size
- mapred.max.split.size

Number of map tasks is not something the programmer sets,rather its something that the hadoop framework,in particular the TaskTracker that creates as many mappers as the number of input splits(generally of 64mb but can be changed) of the InputFile from HDFS...

In hadoop programming, where is partition, comparison?

In hadoop programs , i have always seen mapper and reducer class.
But mapreduce algorithm is a combination of
map
comparison
partition
reduce
So where are partition, comparison processes in program?

Take a look at these posts. They present map/reduce, partitioner, combiner and counters

A good place where you can figure things out is if you go into your hadoop deployment, and checkout the source code. Most of the time you can find your answers there. Give it a go! :)

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.

I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.