In hadoop programming, where is partition, comparison?

In hadoop programming, where is partition, comparison? - java

In hadoop programs , i have always seen mapper and reducer class.
But mapreduce algorithm is a combination of
map
comparison
partition
reduce
So where are partition, comparison processes in program?

Take a look at these posts. They present map/reduce, partitioner, combiner and counters

A good place where you can figure things out is if you go into your hadoop deployment, and checkout the source code. Most of the time you can find your answers there. Give it a go! :)

Related

I/O time when reading from HDFS in Hadoop

I would like to measure the time taken for map and reduce when performing I/O (reading from HDFS) in Hadoop. I am using Yarn.
Hadoop 2.6.0.
What are the options for that?

If you need exact measurments - you could use btrace, add it as a javaagent to your tasks via mapreduce.{map,reduce}.java.opts - and then write script which measures whatever you like. Sample of btrace scripts are here.
Also there is HTrace - that might also be helpful.

One rough estimation could be creating custom counters. For both mapper and reducer you could collect the timestamp when mapper(or reducer) starts processing and when it ends. From starting and ending timestamp, calculate and add it to custom counters, i.e mappers add to MAPPER_RUNNING_TIME and reducers add to REDUCER_RUNNING_TIME (or whatever name you would like to give it). When the execution is finished, subtract the aggregated value of your counters from MILLIS_MAPS and MILLIS_REDUCES respectively. You might need to look into Hadoop source code though to confirm if the staging time is or is not being included into MILLIS_MAPS and MILLIS_REDUCES. With this estimation you would need to take into account that the tasks are being executed concurrently, so the time will be rather total (or aggregated for all mappers and reducers).
I have not done this personally, but I think this solution could work unless you find better one.

Is Hazelcast's built in CountAggregation really inefficient?

I've been looking into replacing our Oracle database of currently-executing commands with a Hazelcast distributed map implementation. To do this, I need to replace our SQL queries with the Hazelcast equivalent. Hazelcast provides some built-in aggregations, such as a count. I've been happily using this, but when I came to writing my own aggregations, I had a look at the source code for the CountAggregation. It can be found here: http://grepcode.com/file/repo1.maven.org/maven2/com.hazelcast/hazelcast/3.3-RC2/com/hazelcast/mapreduce/aggregation/impl/CountAggregation.java
Aggregations in Hazelcast are implemented using the MapReduce algorithm. But to me, the source above seems to be really inefficient. For the Mapper stage of the algorithm, they use a SupplierConsumingMapper, which simply emits mappings using the same key as the supplied key. What this then means is that the reducing stage doesn't actually reduce anything, because all of the emitted keys are different, and you end up with a whole load of 1's to count up at the final collation stage, rather than a number of partial counts to add together.
Surely what they should be doing is using a mapper which always emits the same key? Then the combiners and reducers could actually do some combining and reducing. It seems to me that the source code above is incorrectly using the MapReduce model, although the result you end up with is correct. Have I misunderstood something?

Hey you're absolutely correct. The implementation is a bit to simple at that place :) Can you please file an issue at github so we won't forget to fix that one. Thanks Chris

Hadoop: Chaining jobs in 0.20.203

I currently have a task where i need to chain a few jobs in Hadoop.
What i am dong right now is that i have 2 jobs. My first job has a map function,a combiner and a reducer. Well i need one more phase of reduce so i created a second job with a simple map task that passes the output of the previous reducer to the final reducer.
I find that this is a bit "stupid" because there has to be a way to simply chain this. Moreover i think the I/Os would be decreased that way.
I am using the 0.20.203 version and i only find deprecated examples of ChainMapper and ChainReducer using JobConf.
I have found these:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainMapper.html
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainReducer.html
that seems to work with Job class and not with the JobConf which is deprecated in 203, but there isn't any package that contains these classes in 203.

You can consider using oozie. Creating a workflow would be much easier.

For given operations on a large set of data, is there a way to determine if the data can be decomposed into mapreduce operations?

We do stats and such on large sets of data. Right now it is all done on one machine. We're studying the feasibility of moving to a map-reduce paradigm where we decompose the data into subsets, run some operations on that, then combine the results.
Is there any sort of mathematical test that can be applied to a set of operations to determine if the data they operate on can be decomposed?
Or maybe a list somewhere saying what can and cannot be decomposed?
For instance, I didn't think there was a way to decompose standard deviation, but there is...
edit: added tags

Take a look at this paper: http://www.janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf . They have algorithms for many common statistical problems, and there is open source code available.

Variance, as well as the mean can be calculated online (in a single pass), see wikipedia. There's also a parallel algorithm.

Parallel computing is best suited to problem which are "embarrassingly parallel" i.e., there is no dependency between any two tasks.
Please check out http://en.wikipedia.org/wiki/Embarrassingly_parallel
Also, In cases where the operations are commutative or associative, MapReduce programs can easily be optimized for better performance.

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.

I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.