What I need to do and am having some trouble doing is to have two values output for one key as the output to my reduce function. The reduce function receives data in the form of an Id and a list of integers associated with that Id. It needs to output that Id, the average of the integers in the list and the length of the list.
However, the implementation of the reduce function is supposed to have OutputCollector <Text, IntWritable> as an argument which clearly limits the number of outputs associated with each key to 1.
Any help in this regard would be greatly appreciated. Thanks in advance.
Hadoop version: 2.0.0
You have to use MultipleOutputs. In the Job:
MultipleOutputs.addMultiNamedOutput(job,
"Name",
SequenceFileOutputFormat.class, Text.class,
Writable.class);
In the reducer :
multipleOutputs.getCollector
Here are a few answers to your vague question.
You can call collect() as many times as you want for the same key if you don't mind each of the data (length and mean) appearing on its own record in a blended output. This could be accomplished by writing the key differently to distinguish the different records types, as follows:
oc.collect( new Text( k.toString() + " mean", mean );
oc.collect( new Text( k.toString() + " length", length );
OR
You should choose a different value type V3 instead of IntWritable. Either create a PairOfIntWritable or use an ArrayWritable to shove whatever you want into a single call to collect. Then the length and mean can be "fields" of the value for a single record for each key.
OR
If you absolutely have to use IntWritable, use an invertible pairing function to combine two integers into one. You'll need to insure that you can't exceed the maximum value of an IntWritable with any possible pair you could generate from your input data.
OR
Use MultipleOutputs to send one record each to a different file distinguished by name, so the part-r-nnnnn contain means and length-r-nnnnn contain lengths, for example. The JavaDoc on MultipleOutputs explains its use.
Related
I am relatively new to the hadoop world. I have been following examples I could find to understand how the record splitting step works for mapreduce jobs. I noticed that TextInputFormat splits file into records with key as the byte offset and value as a string. In this case, we could have two different records in a mapper having same offset from different input files.
Does it affect the mapper in any way? I think the uniqueness of the key to mapper is irrelevant if we do not process it (e.g. wordcount). But if we have to process it in mapper, the key may have to be unique. Can anyone elaborate on this ?
Thanks in advance.
Input to mapper is a file (or hdfs block) and not a key-value pair. In other words, mapper itself creates key-value pairs and does not get impacted by duplicate keys.
The "final" output generated by a Mapper is a multivalued hashmap.
< Key, <List of Values>>
This output becomes input to Reducer. All values of a key are processed by same reducer. It is ok for mappers to create more than one value for a key. Infact some solutions depend on this behavior.
Actually answer to your question totally depends on the scenario.
If you are not utilizing key(i.e. byte offset in case of textinputformat which is least utilized, but probably if you are using keyvalusepairInputformat you may be utilizing it.) then it never Impact, but if your map() function logic is such that you are doing some calculations on the basis of key then it will definitely impact.
So it totally depends on the scenario.
There is a misunderstanding. Actuallty,
For every input split of the file one Mapper will be assigned. All the
records from a single input split will be processed by only one Mapper
for a given job.
It is not required to bother about records with duplicate keys are arriving for mapper since mapper's execution scope is always at one key/value pair at any point of time.
The output from mapper task which is n number of key/value pairs are eventually merged, sorted and partitioned based on the keys.
The reducer will collect the required outputs from all mappers according to the partition and brings to the memory of reducer where it handles to arrange the key/value pairs as <key , Iterable <value> > .
I'm still trying to get an intuition as to when to use the Hadoop combiner class (I saw a few articles but they did not specifically help in my situation).
My question is, is it appropriate to use a combiner class when the value of the pair is of the Text class? For instance, let's say we have the following output from the mapper:
fruit apple
fruit orange
fruit banana
...
veggie carrot
veggie celery
...
Can we apply a combiner class here to be:
fruit apple orange banana
...
veggie carrot celery
...
before it even reaches the reducer?
Combiners are typically suited to a problem where you are performing some form of aggregation, min, max etc operation on the data - these values can be calculated in the combiner for the map output, and then calculated again in the reducer for all the combined outputs. This is useful as it means you are not transferring all the data across the network between the mappers and the reducer.
Now there is not reason that you can't introduce a combiner to accumulate a list of the values observed for each key (i assume this is what your example shows), but there are some things which would make it tricker.
If you have to output <Text, Text> pairs from the mapper, and consume <Text, Text> in the reducer then your combiner can easily concatenate the list of values together and output this as a Text value. Now in your reducer, you can do the same, concatenate all the values together and form one big output.
You may run into a problem if you wanted to sort and dedup the output list - as the combiner / reducer logic would need to tokenize the Text object back into words, sort and dedup the list and then rebuild the list of words.
To directly answer your question - when would it be appropriate, well i can think of some examples:
If you wanted to find the lexicographical smallest or largest value associated with each key
You have millions of values for each key and you want to 'randomly' sample a small set the values
Combiner class is used when there is situation to use commutative or associative approach. Commutative example:
abc=cba during combine task perform (ab=d),c and then send value of d,c to reducer. Now the reducer has to perform only one task instead of two task i.e. ab = d
dc to get final answer. If you use combiner need to do only dc.
Similarly for associative (a+b)+c = a+(b+c)
Associative(Grouping) and commutative(moving around) result will not differ on how you multiply or add. Mainly combiner is used for structured data which obeys Associative & commutative.
Advantage of combiner:
It reduces network I/O between Map and reducer
It reduces Disk I/O in reducer as part of execution happens in Combiner.
It sounds like a simple job, but with MapReduce it doesn't seem that straight-forward.
I have N files in which there is only one line of text for each file. I'd like the Mapper to output key value pairs like < filename, score >, in which 'score' is an integer calculated from the line of text. As a sidenote I am using the below snippet to do so (hope it's correct).
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
Assuming the mapper does its job correctly, it should output N key value pairs.
Now the problem is how should I program the Reducer to output the one key value pair with the maximum 'score'?
From what I know Reducer only works with key value pairs that share the same key. Since the output in this scenario all have different keys, I am guessing something should be done before the Reduce step. Or perhaps should the Reduce step be omitted altogether?
Lets assume that
File1 has 10,123,23,233
File2 has 1,3,56,1234
File3 has 6,1,3435,678
Here is the approach for finding the maximum number from all the input files.
Lets first do some random sampling (like say every N records). From File1 123 and 10, from File2 56 and 1, from File3 1 and 678.
Pick the maximum number from the random sampling, which is 678.
Pass the maximum number from the random sampling to the mapper and ignore the input numbers less the maximum number found in the random sampling and emit the others in the mappers. Mappers will ignore anything less than 678 and emit 678, 1234 and 3435.
Configure the job to use 1 reducer and find the max of all the numbers sent to the reducer. In this scenario reducer will receive 678, 1234 and 3435. And will calculate the maximum number to be 3435.
Some observations of the above approach
The data has to be passed twice.
The data transferred between the mappers and reducers is decreased.
The data processed by the reducers also decreases.
Better the input sampling, faster the Job completes.
Combiner with similar functionality as the Reducer will further improve the Job time.
You can use the setup() and cleanup() methods (configure() and close() methods in old API).
Declare a global variable in reduce class, which determines the maximum score. For each call to reduce, you would compare the input value (score) with the global variable.
Setup() is called once before all reduce invocations in the same reduce task. Cleanup() is called after last reduce invocation in the same reduce task. So, if you have multiple reducers, Setup() and cleanup() methods would be called separately on each reduce task.
You can return the the filename and the score as the value and just return any constant as the key from your mapper
Refer slide 32 & 33 of http://www.slideshare.net/josem.alvarez/map-reduceintro
I used the same approach and got the result. Only concern is when you have multiple fields, you need to create fieldnamemin and fieldnamemax individually.
Omit the Reducer !!
Use the Configuration to set the global variable as score and key and then access it in the mapper to do a simple selection of max score by using the global variable as the memory of max score and key
It should be simple. I guess.
The current methods to set/increment hadoop counters only take in long values.
eg: increment(long incr) and setValue(long value) are two methods I pulled out from the Hadoop Javadocs.
My requirement is to store more complex type of information as part of the counters (as key/value pairs). This info might involve (string, string) key,value pairs.
How do I achieve this using Hadoop counters?
If this is not possible, is there any other datastructure/facility in Hadoop/MR that allows storing such misc information that could be retrieved later by specifying the job_id, etc.
Thanks,
Params
Counters work because counts are the sum of the counts. Each task has its own counter which can be aggregated higher up. String don't quite have the same type of information (how do you increment a string?).
Check out ZooKeeper for this. It is great for storing miscellaneous information and coordinating between processes. You can create a znode that represents a job run (the job_id perhaps?) and then have individual strings as children.
Since Hadoop Counters support only Strings, key/value pairs can be used to collect the statistics using the OutputCollector.collect(K,V) in the map and reduce functions.
The advantage of this, the statistics from the mapper using the OutputCollector can be further processed (like aggregation) in the reducer function. The statistics from the reducer are just written to the specified output format without any processing.
A simple wordcount reducer in Ruby looks like this:
#!/usr/bin/env ruby
wordcount = Hash.new
STDIN.each_line do |line|
keyval = line.split("|")
wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i
end
wordcount.each_pair do |word,count|
puts "#{word}|#{count}"
end
it gets in the STDIN all mappers intermediate values. Not from a specific key.
So actually there is only ONE reducer for all (and not reducer per word or per set of words).
However, on Java examples I saw this interface that gets a key and list of values as inout. Which means intermidiate map values are grouped by key before reduced and reducers can run in parallel:
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Is this a Java only feature? Or can I do it with Hadoop Streaming using Ruby?
Reducers will always run in parallel, whether you're using streaming or not (if you're not seeing this, verify that the job configuration is set to allow multiple reduce tasks -- see mapred.reduce.tasks in your cluster or job configuration). The difference is that the framework packages things up a little more nicely for you when you use Java versus streaming.
For Java, the reduce task gets an iterator over all the values for a particular key. This makes it easy to walk the values if you are, say, summing the map output in your reduce task. In streaming, you literally just get a stream of key-value pairs. You are guaranteed that the values will be ordered by key, and that for a given key will not be split across reduce tasks, but any state tracking you need is up to you. For example, in Java your map output comes to your reducer symbolically in the form
key1, {val1, val2, val3}
key2, {val7, val8}
With streaming, your output instead looks like
key1, val1
key1, val2
key1, val3
key2, val7
key2, val8
For example, to write a reducer that computes the sum of the values for each key, you'll need a variable to store the last key you saw and a variable to store the sum. Each time you read a new key-value pair, you do the following:
check if the key is different than the last key.
if so, output your key and current sum, and reset the sum to zero.
add the current value to your sum and set last key to the current key.
HTH.
I haven't tried Hadoop Streaming myself but from reading the docs I think you can achieve similar parallel behavior.
Instead of passing a key with the associated values to each reducer, streaming will group the mapper output by keys. It also guarantees that values with the same keys won't be split over multiple reducers. This is somewhat different from normal Hadoop functionality, but even so, the reduce work will be distributed over multiple reducers.
Try to use the -verbose option to get more information about what's really going on. You can also try to experiment with the -D mapred.reduce.tasks=X option where X is the desired number of reducers.