Hadoop Combiner Class for Text - java

I'm still trying to get an intuition as to when to use the Hadoop combiner class (I saw a few articles but they did not specifically help in my situation).
My question is, is it appropriate to use a combiner class when the value of the pair is of the Text class? For instance, let's say we have the following output from the mapper:
fruit apple
fruit orange
fruit banana
...
veggie carrot
veggie celery
...
Can we apply a combiner class here to be:
fruit apple orange banana
...
veggie carrot celery
...
before it even reaches the reducer?

Combiners are typically suited to a problem where you are performing some form of aggregation, min, max etc operation on the data - these values can be calculated in the combiner for the map output, and then calculated again in the reducer for all the combined outputs. This is useful as it means you are not transferring all the data across the network between the mappers and the reducer.
Now there is not reason that you can't introduce a combiner to accumulate a list of the values observed for each key (i assume this is what your example shows), but there are some things which would make it tricker.
If you have to output <Text, Text> pairs from the mapper, and consume <Text, Text> in the reducer then your combiner can easily concatenate the list of values together and output this as a Text value. Now in your reducer, you can do the same, concatenate all the values together and form one big output.
You may run into a problem if you wanted to sort and dedup the output list - as the combiner / reducer logic would need to tokenize the Text object back into words, sort and dedup the list and then rebuild the list of words.
To directly answer your question - when would it be appropriate, well i can think of some examples:
If you wanted to find the lexicographical smallest or largest value associated with each key
You have millions of values for each key and you want to 'randomly' sample a small set the values

Combiner class is used when there is situation to use commutative or associative approach. Commutative example:
abc=cba during combine task perform (ab=d),c and then send value of d,c to reducer. Now the reducer has to perform only one task instead of two task i.e. ab = d
dc to get final answer. If you use combiner need to do only dc.
Similarly for associative (a+b)+c = a+(b+c)
Associative(Grouping) and commutative(moving around) result will not differ on how you multiply or add. Mainly combiner is used for structured data which obeys Associative & commutative.
Advantage of combiner:
It reduces network I/O between Map and reducer
It reduces Disk I/O in reducer as part of execution happens in Combiner.

Related

Does record splitting need to generate unique keys for each record in hadoop?

I am relatively new to the hadoop world. I have been following examples I could find to understand how the record splitting step works for mapreduce jobs. I noticed that TextInputFormat splits file into records with key as the byte offset and value as a string. In this case, we could have two different records in a mapper having same offset from different input files.
Does it affect the mapper in any way? I think the uniqueness of the key to mapper is irrelevant if we do not process it (e.g. wordcount). But if we have to process it in mapper, the key may have to be unique. Can anyone elaborate on this ?
Thanks in advance.
Input to mapper is a file (or hdfs block) and not a key-value pair. In other words, mapper itself creates key-value pairs and does not get impacted by duplicate keys.
The "final" output generated by a Mapper is a multivalued hashmap.
< Key, <List of Values>>
This output becomes input to Reducer. All values of a key are processed by same reducer. It is ok for mappers to create more than one value for a key. Infact some solutions depend on this behavior.
Actually answer to your question totally depends on the scenario.
If you are not utilizing key(i.e. byte offset in case of textinputformat which is least utilized, but probably if you are using keyvalusepairInputformat you may be utilizing it.) then it never Impact, but if your map() function logic is such that you are doing some calculations on the basis of key then it will definitely impact.
So it totally depends on the scenario.
There is a misunderstanding. Actuallty,
For every input split of the file one Mapper will be assigned. All the
records from a single input split will be processed by only one Mapper
for a given job.
It is not required to bother about records with duplicate keys are arriving for mapper since mapper's execution scope is always at one key/value pair at any point of time.
The output from mapper task which is n number of key/value pairs are eventually merged, sorted and partitioned based on the keys.
The reducer will collect the required outputs from all mappers according to the partition and brings to the memory of reducer where it handles to arrange the key/value pairs as <key , Iterable <value> > .

Multiple outputs for one key for reducer function, Hadoop

What I need to do and am having some trouble doing is to have two values output for one key as the output to my reduce function. The reduce function receives data in the form of an Id and a list of integers associated with that Id. It needs to output that Id, the average of the integers in the list and the length of the list.
However, the implementation of the reduce function is supposed to have OutputCollector <Text, IntWritable> as an argument which clearly limits the number of outputs associated with each key to 1.
Any help in this regard would be greatly appreciated. Thanks in advance.
Hadoop version: 2.0.0
You have to use MultipleOutputs. In the Job:
MultipleOutputs.addMultiNamedOutput(job,
"Name",
SequenceFileOutputFormat.class, Text.class,
Writable.class);
In the reducer :
multipleOutputs.getCollector
Here are a few answers to your vague question.
You can call collect() as many times as you want for the same key if you don't mind each of the data (length and mean) appearing on its own record in a blended output. This could be accomplished by writing the key differently to distinguish the different records types, as follows:
oc.collect( new Text( k.toString() + " mean", mean );
oc.collect( new Text( k.toString() + " length", length );
OR
You should choose a different value type V3 instead of IntWritable. Either create a PairOfIntWritable or use an ArrayWritable to shove whatever you want into a single call to collect. Then the length and mean can be "fields" of the value for a single record for each key.
OR
If you absolutely have to use IntWritable, use an invertible pairing function to combine two integers into one. You'll need to insure that you can't exceed the maximum value of an IntWritable with any possible pair you could generate from your input data.
OR
Use MultipleOutputs to send one record each to a different file distinguished by name, so the part-r-nnnnn contain means and length-r-nnnnn contain lengths, for example. The JavaDoc on MultipleOutputs explains its use.

Specifying text/string types as value for Hadoop counters

The current methods to set/increment hadoop counters only take in long values.
eg: increment(long incr) and setValue(long value) are two methods I pulled out from the Hadoop Javadocs.
My requirement is to store more complex type of information as part of the counters (as key/value pairs). This info might involve (string, string) key,value pairs.
How do I achieve this using Hadoop counters?
If this is not possible, is there any other datastructure/facility in Hadoop/MR that allows storing such misc information that could be retrieved later by specifying the job_id, etc.
Thanks,
Params
Counters work because counts are the sum of the counts. Each task has its own counter which can be aggregated higher up. String don't quite have the same type of information (how do you increment a string?).
Check out ZooKeeper for this. It is great for storing miscellaneous information and coordinating between processes. You can create a znode that represents a job run (the job_id perhaps?) and then have individual strings as children.
Since Hadoop Counters support only Strings, key/value pairs can be used to collect the statistics using the OutputCollector.collect(K,V) in the map and reduce functions.
The advantage of this, the statistics from the mapper using the OutputCollector can be further processed (like aggregation) in the reducer function. The statistics from the reducer are just written to the specified output format without any processing.

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander
Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.
As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.
So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.
Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.
Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Detect changes in random ordered input (hash function?)

I'm reading lines of text that can come in any order. The problem is that the output can actually be indentical to the previous output. How can I detect this, without sorting the output first?
Is there some kind of hash function that can take identical input, but in any order, and still produce the same result?
The easiest way would seem to be to hash each line on the way in, storing the hash and the original data, and then compare each new hash with your collection of existing hashes. If you get a positive, you could compare the actual data, to make sure it's not a false positive - though this would be extremely rare, you could go with a quicker hash algorithm, like MD5 or CRC (instead of something like SHA, which is slower but less likely to collide), just so it's quick, and then compare the actual data when you get a hit.
So you have input like
A B C D
D E F G
C B A D
and you need to detect that the first and third lines are identical?
If you want to find out if two files contain the same set of lines, but in a different order, you can use a regular hash function on each line individually, then combine them with a function where ordering doesn't matter, like addition.
If the lines are fairly long, you could just keep a list of the hashes of each line -- sort those and compare with previous outputs.
If you don't need a 100% fool-proof solution, you could store the hash of each line in a Bloom filter (look it up on Wikipedia) and compare the Bloom filters at the end of processing. This can give you false positives (i.e. you think you have the same output but it isn't really the same) but you can tweak the error rate by adjusting the size of the Bloom filter...
If you add up the ASCII values of each character, you'd get the same result regardless of order.
(This may be a bit too simplified, but perhaps it sparks an idea for you.
See Programming Pearls, section 2.8, for an interesting back story.)
Any of the hash-based methods may produce bad results because more than one string can produce the same hash. (It's not likely, but it's possible.) This is particularly true of the suggestion to add the hashes, since you would essentially be taking a particularly bad hash of the hash values.
A hash method should only be attempted if it's not critical that you miss a change or spot a change where none exists.
The most accurate way would be to keep a Map using the line strings as key and storing the count of each as the value. (If each string can only appear once, you don't need the count.) Compute this for the expected set of lines. Duplicate this collection to examine the incoming lines, reducing the count for each line as you see it.
If you encounter a line with a zero count (or no map entry at all), you've seen a line you didn't expect.
If you end this with non-zero entries remaining in the Map, you didn't see something you expected.
Well the problem specification is a bit limited.
As I understand it you wish to see if several strings contain the same elements regardless of order.
For example:
A B C
C B A
are the same.
The way to do this is to create a set of the values then compare the sets. To create a set do:
HashSet set = new HashSet();
foreach (item : string) {
set.add(item);
}
Then just compare the contents of the sets by running through one of the sets and comparing it w/others. The execution time will be O(N) instead of O(NlogN) for the sorting example.

Categories

Resources