Creating multiple output files with AvroMultipleOutputs - java

I have a Reducer using AvroKeyOutput as the output format. By default, MapReduce will write all my keys to a single output file. I would like to write to a separate output file for each key value. Avro provides the AvroMultipleOutputs method, but examples are slim. The one provided by Apache AvroMultipleOutputs, shows how to pre-configure the various outputs when defining the job. The examples shows:
JOB:
AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema);
AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null);
REDUCER:
amos = new AvroMultipleOutputs(conf);
amos.getCollector("avro1", reporter).collect(datum);
amos.getCollector("avro2", "A", reporter).collect(datum);
amos.getCollector("avro3", "B", reporter).collect(datum);
But I don't know how many files I will need or what their names are, since it is based on the key values that come out of my reducer. How could I modify this to accommodate dynamic file naming?

A strategy you can use in this situation:
use a Map-only job (zero reduce tasks)
have a single named multiple output configuration
during the map() use your key value for the base output path in AvroMultipleOutputs.write(String namedOutput, Object key, Object value, String baseOutputPath)

Related

Mapreduce questions

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...
I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside.
- "_SUCCESS"
- "_logs"
- "part-r-00000"
The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?
Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Text newValue = new Text();
newValue.set(value.toString() + "_f2");
context.write(key, newValue);
}
at a later stage, how do i "remove" the identifier such that i can just get the "value"?
Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.
If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix # in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.
Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself
Hope this solve your query. Please comment if answer is not clear.

A pair of strings as a KEY in reduce function - HADOOP

Hello I am implementing a facebook-like program in java using hadoop framework (I am new to this). The main idea is that I have an input .txt file like this:
Christina Bill,James,Nick,Jessica
James Christina,Mary,Toby,Nick
...
The 1st is the user and the comma separated are his friends.
In the map function I scan each line of the file and emit the user with each one of his friends like
Christina Bill
Christina James
which will be converted to (Christina,[Bill,James,..])...
BUT in the description of my assignment it specifies that the Reduce function will receive as key the tuple of
two users, following by both their friends, you will count the
common ones and if that number is equal or greater than a
set number, like 5, you can safely assume that their
uncommon friends can be suggested. So how exactly do I pass a pair of users to the reduce function. I thought the input of the reduce function has to be the same as the output of the map function. I started coding this but I don't think this is the right approach. Any ideas?
public class ReduceFunction<KEY> extends Reducer<KEY,Text,KEY,Text> {
private Text suggestedFriend = new Text();
public void reduce(KEY key1,KEY key2, Iterable<Text> value1,Iterable<Text> value2,Context context){
}}
The output of the map phase should, indeed, be of the same type as the input of the reduce phase. This means that, if there is a requirement for the input of the reduce phase, you have to change your mapper.
The idea is simple:
map(user u,friends F):
for each f in F do
emit (u-f, F\f)
reduce(userPair u1-u2, friends F1,F2):
#commonFriends = |F1 intersection F2|
To implement this logic, you can just use a Text key, in which you concatenate the names of the users, using, e.g., the '-' character between them.
Note that in each reduce method, you will only receive two lists of friends, assuming that each user appears once in your input data. Then, you only have to compare the two lists for common names of friends.
Check if you can implement custom record reader, read two records at once from input file in mapper class. And then emit context.write(outkey, NullWritable.get()); from mapper class. Now in reducer class you need to handle two records came as a key(outkey) from mapper class. Good luck !

Java Map Reduce Read from Different Format - Avro, Textfile

I have a few Hive tables where some of them are in Avro format and some of them are in plain textfile. The schemas are slightly different but all contains certain attributes that I need.
I am planning to write a map reduce job process the data. The question is that I am trying to avoid tons of separate jobs and trying to simplify the process as much as possible. Finger crossed that I only need to write one job.
Is there any example of showing how to read different format of inputs in one mapper.
Say for example, I have a hdfs path I know in AVRO, and I also have another hdfs path where the data is in plain text file.
// Pseudo code
mapper (Paths){
for(Path in Paths){
if Path.containsAvro() {
... read as avro
} else {
... read as textfile
}
..
}
}
Use two different mappers, one for each format, for the same job. The mappers can each read their own format of data but must all write the same format of data. Use something like this to configure:
MultipleInputs.addInputPath(job, new Path(path_to_data_with_format_1), SomeInputFormat.class, ReadFormatOneMapper.class);
MultipleInputs.addInputPath(job, new Path(path_to_data_with_format_2), SomeOtherInputFormat.class, ReadFormatTwoMapper.class);
Of course, SomeInputFormat and SomeOtherInputFormat aren't real input format classes. In this example the two mapper classes would output key/value pairs with the same kay/value types and the reducer, if you have one, would get the data from both mappers.

Different keys in Mapper & Combiners

In Map-Reduce, is it possible to have different type of keys propagated between Mappers, Combiners and Reducers.
For example, if I have a mapper (implemented in Java) which outputs Text,IntWritable as key/value pairs.
Then, in combiner I consolidate all the output as single key and want to output that as NullWritable, Text.
and Then in Reducer, I want to output Text, IntWritable.
Is it possible to do something like above ? If not, why ?
You can specify different key/value types for mapper and reducer with methods:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
and
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
But you can't set key/value types for combiner different from mapper. That's because combiners are used only to reduce the amount of data that would be sent from mapper to reducer. You should be able to remove combiners without any side effects. But if your combiner produces NullWritable, Text pairs and mapper produces Text, IntWritable pairs, then your program would fail without combiner.
You can use Text instead of IntWritable:
Mapper output: Text,Text
Combiner: Text,Text. As Combiner output key NullWritable can be used.
Reducer input: Text,Text
Hadoop does use of Combiner for optimization of processing, but not guarantee the Combiner execution. Therefore, you couldn't assuming that, and Hadoop could send the map output data to Reducer phase directly.
For a discussion about this, I recommended read the Chapter 3 of this book. In the page 48 of the PDF file comment about that problem.

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}
If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)
If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.
If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Categories

Resources