A pair of strings as a KEY in reduce function - HADOOP - java

Hello I am implementing a facebook-like program in java using hadoop framework (I am new to this). The main idea is that I have an input .txt file like this:
Christina Bill,James,Nick,Jessica
James Christina,Mary,Toby,Nick
...
The 1st is the user and the comma separated are his friends.
In the map function I scan each line of the file and emit the user with each one of his friends like
Christina Bill
Christina James
which will be converted to (Christina,[Bill,James,..])...
BUT in the description of my assignment it specifies that the Reduce function will receive as key the tuple of
two users, following by both their friends, you will count the
common ones and if that number is equal or greater than a
set number, like 5, you can safely assume that their
uncommon friends can be suggested. So how exactly do I pass a pair of users to the reduce function. I thought the input of the reduce function has to be the same as the output of the map function. I started coding this but I don't think this is the right approach. Any ideas?
public class ReduceFunction<KEY> extends Reducer<KEY,Text,KEY,Text> {
private Text suggestedFriend = new Text();
public void reduce(KEY key1,KEY key2, Iterable<Text> value1,Iterable<Text> value2,Context context){
}}

The output of the map phase should, indeed, be of the same type as the input of the reduce phase. This means that, if there is a requirement for the input of the reduce phase, you have to change your mapper.
The idea is simple:
map(user u,friends F):
for each f in F do
emit (u-f, F\f)
reduce(userPair u1-u2, friends F1,F2):
#commonFriends = |F1 intersection F2|
To implement this logic, you can just use a Text key, in which you concatenate the names of the users, using, e.g., the '-' character between them.
Note that in each reduce method, you will only receive two lists of friends, assuming that each user appears once in your input data. Then, you only have to compare the two lists for common names of friends.

Check if you can implement custom record reader, read two records at once from input file in mapper class. And then emit context.write(outkey, NullWritable.get()); from mapper class. Now in reducer class you need to handle two records came as a key(outkey) from mapper class. Good luck !

Related

Mapreduce questions

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...
I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside.
- "_SUCCESS"
- "_logs"
- "part-r-00000"
The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?
Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Text newValue = new Text();
newValue.set(value.toString() + "_f2");
context.write(key, newValue);
}
at a later stage, how do i "remove" the identifier such that i can just get the "value"?
Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.
If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix # in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.
Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself
Hope this solve your query. Please comment if answer is not clear.

In JBehave, how do I pass an array as a parameter from a story file to a step file?

I've been reading the JBehave docs and I'm not finding anything that speaks to this specific use case. The closest I found was this on parameterised scenarios, but it's not quite what I'm looking for. I don't need to run the same logic many times with different parameters, I need to run the step logic once with a set of parameters. Specifically, I need to pass combinations of the numbers 1-4. Is there a way to do this?
Do you mean something like Tabular Parameters?
You could use it like this:
Given the numbers:
|combinations|
|1234|
|4321|
|1324|
When ...
and then:
#Given("the numbers: $numbersTable")
public void theNumbers(ExamplesTable numbersTable) {
List numbers = new ArrayList();
for (Map<String,String> row : numbersTable.getRows()) {
String combination = row.get("combinations");
numbers.add(combination);
}
}
I just rewrote the jBehave example so it could fit your needs. You can pass any count of combinations into the tables inside the given,when,then steps and transform it to an array or in my example into a list.

Parsing and looking up a string with variable number of fields java

I have to read a file and store the values and then later do a lookup.
For e.g., the file will look as follows:
Gryffindor = 5
Gryffindor.Name.Harry = 10
Gryffindor.Name.Harry.Cloak.Black = 15
and so on...
I need to store these (I was thinking of a map). Later, I need to process every character and lookup this map to assign them points. Suppose I encounter Harry, I know that he's from Gryffindor and he's wearing a blue cloak. I will have to lookup this map (or whatever object I use) as
Gryffindor.Name.Harry.Cloak.Blue
which should return me nothing. I then need to fall back to just the name and lookup
Gryffindor.Name.Harry
that should return me a 10.
Similarly, if I lookup for Ron, (suppose he's wearing black),
Gryffindor.Name.Ron.Cloak.Black
should return nothing, fall back to
Gryffindor.Name.Ron
again nothing, fall back to
Gryffindor
which should return 5.
What will be an elegant way to store and read this data? I was thinking of using a map for storing the key value pairs and then a switch case to read them back. How would you do it?
Java has a built-in Properties class that implements Map and can read and write the data format you describe (see that class's load() and store() methods).
There's nothing in there to implement your "fall back to a higher-level key" feature, so you'll need to write a method that looks in the Properties instance for data under the desired key, and keeps trying successively shorter versions of the same key if it finds nothing.

Java sorting with collections plus manual sorting

I am writing a console application that calculates using hashtables various prices. It writes prices with a class called Priceprint. I am using hashtables for the rest of the program because order is not particularly important, but it orders the keys before creating a list as output. It puts them in order by putting the keys in a vector, sorting the vector with Collections.sort() and manually swapping the first and second keys with entries with keys exchange and special. It then uses an Enumeration to get everything from the vector, and calls another function to write each entry to the screen.
public void out(Hashtable<String, Double> b, Hashtable<String, Double> d) {
Vector<String> v;
Enumeration<String> k;
String te1, te2, e;
int ex, sp;
v = new Vector<String>(d.keySet());
Collections.sort(v);
te1 = new String(v.get(0));
ex = v.indexOf("exchange");
v.set(ex, te1); v.set(0, "exchange");
te2 = new String(v.get(1));
ex = v.indexOf("special");
v.set(ex, te2); v.set(1, "special");
if (msgflag == true)
System.out.println("Listing Bitcoin and dollar prices.");
else {
System.out.println("Listing Bitcoin and dollar prices, "
+ message + ".");
msgflag = true;
}
k = v.elements();
while (k.hasMoreElements()) {
e = new String(k.nextElement());
out(e, d.get(e), b.get(e));
}
}
Now the problem is I've ran into through lack of thought alone is that swap the entries and sort the list in alphabetical order of it's keys. So when I run the program exchange and special are at the top but the rest of the list is no longer in order. I might have to scrap the essential design where lists are outputted through the code for single entries with keys price and special coming to the top but having order with every other aspect of the list. It's a shame, because it pretty much all might need to go and I really liked the design.
Here is the full code, ignore the fact I'm using constructors on a class that evidently should be using static methods but overlooked that: http://pastebin.com/CdwhcV2L
Here is the code using Printprice to create a list of prices to test another part of the program but also Printprice lists: http://pastebin.com/E2Fq13zF
Output:
john#fekete:~/devel/java/pricecalc$ java backend.test.Test
I test CalcPrice, but I also test Printprice(Hashtable, Hashtable, String).
Listing Bitcoin and dollar prices, for unit test, check with calculator.
Exchange rate is $127.23 (USDBTC).
Special is 20.0%.
privacy: $2.0 0.0126BTC, for unit test, check with calculator.
quotaband: $1.5 0.0094BTC, for unit test, check with calculator.
quotahdd: $5.0 0.0314BTC, for unit test, check with calculator.
shells: $5.0 0.0314BTC, for unit test, check with calculator.
hosting: $10.0 0.0629BTC, for unit test, check with calculator.
The problem appears to be that you are putting the first and second elements back into the vector at the locations that "exchange" and "special" came from, instead of removing "exchange" and "special" from the vector and inserting them at the top of the vector.
Doing this correctly would be more efficient with a LinkedList instead of a Vector. To carry out the required operations, assuming v is a List:
v.add(0, v.remove(v.indexOf("special")));
v.add(0, v.remove(v.indexOf("exchange")));
This should put "exchange" first, "special" second and the rest of the list will remain in sorted order afterwards.

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}
If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)
If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.
If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Categories

Resources