How to parse CustomWritable from text in Hadoop - java

Say I have timestamped values for specific users in text files, like
#userid; unix-timestamp; value
1; 2010-01-01 00:00:00; 10
2; 2010-01-01 00:00:00; 20
1; 2010-01-01 01:00:00; 11
2; 2010-01-01 01:00:00, 21
1; 2010-01-02 00:00:00; 12
2; 2010-01-02 00:00:00; 22
I have a custom class "SessionSummary" implementing readFields and write of WritableComparable. It's purpose is to sum up all values per user for each calendar day.
So the mapper maps the lines to each user, the reducer summarizes all values per day per user and outputs a SessionSummary as TextOutputFormat (using toString of SessionSummary, as tab-separated UTF-8 strings):
1; 2010-01-01; 21
2; 2010-01-01; 41
1; 2010-01-02; 12
2; 2010-01-02; 22
If I need to use these summary-entries for a second Map/Reduce stage, how should I parse this summary data to populate the members? Can I reuse the existing readFields and write-methods (of the WritableComparable interface implementation) by using the text String as DataInput somehow? This (obviously) did not work:
public void map(...) {
SessionSummary ssw = new SessionSummary();
ssw.readFields(new DataInputStream(new ByteArrayInputStream(value.getBytes("UTF-8"))));
}
In general: Is there a best practice to implement custom keys and values in Hadoop and make them easily reusable across several M/R stages, while keeping human-readable text output at every stage?
(Hadoop version is 0.20.2 / CDH3u3)

The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat. Also make sure you set the outputKeyClass and outputValueClass on the Job accordingly.
The mapper in the second job then has SessionSummary (and whatever the value type is)
If you need to see the textual output from the first MR job, you can run the following on the output files in HDFS:
hadoop fs -libjars my-lib.jar -text output-dir/part-r-*
This will read in the sequence file Key/Value pairs and call toString() on both objects, tab separating them when outputting to stdout. The -libjars specifies where hadoop can find your custom Key / Value classes

Related

Implementing Apriori Algorithm on Hadoop

I am attempting to implement the Apriori algorithm on using Hadoop. I have already implemented a non-distributed version of the Apriori algorithm but my lack of familiarity with Hadoop and MapReduce has presented a number of concerns.
The way I want to implement the algorithm is in two phases:
1) In the first phase, the map reduce job will operate on the original transaction dataset. The output of this phase is a file containing all of the 1-itemsets and their support of 1.
2) In the second phase, I want to read in the output of the previous phase and then construct the new itemsets. Importantly, I want to then, in the mapper, determine if any of the new itemsets are still found in the dataset. I imagine that if I send the original dataset as the input to the mapper, it will partition the original file so that each mapper only scans through a partial dataset. The candidate list however needs to be constructed from all of the previous phase's output. This will then iterate in a loop for a fixed number of passes.
My problem is figuring out how to specifically ensure that I can access the full itemsets in each mapper, as well as being able to access the original dataset to calculate the new support in each phase.
Thanks for any advice, comments, suggestions or answers.
EDIT: Based on the feedback, I just want to be more specific about what I'm asking here.
Before you start, I suggest you read the Hadoop Map-Reduce Tutorial.
Step 1:
Load your data file to HDFS. Let's assume your data is txt file and each set is a line.
a b c
a c d e
a e f
a f z
...
Step 2:
Follow the Map-Reduce Tutorial to build your own Apriori Class.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Seprate the line into tokens by space
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// Add the token into a writable set
... put the element into a writable set ...
}
context.write(word, one);
}
Step 3:
Run the mapreduce jar file. The output will be in a file in the HDFS.
You will have something like:
a b 3 (number of occurrence)
a b c 5
a d 2
...
Based on the output file, you could calculate the relationship.
On a related note, you might want to consider using a higher level abstraction than map-reduce like Cascading or Apache Spark.
I implemented AES algorithm in both Apache Spark and Hadoop MapReduce using Hadoop Streaming.
I know it is not the same as Apriori but you can try to use my approach.
Simple example of AES implemented using Hadoop Streming MapReduce.
Project structure for AES Hadoop Streaming
1n_reducer.py / 1n_combiner is the same code but without constraint .
import sys
CONSTRAINT = 1000
def do_reduce(word, _values):
return word, sum(_values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
base_mapper.py:
import sys
def count_usage():
for line in sys.stdin:
elements = line.rstrip("\n").rsplit(",")
for item in elements:
print("{item}\t{count}".format(item=item, count=1))
if __name__ == "__main__":
count_usage()
2n_mapper.py uses the result of previous iteration.
In answer to your question, you can read the output of previous iteration to form itemsets in such way.
import itertools
import sys
sys.path.append('.')
N_DIM = 2
def get_2n_items():
items = set()
with open("part-00000") as inf:
for line in inf:
parts = line.split('\t')
if len(parts) > 1:
items.add(parts[0])
return items
def count_usage_of_2n_items():
all_items_set = get_2n_items()
for line in sys.stdin:
items = line.rstrip("\n").rsplit(",") # 74743 43355 53554
exist_in_items = set()
for item in items:
if item in all_items_set:
exist_in_items.add(item)
for combination in itertools.combinations(exist_in_items, N_DIM):
combination = sorted(combination)
print("{el1},{el2}\t{count}".format(el1=combination[0], el2=combination[1], count=1))
if __name__ == "__main__":
count_usage_of_2n_items()
From my experience, Apriori algorithm is not suitable for Hadoop if the number of unique combinations (items sets) is too large (100K+).
If you found an elegant solution for Apriori algorithm implementation using Hadoop MapReduce (Streaming or Java MapReduce implementation) please share with community.
PS. If you need more code snippets please ask for.

pmml model created from xgboost in R leads to different result than original model in R

I have a ranking task, where my training data looks like this:
session_id item_id item_features target
---------------------------------------------
session1 item1 ... 1
session1 item2 ... 0
...
sessionN item1 ... 0
sessionN itemX ... 10
sessionN itemY ... 0
...
I am using xgboost in R with the objective "rank:pairwise" for training the model. xgboost expects grouped data (same session_id) to be bunched together in the training and test sets. The lines belonging to the same session_id have to be specified using the function setinfo() (e. g. setinfo(model, 'group', group_info).
When I evaluate the model in R, applying new data works perfectly. However, I have used the package pmml to convert the model into a pmml file in order to use it in Java.
In Java the pmml file gets parsed and evaluated via the org.jpmml pmml-evaluator dependency (v. 1.3.15). Feeding the same data as in R to the org.jpmml.evaluator.Evaluator yields different results, though. The results are mostly negative values - which is no valid result in my setup- all predicted targets should be positive.
I have come up with two possible explanations:
There might be a bug in the pmml conversion in my scenario
I have no idea, where I can apply the equivalent of setinfo() in Java. Since I am only applying the model to a single session at a time, I was under the impression that I did not need to specify it. But maybe, I was wrong.
Please contact me for fully working example including training and test data, I will send via mail. But for starters, here is the R code from training the model:
library(xgboost)
example_matrix_train <- xgb.DMatrix(X, label = y)
setinfo(example_matrix_train, 'group', example_train_groupInfo)
example.model <- xgboost(data = example_matrix_train, objective = "rank:pairwise", max.depth = 8, eta = 0.2, nthread = 8, nround = 10, verbose=0)
library(pmml)
library(pmmlTransformations)
xgb.dump(example.model, "example.model.dumped.trees")
logfile <- file(paste0("pmml_example_model",Sys.Date(),".txt"), open="a")
sink(logfile)
pmml(example.model, inputFeatureNames = colnames(example_train), outputLabelName = "prediction1", xgbDumpFile = "example.model.dumped.trees")
sink()
Any help is welcome
I have come up with two possible explanations: There might be a bug in the pmml conversion
This is the true explanation - the pmml package is producing incorrect PMML for XGBoost models. The technical reason is that it is using XGBoost text dump file as input, but the information contained therein is incomplete (eg. rounded threshold values).
If you're looking to export XGBoost models into PMML, then you should be using the r2pmml package, which is using XGBoost binary files as input.
In truth, the 'pmml' package currently does not support the 'rank:pairwise' objective function you need. The upcoming release of the 'pmml' package (version 1.5.3) includes a check for unsupported objective functions.

Seems like reducer method not running in my Reducer class

I have a sample input file as below which contains a sequence number, name, medicine, gender, amount spent. My requirement is to get the total amount spent on each medicine. I have written a Mapreduce program and ran it in my local machine under a Single Node cluster with Hadoop and other necessary packages installed.
Irma Ellison,avil,female,872
Hilary Bush,avil,male,999
Ahmed Mejia,paracetamol,female,654
Grace Boone,metacin,female,918
Hayes Ortiz,paracetamol,male,734
Lani Matthews,paracetamol,female,836
Cathleen Stewart,paracetamol,male,178
Jonas Boone,metacin,female,649
Desiree Pearson,avil,male,439
Britanney Sullivan,metacin,female,659
for the above input i am expecting the output as below.
avil 2310
metacin 2226
paracetamol 2402
When I declare my reducer class as
public class VisReducer extends Reducer < Text, IntWritable, Text, IntWritable > . I am getting my expected output and everything looks good.
But mistakenly I have changed my reducer class declaration as
public class VisReducer extends Reducer< Text, Iterable< IntWritable >, Text, IntWritable > . The output seems to be just a Mapper output and looks like for some reason, reduce method in the Reduceer class has not run. I have added a System.out.println() in reduce method, and checked the logs and could not see what I printed, Whereas in the first case, I can see the output. nI am not able to understand what is causing the issue.
Can someone help me to understand what exactly is happening.
Output in my second case.
avil 439
avil 999
avil 872
metacin 659
metacin 649
metacin 918
paracetamol 178
paracetamol 836
paracetamol 734
paracetamol 654
It might be a very basic question as i am just starting my hadoop learning and could not find any relevant quetions online.
You will get desired output when you declare Reducer as per specification
Visit Apache documentation page on Reducer, Reducer contain four parameters
org.apache.hadoop.mapreduce
Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
KEYIN - the input key
VALUEIN - the input value
KEYOUT - the output key type
VALUEOUT - the output value
From your example:
public class VisReducer extends Reducer < Text, IntWritable, Text, IntWritable >
KEYIN - Text
VALUEIN - IntWritable
KEYOUT - Text
VALUEOUT - IntWritable
If you pass input key as Text, input value as IntWritable to Reducer, it will generate output key as Text and output value as IntWritable
After all mappers completes it work they give the out put as key,value pairs.
for ex: lets assume 2 mappers in ur case the mappers output is
mapper1 o/p
key1,value1
key2,value1
mapper2 o/p
key1,value2
key3,value1
key2,value2
And then Reducer class will be called. The Reducer class have 3 phases.
1.Shuffle: The Reducer copies the sorted output from each Mapper using HTTP across the network.
here the shuffed temp o/p is
key1,value1
key2,value1
key1,value2
key3,value1
key2,value2
2.Sort: The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
here the sorted temp o/p is
key1,value1
key1,value2
key2,value1
key2,value2
key3,value1
3.Reduce: In this phase the reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context) method is called for each in the sorted inputs.
here the actual reduce method that works on the mapper o/p it takes the input as
key1,<value1,value2>
key2,<value1,value2>
key3,<value1>
The Reducer class declaration and the reduce method in side the Reducer class will be different. As the input parameters for a Reducer class will be the output parameters of Mapper class(maximum cases) and the reduce method parameters will be (Object, Iterable,org.apache.hadoop.mapreduce.Reducer.Context).

MapReduce distributed reducer

Just started learning MapReduce and I have a file where there are an actor and a movie he played in (per line). I want to create a file as follows:
actor movie1, movie2, ..., movieN
i.e. a key - value file but only one line appearance of an actor and all his movies. This is no problem.
After I have this file created I want to find the actor with most movies played in as a second MR - Job. I read my new file (output of the previous Job) and simply replace (in map()) the movies with the number. In my Reducer I just have to compare with previous result
if(numberOfRoles.get() < sum){
numberOfRoles.set(sum);
actorWithMostRoles.set(key);
}
where numberOfRoles and actorWithMostRoles are attributes of the Reducer - Class.
This works without any problems.
My output of jps:
$ jps
32347 Jps
25323 DataNode
25145 NameNode
25541 SecondaryNameNode
I know that there can be multiple Mapper & Reducer. For example Reducer_0 and Reducer_1 which will output the actor with the most movies played in. Having following data:
actor1 movie1, movie2, movie3
actor2 movie4, movie5
So Reducer_0 will get actor1 to count and thus output actor1 3 and Reducer_1 will output actor2 2. So I will have two lines instead of one (actor1) - because each Reducer has found the actor.
After I have described my doing I have following question:
Either I don't understand how it works (with multiple reducer - in a cluster) or I have to do synchronisation somehow?
Yes, you understand how it works.
You will need another map reduce job to finish it up for you in this setup.
or, just use a single reducer and be done with it!
In the second MR Job read your new file (output of the previous Job)
and change your MR to like this below
Mapping Phase :
Read each actor and their movie count and output it with a special key "max" and value pair of actor name and their movie count like this one
output key = "max"
output value = ("actor", movieCount)
Reducing Phase :
You will get all of the actor and his movie count as value list in a single reducer so just find the max movie count from the value list
input key = "max"
input value = [("actor",movie_count), ("actor",movie_count) ...]
output key = "most movies played"
output value = max_value

Tanimoto Coefficient in mhout return only 1.0 as prediction value

I have tried to run mahout framework and use Tanimoto coefficient on set of items. Fortunately, it works with me, however, it returns value 1.0 for all predicted items, the code was as follow:
public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("stack.csv")); //load data from file needed for computation
UserSimilarity similarity = new TanimotoCoefficientSimilarity(model); //log likelihood similarity will be used for making recommendation .
/*To use TanimotoCoefficientSimilarity replace “LogLikelihoodSimilarity” with TanimotoCoefficientSimilarity”.
UserSimilarity implementation provides how similar two two users are using LoglikehoodSimilarity */
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); //Define a group of user most similar to a given user . 2 define a group of 2 user having most similar preference
Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); // creates a recommendation engine
List<RecommendedItem>recommendations = recommender.recommend(3, 5);
/*one recommendation for user with ID 4 . In Mahout it always take Integer value i.e It will always take userId and number of item to be recommended */
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
}
The out put as follow:
[main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Creating FileDataModel for file stack.csv
[main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Reading file info...
[main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Read lines: 696
RecommendedItem[item:589, value:1.0]
RecommendedItem[item:380, value:1.0]
RecommendedItem[item:2916, value:1.0]
RecommendedItem[item:3107, value:1.0]
RecommendedItem[item:2028, value:1.0]
Part of my data file is as follow:
1 3408
1 595
1 2398
1 2918
1 2791
1 2687
1 3105
.
.
.
Up to my best knowledge that Tanimoto Coefficient value is usually between 0 and 1.0 , but here it shows only 1.0 which is something impossible as I think. So, anybody have any idea how can solve this problem? is there any threshold that I can change?
Any help with this is highly appreciated.
Many thanks in advance.
Tanimoto coefficient, or also known as the Jaccard coefficient, completely ignores preference values and just considers that the user likes this items, and nothing more. How it is computed? The final value is the number of items that two users express some preference for (in other words only like) divided by the number of items that either user expresses some preference for.
Read more about Jaccard coefficient here: http://en.wikipedia.org/wiki/Jaccard_index
Read more about the Mahout's implementation TanimotoCoefficientSimilarity in the book Mahout in Action.

Categories

Resources