I am attempting to implement the Apriori algorithm on using Hadoop. I have already implemented a non-distributed version of the Apriori algorithm but my lack of familiarity with Hadoop and MapReduce has presented a number of concerns.
The way I want to implement the algorithm is in two phases:
1) In the first phase, the map reduce job will operate on the original transaction dataset. The output of this phase is a file containing all of the 1-itemsets and their support of 1.
2) In the second phase, I want to read in the output of the previous phase and then construct the new itemsets. Importantly, I want to then, in the mapper, determine if any of the new itemsets are still found in the dataset. I imagine that if I send the original dataset as the input to the mapper, it will partition the original file so that each mapper only scans through a partial dataset. The candidate list however needs to be constructed from all of the previous phase's output. This will then iterate in a loop for a fixed number of passes.
My problem is figuring out how to specifically ensure that I can access the full itemsets in each mapper, as well as being able to access the original dataset to calculate the new support in each phase.
Thanks for any advice, comments, suggestions or answers.
EDIT: Based on the feedback, I just want to be more specific about what I'm asking here.
Before you start, I suggest you read the Hadoop Map-Reduce Tutorial.
Step 1:
Load your data file to HDFS. Let's assume your data is txt file and each set is a line.
a b c
a c d e
a e f
a f z
...
Step 2:
Follow the Map-Reduce Tutorial to build your own Apriori Class.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Seprate the line into tokens by space
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// Add the token into a writable set
... put the element into a writable set ...
}
context.write(word, one);
}
Step 3:
Run the mapreduce jar file. The output will be in a file in the HDFS.
You will have something like:
a b 3 (number of occurrence)
a b c 5
a d 2
...
Based on the output file, you could calculate the relationship.
On a related note, you might want to consider using a higher level abstraction than map-reduce like Cascading or Apache Spark.
I implemented AES algorithm in both Apache Spark and Hadoop MapReduce using Hadoop Streaming.
I know it is not the same as Apriori but you can try to use my approach.
Simple example of AES implemented using Hadoop Streming MapReduce.
Project structure for AES Hadoop Streaming
1n_reducer.py / 1n_combiner is the same code but without constraint .
import sys
CONSTRAINT = 1000
def do_reduce(word, _values):
return word, sum(_values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
base_mapper.py:
import sys
def count_usage():
for line in sys.stdin:
elements = line.rstrip("\n").rsplit(",")
for item in elements:
print("{item}\t{count}".format(item=item, count=1))
if __name__ == "__main__":
count_usage()
2n_mapper.py uses the result of previous iteration.
In answer to your question, you can read the output of previous iteration to form itemsets in such way.
import itertools
import sys
sys.path.append('.')
N_DIM = 2
def get_2n_items():
items = set()
with open("part-00000") as inf:
for line in inf:
parts = line.split('\t')
if len(parts) > 1:
items.add(parts[0])
return items
def count_usage_of_2n_items():
all_items_set = get_2n_items()
for line in sys.stdin:
items = line.rstrip("\n").rsplit(",") # 74743 43355 53554
exist_in_items = set()
for item in items:
if item in all_items_set:
exist_in_items.add(item)
for combination in itertools.combinations(exist_in_items, N_DIM):
combination = sorted(combination)
print("{el1},{el2}\t{count}".format(el1=combination[0], el2=combination[1], count=1))
if __name__ == "__main__":
count_usage_of_2n_items()
From my experience, Apriori algorithm is not suitable for Hadoop if the number of unique combinations (items sets) is too large (100K+).
If you found an elegant solution for Apriori algorithm implementation using Hadoop MapReduce (Streaming or Java MapReduce implementation) please share with community.
PS. If you need more code snippets please ask for.
Related
I'm quite new to Spark and I would like to extract features (basically count of words) from a text file using the Dataset class. I have read the "Extracting, transforming and selecting features" tutorial on Spark but every example reported starts from a bag of words defined "on the fly". I have tried several times to generate the same kind of Dataset starting from a text file but I have always failed. Here is my code:
SparkSession spark = SparkSession
.builder()
.appName("Simple application")
.config("spark.master", "local")
.getOrCreate();
Dataset<String> textFile = spark.read()
.textFile("myFile.txt")
.as(Encoders.STRING());
Dataset<Row> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty()).toDF();
Word2Vec word2Vec = new Word2Vec()
.setInputCol("value")
.setOutputCol("result")
.setVectorSize(16)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(words);
Dataset<Row> result = model.transform(words);
I get this error message: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column value must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.
I think I have to convert each line into a Row using something like:
RowFactory.create(0.0, line)
but I cannot figure out how to do that.
Basically, I was trying to train a classification system based on word counts of strings generated from a long sequence of characters. My text file contains one sequence per line so I need to split and count them for each row (the sub-strings are called k-mers and a general description can be found here). Depending on the length of the k-mers I could have more than 4^32 different strings, so I was looking for a scalable machine learning algorithm like Spark.
If you just want to count occurences of words, you can do:
Dataset<String> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty());
Dataset<Row> counts = words.toDF("word").groupBy(col("word")).count();
Word2Vec is much more powerful ML algorithm, in your case it's not necessary to use it. Remember to add import static org.apache.spark.sql.functions.*; at the beggining of the file
I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...
The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)
I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.
class StanfordNLPService {
//private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
private StanfordCoreNLP nerPipeline;
/*
Initialize the nlp instances for ner and sentiments.
*/
public void init() {
Properties nerAnnotators = new Properties();
nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
nerPipeline = new StanfordCoreNLP(nerAnnotators);
}
/**
* #param text Text from entities to be extracted.
*/
public void printEntities(String text) {
// boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
try {
// Properties nerAnnotators = new Properties();
// nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
// nerPipeline = new StanfordCoreNLP(nerAnnotators);
Annotation document = nerPipeline.process(text);
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// Get the entity type and offset information needed.
String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type
int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start
int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end.
String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type
System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
}
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC: Appropriate 100
(Type:value:offset) MISC: Time 112
Iteration 2:
(Type:value:offset) O: Appropriate 100
(Type:value:offset) O: Time 112
Here is the answer from the NER FAQ:
http://nlp.stanford.edu/software/crf-faq.shtml
Is the NER deterministic? Why do the results change for the same data?
Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.
The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".
This feature can be turned off in recent versions with the flag -useKnownLCWords false
I've looked over the code some, and here is a possible way to resolve this:
What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.
Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:
java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz
Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.
This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz
Then in your original code:
nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");
Please let me know if this works or if it's not working, and I can look more deeply into this!
After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.
The problem is the following area of the code
CRFClassifier.classifyMaxEnt()
int[] bestSequence = tagInference.bestSequence(model); Line 1249
ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.
Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.
I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the CSV format with the following features:
DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO
I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.
I am following code from GitHub to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> - is any of that wrong?
// Load and parse the data file.
JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");
// I have 14 features so giving 14 as arg to the following
final HashingTF tf = new HashingTF(14);
// Create LabeledPoint datasets for Actionable and nonactionable
JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
#Override public LabeledPoint call(String alert) {
List<String> featureList = Arrays.asList(alert.trim().split(","));
String actionType = featureList.get(featureList.size() - 1).toLowerCase();
return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
}
});
Similarly above I create testdata and use in the following code to do prediction
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint? I am new to machine learning and Spark MLlib.
I am more familiar with the scala version but I'll try to help.
You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).
Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:
val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))
this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.
Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:
val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.
This is the example (sorry scala again) if you have a testData where you did the same transformations as before:
val predictionAndLabels = testData.map { point =>
val prediction = modelRF.predict(point.features)
(point.label, prediction)
}
Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.
I hope it helps!!
You're heading in the correct direction, and you've already managed to train a model which is great.
For binary clasification it will return either a 0.0 or a 1.0, and its up to you to map this back to your string values.
Say I have timestamped values for specific users in text files, like
#userid; unix-timestamp; value
1; 2010-01-01 00:00:00; 10
2; 2010-01-01 00:00:00; 20
1; 2010-01-01 01:00:00; 11
2; 2010-01-01 01:00:00, 21
1; 2010-01-02 00:00:00; 12
2; 2010-01-02 00:00:00; 22
I have a custom class "SessionSummary" implementing readFields and write of WritableComparable. It's purpose is to sum up all values per user for each calendar day.
So the mapper maps the lines to each user, the reducer summarizes all values per day per user and outputs a SessionSummary as TextOutputFormat (using toString of SessionSummary, as tab-separated UTF-8 strings):
1; 2010-01-01; 21
2; 2010-01-01; 41
1; 2010-01-02; 12
2; 2010-01-02; 22
If I need to use these summary-entries for a second Map/Reduce stage, how should I parse this summary data to populate the members? Can I reuse the existing readFields and write-methods (of the WritableComparable interface implementation) by using the text String as DataInput somehow? This (obviously) did not work:
public void map(...) {
SessionSummary ssw = new SessionSummary();
ssw.readFields(new DataInputStream(new ByteArrayInputStream(value.getBytes("UTF-8"))));
}
In general: Is there a best practice to implement custom keys and values in Hadoop and make them easily reusable across several M/R stages, while keeping human-readable text output at every stage?
(Hadoop version is 0.20.2 / CDH3u3)
The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat. Also make sure you set the outputKeyClass and outputValueClass on the Job accordingly.
The mapper in the second job then has SessionSummary (and whatever the value type is)
If you need to see the textual output from the first MR job, you can run the following on the output files in HDFS:
hadoop fs -libjars my-lib.jar -text output-dir/part-r-*
This will read in the sequence file Key/Value pairs and call toString() on both objects, tab separating them when outputting to stdout. The -libjars specifies where hadoop can find your custom Key / Value classes