I'm quite new to Spark and I would like to extract features (basically count of words) from a text file using the Dataset class. I have read the "Extracting, transforming and selecting features" tutorial on Spark but every example reported starts from a bag of words defined "on the fly". I have tried several times to generate the same kind of Dataset starting from a text file but I have always failed. Here is my code:
SparkSession spark = SparkSession
.builder()
.appName("Simple application")
.config("spark.master", "local")
.getOrCreate();
Dataset<String> textFile = spark.read()
.textFile("myFile.txt")
.as(Encoders.STRING());
Dataset<Row> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty()).toDF();
Word2Vec word2Vec = new Word2Vec()
.setInputCol("value")
.setOutputCol("result")
.setVectorSize(16)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(words);
Dataset<Row> result = model.transform(words);
I get this error message: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column value must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.
I think I have to convert each line into a Row using something like:
RowFactory.create(0.0, line)
but I cannot figure out how to do that.
Basically, I was trying to train a classification system based on word counts of strings generated from a long sequence of characters. My text file contains one sequence per line so I need to split and count them for each row (the sub-strings are called k-mers and a general description can be found here). Depending on the length of the k-mers I could have more than 4^32 different strings, so I was looking for a scalable machine learning algorithm like Spark.
If you just want to count occurences of words, you can do:
Dataset<String> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty());
Dataset<Row> counts = words.toDF("word").groupBy(col("word")).count();
Word2Vec is much more powerful ML algorithm, in your case it's not necessary to use it. Remember to add import static org.apache.spark.sql.functions.*; at the beggining of the file
Related
I have a tensorflow model trained in python by following this article. After training I have generated the frozen graph. Now I need to use this graph and generate recognition on a JAVA based application.
For this I was looking at the following example . However I failed to understand is to how to collect my output. I know that I need to provide 3 inputs to the graph.
From the example given on the official tutorial I have read the code that is based on python.
def run_graph(wav_data, labels, input_layer_name, output_layer_name,
num_top_predictions):
"""Runs the audio data through the graph and prints predictions."""
with tf.Session() as sess:
# Feed the audio data as input to the graph.
# predictions will contain a two-dimensional array, where one
# dimension represents the input image count, and the other has
# predictions per class
softmax_tensor = sess.graph.get_tensor_by_name(output_layer_name)
predictions, = sess.run(softmax_tensor, {input_layer_name: wav_data})
# Sort to show labels in order of confidence
top_k = predictions.argsort()[-num_top_predictions:][::-1]
for node_id in top_k:
human_string = labels[node_id]
score = predictions[node_id]
print('%s (score = %.5f)' % (human_string, score))
return 0
Can someone help me to understand the tensorflow java api?
The literal translation of the Python code you listed above would be something like this:
public static float[][] getPredictions(Session sess, byte[] wavData, String inputLayerName, String outputLayerName) {
try (Tensor<String> wavDataTensor = Tensors.create(wavData);
Tensor<Float> predictionsTensor = sess.runner()
.feed(inputLayerName, wavDataTensor)
.fetch(outputLayerName)
.run()
.get(0)
.expect(Float.class)) {
float[][] predictions = new float[(int)predictionsTensor.shape(0)][(int)predictionsTensor.shape(1)];
predictionsTensor.copyTo(predictions);
return predictions;
}
}
The returned predictions array will have the "confidence" values of each of the predictions, and you'll have to run the logic to compute the "top K" on it similar to how the Python code is using numpy (.argsort()) to do so on what sess.run() returned.
From a cursory reading of the tutorial page and code, it seems that predictions will have 1 row and 12 columns (one for each hotword). I got this from the following Python code:
import tensorflow as tf
graph_def = tf.GraphDef()
with open('/tmp/my_frozen_graph.pb', 'rb') as f:
graph_def.ParseFromString(f.read())
output_layer_name = 'labels_softmax:0'
tf.import_graph_def(graph_def, name='')
print(tf.get_default_graph().get_tensor_by_name(output_layer_name).shape)
Hope that helps.
I am attempting to implement the Apriori algorithm on using Hadoop. I have already implemented a non-distributed version of the Apriori algorithm but my lack of familiarity with Hadoop and MapReduce has presented a number of concerns.
The way I want to implement the algorithm is in two phases:
1) In the first phase, the map reduce job will operate on the original transaction dataset. The output of this phase is a file containing all of the 1-itemsets and their support of 1.
2) In the second phase, I want to read in the output of the previous phase and then construct the new itemsets. Importantly, I want to then, in the mapper, determine if any of the new itemsets are still found in the dataset. I imagine that if I send the original dataset as the input to the mapper, it will partition the original file so that each mapper only scans through a partial dataset. The candidate list however needs to be constructed from all of the previous phase's output. This will then iterate in a loop for a fixed number of passes.
My problem is figuring out how to specifically ensure that I can access the full itemsets in each mapper, as well as being able to access the original dataset to calculate the new support in each phase.
Thanks for any advice, comments, suggestions or answers.
EDIT: Based on the feedback, I just want to be more specific about what I'm asking here.
Before you start, I suggest you read the Hadoop Map-Reduce Tutorial.
Step 1:
Load your data file to HDFS. Let's assume your data is txt file and each set is a line.
a b c
a c d e
a e f
a f z
...
Step 2:
Follow the Map-Reduce Tutorial to build your own Apriori Class.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Seprate the line into tokens by space
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// Add the token into a writable set
... put the element into a writable set ...
}
context.write(word, one);
}
Step 3:
Run the mapreduce jar file. The output will be in a file in the HDFS.
You will have something like:
a b 3 (number of occurrence)
a b c 5
a d 2
...
Based on the output file, you could calculate the relationship.
On a related note, you might want to consider using a higher level abstraction than map-reduce like Cascading or Apache Spark.
I implemented AES algorithm in both Apache Spark and Hadoop MapReduce using Hadoop Streaming.
I know it is not the same as Apriori but you can try to use my approach.
Simple example of AES implemented using Hadoop Streming MapReduce.
Project structure for AES Hadoop Streaming
1n_reducer.py / 1n_combiner is the same code but without constraint .
import sys
CONSTRAINT = 1000
def do_reduce(word, _values):
return word, sum(_values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
base_mapper.py:
import sys
def count_usage():
for line in sys.stdin:
elements = line.rstrip("\n").rsplit(",")
for item in elements:
print("{item}\t{count}".format(item=item, count=1))
if __name__ == "__main__":
count_usage()
2n_mapper.py uses the result of previous iteration.
In answer to your question, you can read the output of previous iteration to form itemsets in such way.
import itertools
import sys
sys.path.append('.')
N_DIM = 2
def get_2n_items():
items = set()
with open("part-00000") as inf:
for line in inf:
parts = line.split('\t')
if len(parts) > 1:
items.add(parts[0])
return items
def count_usage_of_2n_items():
all_items_set = get_2n_items()
for line in sys.stdin:
items = line.rstrip("\n").rsplit(",") # 74743 43355 53554
exist_in_items = set()
for item in items:
if item in all_items_set:
exist_in_items.add(item)
for combination in itertools.combinations(exist_in_items, N_DIM):
combination = sorted(combination)
print("{el1},{el2}\t{count}".format(el1=combination[0], el2=combination[1], count=1))
if __name__ == "__main__":
count_usage_of_2n_items()
From my experience, Apriori algorithm is not suitable for Hadoop if the number of unique combinations (items sets) is too large (100K+).
If you found an elegant solution for Apriori algorithm implementation using Hadoop MapReduce (Streaming or Java MapReduce implementation) please share with community.
PS. If you need more code snippets please ask for.
I'm trying to use a Tensorflow model that I trained in python to score data in Scala (using TF Java API). For the model, I've used thisregression example, with the only change being that I dropped asText=True from export_savedmodel.
My snippet of Scala:
val b = SavedModelBundle.load("/tensorflow/tf-estimator-tutorials/trained_models/reg-model-01/export/1531933435/", "serve")
val s = b.session()
// output = predictor_fn({'csv_rows': ["0.5,1,ax01,bx02", "-0.5,-1,ax02,bx02"]})
val input = "0.5,1,ax01,bx02"
val inputTensor = Tensor.create(input.getBytes("UTF-8"))
val result = s.runner()
.feed("csv_rows", inputTensor)
.fetch("dnn/logits/BiasAdd")
.run()
.get(0)
When I run, I get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Input to reshape is a tensor with 2 values, but the requested shape has 4
[[Node: dnn/input_from_feature_columns/input_layer/alpha_indicator/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _output_shapes=[[?,2]], _device="/job:localhost/replica:0/task:0/device:CPU:0"](dnn/input_from_feature_columns/input_layer/alpha_indicator/Sum, dnn/input_from_feature_columns/input_layer/alpha_indicator/Reshape/shape)]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:298)
at org.tensorflow.Session$Runner.run(Session.java:248)
I figure that there's a problem with how I've prepared my input Tensor, but I'm stuck on how to best debug this.
The error message suggests that the shape of the input tensor in some operation isn't what is expected.
Looking at the Python notebook you linked to (particularly section 8a and 8c), it seems that the input tensor is expected to be a "batch" of string tensors, not a single string tensor.
You can observe this by comparing the shapes of the tensors in your Scala and Python program (inputTensor.shape() in scala vs. the shape of csv_rows provided to predict_fn in the Python notebook).
From that, it seems what you want is for inputTensor to be a vector of strings, not a single scalar string. To do that, you'd want to do something like:
val input = Array("0.5,1,ax01,bx02")
val inputTensor = Tensor.create(input.map(x => x.getBytes("UTF-8"))
Hope that helps
I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the CSV format with the following features:
DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO
I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.
I am following code from GitHub to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> - is any of that wrong?
// Load and parse the data file.
JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");
// I have 14 features so giving 14 as arg to the following
final HashingTF tf = new HashingTF(14);
// Create LabeledPoint datasets for Actionable and nonactionable
JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
#Override public LabeledPoint call(String alert) {
List<String> featureList = Arrays.asList(alert.trim().split(","));
String actionType = featureList.get(featureList.size() - 1).toLowerCase();
return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
}
});
Similarly above I create testdata and use in the following code to do prediction
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint? I am new to machine learning and Spark MLlib.
I am more familiar with the scala version but I'll try to help.
You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).
Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:
val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))
this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.
Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:
val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.
This is the example (sorry scala again) if you have a testData where you did the same transformations as before:
val predictionAndLabels = testData.map { point =>
val prediction = modelRF.predict(point.features)
(point.label, prediction)
}
Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.
I hope it helps!!
You're heading in the correct direction, and you've already managed to train a model which is great.
For binary clasification it will return either a 0.0 or a 1.0, and its up to you to map this back to your string values.
I'm trying to sum the length of all the lines of a file using spark.
These lines are formatted as follows :
A1004AX2J2HXGL\tB0007RT9LC\tMay 30, 2005\t3\n\t4\t5.0\tLes carottes sont cuites
To achieve my aim, I tried this code given in the documentation :
JavaRDD<String> txtFile = sc.textFile(filePath);
JavaRDD<Integer> linesLength = txtFile.map(s -> s.length());
long totalLength = linesLength.reduce((a, b) -> a+b);
However, it doesn't work. For instance, for a 5.8GB text file, it returns 1602633268 when it should return 5897600784.
I suppose it is due to the fact that some lines may contain strange characters,
which stop the reading for the line.
With goold old Java, this problem could be solved with a BufferedReader, like in this case. However, I found no mentions of something similar for Spark in the documentation.
How could I proceed ?
I know you already found at least part of your problem and answered the question, but I'd like to point out another problem: you are counting characters in this Spark code, but sounds like you are trying to find the file size in bytes. These are not at all necessarily the same thing.
I add it all wrong, it was just an integer overflow. I made it work by changing Integer to Long :
JavaRDD<String> txtFile = sc.textFile(path);
JavaRDD<Long> linesLength = txtFile.map(s -> Long.valueOf(s.length()));
Long totalLength = linesLength.reduce((a, b) -> a +b);