Predict on user input data with Spark ML Pipeline in Java - java

I have made a model using Spark ML in Java with a dataset that was split into train and test. I was wondering how to use the model on real time data such as from a user input ?
Tokenizer tokenizer = new Tokenizer()
.setInputCol("col1")
.setOutputCol("tokens");
CountVectorizer countVectorizer = new CountVectorizer()
.setInputCol("tokens")
.setOutputCol("features")
.setMinDF(1);
RandomForestClassifier rf = new RandomForestClassifier()
.setLabelCol("y_value")
.setFeaturesCol("features");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{tokenizer, countVectorizer, rf});
PipelineModel model = pipeline.fit(training);
Dataset<Row> predictions = model.transform(test);

Related

How to predict out of sample values using Spark (JAVA API)

I'm quite new to Spark and I need to use the JAVA api. Our goal is to serve predictions on the fly, where the user is going to provide a few of the variables, but not the label or the goal variable, of course.
But the model seems to need the data to be split in training data and test data for training and validation.
How can I get the prediction and the RMSE for the out of the sample data, that the user will query on the fly?
Dataset<Row>[] splits = df.randomSplit(new double[] {0.99, 0.1});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = df_p;
My out of sample data has the following format (where 0s is data the user cannot provide)
IMO,PORT_ID,DWT,TERMINAL_ID,BERTH_ID,TIMESTAMP,label,OP_ID
0000000,1864,80000.00,5689,6060,2020-08-29 00:00:00.000,1,2
'label' is the result I want to predict.
This is how I used the models:
// Train a GBT model.
GBTRegressor gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxIter(10);
// Chain indexer and GBT in a Pipeline.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, gbt, discretizer});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
// Select example rows to display.
predictions.select("prediction", "label", "weekofyear", "dayofmonth", "month", "year", "features").show(150);
// Select (prediction, true label) and compute test error.
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse");
double rmse = evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);

How to predict multiple attributes in java using weka api?

I have a question about how to make multiple attribute predictions using weka in java.
I have this dataset:
MONTH,LOC,CLASS,METHOD,MGOD,CGOD
1,2115,9,192,1,1
2,2115,9,192,1,1
3,2115,9,192,1,1
4,2387,9,210,2,1
5,2356,9,208,2,1
6,2356,9,208,2,1
7,2510,9,219,2,2
8,2348,9,206,2,1
9,2356,9,206,2,1
10,2356,9,206,2,1
11,2051,7,172,2,0
12,2051,7,172,2,0
13,2048,7,172,2,0
14,2048,7,172,2,0
15,2083,7,173,1,0
16,2083,7,173,1,0
17,2143,7,171,1,0
18,2143,7,171,1,0
19,1909,7,155,1,0
20,1909,7,155,1,0
21,1909,7,155,1,0
22,1909,7,155,1,0
23,1909,7,155,1,0
24,1820,6,156,1,0
25,1826,6,157,1,0
26,1826,6,157,1,0
27,1826,6,157,1,0
I would like to make a prediction for month 28 based on the previous months.
My code:
DataSource ds = new DataSource("src/main/java/dataset.arff");
Instances inst = ds.getDataSet();
inst.setClassIndex(1);
LinearRegression nb = new LinearRegression();
nb.buildClassifier(inst);
Instance novo = new DenseInstance(6);
novo.setDataset(inst);
novo.setValue(0, 28);
double prediction[] = nb.distributionForInstance(novo);
System.out.println("Prediction: "+Math.round(prediction[0]));

testing OpenNLP classifier model

I'm currently training a model for a classifier. yesterday I found out that it will be more accurate if you also test the created classify model. I tried searching on the internet how to test a model : testing openNLP model. But I cant get it to work. I think the reason is because i'm using OpenNLP version 1.83 instead of 1.5. Could anyone explain me how to properly test my model in this version of OpenNLP?
Thanks in advance.
Below is the way im training my model:
public static DoccatModel trainClassifier() throws IOException
{
// read the training data
final int iterations = 100;
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/trainingssetTest.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
// define the training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
// create a model from traning data
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
return model;
}
I can think of two ways to test your model. Either way, you will need to have annotated documents (an by annotated I really mean expert-classified).
The first way involves using the opennlp DocCatEvaluator. The syntax would be something akin to
opennlp DoccatEvaluator -model model -data sampleData
The format of your sampleData should be
OUTCOME <document text....>
documents are separated by the new line character.
The second way involves creating an DocumentCategorizer. Something like:
(the model is the DocCat model from your question)
DocumentCategorizer categorizer = new DocumentCategorizerME(model);
// could also use: Tokenizer tokenizer = new TokenizerME(tokenizerModel)
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE();
// linesample is like in your question...
for(String sample=linesample.read(); sample != null; sample=linesample.read()){
String[] tokens = tokenizer.tokenize(sample);
double[] outcomeProb = categorizer.categorize(tokens);
String sampleOutcome = categorizer.getBestCategory(outcomeProb);
// check if the outcome is right...
// keep track of # right and wrong...
}
// calculate agreement metric of your choice
Since I typed the code here there may be a syntax error or two (either I or the SO community can fix), but the idea for running through your data, tokenizing, running it through the document categorizer and keeping track of the results is how you want to evaluate your model.
Hope it helps...

How to work with Java Apache Spark MLlib when DataFrame has columns?

So I'm new to Apache Spark and I have a file that looks like this:
Name Size Records
File1 1,000 104,370
File2 950 91,780
File3 1,500 109,123
File4 2,170 113,888
File5 2,000 111,974
File6 1,820 110,666
File7 1,200 106,771
File8 1,500 108,991
File9 1,000 104,007
File10 1,300 107,037
File11 1,900 111,109
File12 1,430 108,051
File13 1,780 110,006
File14 2,010 114,449
File15 2,017 114,889
This is my sample/test data. I'm working on an anomaly detection program and I have to test other files with the same format but different values and detect which one have anomalies on the size and records values (if size/records on another file differ a lot from the standard one, or if size and records are not proportional within each other). I decided to start trying different ML algorithms and I wanted to start with the k-Means approach. I tried putting this file on the following line:
KMeansModel model = kmeans.fit(file)
file is already parsed to a Dataset variable. However I get an error and I'm pretty sure it has to do with the structure/schema of the file. Is there a way to work with structured/labeled/organized data when trying to fit in on a model?
I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
And this is the code:
public class practice {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Anomaly Detection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1 = "C:\\Users\\ZK0GJXO\\Documents\\day1.txt";
Dataset<Row> df = spark.read().
option("header", "true").
option("delimiter", "\t").
csv(day1);
df.show();
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(df);
}
}
Thanks
By default all Spark ML models train on a column called "features". One can specify a different input column name via the setFeaturesCol method http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/KMeans.html#setFeaturesCol(java.lang.String)
update:
One can combine multiple columns into a single feature vector using VectorAssembler:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"size", "records"})
.setOutputCol("features");
Dataset<Row> vectorized_df = assembler.transform(df)
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(vectorized_df);
One can further streamline and chain these feature transformations with the pipeline API https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline

Spark: Running multiple queries on multiple files, optimization

I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.

Categories

Resources