How to predict out of sample values using Spark (JAVA API)

How to predict out of sample values using Spark (JAVA API) - java

I'm quite new to Spark and I need to use the JAVA api. Our goal is to serve predictions on the fly, where the user is going to provide a few of the variables, but not the label or the goal variable, of course.
But the model seems to need the data to be split in training data and test data for training and validation.
How can I get the prediction and the RMSE for the out of the sample data, that the user will query on the fly?
Dataset<Row>[] splits = df.randomSplit(new double[] {0.99, 0.1});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = df_p;
My out of sample data has the following format (where 0s is data the user cannot provide)
IMO,PORT_ID,DWT,TERMINAL_ID,BERTH_ID,TIMESTAMP,label,OP_ID
0000000,1864,80000.00,5689,6060,2020-08-29 00:00:00.000,1,2
'label' is the result I want to predict.
This is how I used the models:
// Train a GBT model.
GBTRegressor gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxIter(10);
// Chain indexer and GBT in a Pipeline.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, gbt, discretizer});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
// Select example rows to display.
predictions.select("prediction", "label", "weekofyear", "dayofmonth", "month", "year", "features").show(150);
// Select (prediction, true label) and compute test error.
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse");
double rmse = evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);

Related

Predict on user input data with Spark ML Pipeline in Java

I have made a model using Spark ML in Java with a dataset that was split into train and test. I was wondering how to use the model on real time data such as from a user input ?
Tokenizer tokenizer = new Tokenizer()
.setInputCol("col1")
.setOutputCol("tokens");
CountVectorizer countVectorizer = new CountVectorizer()
.setInputCol("tokens")
.setOutputCol("features")
.setMinDF(1);
RandomForestClassifier rf = new RandomForestClassifier()
.setLabelCol("y_value")
.setFeaturesCol("features");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{tokenizer, countVectorizer, rf});
PipelineModel model = pipeline.fit(training);
Dataset<Row> predictions = model.transform(test);

testing OpenNLP classifier model

I'm currently training a model for a classifier. yesterday I found out that it will be more accurate if you also test the created classify model. I tried searching on the internet how to test a model : testing openNLP model. But I cant get it to work. I think the reason is because i'm using OpenNLP version 1.83 instead of 1.5. Could anyone explain me how to properly test my model in this version of OpenNLP?
Thanks in advance.
Below is the way im training my model:
public static DoccatModel trainClassifier() throws IOException
{
// read the training data
final int iterations = 100;
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/trainingssetTest.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
// define the training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
// create a model from traning data
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
return model;
}

I can think of two ways to test your model. Either way, you will need to have annotated documents (an by annotated I really mean expert-classified).
The first way involves using the opennlp DocCatEvaluator. The syntax would be something akin to
opennlp DoccatEvaluator -model model -data sampleData
The format of your sampleData should be
OUTCOME <document text....>
documents are separated by the new line character.
The second way involves creating an DocumentCategorizer. Something like:
(the model is the DocCat model from your question)
DocumentCategorizer categorizer = new DocumentCategorizerME(model);
// could also use: Tokenizer tokenizer = new TokenizerME(tokenizerModel)
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE();
// linesample is like in your question...
for(String sample=linesample.read(); sample != null; sample=linesample.read()){
String[] tokens = tokenizer.tokenize(sample);
double[] outcomeProb = categorizer.categorize(tokens);
String sampleOutcome = categorizer.getBestCategory(outcomeProb);
// check if the outcome is right...
// keep track of # right and wrong...
}
// calculate agreement metric of your choice
Since I typed the code here there may be a syntax error or two (either I or the SO community can fix), but the idea for running through your data, tokenizing, running it through the document categorizer and keeping track of the results is how you want to evaluate your model.
Hope it helps...

How to work with Java Apache Spark MLlib when DataFrame has columns?

So I'm new to Apache Spark and I have a file that looks like this:
Name Size Records
File1 1,000 104,370
File2 950 91,780
File3 1,500 109,123
File4 2,170 113,888
File5 2,000 111,974
File6 1,820 110,666
File7 1,200 106,771
File8 1,500 108,991
File9 1,000 104,007
File10 1,300 107,037
File11 1,900 111,109
File12 1,430 108,051
File13 1,780 110,006
File14 2,010 114,449
File15 2,017 114,889
This is my sample/test data. I'm working on an anomaly detection program and I have to test other files with the same format but different values and detect which one have anomalies on the size and records values (if size/records on another file differ a lot from the standard one, or if size and records are not proportional within each other). I decided to start trying different ML algorithms and I wanted to start with the k-Means approach. I tried putting this file on the following line:
KMeansModel model = kmeans.fit(file)
file is already parsed to a Dataset variable. However I get an error and I'm pretty sure it has to do with the structure/schema of the file. Is there a way to work with structured/labeled/organized data when trying to fit in on a model?
I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
And this is the code:
public class practice {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Anomaly Detection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1 = "C:\\Users\\ZK0GJXO\\Documents\\day1.txt";
Dataset<Row> df = spark.read().
option("header", "true").
option("delimiter", "\t").
csv(day1);
df.show();
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(df);
}
}
Thanks

By default all Spark ML models train on a column called "features". One can specify a different input column name via the setFeaturesCol method http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/KMeans.html#setFeaturesCol(java.lang.String)
update:
One can combine multiple columns into a single feature vector using VectorAssembler:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"size", "records"})
.setOutputCol("features");
Dataset<Row> vectorized_df = assembler.transform(df)
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(vectorized_df);
One can further streamline and chain these feature transformations with the pipeline API https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline

Spark: Running multiple queries on multiple files, optimization

I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.

Jena TDB to store and query using API

I am new to both Jena-TDB and SPARQL, so it might be a silly question. I am using tdb-0.9.0, on Windows XP.
I am creating the TDB model for my trail_1.rdf file. My understanding here(correct me if I am wrong) is that the following code will read the given rdf file in TDB model and also stores/load (not sure what's the better word) the model in the given directory D:\Project\Store_DB\data1\tdb:
// open TDB dataset
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
tdb.close();
dataset.close();
Is this understanding correct?
As per my understanding since now the model is stored at D:\Project\Store_DB\data1\tdb directory, I should be able to run query on it at some later point of time.
So to query the TDB Store at D:\Project\Store_DB\data1\tdb I tried following, but it prints nothing:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
Iterator<String> graphNames = dataset.listNames();
while (graphNames.hasNext()) {
String graphName = graphNames.next();
System.out.println(graphName);
}
I also tried this, which also did not print anything:
String directory = "D:\\Project\\Store_DB\\data1\\tdb" ;
Dataset dataset = TDBFactory.createDataset(directory) ;
String sparqlQueryString = "SELECT (count(*) AS ?count) { ?s ?p ?o }" ;
Query query = QueryFactory.create(sparqlQueryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, dataset) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(results) ;
What am I doing incorrect? Is there anything wrong with my understanding that I have mentioned above?

For part (i) of your question, yes, your understanding is correct.
For part (ii), the reason that listNames does not return any results is because you have not put your data into a named graph. In particular,
Model tdb = dataset.getDefaultModel();
means that you are storing data into TDB's default graph, i.e. the graph with no name. If you wish listNames to return something, change that line to:
Model tdb = dataset.getNamedGraph( "graph42" );
or something similar. You will, of course, then need to refer to that graph by name when you query the data.
If your goal is simply to test whether or not you have successfully loaded data into the store, try the command line tools bin/tdbdump (Linux) or bat\tdbdump.bat (Windows).
For part (iii), I tried your code on my system, pointing at one of my TDB images, and it works just as one would expect. So: either the TDB image you're using doesn't have any data in it (test with tdbdump), or the code you actually ran was different to the sample above.

The problem in your part 1 code is, I think, you are not committing the data .
Try with this version of your part 1 code:
String directory = "D:\\Project\\Store_DB\\data1\\tdb";
Dataset dataset = TDBFactory.createDataset(directory);
Model tdb = dataset.getDefaultModel();
// read the input file
String source = "D:\\Project\\Store_DB\\tmp\\trail_1.rdf";
FileManager.get().readModel( tdb, source);
dataset.commit();//INCLUDE THIS STAMEMENT
tdb.close();
dataset.close();
and then try with your part 3 code :) ....

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to predict out of sample values using Spark (JAVA API) - java

Related

Predict on user input data with Spark ML Pipeline in Java

testing OpenNLP classifier model

How to work with Java Apache Spark MLlib when DataFrame has columns?

Spark: Running multiple queries on multiple files, optimization

Jena TDB to store and query using API

Categories

Resources