Spark - createDataFrame returns NPE - java

I'm trying to run these lines :
dsFinalSegRfm.show(20, false);
Long compilationTime = System.currentTimeMillis() / 1000;
JavaRDD<CustomerKnowledgeEntity> customerKnowledgeList = dsFinalSegRfm.javaRDD().map(
(Function<Row, CustomerKnowledgeEntity>) rowRfm -> {
CustomerKnowledgeEntity customerKnowledge = new CustomerKnowledgeEntity();
customerKnowledge.setCustomerId(new Long(getString(rowRfm.getAs("CLI_ID"))));
customerKnowledge.setKnowledgeType("rfm-segmentation");
customerKnowledge.setKnowledgeTypeId("default");
InformationsEntity infos = new InformationsEntity();
infos.setCreationDate(new Date());
infos.setModificationDate(new Date());
infos.setUserModification("addKnowledge");
customerKnowledge.setInformations(infos);
List<KnowledgeEntity> knowledgeEntityList = new ArrayList<>();
List<WrappedArray<String>> segList = rowRfm.getList(rowRfm.fieldIndex("SEGS"));
for (WrappedArray<String> seg : segList) {
KnowledgeEntity knowledge = new KnowledgeEntity();
Map<String, Object> attr = new HashMap<>();
attr.put("segment", seg.apply(1));
attr.put("segmentSemester", seg.apply(2));
knowledge.setKnowledgeId(seg.apply(0));
knowledge.setAttributes(attr);
knowledge.setPriority(0);
knowledge.setCount(1);
knowledge.setDeleted(false);
knowledgeEntityList.add(knowledge);
}
customerKnowledge.setKnowledgeCollections(knowledgeEntityList);
return customerKnowledge;
});
Long dataConstructionTime = System.currentTimeMillis() / 1000;
Dataset<Row> dataset = sparkSession
.createDataFrame(customerKnowledgeList, CustomerKnowledgeEntity.class)
.repartition(16)
.cache();
The dsFinalSegRfm.show(20, false); returns what I expect :
But I'm getting a Null Pointer Exception from createDataFrame method.
I'm learning Spark but I find it very opaque for debugging...
Any help is appreciated !

Related

IF/ELSE conditions on Spark / JAVA

I'm trying to run the following code:
for (java.util.Iterator<Row> iter = dataframe1.toLocalIterator(); iter.hasNext();) {
Row it = (iter.next());
String item = it.get(2).toString();
String rayon = it.get(6).toString();
Double d = Double.parseDouble(rayon)/100000;
String geomType = it.get(14).toString();
Dataset<Row> res_f = null;
if(geomType.equalsIgnoreCase("Polygon")) {
res_f= dataframe2.withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_GeomFromWKT('"+item+"'))"));
} else {
res_f = dataframe2.withColumn("ST_BUFFERR",expr("ST_Buffer(ST_GeomFromWKT('"+item+"'),"+d+")")).withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_BUFFERR)"));
}
res_f.show();
}
But res_f returns nothing and is always null.
I'm using Spark with Java.
EDIT
I solved the problem, just change this line from Dataset<Row> res_f = null; to Dataset<Row> res_f;
I need your help .

Can't save Dataframe to mongodb

I have got a code on Scala which reads data from Twitter with help streaming, and I would like to do the same on Java. I'm trying seriallize data with help Jackson Mapper. But I have an error here MongoSpark.save(dataFrame,writeConfig);This one is underlined (Cannot resolve method save(org,apache.spark.sql.Dataframe, com.mongodb.saprk.config.WriteConfig))Can do the same in other way?MongoSpark.save(rawTweetsDF.coalesce(1).write.format("org.apache.spark.sql.json").option("forensicdb","LiveRawTweets").mode("append"), writeConfig) I also confused about this line can I do the same in Java?
P.S. I'm using Spark 1.6.2 version
object tweetstreamingmodel {
//***********************************************************************************
#transient
#volatile private
var spark_SparkSession: SparkSession = _ //Equivalent of SQLContext
val naivemodelpth = "/home/smulwa/data/naiveBayesModel"
case class SummaryStats(Recall: Double, Precision: Double, F1measure: Double, Accuracy: Double)
var tweetcategory: String = _
//***********************************************************************************
def main(args: Array[String]) {
try {
var totalTweets: Long = 0
if (spark_SparkSession == null) {
spark_SparkSession = SentUtilities.getSparkSession() //Get Spark Session Object
}
val spark_streamcontext = SentUtilities.getSparkStreamingContext(spark_SparkSession.sparkContext)
spark_streamcontext.checkpoint("hdfs://KENBO-SPK08.forensics.net:54310/checkpoint/")
// Load Naive Bayes Model from local drive.
val sqlcontext = spark_SparkSession.sqlContext //Create SQLContext from SparkSession Object
import sqlcontext.implicits._
val twitteroAuth: Some[OAuthAuthorization] = OAuthUtilities.getTwitterOAuth()
val tweetfilters = MongoScalaUtil.getTweetFilters(spark_SparkSession)
val Twitterstream: DStream[Status] = TwitterUtils.createStream(spark_streamcontext, twitteroAuth, tweetfilters,
StorageLevel.MEMORY_AND_DISK_SER).filter(_.getLang() == "en")
Twitterstream.foreachRDD {
rdd =>
if (rdd != null && !rdd.isEmpty() && !rdd.partitions.isEmpty) {
saveRawTweetsToMongoDB(rdd)
rdd.foreachPartition {
partitionOfRecords =>
if (!partitionOfRecords.isEmpty) {
partitionOfRecords.foreach(record =>
MongoScalaUtil.SaveRawtweetstoMongodb(record.toString, record.getUser.getId, record.getId, SentUtilities.getStrea mDate(), SentUtilities.getStreamTime())) //mongo_utilities.save(record.toString,spark_SparkSession.sparkContext))
}
}
}
}
val jacksonObjectMapper: ObjectMapper = new ObjectMapper()
// #param rdd -- RDD of Status objects to save.
def saveRawTweetsToMongoDB(rdd: RDD[Status]): Unit = {
try {
val sqlContext = spark_SparkSession.sqlContext
val tweet = rdd.map(status => jacksonObjectMapper.writeValueAsString(status))
val rawTweetsDF = sqlContext.read.json(tweet)
val readConfig: ReadConfig = ReadConfig(Map("uri" ->
"mongodb://10.0.10.100:27017/forensicdb.LiveRawTweets?readPreference=primaryPreferred"))
val writeConfig: WriteConfig = WriteConfig(Map("uri" ->
"mongodb://10.0.10.100:27017/forensicdb.LiveRawTweets"))
MongoSpark.save(rawTweetsDF.coalesce(1).write.format("org.apache.spark.sql.json").option("forensicdb",
"LiveRawTweets").mode("append"), writeConfig)
} catch {
case e: Exception => println("Error Saving tweets to Mongodb:", e)
}
}
}
and Java analogue
public class Main {
// Set system credentials for access to twitter
private static void setTwitterOAuth() {
System.setProperty("twitter4j.oauth.consumerKey", TwitterCredentials.consumerKey);
System.setProperty("twitter4j.oauth.consumerSecret", TwitterCredentials.consumerSecret);
System.setProperty("twitter4j.oauth.accessToken", TwitterCredentials.accessToken);
System.setProperty("twitter4j.oauth.accessTokenSecret", TwitterCredentials.accessTokenSecret);
}
public static void main(String[] args) {
setTwitterOAuth();
SparkConf conf = new SparkConf().setMaster("local[2]")
.setAppName("SparkTwitter");
// Spark contexts
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sparkContext, new Duration(1000));
JavaReceiverInputDStream < Status > twitterStream = TwitterUtils.createStream(jssc);
// Stream that contains just tweets in english
JavaDStream < Status > enTweetsDStream = twitterStream.filter((status) -> "en".equalsIgnoreCase(status.getLang()));
enTweetsDStream.persist(StorageLevel.MEMORY_AND_DISK());
enTweetsDStream.foreachRDD(rdd -> {
if (rdd != null && !rdd.isEmpty() && !rdd.partitions().isEmpty()) {
saveRawTweetsToMondoDb(rdd, sparkContext);
}
});
enTweetsDStream.print();
jssc.start();
jssc.awaitTermination();
}
static void saveRawTweetsToMondoDb(JavaRDD < Status > rdd, JavaSparkContext sparkContext) {
try {
ObjectMapper objectMapper = new ObjectMapper();
Function < Status, String > toJsonString = status -> objectMapper.writeValueAsString(status);
SQLContext sqlContext = new SQLContext(sparkContext);
JavaRDD < String > tweet = (JavaRDD < String > ) rdd.map(toJsonString);
DataFrame dataFrame = sqlContext.read().json(tweet);
// Setting for read
Map < String, String > readOverrides = new HashMap < > ();
readOverrides.put("uri", "mongodb://127.0.0.1/forensicdb.LiveRawTweets");
readOverrides.put("readPreference", "primaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkContext).withJavaOptions(readOverrides);
// Settings for writing
Map < String, String > writeOverrides = new HashMap < > ();
writeOverrides.put("uri", "mongodb://127.0.0.1/forensicdb.LiveRawTweets");
WriteConfig writeConfig = WriteConfig.create(sparkContext).withJavaOptions(writeOverrides);
MongoSpark.write(dataFrame).option("collection", "LiveRawTweets").mode("append").save();
MongoSpark.save(dataFrame, writeConfig);
} catch (Exception e) {
System.out.println("Error saving to database");
}
}

ignite: how to save and re-load trained model

Following is the piece of code i used to train my model. After that how and where can i save my model and read it back other than FileExporter class? is it only in a file or can i store it in a cache and access back?
IgniteCache<Integer, double[]> cache = ignite.getOrCreateCache("MLData_IRIS");
// extracting sepal length, sepal width, petal length, petal width
IgniteBiFunction<Integer, double[], Vector> featureExtractor = new RangeExtractor(1, 5);
IgniteBiFunction<Integer, double[], Double> labelExtractor = new PointExtractor(0);
System.out.println(">>> Create new training dataset splitter object.");
TrainTestSplit<Integer, double[]> split = new TrainTestDatasetSplitter<Integer, double[]>()
.split(0.5, 0.5);
IgniteBiPredicate<Integer, double[]> testData = split.getTestFilter();
IgniteBiPredicate<Integer, double[]> trainData = split.getTrainFilter();
// Set up the trainer
KMeansTrainer trainer = new KMeansTrainer()
.withDistance(new EuclideanDistance()) //other metrics are HammingDistance, ManhattanDistance
.withAmountOfClusters(3) // number clusters want to create
.withMaxIterations(100)
.withEpsilon(1.0E-4D)
.withSeed(1234L);
long t1 = System.currentTimeMillis();
KMeansModel mdl = trainer.fit(
ignite,
cache,
trainData,
featureExtractor,
labelExtractor
);
long t2 = System.currentTimeMillis();
System.out.println("time taken to build the model : " + (t2 - t1) + " ms");
System.out.println(">>> --------------------------------------------");
System.out.println(">>> trained model: " + mdl.toString(true));
For now Ignite have only this mechanism - FileExporter.
But, for version 2.8 we already implemented model storage.
Sample for saving model:
ModelStorage storage = new ModelStorageFactory().getModelStorage(ignite);
storage.mkdirs("/");
storage.putFile("/my_model", serializedMdl);
ModelDescriptor desc = new ModelDescriptor(
"MyModel",
"My Cool Model",
new ModelSignature("", "", ""),
new ModelStorageModelReader("/my_model"),
new IgniteModelParser<>()
);
ModelDescriptorStorage descStorage = new ModelDescriptorStorageFactory().getModelDescriptorStorage(ignite);
descStorage.put("my_model", desc);
Sample for loading model:
Ignite ignite = Ignition.ignite();
ModelDescriptorStorage descStorage = new ModelDescriptorStorageFactory().getModelDescriptorStorage(ignite);
ModelDescriptor desc = descStorage.get(mdl);
Model<byte[], byte[]> infMdl = new SingleModelBuilder().build(desc.getReader(), desc.getParser());
Vector input = VectorUtils.of(x);
try {
return deserialize(infMdl.predict(serialize(input)));
}
catch (IOException | ClassNotFoundException e) {
throw new RuntimeException(e);
}
Where x - is vector of doubles and mdl - is model name.
NOTE: this API will be available with release 2.8. But, you could try it right now if you will build Ignite from master branch.

Getting latest data from AWS custom Cloudwatch in Java

I have a custom metric in AWS cloudwatch and i am putting data into it through AWS java API.
for(int i =0;i<collection.size();i++){
String[] cell = collection.get(i).split("\\|\\|");
List<Dimension> dimensions = new ArrayList<>();
dimensions.add(new Dimension().withName(dimension[0]).withValue(cell[0]));
dimensions.add(new Dimension().withName(dimension[1]).withValue(cell[1]));
MetricDatum datum = new MetricDatum().withMetricName(metricName)
.withUnit(StandardUnit.None)
.withValue(Double.valueOf(cell[2]))
.withDimensions(dimensions);
PutMetricDataRequest request = new PutMetricDataRequest().withNamespace(namespace+"_"+cell[3]).withMetricData(datum);
String response = String.valueOf(cw.putMetricData(request));
GetMetricDataRequest res = new GetMetricDataRequest().withMetricDataQueries();
//cw.getMetricData();
com.amazonaws.services.cloudwatch.model.Metric m = new com.amazonaws.services.cloudwatch.model.Metric();
m.setMetricName(metricName);
m.setDimensions(dimensions);
m.setNamespace(namespace);
MetricStat ms = new MetricStat().withMetric(m);
MetricDataQuery metricDataQuery = new MetricDataQuery();
metricDataQuery.withMetricStat(ms);
metricDataQuery.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<MetricDataQuery>();
mqList.add(metricDataQuery);
res.withMetricDataQueries(mqList);
GetMetricDataResult result1= cw.getMetricData(res);
}
Now i want to be able to fetch the latest data entered for a particular namespace, metric name and dimention combination through Java API. I am not able to find appropriate documenation from AWS regarding the same. Can anyone please help me?
I got the results from cloudwatch by the below code.\
GetMetricDataRequest getMetricDataRequest = new GetMetricDataRequest().withMetricDataQueries();
Integer integer = new Integer(300);
Iterator<Map.Entry<String, String>> entries = dimensions.entrySet().iterator();
List<Dimension> dList = new ArrayList<Dimension>();
while (entries.hasNext()) {
Map.Entry<String, String> entry = entries.next();
dList.add(new Dimension().withName(entry.getKey()).withValue(entry.getValue()));
}
com.amazonaws.services.cloudwatch.model.Metric metric = new com.amazonaws.services.cloudwatch.model.Metric();
metric.setNamespace(namespace);
metric.setMetricName(metricName);
metric.setDimensions(dList);
MetricStat ms = new MetricStat().withMetric(metric)
.withPeriod(integer)
.withUnit(StandardUnit.None)
.withStat("Average");
MetricDataQuery metricDataQuery = new MetricDataQuery().withMetricStat(ms)
.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<>();
mqList.add(metricDataQuery);
getMetricDataRequest.withMetricDataQueries(mqList);
long timestamp = 1536962700000L;
long timestampEnd = 1536963000000L;
Date d = new Date(timestamp );
Date dEnd = new Date(timestampEnd );
getMetricDataRequest.withStartTime(d);
getMetricDataRequest.withEndTime(dEnd);
GetMetricDataResult result1= cw.getMetricData(getMetricDataRequest);

How to pass csv mapped bean class to Dataset

I wrote code to read a csv file and map all the columns to a bean class.
Now, I'm trying to set these values to a Dataset and getting an issue.
7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
If I try to set the values manually it works fine
public void run(String t, String u) throws FileNotFoundException {
JavaRDD<String> pairRDD = sparkContext.textFile("C:/temp/L1_result.csv");
JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String,String> call(String rec) throws FileNotFoundException {
String[] tokens = rec.split(";");
String[] vals = new String[tokens.length];
for(int i= 0; i < tokens.length; i++){
vals[i] =tokens[i];
}
return new Tuple2<String, String>(tokens[0], tokens[1]);
}
});
ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
cpm.setType(funds.class);
String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
cpm.setColumnMapping(csvcolumns);
CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));
CsvToBean csvtobean = new CsvToBean();
List csvDataList = csvtobean.parse(cpm, csvReader);
for (Object dataobject : csvDataList) {
funds fund = (funds) dataobject;
System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
}
/* funds b0 = new funds();
b0.setK("k0");
b0.setSomething("sth0");
funds b1 = new funds();
b1.setK("k1");
b1.setSomething("sth1");
List<funds> data = new ArrayList<funds>();
data.add(b0);
data.add(b1);*/
System.out.println("Portfolio:" + rowJavaRDD.values());
//manual set works fine ///
// Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
fundDf.printSchema();
fundDf.write().option("mergeschema", true).parquet("C:/test");
}
The line below is giving an issue: using rowJavaRDD.values():
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
what is the resolution to this? whatever values Im column mapping should be passed here, but how this needs to be done. Any idea really helps me.
Dataset fundDf = SQLContext.createDataFrame(csvDataList, funds.class);
Passing list worked!

Categories

Resources