I'm trying to run these lines :
dsFinalSegRfm.show(20, false);
Long compilationTime = System.currentTimeMillis() / 1000;
JavaRDD<CustomerKnowledgeEntity> customerKnowledgeList = dsFinalSegRfm.javaRDD().map(
(Function<Row, CustomerKnowledgeEntity>) rowRfm -> {
CustomerKnowledgeEntity customerKnowledge = new CustomerKnowledgeEntity();
customerKnowledge.setCustomerId(new Long(getString(rowRfm.getAs("CLI_ID"))));
customerKnowledge.setKnowledgeType("rfm-segmentation");
customerKnowledge.setKnowledgeTypeId("default");
InformationsEntity infos = new InformationsEntity();
infos.setCreationDate(new Date());
infos.setModificationDate(new Date());
infos.setUserModification("addKnowledge");
customerKnowledge.setInformations(infos);
List<KnowledgeEntity> knowledgeEntityList = new ArrayList<>();
List<WrappedArray<String>> segList = rowRfm.getList(rowRfm.fieldIndex("SEGS"));
for (WrappedArray<String> seg : segList) {
KnowledgeEntity knowledge = new KnowledgeEntity();
Map<String, Object> attr = new HashMap<>();
attr.put("segment", seg.apply(1));
attr.put("segmentSemester", seg.apply(2));
knowledge.setKnowledgeId(seg.apply(0));
knowledge.setAttributes(attr);
knowledge.setPriority(0);
knowledge.setCount(1);
knowledge.setDeleted(false);
knowledgeEntityList.add(knowledge);
}
customerKnowledge.setKnowledgeCollections(knowledgeEntityList);
return customerKnowledge;
});
Long dataConstructionTime = System.currentTimeMillis() / 1000;
Dataset<Row> dataset = sparkSession
.createDataFrame(customerKnowledgeList, CustomerKnowledgeEntity.class)
.repartition(16)
.cache();
The dsFinalSegRfm.show(20, false); returns what I expect :
But I'm getting a Null Pointer Exception from createDataFrame method.
I'm learning Spark but I find it very opaque for debugging...
Any help is appreciated !
Related
I'm trying to run the following code:
for (java.util.Iterator<Row> iter = dataframe1.toLocalIterator(); iter.hasNext();) {
Row it = (iter.next());
String item = it.get(2).toString();
String rayon = it.get(6).toString();
Double d = Double.parseDouble(rayon)/100000;
String geomType = it.get(14).toString();
Dataset<Row> res_f = null;
if(geomType.equalsIgnoreCase("Polygon")) {
res_f= dataframe2.withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_GeomFromWKT('"+item+"'))"));
} else {
res_f = dataframe2.withColumn("ST_BUFFERR",expr("ST_Buffer(ST_GeomFromWKT('"+item+"'),"+d+")")).withColumn("ST_WITHIN",expr("ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',longitude,' ',latitude,')',4326)),ST_BUFFERR)"));
}
res_f.show();
}
But res_f returns nothing and is always null.
I'm using Spark with Java.
EDIT
I solved the problem, just change this line from Dataset<Row> res_f = null; to Dataset<Row> res_f;
I need your help .
I have got a code on Scala which reads data from Twitter with help streaming, and I would like to do the same on Java. I'm trying seriallize data with help Jackson Mapper. But I have an error here MongoSpark.save(dataFrame,writeConfig);This one is underlined (Cannot resolve method save(org,apache.spark.sql.Dataframe, com.mongodb.saprk.config.WriteConfig))Can do the same in other way?MongoSpark.save(rawTweetsDF.coalesce(1).write.format("org.apache.spark.sql.json").option("forensicdb","LiveRawTweets").mode("append"), writeConfig) I also confused about this line can I do the same in Java?
P.S. I'm using Spark 1.6.2 version
object tweetstreamingmodel {
//***********************************************************************************
#transient
#volatile private
var spark_SparkSession: SparkSession = _ //Equivalent of SQLContext
val naivemodelpth = "/home/smulwa/data/naiveBayesModel"
case class SummaryStats(Recall: Double, Precision: Double, F1measure: Double, Accuracy: Double)
var tweetcategory: String = _
//***********************************************************************************
def main(args: Array[String]) {
try {
var totalTweets: Long = 0
if (spark_SparkSession == null) {
spark_SparkSession = SentUtilities.getSparkSession() //Get Spark Session Object
}
val spark_streamcontext = SentUtilities.getSparkStreamingContext(spark_SparkSession.sparkContext)
spark_streamcontext.checkpoint("hdfs://KENBO-SPK08.forensics.net:54310/checkpoint/")
// Load Naive Bayes Model from local drive.
val sqlcontext = spark_SparkSession.sqlContext //Create SQLContext from SparkSession Object
import sqlcontext.implicits._
val twitteroAuth: Some[OAuthAuthorization] = OAuthUtilities.getTwitterOAuth()
val tweetfilters = MongoScalaUtil.getTweetFilters(spark_SparkSession)
val Twitterstream: DStream[Status] = TwitterUtils.createStream(spark_streamcontext, twitteroAuth, tweetfilters,
StorageLevel.MEMORY_AND_DISK_SER).filter(_.getLang() == "en")
Twitterstream.foreachRDD {
rdd =>
if (rdd != null && !rdd.isEmpty() && !rdd.partitions.isEmpty) {
saveRawTweetsToMongoDB(rdd)
rdd.foreachPartition {
partitionOfRecords =>
if (!partitionOfRecords.isEmpty) {
partitionOfRecords.foreach(record =>
MongoScalaUtil.SaveRawtweetstoMongodb(record.toString, record.getUser.getId, record.getId, SentUtilities.getStrea mDate(), SentUtilities.getStreamTime())) //mongo_utilities.save(record.toString,spark_SparkSession.sparkContext))
}
}
}
}
val jacksonObjectMapper: ObjectMapper = new ObjectMapper()
// #param rdd -- RDD of Status objects to save.
def saveRawTweetsToMongoDB(rdd: RDD[Status]): Unit = {
try {
val sqlContext = spark_SparkSession.sqlContext
val tweet = rdd.map(status => jacksonObjectMapper.writeValueAsString(status))
val rawTweetsDF = sqlContext.read.json(tweet)
val readConfig: ReadConfig = ReadConfig(Map("uri" ->
"mongodb://10.0.10.100:27017/forensicdb.LiveRawTweets?readPreference=primaryPreferred"))
val writeConfig: WriteConfig = WriteConfig(Map("uri" ->
"mongodb://10.0.10.100:27017/forensicdb.LiveRawTweets"))
MongoSpark.save(rawTweetsDF.coalesce(1).write.format("org.apache.spark.sql.json").option("forensicdb",
"LiveRawTweets").mode("append"), writeConfig)
} catch {
case e: Exception => println("Error Saving tweets to Mongodb:", e)
}
}
}
and Java analogue
public class Main {
// Set system credentials for access to twitter
private static void setTwitterOAuth() {
System.setProperty("twitter4j.oauth.consumerKey", TwitterCredentials.consumerKey);
System.setProperty("twitter4j.oauth.consumerSecret", TwitterCredentials.consumerSecret);
System.setProperty("twitter4j.oauth.accessToken", TwitterCredentials.accessToken);
System.setProperty("twitter4j.oauth.accessTokenSecret", TwitterCredentials.accessTokenSecret);
}
public static void main(String[] args) {
setTwitterOAuth();
SparkConf conf = new SparkConf().setMaster("local[2]")
.setAppName("SparkTwitter");
// Spark contexts
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sparkContext, new Duration(1000));
JavaReceiverInputDStream < Status > twitterStream = TwitterUtils.createStream(jssc);
// Stream that contains just tweets in english
JavaDStream < Status > enTweetsDStream = twitterStream.filter((status) -> "en".equalsIgnoreCase(status.getLang()));
enTweetsDStream.persist(StorageLevel.MEMORY_AND_DISK());
enTweetsDStream.foreachRDD(rdd -> {
if (rdd != null && !rdd.isEmpty() && !rdd.partitions().isEmpty()) {
saveRawTweetsToMondoDb(rdd, sparkContext);
}
});
enTweetsDStream.print();
jssc.start();
jssc.awaitTermination();
}
static void saveRawTweetsToMondoDb(JavaRDD < Status > rdd, JavaSparkContext sparkContext) {
try {
ObjectMapper objectMapper = new ObjectMapper();
Function < Status, String > toJsonString = status -> objectMapper.writeValueAsString(status);
SQLContext sqlContext = new SQLContext(sparkContext);
JavaRDD < String > tweet = (JavaRDD < String > ) rdd.map(toJsonString);
DataFrame dataFrame = sqlContext.read().json(tweet);
// Setting for read
Map < String, String > readOverrides = new HashMap < > ();
readOverrides.put("uri", "mongodb://127.0.0.1/forensicdb.LiveRawTweets");
readOverrides.put("readPreference", "primaryPreferred");
ReadConfig readConfig = ReadConfig.create(sparkContext).withJavaOptions(readOverrides);
// Settings for writing
Map < String, String > writeOverrides = new HashMap < > ();
writeOverrides.put("uri", "mongodb://127.0.0.1/forensicdb.LiveRawTweets");
WriteConfig writeConfig = WriteConfig.create(sparkContext).withJavaOptions(writeOverrides);
MongoSpark.write(dataFrame).option("collection", "LiveRawTweets").mode("append").save();
MongoSpark.save(dataFrame, writeConfig);
} catch (Exception e) {
System.out.println("Error saving to database");
}
}
Following is the piece of code i used to train my model. After that how and where can i save my model and read it back other than FileExporter class? is it only in a file or can i store it in a cache and access back?
IgniteCache<Integer, double[]> cache = ignite.getOrCreateCache("MLData_IRIS");
// extracting sepal length, sepal width, petal length, petal width
IgniteBiFunction<Integer, double[], Vector> featureExtractor = new RangeExtractor(1, 5);
IgniteBiFunction<Integer, double[], Double> labelExtractor = new PointExtractor(0);
System.out.println(">>> Create new training dataset splitter object.");
TrainTestSplit<Integer, double[]> split = new TrainTestDatasetSplitter<Integer, double[]>()
.split(0.5, 0.5);
IgniteBiPredicate<Integer, double[]> testData = split.getTestFilter();
IgniteBiPredicate<Integer, double[]> trainData = split.getTrainFilter();
// Set up the trainer
KMeansTrainer trainer = new KMeansTrainer()
.withDistance(new EuclideanDistance()) //other metrics are HammingDistance, ManhattanDistance
.withAmountOfClusters(3) // number clusters want to create
.withMaxIterations(100)
.withEpsilon(1.0E-4D)
.withSeed(1234L);
long t1 = System.currentTimeMillis();
KMeansModel mdl = trainer.fit(
ignite,
cache,
trainData,
featureExtractor,
labelExtractor
);
long t2 = System.currentTimeMillis();
System.out.println("time taken to build the model : " + (t2 - t1) + " ms");
System.out.println(">>> --------------------------------------------");
System.out.println(">>> trained model: " + mdl.toString(true));
For now Ignite have only this mechanism - FileExporter.
But, for version 2.8 we already implemented model storage.
Sample for saving model:
ModelStorage storage = new ModelStorageFactory().getModelStorage(ignite);
storage.mkdirs("/");
storage.putFile("/my_model", serializedMdl);
ModelDescriptor desc = new ModelDescriptor(
"MyModel",
"My Cool Model",
new ModelSignature("", "", ""),
new ModelStorageModelReader("/my_model"),
new IgniteModelParser<>()
);
ModelDescriptorStorage descStorage = new ModelDescriptorStorageFactory().getModelDescriptorStorage(ignite);
descStorage.put("my_model", desc);
Sample for loading model:
Ignite ignite = Ignition.ignite();
ModelDescriptorStorage descStorage = new ModelDescriptorStorageFactory().getModelDescriptorStorage(ignite);
ModelDescriptor desc = descStorage.get(mdl);
Model<byte[], byte[]> infMdl = new SingleModelBuilder().build(desc.getReader(), desc.getParser());
Vector input = VectorUtils.of(x);
try {
return deserialize(infMdl.predict(serialize(input)));
}
catch (IOException | ClassNotFoundException e) {
throw new RuntimeException(e);
}
Where x - is vector of doubles and mdl - is model name.
NOTE: this API will be available with release 2.8. But, you could try it right now if you will build Ignite from master branch.
I have a custom metric in AWS cloudwatch and i am putting data into it through AWS java API.
for(int i =0;i<collection.size();i++){
String[] cell = collection.get(i).split("\\|\\|");
List<Dimension> dimensions = new ArrayList<>();
dimensions.add(new Dimension().withName(dimension[0]).withValue(cell[0]));
dimensions.add(new Dimension().withName(dimension[1]).withValue(cell[1]));
MetricDatum datum = new MetricDatum().withMetricName(metricName)
.withUnit(StandardUnit.None)
.withValue(Double.valueOf(cell[2]))
.withDimensions(dimensions);
PutMetricDataRequest request = new PutMetricDataRequest().withNamespace(namespace+"_"+cell[3]).withMetricData(datum);
String response = String.valueOf(cw.putMetricData(request));
GetMetricDataRequest res = new GetMetricDataRequest().withMetricDataQueries();
//cw.getMetricData();
com.amazonaws.services.cloudwatch.model.Metric m = new com.amazonaws.services.cloudwatch.model.Metric();
m.setMetricName(metricName);
m.setDimensions(dimensions);
m.setNamespace(namespace);
MetricStat ms = new MetricStat().withMetric(m);
MetricDataQuery metricDataQuery = new MetricDataQuery();
metricDataQuery.withMetricStat(ms);
metricDataQuery.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<MetricDataQuery>();
mqList.add(metricDataQuery);
res.withMetricDataQueries(mqList);
GetMetricDataResult result1= cw.getMetricData(res);
}
Now i want to be able to fetch the latest data entered for a particular namespace, metric name and dimention combination through Java API. I am not able to find appropriate documenation from AWS regarding the same. Can anyone please help me?
I got the results from cloudwatch by the below code.\
GetMetricDataRequest getMetricDataRequest = new GetMetricDataRequest().withMetricDataQueries();
Integer integer = new Integer(300);
Iterator<Map.Entry<String, String>> entries = dimensions.entrySet().iterator();
List<Dimension> dList = new ArrayList<Dimension>();
while (entries.hasNext()) {
Map.Entry<String, String> entry = entries.next();
dList.add(new Dimension().withName(entry.getKey()).withValue(entry.getValue()));
}
com.amazonaws.services.cloudwatch.model.Metric metric = new com.amazonaws.services.cloudwatch.model.Metric();
metric.setNamespace(namespace);
metric.setMetricName(metricName);
metric.setDimensions(dList);
MetricStat ms = new MetricStat().withMetric(metric)
.withPeriod(integer)
.withUnit(StandardUnit.None)
.withStat("Average");
MetricDataQuery metricDataQuery = new MetricDataQuery().withMetricStat(ms)
.withId("m1");
List<MetricDataQuery> mqList = new ArrayList<>();
mqList.add(metricDataQuery);
getMetricDataRequest.withMetricDataQueries(mqList);
long timestamp = 1536962700000L;
long timestampEnd = 1536963000000L;
Date d = new Date(timestamp );
Date dEnd = new Date(timestampEnd );
getMetricDataRequest.withStartTime(d);
getMetricDataRequest.withEndTime(dEnd);
GetMetricDataResult result1= cw.getMetricData(getMetricDataRequest);
I wrote code to read a csv file and map all the columns to a bean class.
Now, I'm trying to set these values to a Dataset and getting an issue.
7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
If I try to set the values manually it works fine
public void run(String t, String u) throws FileNotFoundException {
JavaRDD<String> pairRDD = sparkContext.textFile("C:/temp/L1_result.csv");
JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String,String> call(String rec) throws FileNotFoundException {
String[] tokens = rec.split(";");
String[] vals = new String[tokens.length];
for(int i= 0; i < tokens.length; i++){
vals[i] =tokens[i];
}
return new Tuple2<String, String>(tokens[0], tokens[1]);
}
});
ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
cpm.setType(funds.class);
String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
cpm.setColumnMapping(csvcolumns);
CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));
CsvToBean csvtobean = new CsvToBean();
List csvDataList = csvtobean.parse(cpm, csvReader);
for (Object dataobject : csvDataList) {
funds fund = (funds) dataobject;
System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
}
/* funds b0 = new funds();
b0.setK("k0");
b0.setSomething("sth0");
funds b1 = new funds();
b1.setK("k1");
b1.setSomething("sth1");
List<funds> data = new ArrayList<funds>();
data.add(b0);
data.add(b1);*/
System.out.println("Portfolio:" + rowJavaRDD.values());
//manual set works fine ///
// Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
fundDf.printSchema();
fundDf.write().option("mergeschema", true).parquet("C:/test");
}
The line below is giving an issue: using rowJavaRDD.values():
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
what is the resolution to this? whatever values Im column mapping should be passed here, but how this needs to be done. Any idea really helps me.
Dataset fundDf = SQLContext.createDataFrame(csvDataList, funds.class);
Passing list worked!