Convert JavaDStream<String> to JavaRDD<String> - java

I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.

You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)

You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})

I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();

Related

Create a Dataset from String Spark Java (Without RDD)

I need to create a Dataset from String. Key is the String
Header h = new Header();
h.setName(Key);
SQLContext sqlC = spark.sqlContext();
Dataset<String> ds = sqlC.createDataset(Collections.singletonList(h), Encoders.STRING());
ds.show();
I need to write it into txt file(Is there one? I am using csv right now)
ds.write().format("com.databricks.spark.csv").mode("overwrite")
.save(SomeLocation);
from documentation df.write.text():
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#text-java.lang.String-

How to convert the datasets of Spark Row into string?

I have written the code to access the Hive table using SparkSQL. Here is the code:
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.master("local[*]")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> df = spark.sql("select survey_response_value from health").toDF();
df.show();
I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?
Here is the sample code in Java.
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//create df
List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
df.show();
//using df.as
List<String> listOne = df.as(Encoders.STRING()).collectAsList();
System.out.println(listOne);
//using df.map
List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
System.out.println(listTwo);
}
}
"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html
You can use the map function to convert every row into a string, e.g.:
df.map(row => row.mkString())
Instead of just mkString you can of course do more sophisticated work
The collect method then can retreive the whole thing into an array
val strings = df.map(row => row.mkString()).collect
(This is the Scala syntax, I think in Java it's quite similar)
If you are planning to read the dataset line by line, then you can use the iterator over the dataset:
Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);
for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
String item = (iter.next()).toString();
System.out.println(item.toString());
}
to put as a single string, from sparkSession you can do:
sparkSession.read.textFile(filePath).collect.mkString
assuming your Dataset is of type String: Dataset[String]

How to convert JSON from KAFKA to pass it to Spark's machine learning Algorithm

I am trying to learn spark and spark-streaming using Java. And developing an IOT application.
I am having a KAFKA server in place which accepts JSON data and I am able to parse it using SQLContext and foreach function.
Data format is as follows,
[{"t":1481368346000,"sensors":[{"s":"s1","d":"+149.625"},{"s":"s2","d":"+23.062"},{"s":"s3","d":"+16.375"},{"s":"s4","d":"+235.937"},{"s":"s5","d":"+271.437"},{"s":"s6","d":"+265.937"},{"s":"s7","d":"+295.562"},{"s":"s8","d":"+301.687"}]}]
In this t is a timestamp of each data stream
and sensors is array of sensor data with s as a name of each sensor and d containing a data.
What I have done till now is,
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topics);
SQLContext sqlContext = spark.sqlContext();
StreamingLinearRegressionWithSGD model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(2));
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
return message._2();
};
});
json.print();
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> jsonRecord) throws Exception {
System.out.println("JSON Record ---- "+jsonRecord);
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
}
});
ssc.start();
ssc.awaitTermination();
What I want to do is, convert json which is JavaDStream into a data structure which StreamingLinearRegressionWithSGD model accepts.
When I try to use sparks's map function to map json stream to JavaDStream as follows,
JavaDStream<LabeledPoint> forML = json.map(new Function<String, LabeledPoint>() {
#Override
public LabeledPoint call(String jsonRecord) throws Exception {
// TODO Auto-generated method stub
System.out.println("\n\n\n here is JSON in"+ jsonRecord);
LabeledPoint returnObj = null;
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
returnObj = new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
return returnObj;
}
}).cache();
model.trainOn(forML);
And call model.trainOn it fails with NullPointerException at
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
Now the questions I am having are,
Am I doing this right?
How I will be able to predict values and why and how I need to create a different stream to pass it on to predictOn function of model?
I will be receiving multiple sensors but single value for each sensor, and there can be thousands of such streams, how I can create different model for each of those thousand sensors and predict for such a vast amount of data efficiently?
Are there any other good machine learning algorithms or approaches which can be utilized for this type of sensor data?

Creating a simple 1-row Spark DataFrame with Java API

In Scala, I can create a single-row DataFrame from an in-memory string like so:
val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()
When df.show() runs, it outputs:
+-----+
| fizz|
+-----+
| buzz|
+-----+
Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
.parallelize(stringAsList), StringType);
df.show();
...but still seem to be coming up short. Now when df.show(); executes, I get:
++
||
++
||
++
(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?
I have created 2 examples for Spark 2 if you need to upgrade:
Simple Fizz/Buzz (or foe/bar - old generation :) ):
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String> stringAsList = new ArrayList<>();
stringAsList.add("bar");
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes.createStructType(
new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
2x2 data:
SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
.getOrCreate();
List<String[]> stringAsList = new ArrayList<>();
stringAsList.add(new String[] { "bar1.1", "bar2.1" });
stringAsList.add(new String[] { "bar1.2", "bar2.2" });
JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));
// Creates schema
StructType schema = DataTypes
.createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
DataTypes.createStructField("foe2", DataTypes.StringType, false) });
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.
You can achieve this by creating List to Rdd and than create Schema which will contain column name.
There might be other ways as well, it's just one of them.
List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
return RowFactory.create(row);
});
StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });
DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();
//+----+
|fizz|
+----+
|buzz|
Building on what #jgp suggested. If you want to do this for mixed types you can do:
List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
new Tuple2<>(1, false),
new Tuple2<>(1, false),
new Tuple2<>(1, false));
JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));
StructType mySchema = new StructType()
.add("id", DataTypes.IntegerType, false)
.add("flag", DataTypes.BooleanType, false);
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();
This might help with the #jdk2588 's question.
This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

Converting Java Map to Spark DataFrame (Java API)

I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:
Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???
But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.
So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.
SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sf);
SQLContext sqlCon = new SQLContext(sc);
Map map = new HashMap<String, Map<String, String>>();
map.put("test1", putMap);
HashMap putMap = new HashMap<String, String>();
putMap.put("1", "test");
List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();
Set<String> allKeys = map.keySet();
for (String key : allKeys) {
list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
};
JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);
System.out.println(rdd.first());
List<StructField> fields = new ArrayList<>();
StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
StructField field2 = DataTypes.createStructField("Map",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
fields.add(field1);
fields.add(field2);
StructType struct = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {
#Override
public Row call(Tuple2<String, HashMap> arg0) throws Exception {
return RowFactory.create(arg0._1, arg0._2);
}
});
DataFrame df = sqlCon.createDataFrame(rowRDD, struct);
df.show();
In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!
Edit: Obviously you can delete all the prints. I did this for visualization purposes!

Categories

Resources