Create a Dataset from String Spark Java (Without RDD) - java

I need to create a Dataset from String. Key is the String
Header h = new Header();
h.setName(Key);
SQLContext sqlC = spark.sqlContext();
Dataset<String> ds = sqlC.createDataset(Collections.singletonList(h), Encoders.STRING());
ds.show();
I need to write it into txt file(Is there one? I am using csv right now)
ds.write().format("com.databricks.spark.csv").mode("overwrite")
.save(SomeLocation);

from documentation df.write.text():
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html#text-java.lang.String-

Related

How can I change a non numeric value in all the data set using Spark?

I'm using a data set with a lot of columns, this data set has ? in all the data set. I would like to Spark (Java) to change the ? to 0. By far I can only do this with one column but I would like to do everywhere:
Dataset<Row> csvData = spark.read()
.option("header", false)
.option("inferSchema", true)
.option("maxColumns", 50000)
.csv("src/main/resources/K9.data");
csvData = csvData.withColumn("_c5409", when(col("_c5409").isNull(),0).otherwise(col("_c5409")) )
.withColumn("_c0", when(col("_c0").equalTo("?"),0).otherwise(col("_c0")) );
Maybe this has an easy solution, I'm new with Java and Spark :)
You can create the list of columns using when, and use that in select if it has to deal with complex if and else cases
List<org.apache.spark.sql.Column> list = new ArrayList<org.apache.spark.sql.Column>();
for( String col : csvData.columns()){
list.add(when(csvData.col(col).isNull(),0).otherwise(csvData.col(col)).alias(col));
}
csvData = csvData.select(list.toArray(new org.apache.spark.sql.Column[0]));
If it is simply to replace nulls, this is good enough
csvData = csvData.na().fill(0, df.columns());

How to convert the datasets of Spark Row into string?

I have written the code to access the Hive table using SparkSQL. Here is the code:
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.master("local[*]")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> df = spark.sql("select survey_response_value from health").toDF();
df.show();
I would like to know how I can convert the complete output to String or String array? As I am trying to work with another module where only I can pass String or String type Array values.
I have tried other methods like .toString or typecast to String values. But did not worked for me.
Kindly let me know how I can convert the DataSet values to String?
Here is the sample code in Java.
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//create df
List<String> myList = Arrays.asList("one", "two", "three", "four", "five");
Dataset<Row> df = spark.createDataset(myList, Encoders.STRING()).toDF();
df.show();
//using df.as
List<String> listOne = df.as(Encoders.STRING()).collectAsList();
System.out.println(listOne);
//using df.map
List<String> listTwo = df.map(row -> row.mkString(), Encoders.STRING()).collectAsList();
System.out.println(listTwo);
}
}
"row" is java 8 lambda parameter. Please check developer.com/java/start-using-java-lambda-expressions.html
You can use the map function to convert every row into a string, e.g.:
df.map(row => row.mkString())
Instead of just mkString you can of course do more sophisticated work
The collect method then can retreive the whole thing into an array
val strings = df.map(row => row.mkString()).collect
(This is the Scala syntax, I think in Java it's quite similar)
If you are planning to read the dataset line by line, then you can use the iterator over the dataset:
Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);
for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
String item = (iter.next()).toString();
System.out.println(item.toString());
}
to put as a single string, from sparkSession you can do:
sparkSession.read.textFile(filePath).collect.mkString
assuming your Dataset is of type String: Dataset[String]

How to convert JSON from KAFKA to pass it to Spark's machine learning Algorithm

I am trying to learn spark and spark-streaming using Java. And developing an IOT application.
I am having a KAFKA server in place which accepts JSON data and I am able to parse it using SQLContext and foreach function.
Data format is as follows,
[{"t":1481368346000,"sensors":[{"s":"s1","d":"+149.625"},{"s":"s2","d":"+23.062"},{"s":"s3","d":"+16.375"},{"s":"s4","d":"+235.937"},{"s":"s5","d":"+271.437"},{"s":"s6","d":"+265.937"},{"s":"s7","d":"+295.562"},{"s":"s8","d":"+301.687"}]}]
In this t is a timestamp of each data stream
and sensors is array of sensor data with s as a name of each sensor and d containing a data.
What I have done till now is,
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topics);
SQLContext sqlContext = spark.sqlContext();
StreamingLinearRegressionWithSGD model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(2));
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
return message._2();
};
});
json.print();
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> jsonRecord) throws Exception {
System.out.println("JSON Record ---- "+jsonRecord);
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
}
});
ssc.start();
ssc.awaitTermination();
What I want to do is, convert json which is JavaDStream into a data structure which StreamingLinearRegressionWithSGD model accepts.
When I try to use sparks's map function to map json stream to JavaDStream as follows,
JavaDStream<LabeledPoint> forML = json.map(new Function<String, LabeledPoint>() {
#Override
public LabeledPoint call(String jsonRecord) throws Exception {
// TODO Auto-generated method stub
System.out.println("\n\n\n here is JSON in"+ jsonRecord);
LabeledPoint returnObj = null;
if(!jsonRecord.isEmpty()){
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
timestamp.printSchema();
timestamp.show(false);
Dataset<Row> data = sqlContext.read().json(jsonRecord).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select(org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors")))
.toDF("sensors").select("sensors.s","sensors.d").where("sensors.s = 's1'");
Row firstRow = df.head();
String valueOfFirstSensor = firstRow.getString(1);
System.out.println("---------valueOfFirstSensor --------"+ valueOfFirstSensor);
double[] values = new double[1];
values[0] = firstRow.getDouble(0);
returnObj = new LabeledPoint(timestamp.head().getDouble(0), Vectors.dense(values));
df.show(false);
}
return returnObj;
}
}).cache();
model.trainOn(forML);
And call model.trainOn it fails with NullPointerException at
Dataset<Row> timestamp = sqlContext.read().json(jsonRecord).select("t");
Now the questions I am having are,
Am I doing this right?
How I will be able to predict values and why and how I need to create a different stream to pass it on to predictOn function of model?
I will be receiving multiple sensors but single value for each sensor, and there can be thousands of such streams, how I can create different model for each of those thousand sensors and predict for such a vast amount of data efficiently?
Are there any other good machine learning algorithms or approaches which can be utilized for this type of sensor data?

Comparison between different methods of executing SQL queries on Cassandra Column Families using spark

As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods
using Spark SQLContext with a statically defined schema
// statically defined in the application
public static class TableTuple implements Serializable {
private int id;
private String line;
TableTuple (int i, String l) {
id = i;
line = l;
}
// getters and setters
...
}
and I consume the definition as:
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> rowrdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<TableTuple> rdd = rowrdd.map(row -> new TableTuple(row.getInt(0), row.getString(1)));
DataFrame dataFrame = sqlContext.createDataFrame(rdd, TableTuple.class);
dataFrame.registerTempTable("lines");
DataFrame resultsFrame = sqlContext.sql("Select line from lines where id=1");
System.out.println(Arrays.asList(resultsFrame.collect()));
using Spark SQLContext with a dynamically defined schema
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> cassandraRdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<Row> rdd = cassandraRdd.map(row -> RowFactory.create(row.getInt(0), row.getString(1)));
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
dataFrame.registerTempTable("lines");
DataFrame resultDataFrame = sqlContext.sql("select line from lines where id = 1");
System.out.println(Arrays.asList(resultDataFrame.collect()));
using CassandraSQLContext from the spark-cassandra-connector
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
DataFrame resultsFrame = sqlContext.sql("Select line from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where id = 1");
System.out.println(Arrays.asList(resultsFrame.collect()));
I would like to know the advantages/disadvantages of one method over another. Also, for the CassandraSQLContext method, are queries limited to CQL, or is it fully compatible with Spark SQL. I would also like an analysis pertaining to my specific use case, I have a cassandra column family with ~17.6 million tuples having 62 columns. For querying such a large database, which method is most adequate ?

Convert JavaDStream<String> to JavaRDD<String>

I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.
You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)
You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})
I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();

Categories

Resources